Skip to content

Speculative Decoding Settings don't work with llama-cpp backend #9371

@feal87

Description

@feal87

LocalAI version:
localai/localai:latest-gpu-nvidia-cuda-12 4.1.3

Environment, CPU architecture, OS, and Version:
WSL-2 Windows

Describe the bug
Tried with

"
draft_model: gemma-4-E4B.gguf
n_draft: 8

options:

  • spec_type:draft
    "

and I get the above (despite the parameters seemingly being passed.

�[90mApr 15 22:10:37�[0m �[90mDEBUG�[0m GRPC: Loading model with options �[36moptions�[0m={{{} [] [] 0x395c882c46a8} 0 [] gemma-4-31b-Q6_K.gguf 20480 92994145 512 false false true false false false false 0 4 0 0 0 0 /models/gemma-4-31b-Q6_K.gguf false 0 false 0 0 false gemma-4-E4B.gguf 0 false false 0 0 0 false 0 0 0 0 0 0 0 true false //models [] [] [spec_type:draft spec_p_min:0.8 draft_gpu_layers:99 use_jinja:true] [] false []}

�[90mApr 15 22:10:45�[0m �[90mDEBUG�[0m GRPC stderr �[36mid�[0m="Gemma 31B - Q6_K speculative-127.0.0.1:39857" �[36mline�[0m="no implementations specified for speculative decoding" �[36mcaller�[0m={�[36mcaller.file�[0m="/build/pkg/model/process.go"

To Reproduce
See above for setting

Expected behavior
Secondary model loaded and speculative decoding activated.

Logs
See above

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions