feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults by localai-bot · Pull Request #9852 · mudler/LocalAI

localai-bot · 2026-05-16T17:05:25Z

Summary

Bump LLAMA_VERSION in backend/cpp/llama-cpp/Makefile from 1348f67c to 0253fb21 (latest master, post the merge of ggml-org/llama.cpp#22673 on 2026-05-16) to pick up Multi-Token Prediction (MTP) speculative decoding.
Document the new draft-mtp speculative type in docs/content/advanced/model-configuration.md, including both load paths (MTP head embedded in the main GGUF vs. separate mtp-*.gguf sibling), recommended spec_n_max:2-3, and the chained draft-mtp,ngram-mod recipe from the upstream PR notes.
Update .agents/llama-cpp-backend.md to record that new upstream speculative types flow through the existing spec_type parser automatically.

Why no grpc-server.cpp changes

The audit of the upstream PR diff against backend/cpp/llama-cpp/grpc-server.cpp shows every symbol our code touches is unchanged in signature:

spec_type already delegates to upstream's common_speculative_types_from_names(), which gained draft-mtp as a new map entry, so it's accepted without code edits.
cparams.n_rs_seq (required for MTP's recurrent-state rollback) is auto-derived inside common_context_params_to_llama via the new params.speculative.need_n_rs_seq() method.
When spec_type=draft-mtp is set without a draft_model, the upstream server-context branch creates the MTP draft context directly off the target model, so the existing gRPC path "just works."
Note: upstream's -hf auto-discovery of mtp-*.gguf sibling files runs in common_params_handle_models (common/arg.cpp), which LocalAI's gRPC layer does not call. Users wanting a separate MTP sibling file need to download it and set draft_model explicitly - this is called out in the docs.

Test plan

CI: make backends/llama-cpp builds against the new pin on all variants (avx, avx2, avx512, fallback, grpc).
Local compile verification was attempted via make -C backend/cpp/llama-cpp llama-cpp-fallback; llama.cpp's own cmake config ran clean against the new SHA, but the gRPC sub-project failed at find_package(absl) because the dev host is missing libabsl-dev. Unrelated to the bump - CI environments have the dependency.
Smoke test: load a Qwen3.6 MTP-enabled GGUF with options: [spec_type:draft-mtp, spec_n_max:3] and verify draft acceptance shows up in slot stats / throughput improves vs. baseline.
Smoke test: chained spec_type:draft-mtp,ngram-mod with spec_ngram_mod_n_match:24 still parses and runs.
Existing spec_type:draft-simple / draft-eagle3 / ngram families continue to work (regression).

Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge, 2026-05-16) to pick up Multi-Token Prediction support. No grpc-server.cpp changes are required: the existing `spec_type` option delegates to upstream's `common_speculative_types_from_names()`, which already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed by MTP is auto-derived inside `common_context_params_to_llama` from `params.speculative.need_n_rs_seq()`, and when no `draft_model` is set the upstream server builds the MTP context off the target model itself. Docs: extend the speculative-decoding section of the model-configuration guide with the new type, both load paths (MTP head embedded in the main GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended `spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is not wired through LocalAI's gRPC layer. Agent guide: short note explaining that new upstream spec types are picked up automatically and that MTP needs no gRPC plumbing. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… + load Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by `convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and, when present and the user has not configured a `spec_type` explicitly, auto-append the upstream-recommended speculative-decoding tuple: - spec_type:draft-mtp - spec_n_max:6 - spec_p_min:0.75 The 0.75 p_min is pinned defensively because upstream marks the current default with a "change to 0.0f" TODO; locking it here keeps acceptance thresholds stable across future llama.cpp bumps. Detection runs in two places: - The model importer (`POST /models/import-uri`, the `/import-model` UI) range-fetches the GGUF header for HuggingFace / direct-URL imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and non-fatal error handling. OCI/Ollama URIs are skipped because the artifact is not directly streamable; the load-time hook covers them once the file is on disk. - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local header on every model start and appends the same options if `spec_type` is not already set. Both paths share `ApplyMTPDefaults` and respect an explicit user-set `spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo specs cover the append, preserve-user-choice, legacy alias, and nil safety paths. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

`gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was handing it the raw `huggingface://...` URI directly (and similarly for any other custom downloader scheme). Live-test against `huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf` exposed this: the probe failed with `unsupported protocol scheme "huggingface"`, was caught by the non-fatal error path, and the MTP options were silently never applied to the generated YAML. Route every candidate URI through `downloader.URI.ResolveURL()` and require the resolved form to be HTTP(S). After the fix the probe successfully reads `<arch>.nextn_predict_layers=1` from the real HF GGUF and the emitted ConfigFile carries spec_type:draft-mtp, spec_n_max:6, spec_p_min:0.75 as intended. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler added 3 commits May 16, 2026 16:55

mudler changed the title ~~feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type~~ feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults May 16, 2026

mudler merged commit d77a913 into master May 16, 2026
64 of 65 checks passed

mudler deleted the feat/llama-cpp-mtp-support branch May 16, 2026 20:42

BrewTestBot mentioned this pull request May 16, 2026

localai 4.2.6 Homebrew/homebrew-core#283215

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults#9852

feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults#9852
mudler merged 3 commits into
masterfrom
feat/llama-cpp-mtp-support

localai-bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 16, 2026

Summary

Why no grpc-server.cpp changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants