feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults#9852
Merged
Conversation
Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge, 2026-05-16) to pick up Multi-Token Prediction support. No grpc-server.cpp changes are required: the existing `spec_type` option delegates to upstream's `common_speculative_types_from_names()`, which already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed by MTP is auto-derived inside `common_context_params_to_llama` from `params.speculative.need_n_rs_seq()`, and when no `draft_model` is set the upstream server builds the MTP context off the target model itself. Docs: extend the speculative-decoding section of the model-configuration guide with the new type, both load paths (MTP head embedded in the main GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended `spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is not wired through LocalAI's gRPC layer. Agent guide: short note explaining that new upstream spec types are picked up automatically and that MTP needs no gRPC plumbing. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… + load
Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by
`convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and,
when present and the user has not configured a `spec_type` explicitly,
auto-append the upstream-recommended speculative-decoding tuple:
- spec_type:draft-mtp
- spec_n_max:6
- spec_p_min:0.75
The 0.75 p_min is pinned defensively because upstream marks the current
default with a "change to 0.0f" TODO; locking it here keeps acceptance
thresholds stable across future llama.cpp bumps.
Detection runs in two places:
- The model importer (`POST /models/import-uri`, the `/import-model`
UI) range-fetches the GGUF header for HuggingFace / direct-URL
imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and
non-fatal error handling. OCI/Ollama URIs are skipped because the
artifact is not directly streamable; the load-time hook covers them
once the file is on disk.
- The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local
header on every model start and appends the same options if
`spec_type` is not already set.
Both paths share `ApplyMTPDefaults` and respect an explicit user-set
`spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo
specs cover the append, preserve-user-choice, legacy alias, and nil
safety paths.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
`gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was handing it the raw `huggingface://...` URI directly (and similarly for any other custom downloader scheme). Live-test against `huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf` exposed this: the probe failed with `unsupported protocol scheme "huggingface"`, was caught by the non-fatal error path, and the MTP options were silently never applied to the generated YAML. Route every candidate URI through `downloader.URI.ResolveURL()` and require the resolved form to be HTTP(S). After the fix the probe successfully reads `<arch>.nextn_predict_layers=1` from the real HF GGUF and the emitted ConfigFile carries spec_type:draft-mtp, spec_n_max:6, spec_p_min:0.75 as intended. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LLAMA_VERSIONinbackend/cpp/llama-cpp/Makefilefrom1348f67cto0253fb21(latest master, post the merge of ggml-org/llama.cpp#22673 on 2026-05-16) to pick up Multi-Token Prediction (MTP) speculative decoding.draft-mtpspeculative type indocs/content/advanced/model-configuration.md, including both load paths (MTP head embedded in the main GGUF vs. separatemtp-*.ggufsibling), recommendedspec_n_max:2-3, and the chaineddraft-mtp,ngram-modrecipe from the upstream PR notes..agents/llama-cpp-backend.mdto record that new upstream speculative types flow through the existingspec_typeparser automatically.Why no grpc-server.cpp changes
The audit of the upstream PR diff against
backend/cpp/llama-cpp/grpc-server.cppshows every symbol our code touches is unchanged in signature:spec_typealready delegates to upstream'scommon_speculative_types_from_names(), which gaineddraft-mtpas a new map entry, so it's accepted without code edits.cparams.n_rs_seq(required for MTP's recurrent-state rollback) is auto-derived insidecommon_context_params_to_llamavia the newparams.speculative.need_n_rs_seq()method.spec_type=draft-mtpis set without adraft_model, the upstream server-context branch creates the MTP draft context directly off the target model, so the existing gRPC path "just works."-hfauto-discovery ofmtp-*.ggufsibling files runs incommon_params_handle_models(common/arg.cpp), which LocalAI's gRPC layer does not call. Users wanting a separate MTP sibling file need to download it and setdraft_modelexplicitly - this is called out in the docs.Test plan
make backends/llama-cppbuilds against the new pin on all variants (avx,avx2,avx512,fallback,grpc).make -C backend/cpp/llama-cpp llama-cpp-fallback; llama.cpp's own cmake config ran clean against the new SHA, but the gRPC sub-project failed atfind_package(absl)because the dev host is missinglibabsl-dev. Unrelated to the bump - CI environments have the dependency.options: [spec_type:draft-mtp, spec_n_max:3]and verify draft acceptance shows up in slot stats / throughput improves vs. baseline.spec_type:draft-mtp,ngram-modwithspec_ngram_mod_n_match:24still parses and runs.spec_type:draft-simple/draft-eagle3/ ngram families continue to work (regression).