Skip to content

feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults#9852

Merged
mudler merged 3 commits into
masterfrom
feat/llama-cpp-mtp-support
May 16, 2026
Merged

feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults#9852
mudler merged 3 commits into
masterfrom
feat/llama-cpp-mtp-support

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

  • Bump LLAMA_VERSION in backend/cpp/llama-cpp/Makefile from 1348f67c to 0253fb21 (latest master, post the merge of ggml-org/llama.cpp#22673 on 2026-05-16) to pick up Multi-Token Prediction (MTP) speculative decoding.
  • Document the new draft-mtp speculative type in docs/content/advanced/model-configuration.md, including both load paths (MTP head embedded in the main GGUF vs. separate mtp-*.gguf sibling), recommended spec_n_max:2-3, and the chained draft-mtp,ngram-mod recipe from the upstream PR notes.
  • Update .agents/llama-cpp-backend.md to record that new upstream speculative types flow through the existing spec_type parser automatically.

Why no grpc-server.cpp changes

The audit of the upstream PR diff against backend/cpp/llama-cpp/grpc-server.cpp shows every symbol our code touches is unchanged in signature:

  • spec_type already delegates to upstream's common_speculative_types_from_names(), which gained draft-mtp as a new map entry, so it's accepted without code edits.
  • cparams.n_rs_seq (required for MTP's recurrent-state rollback) is auto-derived inside common_context_params_to_llama via the new params.speculative.need_n_rs_seq() method.
  • When spec_type=draft-mtp is set without a draft_model, the upstream server-context branch creates the MTP draft context directly off the target model, so the existing gRPC path "just works."
  • Note: upstream's -hf auto-discovery of mtp-*.gguf sibling files runs in common_params_handle_models (common/arg.cpp), which LocalAI's gRPC layer does not call. Users wanting a separate MTP sibling file need to download it and set draft_model explicitly - this is called out in the docs.

Test plan

  • CI: make backends/llama-cpp builds against the new pin on all variants (avx, avx2, avx512, fallback, grpc).
  • Local compile verification was attempted via make -C backend/cpp/llama-cpp llama-cpp-fallback; llama.cpp's own cmake config ran clean against the new SHA, but the gRPC sub-project failed at find_package(absl) because the dev host is missing libabsl-dev. Unrelated to the bump - CI environments have the dependency.
  • Smoke test: load a Qwen3.6 MTP-enabled GGUF with options: [spec_type:draft-mtp, spec_n_max:3] and verify draft acceptance shows up in slot stats / throughput improves vs. baseline.
  • Smoke test: chained spec_type:draft-mtp,ngram-mod with spec_ngram_mod_n_match:24 still parses and runs.
  • Existing spec_type:draft-simple / draft-eagle3 / ngram families continue to work (regression).

mudler added 3 commits May 16, 2026 16:55
Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge,
2026-05-16) to pick up Multi-Token Prediction support.

No grpc-server.cpp changes are required: the existing `spec_type` option
delegates to upstream's `common_speculative_types_from_names()`, which
already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed
by MTP is auto-derived inside `common_context_params_to_llama` from
`params.speculative.need_n_rs_seq()`, and when no `draft_model` is set
the upstream server builds the MTP context off the target model itself.

Docs: extend the speculative-decoding section of the model-configuration
guide with the new type, both load paths (MTP head embedded in the main
GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended
`spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also
notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is
not wired through LocalAI's gRPC layer.

Agent guide: short note explaining that new upstream spec types are
picked up automatically and that MTP needs no gRPC plumbing.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… + load

Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by
`convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and,
when present and the user has not configured a `spec_type` explicitly,
auto-append the upstream-recommended speculative-decoding tuple:

  - spec_type:draft-mtp
  - spec_n_max:6
  - spec_p_min:0.75

The 0.75 p_min is pinned defensively because upstream marks the current
default with a "change to 0.0f" TODO; locking it here keeps acceptance
thresholds stable across future llama.cpp bumps.

Detection runs in two places:

  - The model importer (`POST /models/import-uri`, the `/import-model`
    UI) range-fetches the GGUF header for HuggingFace / direct-URL
    imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and
    non-fatal error handling. OCI/Ollama URIs are skipped because the
    artifact is not directly streamable; the load-time hook covers them
    once the file is on disk.
  - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local
    header on every model start and appends the same options if
    `spec_type` is not already set.

Both paths share `ApplyMTPDefaults` and respect an explicit user-set
`spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo
specs cover the append, preserve-user-choice, legacy alias, and nil
safety paths.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
`gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was
handing it the raw `huggingface://...` URI directly (and similarly for
any other custom downloader scheme). Live-test against
`huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf`
exposed this: the probe failed with `unsupported protocol scheme
"huggingface"`, was caught by the non-fatal error path, and the MTP
options were silently never applied to the generated YAML.

Route every candidate URI through `downloader.URI.ResolveURL()` and
require the resolved form to be HTTP(S). After the fix the probe
successfully reads `<arch>.nextn_predict_layers=1` from the real HF
GGUF and the emitted ConfigFile carries spec_type:draft-mtp,
spec_n_max:6, spec_p_min:0.75 as intended.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler changed the title feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults May 16, 2026
@mudler mudler merged commit d77a913 into master May 16, 2026
64 of 65 checks passed
@mudler mudler deleted the feat/llama-cpp-mtp-support branch May 16, 2026 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants