feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options by localai-bot · Pull Request #9765 · mudler/LocalAI

localai-bot · 2026-05-11T21:50:38Z

Summary

Takes over #9763 and expands its scope:

Bump llama.cpp to
1ec7ba0c.
Adapt backend/cpp/llama-cpp/grpc-server.cpp to upstream's
parallel-drafting refactor
(ggml-org/llama.cpp#22838) —
the same change the bot's bump PR tripped on:
- common_params_speculative::type (single enum) became types
  (std::vector<common_speculative_type>). Both the "default to
  draft when a draft model is set" branch and the
  spec_type / speculative_type option parser now operate on the
  vector. The parser tolerates comma-separated lists, matching
  common_speculative_types_from_names.
- common_params_speculative_draft::n_ctx is gone (the draft shares
  the target context size). draft_ctx_size is kept as a backward-
  compatible no-op.
- server_context_impl::model → model_tgt at the two reranker /
  model-metadata call sites.
Expose new spec-decoding options: keys so model configs can
tune the new families upstream added in #22838:
- ngram_mod: spec_ngram_mod_n_min / _n_max / _n_match
- ngram_map_k: spec_ngram_map_k_size_n / _size_m / _min_hits
- ngram_map_k4v: spec_ngram_map_k4v_size_n / _size_m / _min_hits
- ngram_cache: spec_lookup_cache_static (alias lookup_cache_static),
  spec_lookup_cache_dynamic (alias lookup_cache_dynamic)
- Draft tuning: draft_cache_type_k / _v,
  draft_threads / _batch, draft_cpu_moe, draft_n_cpu_moe,
  draft_override_tensor (comma-separated
  <tensor regex>=<buffer type>; this one re-implements upstream's
  static parse_tensor_buffer_overrides since it isn't exported).
Docs: docs/content/advanced/model-configuration.md now has
per-family tables for the speculative options and a note that
spec_type accepts a comma-separated list to chain types.

Test plan

make docker-build-llama-cpp builds the linux/amd64 cpu
llama-cpp-avx variant cleanly with the patch applied; the
previously-failing grpc-server.cpp compile passes (only a
pre-existing unused-parameter warning remains). Rebuilt a second
time after adding the new option keys.
CI: backend-jobs-multiarch (linux/amd64, ..., -cpu-llama-cpp, ...) green.
CI: tests-llama-cpp-grpc green.
CI: backend-jobs-darwin (llama-cpp, -metal-darwin-arm64-llama-cpp, go) green.

Closes #9763.

Picks up the upstream `spec : parallel drafting support` change (ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API and `server_context_impl`. Adapt the grpc-server wrapper accordingly: * `common_params_speculative::type` (single enum) became `types` (`std::vector<common_speculative_type>`). Update both the "default to draft when a draft model is set" branch and the `spec_type`/`speculative_type` option parser. The parser now also tolerates comma-separated lists, mirroring the upstream `common_speculative_types_from_names` semantics. * `common_params_speculative_draft::n_ctx` is gone (draft now shares the target context size). Keep the `draft_ctx_size` option name for backward compatibility and ignore the value rather than failing. * `server_context_impl::model` was renamed to `model_tgt`; update the two reranker / model-metadata call sites. Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp target locally. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated <tensor regex>=<buffer type>; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile, which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh on the copy. The fork branched before the API refactor, so it errors out on: * `ctx_server.impl->model_tgt` (fork still has `model`) * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*` (none of these sub-structs exist in the fork) * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads, tensor_buft_overrides}` (fork uses the pre-#22397 flat layout) * `params.speculative.types` vector / `common_speculative_types_from_names` (fork has a scalar `type` and only the singular helper) Approach: 1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]` discriminations (the "default to draft when a draft model is set" branch and the `spec_type` / `speculative_type` option parser) fall back to the singular scalar form, and the entire new-option block (ngram_mod / map_k / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*, tensor_buft_overrides}) is preprocessed out. The macro is *not* defined in the source tree — stock llama-cpp builds get the full new API. 2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp: - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model` - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first `#include`, so the guarded blocks above drop out for the fork build. Both patches are idempotent and follow the existing sed/awk pattern in this script (KV cache types, `get_media_marker`, flat speculative renames). Stock llama-cpp's `grpc-server.cpp` is never touched. Drop both legacy patches once the turboquant fork rebases past ggml-org/llama.cpp#22397 / #22838. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

The previous turboquant fix wrapped the new option-handler blocks in `#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard in the middle of an `else if` chain — the `} else if` openings of the new blocks were responsible for closing the previous block's brace. With the macro defined the new blocks vanish, draft_ctx_size's `{` loses its closer, the for-loop's `}` is consumed instead, and the file ends with a stray opening brace — clang reports it as `function-definition is not allowed here before '{'` on the next top-level `int main(...)` and `expected '}' at end of input`. Move the chain split inside the draft_ctx_size branch: } else if (... "draft_ctx_size") { // ... #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC } // legacy: chain ends here #else } else if (... "spec_ngram_mod_n_min") { // modern: chain continues ... } else if (... "draft_override_tensor") { ... } // closes last branch #endif } // closes for-loop Brace count is now balanced under both preprocessor branches (verified with `tr -cd '{' | wc -c` against the patched and unpatched outputs). Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp `turboquant-avx` variant cleanly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ebuilt Dockerfile.turboquant's `builder-prebuilt` stage was missing the `ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that `builder-fromsource` already has (and that `Dockerfile.llama-cpp` mirrors across both stages). When CI uses the prebuilt base image (quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg passed by the workflow never reaches the env inside the compile stage. backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on hipblas builds when AMDGPU_TARGETS is empty, and the turboquant Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the same check fires from turboquant-fallback under BUILD_TYPE=hipblas: Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101. Stop. make: *** [Makefile:66: turboquant-fallback] Error 2 The bug is latent on master because the docker layer cache stays warm across builds — the compile step rarely re-runs from scratch. The llama.cpp bump in this PR invalidates the cache, so the missing env var becomes load-bearing and the hipblas turboquant CI job fails. Mirror the existing pattern from Dockerfile.llama-cpp. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

localai-bot mentioned this pull request May 11, 2026

chore: ⬆️ Update ggml-org/llama.cpp to 1ec7ba0c14f33f17e980daeeda5f35b225d41994 #9763

Closed

localai-bot changed the title ~~chore(llama.cpp): bump to 1ec7ba0c and adapt grpc-server to new speculative API~~ feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options May 11, 2026

mudler added 3 commits May 11, 2026 22:46

mudler merged commit bc4cd3d into master May 12, 2026
76 checks passed

mudler deleted the worktree-fix-llama-cpp-bump-1ec7ba0c branch May 12, 2026 15:22

BrewTestBot mentioned this pull request May 12, 2026

localai 4.2.2 Homebrew/homebrew-core#282322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options#9765

feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options#9765
mudler merged 5 commits into
masterfrom
worktree-fix-llama-cpp-bump-1ec7ba0c

localai-bot commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

localai-bot commented May 11, 2026 •

edited

Loading