chore(turboquant): bump to 4c1c3ac0 and retire obsolete grpc-server patches by localai-bot · Pull Request #9923 · mudler/LocalAI

localai-bot · 2026-05-21T11:08:14Z

Summary

Replaces #9912. The auto-bot picked 2cbfdc62a1a047b01377948dfdede8cb6a744866 as the new fork pin. CI on that SHA confirmed two things:

The grpc-server.cpp / patch-grpc-server.sh mismatch is real and fixable. The fork has rebased past upstream PRs #21962, #22397 and #22838, so the compatibility shims in patch-grpc-server.sh were actively breaking the build by rewriting the file back to the pre-refactor flat layout. With the shims removed, tests-turboquant-grpc plus every multiarch turboquant build pass.
2cbfdc62 has an unrelated GPU regression. Every singlearch GPU turboquant build (hipblas + all cublas + sycl variants) fails inside the fork's own ggml/src/ggml-cuda/fattn-mma-f16.cuh with a static-assertion error, introduced by the May 14 #22880 HIP: RDNA3 mma FA refactor that bloated that file from 1855 → 2049 lines.

So this PR retreats the pin to 4c1c3ac09d2dba0aa9a55b94f6c50c41a92f9c8c (2026-05-13 22:12 UTC) — the last commit before #22880. At that SHA the fork still has every API piece our patched grpc-server.cpp depends on:

COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE
params.speculative.types (vector) + nested draft.X / ngram_*.X fields
server_context_impl::model_tgt
get_media_marker()
common_speculative_types_from_names

COMMON_SPECULATIVE_TYPE_DRAFT_MTP lands later (May 16) and is not referenced by grpc-server.cpp, so MTP support is deferred until the fork ships a GPU-clean post-MTP SHA.

Changes

Bumps TURBOQUANT_VERSION to 4c1c3ac09d2dba0aa9a55b94f6c50c41a92f9c8c.
Strips patch-grpc-server.sh from ~150 lines to ~13 — only the fork-specific kv_cache_types[] insertion for GGML_TYPE_TURBO2_0 / TURBO3_0 / TURBO4_0 remains.
Leaves the dormant LOCALAI_LEGACY_LLAMA_CPP_SPEC #ifdef blocks in backend/cpp/llama-cpp/grpc-server.cpp as an escape hatch if a future fork pin regresses.

Test plan

tests-turboquant-grpc passes
All backend-jobs-multiarch *-turboquant (cpu + vulkan, amd64 + arm64) pass
All backend-jobs-singlearch *-turboquant (hipblas, cublas 12/13, sycl_f16/f32) pass
Standard llama-cpp build path unchanged

Assisted-by: Claude:claude-opus-4-7

…atches The turboquant fork rebased past ggml-org/llama.cpp#21962, #22397 and #22838, so common_params_speculative now uses the nested draft/ ngram_simple/ngram_mod layout, server_context_impl exposes model_tgt (not model), and get_media_marker() is provided. The compatibility shims in patch-grpc-server.sh were rewriting the shared grpc-server.cpp to the pre-refactor flat layout, which no longer matches the fork and broke the build (see PR #9912 CI failure). Keep only the fork-specific kv_cache_types[] insertion for the TURBO2_0 / TURBO3_0 / TURBO4_0 enum entries. The dormant LOCALAI_LEGACY_LLAMA_CPP_SPEC #ifdef blocks in backend/cpp/llama-cpp/grpc-server.cpp stay as an escape hatch if a future fork bump regresses. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

CI on the prior 2cbfdc62 pin confirmed our grpc-server.cpp/patch fix works (tests-turboquant-grpc + all multiarch turboquant builds passed), but every GPU singlearch turboquant build now hits a static-assertion error in the fork's own ggml/src/ggml-cuda/fattn-mma-f16.cuh — a regression introduced by the May 14 #22880 `HIP: RDNA3 mma FA` refactor (file went from 1855 to 2049 lines). 4c1c3ac0 (2026-05-13 22:12 UTC) is the last commit before that refactor and still has every API piece grpc-server.cpp depends on (DRAFT_SIMPLE enum, nested common_params_speculative, model_tgt, get_media_marker(), common_speculative_types_from_names). MTP support landed later (May 16) and is not exercised by grpc-server.cpp. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

localai-bot mentioned this pull request May 21, 2026

chore: ⬆️ Update TheTom/llama-cpp-turboquant to 2cbfdc62a1a047b01377948dfdede8cb6a744866 #9912

Closed

localai-bot changed the title ~~chore(turboquant): bump to 2cbfdc62 and retire obsolete grpc-server patches~~ chore(turboquant): bump to 4c1c3ac0 and retire obsolete grpc-server patches May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(turboquant): bump to 4c1c3ac0 and retire obsolete grpc-server patches#9923

chore(turboquant): bump to 4c1c3ac0 and retire obsolete grpc-server patches#9923
localai-bot wants to merge 2 commits into
masterfrom
fix/turboquant-bump-2cbfdc62

localai-bot commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

localai-bot commented May 21, 2026 •

edited

Loading