chore(turboquant): bump to 4c1c3ac0 and retire obsolete grpc-server patches#9923
Open
localai-bot wants to merge 2 commits into
Open
chore(turboquant): bump to 4c1c3ac0 and retire obsolete grpc-server patches#9923localai-bot wants to merge 2 commits into
localai-bot wants to merge 2 commits into
Conversation
…atches The turboquant fork rebased past ggml-org/llama.cpp#21962, #22397 and #22838, so common_params_speculative now uses the nested draft/ ngram_simple/ngram_mod layout, server_context_impl exposes model_tgt (not model), and get_media_marker() is provided. The compatibility shims in patch-grpc-server.sh were rewriting the shared grpc-server.cpp to the pre-refactor flat layout, which no longer matches the fork and broke the build (see PR #9912 CI failure). Keep only the fork-specific kv_cache_types[] insertion for the TURBO2_0 / TURBO3_0 / TURBO4_0 enum entries. The dormant LOCALAI_LEGACY_LLAMA_CPP_SPEC #ifdef blocks in backend/cpp/llama-cpp/grpc-server.cpp stay as an escape hatch if a future fork bump regresses. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
CI on the prior 2cbfdc62 pin confirmed our grpc-server.cpp/patch fix works (tests-turboquant-grpc + all multiarch turboquant builds passed), but every GPU singlearch turboquant build now hits a static-assertion error in the fork's own ggml/src/ggml-cuda/fattn-mma-f16.cuh — a regression introduced by the May 14 #22880 `HIP: RDNA3 mma FA` refactor (file went from 1855 to 2049 lines). 4c1c3ac0 (2026-05-13 22:12 UTC) is the last commit before that refactor and still has every API piece grpc-server.cpp depends on (DRAFT_SIMPLE enum, nested common_params_speculative, model_tgt, get_media_marker(), common_speculative_types_from_names). MTP support landed later (May 16) and is not exercised by grpc-server.cpp. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces #9912. The auto-bot picked
2cbfdc62a1a047b01377948dfdede8cb6a744866as the new fork pin. CI on that SHA confirmed two things:grpc-server.cpp/patch-grpc-server.shmismatch is real and fixable. The fork has rebased past upstream PRs #21962, #22397 and #22838, so the compatibility shims inpatch-grpc-server.shwere actively breaking the build by rewriting the file back to the pre-refactor flat layout. With the shims removed,tests-turboquant-grpcplus every multiarch turboquant build pass.2cbfdc62has an unrelated GPU regression. Every singlearch GPU turboquant build (hipblas + all cublas + sycl variants) fails inside the fork's ownggml/src/ggml-cuda/fattn-mma-f16.cuhwith a static-assertion error, introduced by the May 14#22880 HIP: RDNA3 mma FArefactor that bloated that file from 1855 → 2049 lines.So this PR retreats the pin to
4c1c3ac09d2dba0aa9a55b94f6c50c41a92f9c8c(2026-05-13 22:12 UTC) — the last commit before #22880. At that SHA the fork still has every API piece our patchedgrpc-server.cppdepends on:COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLEparams.speculative.types(vector) + nesteddraft.X/ngram_*.Xfieldsserver_context_impl::model_tgtget_media_marker()common_speculative_types_from_namesCOMMON_SPECULATIVE_TYPE_DRAFT_MTPlands later (May 16) and is not referenced bygrpc-server.cpp, so MTP support is deferred until the fork ships a GPU-clean post-MTP SHA.Changes
TURBOQUANT_VERSIONto4c1c3ac09d2dba0aa9a55b94f6c50c41a92f9c8c.patch-grpc-server.shfrom ~150 lines to ~13 — only the fork-specifickv_cache_types[]insertion forGGML_TYPE_TURBO2_0/TURBO3_0/TURBO4_0remains.LOCALAI_LEGACY_LLAMA_CPP_SPEC#ifdefblocks inbackend/cpp/llama-cpp/grpc-server.cppas an escape hatch if a future fork pin regresses.Test plan
tests-turboquant-grpcpassesbackend-jobs-multiarch *-turboquant(cpu + vulkan, amd64 + arm64) passbackend-jobs-singlearch *-turboquant(hipblas, cublas 12/13, sycl_f16/f32) passllama-cppbuild path unchangedAssisted-by: Claude:claude-opus-4-7