Skip to content

[test] chore(turboquant): bump fork pin to rebase/upstream-sync-april-2026#9493

Closed
mudler wants to merge 1 commit intomasterfrom
bump/turboquant-upstream-sync-april-2026
Closed

[test] chore(turboquant): bump fork pin to rebase/upstream-sync-april-2026#9493
mudler wants to merge 1 commit intomasterfrom
bump/turboquant-upstream-sync-april-2026

Conversation

@mudler
Copy link
Copy Markdown
Owner

@mudler mudler commented Apr 22, 2026

Move the TurboQuant llama.cpp fork pin from feature/turboquant-kv-cache (627ebbc6) to rebase/upstream-sync-april-2026 (7f320bb8), picking up the upstream-sync work on the fork.

Testing TheTom/llama-cpp-turboquant#101

cc @TheTom

Move the TurboQuant llama.cpp fork pin from feature/turboquant-kv-cache
(627ebbc6) to rebase/upstream-sync-april-2026 (7f320bb8), picking up the
upstream-sync work on the fork.

Assisted-by: Claude:claude-opus-4-7
@apollosenvy
Copy link
Copy Markdown

Pulled and sanity-checked on AMD / ROCm 7.2.1 (7900 XTX, gfx1100, wave32). HIP path looks good on the new pin.

Build

cmake -B build-hip -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release
cmake --build build-hip -j 12

Clean build, 0 errors, only a benign -Wmissing-prototypes in ggml-turbo-quant.c. The three fattn-vec-instance-f16-turbo{2,3,4}_0.cu template instances added in b8b1d49b3 link in correctly, so no more undefined references on ROCm.

Smoke tests (Qwen3-8B-Q4_K_M, -fa on -ngl 99)

K type V type pp t/s tg t/s VRAM used Output
q8_0 turbo4 223.3 89.9 ~12.2 GiB coherent
f16 turbo3 228.8 90.9 ~11.9 GiB coherent

Both ran to the -n limit with sensible completions. No OOM, no crash, no hipError, no NaNs.

What this covers from PR #101's community-testing checklist

  • AMD HIP build with variadic shfl macros
  • OOM fix (d7b533446 / 4d754604e) validated: turbo3 + turbo4 V-cache both fit and decode on 24 GiB
  • F16-K + TURBO-V dispatch case (58bbe5518) exercised cleanly
  • turbo V unpad gate on V type (156592051) behaves correctly with turbo V + f16 K

Not covered here: gfx1200 (RX 9060 XT), head_dim > 256 (Gemma 4 full-attention), multi-GPU, prefill-heavy workloads against long contexts.

CI note

The only pull_request failure on this PR is backend-jobs (cublas, 13, ...) which timed out at 6 h inside the webui cp -rT rename into tools/grpc-server/ (GitHub Actions wall-clock ceiling, not a build error). Unrelated to the pin itself. The fork's own llama.cpp CI matrix (android-arm64 / ggml-ci-* / etc.) is failing upstream-wide, not specific to this rebase.

LGTM for the HIP slice. Ship it.

cc @TheTom @mudler

@TheTom
Copy link
Copy Markdown

TheTom commented Apr 23, 2026

Thank you!

@mudler
Copy link
Copy Markdown
Owner Author

mudler commented Apr 25, 2026

Closing ad this seems now fixed and merged already in #9497 . Thanks for your support!

@mudler mudler closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants