Add fast CUDA MMVQ GGUF kernels#3463
Merged
Merged
Conversation
lukeme117
added a commit
to lukeme117/candle
that referenced
this pull request
Apr 19, 2026
The per-device Q8_1 scratch workspace added in huggingface#3463 is allocated via `unsafe dev.alloc::<u8>(bytes)`, leaving the bytes uninitialized. The `launch_mmvq_gguf_quantize_q8_1_*` kernels fill the k-sized portion of each row, but `MATRIX_ROW_PADDING` extends the row stride past that, and the mmvq reader consumes every block including the padded tail. Stale bytes in the tail feed into the dot product, so top-1 is typically correct but ranks 2+ drift and logit magnitudes change run to run (reproducible and bit-stable within a run because the same workspace bytes get reused). Same family as the fix in huggingface#3428 for `mul_mat_vec_via_q8_1` / `mul_mat_via_q8_1` / `load_quantized`. Switching the two alloc sites in `workspace_ensure` from `dev.alloc::<u8>` to `dev.alloc_zeros::<u8>` zero-fills the padded tail once per workspace growth and produces stable, correct logits on a Gemma 4 Q4_K_M GGUF (3060 Ti, CUDA 13.2).
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
mmvq_gguf.cuand surrounding infra enabling a fast-path for GGUF decode on CUDA.Specifically, it adds:
Q8_1with dedicatedBF16/F16/F32quantize kernelsQ8_1scratch workspace that is lazily allocated and reused across calls