Skip to content

Add fast CUDA MMVQ GGUF kernels#3463

Merged
EricLBuehler merged 5 commits into
mainfrom
mmvq_gguf
Apr 13, 2026
Merged

Add fast CUDA MMVQ GGUF kernels#3463
EricLBuehler merged 5 commits into
mainfrom
mmvq_gguf

Conversation

@EricLBuehler
Copy link
Copy Markdown
Member

Adds mmvq_gguf.cu and surrounding infra enabling a fast-path for GGUF decode on CUDA.

Specifically, it adds:

  • Native BF16 input/output support (no F32 round-trip!)
  • On-the-fly activation quantization to Q8_1 with dedicated BF16/F16/F32 quantize kernels
  • Per-device Q8_1 scratch workspace that is lazily allocated and reused across calls
  • Batch sizes 1 to 8 with compile-time specialized kernel variants per batch size
  • Automatic fast-path dispatch, falls back to existing PTX-based path for unsupported types or larger batches.

Comment thread candle-core/src/quantized/fast_mmvq.rs Outdated
Comment thread candle-core/src/quantized/fast_mmvq.rs Outdated
Copy link
Copy Markdown
Member

@ivarflakstad ivarflakstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! 👌

@EricLBuehler EricLBuehler merged commit b503458 into main Apr 13, 2026
@EricLBuehler EricLBuehler deleted the mmvq_gguf branch April 13, 2026 22:28
lukeme117 added a commit to lukeme117/candle that referenced this pull request Apr 19, 2026
The per-device Q8_1 scratch workspace added in huggingface#3463 is allocated via
`unsafe dev.alloc::<u8>(bytes)`, leaving the bytes uninitialized. The
`launch_mmvq_gguf_quantize_q8_1_*` kernels fill the k-sized portion of
each row, but `MATRIX_ROW_PADDING` extends the row stride past that, and
the mmvq reader consumes every block including the padded tail. Stale
bytes in the tail feed into the dot product, so top-1 is typically
correct but ranks 2+ drift and logit magnitudes change run to run
(reproducible and bit-stable within a run because the same workspace
bytes get reused).

Same family as the fix in huggingface#3428 for `mul_mat_vec_via_q8_1` /
`mul_mat_via_q8_1` / `load_quantized`. Switching the two alloc sites in
`workspace_ensure` from `dev.alloc::<u8>` to `dev.alloc_zeros::<u8>`
zero-fills the padded tail once per workspace growth and produces stable,
correct logits on a Gemma 4 Q4_K_M GGUF (3060 Ti, CUDA 13.2).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants