Add fast CUDA MMVQ GGUF kernels by EricLBuehler · Pull Request #3463 · huggingface/candle

EricLBuehler · 2026-04-13T02:56:58Z

Adds mmvq_gguf.cu and surrounding infra enabling a fast-path for GGUF decode on CUDA.

Specifically, it adds:

Native BF16 input/output support (no F32 round-trip!)
On-the-fly activation quantization to Q8_1 with dedicated BF16/F16/F32 quantize kernels
Per-device Q8_1 scratch workspace that is lazily allocated and reused across calls
Batch sizes 1 to 8 with compile-time specialized kernel variants per batch size
Automatic fast-path dispatch, falls back to existing PTX-based path for unsupported types or larger batches.

ivarflakstad

lgtm! 👌

The per-device Q8_1 scratch workspace added in huggingface#3463 is allocated via `unsafe dev.alloc::<u8>(bytes)`, leaving the bytes uninitialized. The `launch_mmvq_gguf_quantize_q8_1_*` kernels fill the k-sized portion of each row, but `MATRIX_ROW_PADDING` extends the row stride past that, and the mmvq reader consumes every block including the padded tail. Stale bytes in the tail feed into the dot product, so top-1 is typically correct but ranks 2+ drift and logit magnitudes change run to run (reproducible and bit-stable within a run because the same workspace bytes get reused). Same family as the fix in huggingface#3428 for `mul_mat_vec_via_q8_1` / `mul_mat_via_q8_1` / `load_quantized`. Switching the two alloc sites in `workspace_ensure` from `dev.alloc::<u8>` to `dev.alloc_zeros::<u8>` zero-fills the padded tail once per workspace growth and produces stable, correct logits on a Gemma 4 Q4_K_M GGUF (3060 Ti, CUDA 13.2).

Add fast CUDA MMVQ GGUF kernels

3849961

EricLBuehler requested a review from ivarflakstad April 13, 2026 02:57

ivarflakstad reviewed Apr 13, 2026

View reviewed changes

Comment thread candle-core/src/quantized/fast_mmvq.rs Outdated

Comment thread candle-core/src/quantized/fast_mmvq.rs Outdated

EricLBuehler added 3 commits April 13, 2026 12:46

Support f16

9d7e706

Apply review comments

831dda2

Format

cdd5288

EricLBuehler requested a review from ivarflakstad April 13, 2026 16:51

Fix

b8a30d5

ivarflakstad approved these changes Apr 13, 2026

View reviewed changes

EricLBuehler merged commit b503458 into main Apr 13, 2026

EricLBuehler deleted the mmvq_gguf branch April 13, 2026 22:28

lukeme117 mentioned this pull request Apr 19, 2026

Fix uninitialized workspace in fast_mmvq path #3476

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast CUDA MMVQ GGUF kernels#3463

Add fast CUDA MMVQ GGUF kernels#3463
EricLBuehler merged 5 commits into
mainfrom
mmvq_gguf

EricLBuehler commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

ivarflakstad left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EricLBuehler commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

ivarflakstad left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants