fix(nvfp4): unlock decode fast-path for prequant MoE (Qwen3.6 + Gemma-4) by kekzl · Pull Request #85 · kekzl/imp

kekzl · 2026-04-30T23:44:45Z

Summary

Unlocks the NVFP4 decode fast-path for Qwen3.6-35B-A3B-NVFP4 and Gemma-4-26B-A4B-it-NVFP4 SafeTensors models. Both now produce coherent output with CUDA Graphs captured in decode.

What was broken

Qwen3.6-NVFP4 (already shipped on this branch as 615758a): three bugs blocked the decode fast-path — can_decode_fast whitelist missing NVFP4, no cache_moe_native_nvfp4 for SafeTensors per-expert NVFP4 layout, and missing per-layer free of per-expert allocations to fit in 32 GiB. tg256 was 8.34 → 142 tok/s.
Gemma-4-NVFP4 (this PR's new commit 3901600): engine.cpp was unconditionally setting use_nvfp4_decode = 0 for any Gemma-4 model to dodge unsupported NVFP4 attention CUTLASS paths under per-layer head_dim (256 SWA / 512 global). That over-broad disable also skipped Phase 3-MoE (cache_moe_native_nvfp4), forcing the legacy FP16 fallback (sm_80 WMMA + per-layer D2H expert-offsets sync). Consequences: CUDA Graph capture IMA cascade at decode step 1, decode tg ~42 tok/s, output stuck on the user's prompt instead of an answer.

What changed

src/runtime/engine.cpp (+21/-2): when model.config().is_nvfp4_prequant is true for Gemma-4, keep use_nvfp4_decode at its auto value. Phase 3a (Q*_K → NVFP4) and Phase 3b (NVFP4 → CUTLASS sm_120) iterate wcache_.nvfp4, which is empty for prequant models — they are no-ops. Only Phase 3-MoE does load-bearing work, and that's the single thing the disable was preventing. The "per-layer head_dim" caveat applies to attention CUTLASS kernels, not MoE expert caching.
CLAUDE.md: refresh the CUDA-Graphs note. NVFP4-prequant MoE models capture cleanly post-fix; only legacy GGUF MoE decode (which still does per-layer D2H expert-offsets sync) remains graph-incompatible. Prefill is never captured (variable n).

Result on Gemma-4-26B-A4B-it-NVFP4 (RTX 5090, default config)

Metric	Before	After
Phase-4 wcache `nvfp4_moe`	0	90 (30 layers × 3 projections)
CUDA Graph capture step 1	IMA cascade → fallback	152 PDL edges, AsyncGraphLoop on
MoE prefill: legacy fallback log	every iter	only n>1 prefill batch path
Decode tg	42.86 tok/s	166 tok/s (~3.9× speedup)
Output	`<channel\|>thought\nThe user is asking…` (stuck)	`<channel\|>thought\n…The capital of France is Paris.`

Cross-model regression check

Qwen3.6-35B-A3B-NVFP4: tg = 142–155 tok/s, nvfp4_moe=120, output coherent — unchanged ✓
Gemma-4 GGUF Q8_0 / Q4_K_M: untouched, runs through the non-prequant else branch — preserved ✓
Mistral-Small-3.2-NVFP4 (dense, no MoE): not on this code path — unaffected ✓

Test plan

Greedy decode "The capital of France is" → contains "Paris" (Qwen3.6, 16 tokens)
Greedy decode "What is the capital of France?" → contains "Paris" (Gemma-4, 128 tokens; needs CoT budget)
Sampled coherence T=0.7 top_p=0.9 256 tokens (both models, Rayleigh-scattering / sky-blue prompt) — coherent prose ✓
Long-context smoke ~430 prefill tokens — no NVFP4 long-context drift ✓
Perf sweep at 128 / 1024 / 4096 ctx — both models above plan baseline (Qwen3.6 ≥ 117 tok/s, Gemma-4 ≥ 34 tok/s) ✓
CUDA Graph capture confirmed via logs (ConditionalRunner, AsyncGraphLoop) for both ✓

Pre-push verify-fast hook flagged a 3% decode regression on Qwen3-8B Q8_0 — unrelated dense GGUF model, no code path touched by this PR; almost certainly cuBLAS-autotune variance from a warm GPU after extended NVFP4 testing. Bypassed with --no-verify per author request after manual cross-check.

🤖 Generated with Claude Code

Decode jumps from 8.34 → 117–142 tok/s (~14–17×) on Qwen3.6-35B-A3B-NVFP4 by populating wcache_.nvfp4_moe for SafeTensors NVFP4-prequant models. Three interacting bugs were blocking the existing `gemv_nvfp4_moe_*` fast path. nsys profile showed 75% of decode time burning in `dequantize_nvfp4_kernel` + sm_80 cuBLAS WMMA — the legacy FP16 fallback in executor_forward_moe.cu, which also serialises every layer via a D2H cudaMemcpy of `expert_offsets` (kills CUDA Graphs). Fixes ----- 1. `can_decode_fast` whitelist (executor_forward_moe.cu): NVFP4 was missing from the qtype list, so `decode_fast_path == false` and the NVFP4 MoE branch (line ~548) never ran even when pointers were set. 2. `cache_moe_native_nvfp4` (executor_pre_dequant.cu): for SafeTensors prequant models the loader writes per-expert tensors only — `expert_*_packed.data == nullptr` — so the existing `cache_moe_expert_nvfp4` lambda early-returned at `!packed.data` and wcache_.nvfp4_moe stayed empty. The new helper builds a contiguous packed/micro_scales/tensor_scales triple from the per-expert tensors (D2D memcpy preserves the [N, K_packed] / [N, K/16] row-major layout that gemv_nvfp4_moe_* + dequantize_nvfp4_moe_kernel already expect), stamps `expert_*_packed` so the consumer lookup-by-data-ptr wires up uniformly with the GGUF path, and registers the result. 3. Per-layer free of per-expert allocations: keeping both the per-expert and contiguous copies peaks at ~30 GiB on 35B-A3B and doesn't fit in 32 GiB. After a successful contiguous build for a layer the legacy fallback can't fire (nvfp4_moe_*_ptr is set), so we cudaFree the per-expert tensors inline. moe_budget refreshes via cudaMemGetInfo per layer to pick up the freed bytes. Verification ------------ * Qwen3.6-35B-A3B-NVFP4: tg=117–142 tok/s, "Paris" greedy-decode coherent, no `legacy FP16 fallback` log, nvfp4_moe=120 (40 layers × 3). * Qwen3-Coder-30B-A3B-FP4 (Modelopt SafeTensors NVFP4): nvfp4_moe=144, test passes. * Mistral-Small-3.2-NVFP4 (dense, no MoE): unchanged path, test passes. * Gemma-4-NVFP4 (uses GGUF-style 3D-packed gate_up split): unchanged path via cache_moe_expert_nvfp4, test passes. * Full GTest suite: 82 passed, 15 skipped (model assets missing), 0 failed. The shape convention for SafeTensors NVFP4 prequant tensors is `shape[1] = K_packed` (K_logical/2) — same as the existing executor_attention.cu / executor_ffn.cu NVFP4 dispatch conventions where `tmp.K = hw->shape[1] * 2`. The contiguous buffer therefore allocates `N * K_packed = N * K/2` packed bytes per expert, matching what gemv_nvfp4_moe_decode_kernel reads via expert_stride_packed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pert MoE path * Test counts: 606→~700 across 63 files (was 58/606, drifted) * Attention-dispatch overrides: point at imp.conf keys (attention.mxfp4 / attention.fp8_fmha / attention.fmha_sm120) instead of legacy IMP_* env vars; note env vars still work as dev escape hatches * NVFP4 prequant: document the non-obvious shape convention (shape[1] = K_logical/2, packed U8 wire dtype, Phase-0 promote preserves shape) and the per-expert MoE contiguous-build flow added in 615758a (cache_moe_native_nvfp4 + per-layer free of expert_w_*[e] to fit 32 GiB on 35B-A3B) * Add LlmCompressorE2E breadcrumb to gtest_filter examples Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For Gemma-4, engine.cpp was unconditionally setting `use_nvfp4_decode = 0` to dodge unsupported attention CUTLASS NVFP4 paths under Gemma-4's per-layer head_dim (256 SWA / 512 global). That over-broad disable also skipped Phase 3-MoE (`cache_moe_native_nvfp4` in executor_pre_dequant.cu), which is the only path that builds the contiguous per-layer NVFP4 expert buffer needed by the M=1 decode fast path. Without it, Gemma-4-NVFP4 fell through the legacy FP16 fallback (sm_80 WMMA + D2H expert-offsets sync per layer per token), which: - emitted `MoE prefill: legacy FP16 fallback path` per decode step - broke CUDA-graph capture at decode step 1 (D2H during capture) - tanked tg256 to ~42 tok/s - left output stuck on the user's prompt instead of an answer For NVFP4-prequant SafeTensors models, Phase 3a (Q*_K → NVFP4) and Phase 3b (NVFP4 → CUTLASS sm_120) iterate `wcache_.nvfp4`, which is empty by construction for prequant. They are no-ops there. Only Phase 3-MoE does load-bearing work. The "per-layer head_dim" caveat applies to attention CUTLASS kernels, not MoE expert caching, so leaving `use_nvfp4_decode` at its auto value is safe. Patch: keep `use_nvfp4_decode` enabled when `model.config().is_nvfp4_prequant` is true; preserve the original disable for any other Gemma-4 variant. Result on Gemma-4-26B-A4B-it-NVFP4 (RTX 5090, default config): - Phase-4 wcache: nvfp4_moe=0 → 90 (30 layers × 3 projections) - VRAM: ~10.9 GiB packed + 1.36 GiB micro-scales for the buffer - CUDA Graphs: capture succeeds (152 PDL edges, AsyncGraphLoop on) - Decode tg: 42.86 → 166 tok/s (~3.9× speedup) - Output `"What is the capital of France?"` (T=0, 128 tokens): `<|channel>thought\n…The capital of France is **Paris**.` (Gemma-4-it native CoT before final answer; coherent.) Qwen3.6-NVFP4 unchanged (already coherent at tg=142 tok/s on this branch via `615758a`). Mistral-3.2-NVFP4 unchanged. CLAUDE.md: refresh the CUDA-Graphs note to reflect that NVFP4-prequant MoE models capture cleanly post-fix; only legacy GGUF MoE decode still needs the D2H expert-routing sync and is graph-incompatible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl and others added 3 commits April 30, 2026 22:55

kekzl mentioned this pull request May 1, 2026

fix(server): emit reasoning_content for chat-template-injected <think> prefix #86

Merged

4 tasks

kekzl merged commit f9eaf27 into main May 1, 2026
2 checks passed

kekzl deleted the fix/qwen36-nvfp4-decode-fastpath branch May 1, 2026 00:18

kekzl mentioned this pull request May 1, 2026

chore(docs): consolidate post-v0.7.0 status across TODO / CHANGELOG / BENCHMARKS / docs #87

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nvfp4): unlock decode fast-path for prequant MoE (Qwen3.6 + Gemma-4)#85

fix(nvfp4): unlock decode fast-path for prequant MoE (Qwen3.6 + Gemma-4)#85
kekzl merged 3 commits into
mainfrom
fix/qwen36-nvfp4-decode-fastpath

kekzl commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented Apr 30, 2026

Summary

What was broken

What changed

Result on Gemma-4-26B-A4B-it-NVFP4 (RTX 5090, default config)

Cross-model regression check

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant