fix(nvfp4): unlock decode fast-path for prequant MoE (Qwen3.6 + Gemma-4)#85
Merged
Conversation
Decode jumps from 8.34 → 117–142 tok/s (~14–17×) on Qwen3.6-35B-A3B-NVFP4 by populating wcache_.nvfp4_moe for SafeTensors NVFP4-prequant models. Three interacting bugs were blocking the existing `gemv_nvfp4_moe_*` fast path. nsys profile showed 75% of decode time burning in `dequantize_nvfp4_kernel` + sm_80 cuBLAS WMMA — the legacy FP16 fallback in executor_forward_moe.cu, which also serialises every layer via a D2H cudaMemcpy of `expert_offsets` (kills CUDA Graphs). Fixes ----- 1. `can_decode_fast` whitelist (executor_forward_moe.cu): NVFP4 was missing from the qtype list, so `decode_fast_path == false` and the NVFP4 MoE branch (line ~548) never ran even when pointers were set. 2. `cache_moe_native_nvfp4` (executor_pre_dequant.cu): for SafeTensors prequant models the loader writes per-expert tensors only — `expert_*_packed.data == nullptr` — so the existing `cache_moe_expert_nvfp4` lambda early-returned at `!packed.data` and wcache_.nvfp4_moe stayed empty. The new helper builds a contiguous packed/micro_scales/tensor_scales triple from the per-expert tensors (D2D memcpy preserves the [N, K_packed] / [N, K/16] row-major layout that gemv_nvfp4_moe_* + dequantize_nvfp4_moe_kernel already expect), stamps `expert_*_packed` so the consumer lookup-by-data-ptr wires up uniformly with the GGUF path, and registers the result. 3. Per-layer free of per-expert allocations: keeping both the per-expert and contiguous copies peaks at ~30 GiB on 35B-A3B and doesn't fit in 32 GiB. After a successful contiguous build for a layer the legacy fallback can't fire (nvfp4_moe_*_ptr is set), so we cudaFree the per-expert tensors inline. moe_budget refreshes via cudaMemGetInfo per layer to pick up the freed bytes. Verification ------------ * Qwen3.6-35B-A3B-NVFP4: tg=117–142 tok/s, "Paris" greedy-decode coherent, no `legacy FP16 fallback` log, nvfp4_moe=120 (40 layers × 3). * Qwen3-Coder-30B-A3B-FP4 (Modelopt SafeTensors NVFP4): nvfp4_moe=144, test passes. * Mistral-Small-3.2-NVFP4 (dense, no MoE): unchanged path, test passes. * Gemma-4-NVFP4 (uses GGUF-style 3D-packed gate_up split): unchanged path via cache_moe_expert_nvfp4, test passes. * Full GTest suite: 82 passed, 15 skipped (model assets missing), 0 failed. The shape convention for SafeTensors NVFP4 prequant tensors is `shape[1] = K_packed` (K_logical/2) — same as the existing executor_attention.cu / executor_ffn.cu NVFP4 dispatch conventions where `tmp.K = hw->shape[1] * 2`. The contiguous buffer therefore allocates `N * K_packed = N * K/2` packed bytes per expert, matching what gemv_nvfp4_moe_decode_kernel reads via expert_stride_packed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pert MoE path * Test counts: 606→~700 across 63 files (was 58/606, drifted) * Attention-dispatch overrides: point at imp.conf keys (attention.mxfp4 / attention.fp8_fmha / attention.fmha_sm120) instead of legacy IMP_* env vars; note env vars still work as dev escape hatches * NVFP4 prequant: document the non-obvious shape convention (shape[1] = K_logical/2, packed U8 wire dtype, Phase-0 promote preserves shape) and the per-expert MoE contiguous-build flow added in 615758a (cache_moe_native_nvfp4 + per-layer free of expert_w_*[e] to fit 32 GiB on 35B-A3B) * Add LlmCompressorE2E breadcrumb to gtest_filter examples Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For Gemma-4, engine.cpp was unconditionally setting
`use_nvfp4_decode = 0` to dodge unsupported attention CUTLASS NVFP4
paths under Gemma-4's per-layer head_dim (256 SWA / 512 global).
That over-broad disable also skipped Phase 3-MoE
(`cache_moe_native_nvfp4` in executor_pre_dequant.cu), which is the
only path that builds the contiguous per-layer NVFP4 expert buffer
needed by the M=1 decode fast path. Without it, Gemma-4-NVFP4
fell through the legacy FP16 fallback (sm_80 WMMA + D2H
expert-offsets sync per layer per token), which:
- emitted `MoE prefill: legacy FP16 fallback path` per decode step
- broke CUDA-graph capture at decode step 1 (D2H during capture)
- tanked tg256 to ~42 tok/s
- left output stuck on the user's prompt instead of an answer
For NVFP4-prequant SafeTensors models, Phase 3a (Q*_K → NVFP4) and
Phase 3b (NVFP4 → CUTLASS sm_120) iterate `wcache_.nvfp4`, which is
empty by construction for prequant. They are no-ops there. Only
Phase 3-MoE does load-bearing work. The "per-layer head_dim"
caveat applies to attention CUTLASS kernels, not MoE expert
caching, so leaving `use_nvfp4_decode` at its auto value is safe.
Patch: keep `use_nvfp4_decode` enabled when
`model.config().is_nvfp4_prequant` is true; preserve the original
disable for any other Gemma-4 variant.
Result on Gemma-4-26B-A4B-it-NVFP4 (RTX 5090, default config):
- Phase-4 wcache: nvfp4_moe=0 → 90 (30 layers × 3 projections)
- VRAM: ~10.9 GiB packed + 1.36 GiB micro-scales for the buffer
- CUDA Graphs: capture succeeds (152 PDL edges, AsyncGraphLoop on)
- Decode tg: 42.86 → 166 tok/s (~3.9× speedup)
- Output `"What is the capital of France?"` (T=0, 128 tokens):
`<|channel>thought\n…The capital of France is **Paris**.`
(Gemma-4-it native CoT before final answer; coherent.)
Qwen3.6-NVFP4 unchanged (already coherent at tg=142 tok/s on this
branch via `615758a`). Mistral-3.2-NVFP4 unchanged.
CLAUDE.md: refresh the CUDA-Graphs note to reflect that
NVFP4-prequant MoE models capture cleanly post-fix; only legacy
GGUF MoE decode still needs the D2H expert-routing sync and is
graph-incompatible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Unlocks the NVFP4 decode fast-path for Qwen3.6-35B-A3B-NVFP4 and Gemma-4-26B-A4B-it-NVFP4 SafeTensors models. Both now produce coherent output with CUDA Graphs captured in decode.
What was broken
Qwen3.6-NVFP4 (already shipped on this branch as
615758a): three bugs blocked the decode fast-path —can_decode_fastwhitelist missing NVFP4, nocache_moe_native_nvfp4for SafeTensors per-expert NVFP4 layout, and missing per-layer free of per-expert allocations to fit in 32 GiB. tg256 was 8.34 → 142 tok/s.Gemma-4-NVFP4 (this PR's new commit
3901600):engine.cppwas unconditionally settinguse_nvfp4_decode = 0for any Gemma-4 model to dodge unsupported NVFP4 attention CUTLASS paths under per-layerhead_dim(256 SWA / 512 global). That over-broad disable also skipped Phase 3-MoE (cache_moe_native_nvfp4), forcing the legacy FP16 fallback (sm_80 WMMA + per-layer D2H expert-offsets sync). Consequences: CUDA Graph capture IMA cascade at decode step 1, decode tg ~42 tok/s, output stuck on the user's prompt instead of an answer.What changed
src/runtime/engine.cpp(+21/-2): whenmodel.config().is_nvfp4_prequantis true for Gemma-4, keepuse_nvfp4_decodeat its auto value. Phase 3a (Q*_K → NVFP4) and Phase 3b (NVFP4 → CUTLASS sm_120) iteratewcache_.nvfp4, which is empty for prequant models — they are no-ops. Only Phase 3-MoE does load-bearing work, and that's the single thing the disable was preventing. The "per-layer head_dim" caveat applies to attention CUTLASS kernels, not MoE expert caching.CLAUDE.md: refresh the CUDA-Graphs note. NVFP4-prequant MoE models capture cleanly post-fix; only legacy GGUF MoE decode (which still does per-layer D2H expert-offsets sync) remains graph-incompatible. Prefill is never captured (variable n).Result on Gemma-4-26B-A4B-it-NVFP4 (RTX 5090, default config)
nvfp4_moe<channel|>thought\nThe user is asking…(stuck)<channel|>thought\n…The capital of France is **Paris**.Cross-model regression check
nvfp4_moe=120, output coherent — unchanged ✓elsebranch — preserved ✓Test plan
"The capital of France is"→ contains "Paris" (Qwen3.6, 16 tokens)"What is the capital of France?"→ contains "Paris" (Gemma-4, 128 tokens; needs CoT budget)ConditionalRunner,AsyncGraphLoop) for both ✓Pre-push verify-fast hook flagged a 3% decode regression on Qwen3-8B Q8_0 — unrelated dense GGUF model, no code path touched by this PR; almost certainly cuBLAS-autotune variance from a warm GPU after extended NVFP4 testing. Bypassed with
--no-verifyper author request after manual cross-check.🤖 Generated with Claude Code