Skip to content

fix(nvfp4): unlock decode fast-path for prequant MoE (Qwen3.6 + Gemma-4)#85

Merged
kekzl merged 3 commits into
mainfrom
fix/qwen36-nvfp4-decode-fastpath
May 1, 2026
Merged

fix(nvfp4): unlock decode fast-path for prequant MoE (Qwen3.6 + Gemma-4)#85
kekzl merged 3 commits into
mainfrom
fix/qwen36-nvfp4-decode-fastpath

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented Apr 30, 2026

Summary

Unlocks the NVFP4 decode fast-path for Qwen3.6-35B-A3B-NVFP4 and Gemma-4-26B-A4B-it-NVFP4 SafeTensors models. Both now produce coherent output with CUDA Graphs captured in decode.

What was broken

  • Qwen3.6-NVFP4 (already shipped on this branch as 615758a): three bugs blocked the decode fast-path — can_decode_fast whitelist missing NVFP4, no cache_moe_native_nvfp4 for SafeTensors per-expert NVFP4 layout, and missing per-layer free of per-expert allocations to fit in 32 GiB. tg256 was 8.34 → 142 tok/s.

  • Gemma-4-NVFP4 (this PR's new commit 3901600): engine.cpp was unconditionally setting use_nvfp4_decode = 0 for any Gemma-4 model to dodge unsupported NVFP4 attention CUTLASS paths under per-layer head_dim (256 SWA / 512 global). That over-broad disable also skipped Phase 3-MoE (cache_moe_native_nvfp4), forcing the legacy FP16 fallback (sm_80 WMMA + per-layer D2H expert-offsets sync). Consequences: CUDA Graph capture IMA cascade at decode step 1, decode tg ~42 tok/s, output stuck on the user's prompt instead of an answer.

What changed

  • src/runtime/engine.cpp (+21/-2): when model.config().is_nvfp4_prequant is true for Gemma-4, keep use_nvfp4_decode at its auto value. Phase 3a (Q*_K → NVFP4) and Phase 3b (NVFP4 → CUTLASS sm_120) iterate wcache_.nvfp4, which is empty for prequant models — they are no-ops. Only Phase 3-MoE does load-bearing work, and that's the single thing the disable was preventing. The "per-layer head_dim" caveat applies to attention CUTLASS kernels, not MoE expert caching.

  • CLAUDE.md: refresh the CUDA-Graphs note. NVFP4-prequant MoE models capture cleanly post-fix; only legacy GGUF MoE decode (which still does per-layer D2H expert-offsets sync) remains graph-incompatible. Prefill is never captured (variable n).

Result on Gemma-4-26B-A4B-it-NVFP4 (RTX 5090, default config)

Metric Before After
Phase-4 wcache nvfp4_moe 0 90 (30 layers × 3 projections)
CUDA Graph capture step 1 IMA cascade → fallback 152 PDL edges, AsyncGraphLoop on
MoE prefill: legacy fallback log every iter only n>1 prefill batch path
Decode tg 42.86 tok/s 166 tok/s (~3.9× speedup)
Output <channel|>thought\nThe user is asking… (stuck) <channel|>thought\n…The capital of France is **Paris**.

Cross-model regression check

  • Qwen3.6-35B-A3B-NVFP4: tg = 142–155 tok/s, nvfp4_moe=120, output coherent — unchanged ✓
  • Gemma-4 GGUF Q8_0 / Q4_K_M: untouched, runs through the non-prequant else branch — preserved ✓
  • Mistral-Small-3.2-NVFP4 (dense, no MoE): not on this code path — unaffected ✓

Test plan

  • Greedy decode "The capital of France is" → contains "Paris" (Qwen3.6, 16 tokens)
  • Greedy decode "What is the capital of France?" → contains "Paris" (Gemma-4, 128 tokens; needs CoT budget)
  • Sampled coherence T=0.7 top_p=0.9 256 tokens (both models, Rayleigh-scattering / sky-blue prompt) — coherent prose ✓
  • Long-context smoke ~430 prefill tokens — no NVFP4 long-context drift ✓
  • Perf sweep at 128 / 1024 / 4096 ctx — both models above plan baseline (Qwen3.6 ≥ 117 tok/s, Gemma-4 ≥ 34 tok/s) ✓
  • CUDA Graph capture confirmed via logs (ConditionalRunner, AsyncGraphLoop) for both ✓

Pre-push verify-fast hook flagged a 3% decode regression on Qwen3-8B Q8_0 — unrelated dense GGUF model, no code path touched by this PR; almost certainly cuBLAS-autotune variance from a warm GPU after extended NVFP4 testing. Bypassed with --no-verify per author request after manual cross-check.

🤖 Generated with Claude Code

kekzl and others added 3 commits April 30, 2026 22:55
Decode jumps from 8.34 → 117–142 tok/s (~14–17×) on Qwen3.6-35B-A3B-NVFP4
by populating wcache_.nvfp4_moe for SafeTensors NVFP4-prequant models.

Three interacting bugs were blocking the existing `gemv_nvfp4_moe_*` fast
path. nsys profile showed 75% of decode time burning in
`dequantize_nvfp4_kernel` + sm_80 cuBLAS WMMA — the legacy FP16 fallback
in executor_forward_moe.cu, which also serialises every layer via a D2H
cudaMemcpy of `expert_offsets` (kills CUDA Graphs).

Fixes
-----
1. `can_decode_fast` whitelist (executor_forward_moe.cu): NVFP4 was
   missing from the qtype list, so `decode_fast_path == false` and the
   NVFP4 MoE branch (line ~548) never ran even when pointers were set.

2. `cache_moe_native_nvfp4` (executor_pre_dequant.cu): for SafeTensors
   prequant models the loader writes per-expert tensors only —
   `expert_*_packed.data == nullptr` — so the existing
   `cache_moe_expert_nvfp4` lambda early-returned at `!packed.data` and
   wcache_.nvfp4_moe stayed empty. The new helper builds a contiguous
   packed/micro_scales/tensor_scales triple from the per-expert tensors
   (D2D memcpy preserves the [N, K_packed] / [N, K/16] row-major
   layout that gemv_nvfp4_moe_* + dequantize_nvfp4_moe_kernel already
   expect), stamps `expert_*_packed` so the consumer lookup-by-data-ptr
   wires up uniformly with the GGUF path, and registers the result.

3. Per-layer free of per-expert allocations: keeping both the per-expert
   and contiguous copies peaks at ~30 GiB on 35B-A3B and doesn't fit in
   32 GiB. After a successful contiguous build for a layer the legacy
   fallback can't fire (nvfp4_moe_*_ptr is set), so we cudaFree the
   per-expert tensors inline. moe_budget refreshes via cudaMemGetInfo
   per layer to pick up the freed bytes.

Verification
------------
* Qwen3.6-35B-A3B-NVFP4: tg=117–142 tok/s, "Paris" greedy-decode
  coherent, no `legacy FP16 fallback` log, nvfp4_moe=120 (40 layers × 3).
* Qwen3-Coder-30B-A3B-FP4 (Modelopt SafeTensors NVFP4): nvfp4_moe=144,
  test passes.
* Mistral-Small-3.2-NVFP4 (dense, no MoE): unchanged path, test passes.
* Gemma-4-NVFP4 (uses GGUF-style 3D-packed gate_up split): unchanged
  path via cache_moe_expert_nvfp4, test passes.
* Full GTest suite: 82 passed, 15 skipped (model assets missing), 0 failed.

The shape convention for SafeTensors NVFP4 prequant tensors is
`shape[1] = K_packed` (K_logical/2) — same as the existing
executor_attention.cu / executor_ffn.cu NVFP4 dispatch conventions
where `tmp.K = hw->shape[1] * 2`. The contiguous buffer therefore
allocates `N * K_packed = N * K/2` packed bytes per expert, matching
what gemv_nvfp4_moe_decode_kernel reads via expert_stride_packed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pert MoE path

* Test counts: 606→~700 across 63 files (was 58/606, drifted)
* Attention-dispatch overrides: point at imp.conf keys (attention.mxfp4
  / attention.fp8_fmha / attention.fmha_sm120) instead of legacy
  IMP_* env vars; note env vars still work as dev escape hatches
* NVFP4 prequant: document the non-obvious shape convention
  (shape[1] = K_logical/2, packed U8 wire dtype, Phase-0 promote
  preserves shape) and the per-expert MoE contiguous-build flow added
  in 615758a (cache_moe_native_nvfp4 + per-layer free of expert_w_*[e]
  to fit 32 GiB on 35B-A3B)
* Add LlmCompressorE2E breadcrumb to gtest_filter examples

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For Gemma-4, engine.cpp was unconditionally setting
`use_nvfp4_decode = 0` to dodge unsupported attention CUTLASS NVFP4
paths under Gemma-4's per-layer head_dim (256 SWA / 512 global).
That over-broad disable also skipped Phase 3-MoE
(`cache_moe_native_nvfp4` in executor_pre_dequant.cu), which is the
only path that builds the contiguous per-layer NVFP4 expert buffer
needed by the M=1 decode fast path. Without it, Gemma-4-NVFP4
fell through the legacy FP16 fallback (sm_80 WMMA + D2H
expert-offsets sync per layer per token), which:

  - emitted `MoE prefill: legacy FP16 fallback path` per decode step
  - broke CUDA-graph capture at decode step 1 (D2H during capture)
  - tanked tg256 to ~42 tok/s
  - left output stuck on the user's prompt instead of an answer

For NVFP4-prequant SafeTensors models, Phase 3a (Q*_K → NVFP4) and
Phase 3b (NVFP4 → CUTLASS sm_120) iterate `wcache_.nvfp4`, which is
empty by construction for prequant. They are no-ops there. Only
Phase 3-MoE does load-bearing work. The "per-layer head_dim"
caveat applies to attention CUTLASS kernels, not MoE expert
caching, so leaving `use_nvfp4_decode` at its auto value is safe.

Patch: keep `use_nvfp4_decode` enabled when
`model.config().is_nvfp4_prequant` is true; preserve the original
disable for any other Gemma-4 variant.

Result on Gemma-4-26B-A4B-it-NVFP4 (RTX 5090, default config):

  - Phase-4 wcache: nvfp4_moe=0 → 90 (30 layers × 3 projections)
  - VRAM: ~10.9 GiB packed + 1.36 GiB micro-scales for the buffer
  - CUDA Graphs: capture succeeds (152 PDL edges, AsyncGraphLoop on)
  - Decode tg: 42.86 → 166 tok/s (~3.9× speedup)
  - Output `"What is the capital of France?"` (T=0, 128 tokens):
    `<|channel>thought\n…The capital of France is **Paris**.`
    (Gemma-4-it native CoT before final answer; coherent.)

Qwen3.6-NVFP4 unchanged (already coherent at tg=142 tok/s on this
branch via `615758a`). Mistral-3.2-NVFP4 unchanged.

CLAUDE.md: refresh the CUDA-Graphs note to reflect that
NVFP4-prequant MoE models capture cleanly post-fix; only legacy
GGUF MoE decode still needs the D2H expert-routing sync and is
graph-incompatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl merged commit f9eaf27 into main May 1, 2026
2 checks passed
@kekzl kekzl deleted the fix/qwen36-nvfp4-decode-fastpath branch May 1, 2026 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant