fix(nvfp4): light up CUTLASS NVFP4×NVFP4 prefill for prequant SafeTensors#88
Merged
Conversation
…sors
Phase 0 in executor_pre_dequant.cu sets `Tensor.qtype = NVFP4` directly on
the main weight tensors for prequant SafeTensors models (Mistral-3.2-NVFP4,
Gemma-4-NVFP4 llm-compressor, Qwen3.6-NVFP4 dense layers, etc.). But the
existing CUTLASS cache build (Phase 3b) only iterates `wcache_.nvfp4` —
the legacy decode-cache map populated when FP16/Q*_K weights are runtime-
quantized to NVFP4. Prequant tensors were never in `wcache_.nvfp4`, so
they were never converted to CUTLASS layout, and prefill (M>1) fell
through to `gemm_nvfp4` dequant→cuBLAS at executor_kernels.cu:1975.
Added Phase 0b loop right after Phase 0 promote: walks each model layer's
dense slots (wq/wk/wv/wo, w_gate/w_up/w_down, w_*_shared) and out_proj_,
synthesises an NvFP4QuantResult from the Tensor sidecars, calls
convert_nvfp4_to_cutlass(), and registers the result keyed by Tensor.data.
Per-expert MoE weights (expert_w_*) are unaffected — those use the
CUTLASS 3.x grouped GEMM path which is independent of cutlass_nvfp4.
Measured (Mistral-3.2-NVFP4, RTX 5090, Lorem×11=95-token prefill):
Prefill: 283 → 3122 tok/s (11×, was hitting "slow dequant-to-FP16
fallback" warning every prefill)
Decode: 37 → 48 tok/s (+30%, decode also benefits from cache hits)
Output: "a long established in 199999999999" (numerical garbage)
→ "a dolor sit amet, consectetur adipiscing elit, Quis..."
(coherent Latin continuing the prefix)
Short-prompt regression: "The capital of France is" still produces
"Paris. It is located in the north of the country, on the river" ✓
Gemma-4-NVFP4: unaffected ✓
Qwen3-Coder-30B-A3B-Instruct-FP4 (Modelopt MoE): "Paris." ✓ (different
code path — CUTLASS 3.x grouped — verified untouched).
Note: this is a partial fix for the long-context NVFP4 regression.
Mistral-3.2 long English prose (≥250 tokens) still doesn't always reach
the right answer because of the residual SmoothQuant + FP16-act mismatch.
The recipe-intended path (load per-Linear input_scale + apply as static
prescale) remains open work — see TODO.md.
Memo: memory/nvfp4_prequant_cutlass_cache_2026_05_01.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
May 1, 2026
PR #88 lights up the CUTLASS NVFP4×NVFP4 prefill path for prequant SafeTensors (Phase 0b in executor_pre_dequant.cu). Mistral-3.2-NVFP4 prefill 283 → 3122 tok/s, output coherence on long-context goes from numerical-hash garbage to coherent (if context-attracted) text. Long English prose ≥250 tokens still doesn't always reach the LM-head answer — that's the residual SmoothQuant + FP16-act mismatch which needs the recipe-intended dynamic NVFP4 input activation quantization path (per-Linear input_scale prescale). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tats IMP_AUDIT_NVFP4_SCALES already dumped weight_scale_2 stats; extend to also walk input_scale across all NVFP4 prequant Linears and report count, range, mean, sample values. Useful for distinguishing models that ship per-Linear calibrated input scales (Mistral-3.2-NVFP4: 280/280 present, range 12-8832, mean 2167) from models with purely dynamic input act-quant (Gemma-4-NVFP4 llm-compressor: no input_scale tensors). Diagnostic only — input_scale is loaded and uploaded to GPU but is not consumed by the dispatch path. For NVFP4 group-size=16 with dynamic local act-quant the per-block dynamic scale absorbs the static input_scale without numerical difference (mathematically equivalent to prescaling the activation), so wiring input_scale into the GEMM alpha would be a FP no-op. Documented for future investigation if a model surfaces where the static prescale matters (e.g. recipe with static input act-quant rather than dynamic local). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
May 1, 2026
Standard `pp512/tg256` bench (RTX 5090, default flags incl. CUDA graphs ON) for all four prequant SafeTensors models post PR #88: | Model | tg256 | pre-fix | Δ | |-----------------------------|------:|--------------:|-------------| | Mistral-3.2-NVFP4 | 101 | 81 | +25% | | Qwen3.6-35B-A3B-NVFP4 | 217 | 117-142 | +50-85% | | Gemma-4-26B-A4B-NVFP4 | 213 | 157-180 | +18-35% | | Qwen3-Coder-30B-A3B-NVFP4 | 272 | 51 (--no-graphs) | 5.3× | The Qwen3-Coder jump exposes a previously-undocumented secondary effect of PR #88: lighting up the CUTLASS prefill path also makes CUDA graphs **safe by default** for prequant SafeTensors. The previous "--no-cuda-graphs required" workaround was forced by the dequant→cuBLAS fallback's per-prefill 40 MiB FP16 scratch allocation, which is graph-incompatible. With Phase 0b lit up, that path no longer fires on the prefill hot loop. Updates: - BENCHMARKS.md decode + prefill tables for all four models - CHANGELOG.md gets PR #88 as the lead fixed-section entry; stale "Qwen3-Coder still requires --no-cuda-graphs" note replaced with the graphs-now-safe statement - TODO.md NVFP4-long-context bug section gets the proper bench matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 1, 2026
kekzl
added a commit
that referenced
this pull request
May 3, 2026
…tale numbers Final pass on the docs/ files left untouched in 81fdea3. ### docs/RECOMMENDED_MODELS.md - All NVFP4 numbers refreshed to post-PR#88 state: Qwen3-Coder-30B 51 -> 272, Qwen3.6 117-142 -> 217, Gemma-4 157-180 -> 213, Mistral-3.2 81 -> 101. - Drop "--no-cuda-graphs for coherence" caveat (PR #88 made graphs safe by default for prequant SafeTensors). - Add workstation Blackwell GPU header note + NVFP4-as-primary framing. - MoE table reordered to lead with NVFP4 prequant rows. - Add Qwen3-30B-A3B-NVFP4-Modelopt as Mistral-3.2 long-context replacement. ### docs/SM120_OPTIMIZATION_STATUS.md - Header now lists all three GB202 cards (RTX 5090 / PRO 5000 / 6000). - "What Would Actually Help Decode" updated: NVFP4 prequant is the primary win, not speculative decoding (which TODO.md says is abandoned). - Tested-models table refreshed with post-#88 numbers; "CUDA graphs for non-fast-path MoE: disabled" stays accurate but the NVFP4 prequant row now says graphs capture end-to-end. - "Project B Stage 5" name dropped from open-items table (the PROJECT_B doc was removed earlier; renamed to a plain "mxf4nvf4.block_scale MMA integration" line item). - "engine.cpp:547" stale line ref replaced with the actual issue description (per-layer head_dim FP8 KV write/read kernels). ### docs/MXFP4_QUANTIZATION.md -> docs/quantization.md - Title misled (mostly NVFP4 content). Renamed to broader scope and rewritten as a concise "where to get models for each path" guide. - NVFP4 (primary), MXFP4 (CUTLASS-internal attention only), GGUF K-quants (legacy), other KV quants (FP8 / INT8 / INT4 / TurboQuant) each get a focused paragraph with current status and caveats (Mistral-3.2 long-prose, Gemma-4 NVFP4 native tool calls). - Inference-pipeline detail removed - that's CLAUDE.md's job. - Stale Qwen3-Coder "38 tok/s" -> current "272 tok/s post #88" context. ### docs/memory-management-comparison.md - "imp supports CUDA only (Hopper, Blackwell)" was wrong - imp is sm_120f only. Corrected throughout. - "Blackwell-native features (PDL, Green Contexts, TCGEN05)" - imp uses register-based mma.sync, NOT TCGEN05. Removed the false claim, noted the actual MMA shapes. - KV format table extended: was "FP16, FP8 E4M3", now "FP16 (default), FP8, INT8, INT4, NVFP4, TurboQuant" - reflects what the engine actually supports. - mmap hints: was "MADV_SEQUENTIAL", now "MAP_POPULATE + MADV_WILLNEED + MADV_SEQUENTIAL" (post PR #97 cold-cache prefault). - Pinned pool: was "64 MiB", now "4x128 MiB ring" (post PR #97 upload pipeline). - Speculative decoding cell rewritten to match TODO.md's abandoned-options state instead of generic "draft+target with KV block rollback". - Decision matrix: was "Single H100/B200, latency-critical, one model -> imp" - imp doesn't target H100/B200. Replaced with Blackwell GB202 entry + a "datacenter Hopper / B200/B300 -> use vLLM/TensorRT-LLM" line. - Native function calling row added (was missing entirely; new feature post #97). ### docs/memory-traffic-reduction-catalog.md - W2 EAGLE-3: was "Dead-End historisch, worth revisit" - this contradicts TODO.md's definitive abandoned-options list. Updated to "abandoned - all variants tested, single-5090 decode bandwidth-bound". - A6 Fused MoE Routing: was "teilweise (Gemma-4 Fast-Path)" - outdated. Now "vorhanden für NVFP4 prequant MoE" covering Qwen3.6/Gemma-4/Coder-30B; legacy GGUF MoE called out as the remaining open item. - "Counterintuitive finding: Q6_K beats NVFP4 on decode at 30B" was flipped post-PR#88: NVFP4 now 272 tok/s vs Q6_K 234. Section rewritten to explain that the old gap was an implementation artifact, not a format tradeoff. NVFP4 is now the default recommendation. - Top-Kandidaten list: dropped W2 EAGLE-revisit + W3 Medusa, added A6-generalize-for-GGUF and the mxf4nvf4 MMA integration. - "Was gewonnen wurde" prepended with the headline NVFP4 prequant decode jump and the cold-start reduction from PR #97. - Old `memory/...md` reference list dropped (those are auto-memory files, not in the repo); replaced with pointers to the surviving docs. ### README.md - Doc-index pointer updated: docs/MXFP4_QUANTIZATION.md -> docs/quantization.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 3, 2026
github-actions Bot
pushed a commit
that referenced
this pull request
May 9, 2026
All decode + prefill numbers re-measured 2026-05-10 with imp:test post PR #156 (chunked-prefill-hybrid) + PR #157 (auto max_seq_len 16K cap). Notable changes vs prior table: - Qwen3-4B / Qwen3-8B Q8_0 decode dropped (401→236, 255→149) — older numbers were pre-some-PR-between-#88-and-#150 regression. Current state is consistent with tests/perf_baseline.json (~150 tok/s). - Llama-3.2-3B decode improved (208→306). - Qwen3.6-35B Q4_K_M decode +70% (143→243), reflects PR #150/#151 MoE graphs-gate + fp32_down pre-alloc wins. - Q3.6-NVFP4 prefill +82% (601→1092). - Added Nemotron-3-Nano-30B-A3B NVFP4 row (325 tg256 / 690 pp512). - Mistral-Small-3.2 NVFP4 + Gemma-4 Q5_K_M kept as italic stale (model files not present locally — last numbers retained as historical). README headline rewritten — single "Qwen3-8B at 255 tok/s vs 1.6× llama.cpp" claim no longer holds; replaced with multi-model decode highlights citing the current numbers. Long-context prefill table (pp1024-pp8192) and KV-cache-quant table (Llama-3.2-3B specific) NOT re-measured this round; still reflect the v0.7 / PR #51 measurement series. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 0 in
executor_pre_dequant.cusetsTensor.qtype = NVFP4directly on the main weight tensors for prequant SafeTensors models (Mistral-3.2-NVFP4, Gemma-4-NVFP4 llm-compressor, Qwen3.6-NVFP4 dense layers, Qwen3-Coder-30B-A3B-Instruct-FP4 dense+attention). But the existing CUTLASS cache build (Phase 3b) only iteratedwcache_.nvfp4— the legacy decode-cache map populated when FP16 / Q*_K weights are runtime-quantized to NVFP4. Prequant tensors were never inwcache_.nvfp4, so they were never converted to CUTLASS layout, and prefill (M>1) fell through togemm_nvfp4dequant → cuBLAS atexecutor_kernels.cu:1975:That allocation, fired per-prefill, is also why prequant SafeTensors models needed
--no-cuda-graphsfor stability — runtime allocs aren't graph-compatible.Added Phase 0b loop right after Phase 0 promote: walks each model layer's dense slots (
wq/wk/wv/wo,w_gate/w_up/w_down,w_*_shared) andout_proj_, synthesises anNvFP4QuantResultfrom the Tensor sidecars, callsconvert_nvfp4_to_cutlass(), and registers the result keyed byTensor.data. Per-expert MoE weights (expert_w_*) are unaffected — those use the CUTLASS 3.x grouped GEMM path which is independent ofcutlass_nvfp4.Second commit on this branch:
IMP_AUDIT_NVFP4_SCALESextension to also dumpinput_scalestats. Diagnostic only —input_scaleis loaded but the per-block-dynamic CUTLASS quantizer makes it a FP no-op fordynamic: localrecipes, so we don't wire it into the dispatch.Measured impact
Standard
pp512/tg256bench, RTX 5090, default flags (CUDA graphs ON):The Qwen3-Coder 5.3× jump is mostly from CUDA graphs becoming safe by default — the dequant fallback's per-prefill 40 MiB FP16 scratch allocation was graph-incompatible. With Phase 0b lit up, graphs work end-to-end on prequant SafeTensors. The
--no-cuda-graphsworkaround for Qwen3-Coder NVFP4 is no longer needed.Lorem×11 prompt (95-token prefill) on Mistral-3.2-NVFP4:
" a long established in 199999999999"(numerical garbage)" a dolor sit amet, consectetur adipiscing elit, Lorem..."(coherent Latin)CUTLASS cache size for Mistral-3.2: 280 tensors, 1325 MiB scale_factors (data is borrowed; only SfAtom layout UE4M3 scales are owned).
Regression check
verify-fast(Qwen3-4B Q8_0 smoke)Partial fix scope
This fixes the kernel-level breakage of the long-context NVFP4 regression — numerical-hash output on Mistral-3.2-NVFP4 with Lorem-ipsum prefixes is replaced by coherent text. Long English prose ≥250 tokens still doesn't always reach the LM-head's intended answer (
memory/nvfp4_long_context_regression_2026_04_28.md), butIMP_AUDIT_NVFP4_SCALESanalysis shows that's a model-behaviour / instruction-following issue, not a per-Linearinput_scalemath gap (the static prescale is mathematically equivalent to imp's dynamic-per-block path underdynamic: localrecipes).Test plan
make verify-fast(pre-push, both commits): PASS, decode within 3% threshold, prefill within 5%, smoke prompt distinct=8 contains 'Paris'Memo
memory/nvfp4_prequant_cutlass_cache_2026_05_01.md— fix, measurements, follow-up notesmemory/nvfp4_long_context_regression_2026_04_28.md— original bug investigation, now partially resolved🤖 Generated with Claude Code