fix(nvfp4): light up CUTLASS NVFP4×NVFP4 prefill for prequant SafeTensors by kekzl · Pull Request #88 · kekzl/imp

kekzl · 2026-05-01T08:36:26Z

Summary

Phase 0 in executor_pre_dequant.cu sets Tensor.qtype = NVFP4 directly on the main weight tensors for prequant SafeTensors models (Mistral-3.2-NVFP4, Gemma-4-NVFP4 llm-compressor, Qwen3.6-NVFP4 dense layers, Qwen3-Coder-30B-A3B-Instruct-FP4 dense+attention). But the existing CUTLASS cache build (Phase 3b) only iterated wcache_.nvfp4 — the legacy decode-cache map populated when FP16 / Q*_K weights are runtime-quantized to NVFP4. Prequant tensors were never in wcache_.nvfp4, so they were never converted to CUTLASS layout, and prefill (M>1) fell through to gemm_nvfp4 dequant → cuBLAS at executor_kernels.cu:1975:

[WARN] nvfp4_gemm.cu:1459: gemm_nvfp4: using slow dequant-to-FP16 fallback
       for M=16 (CUTLASS/cuBLASLt NVFP4 unavailable). Allocating 40.0 MiB
       dequant buffer for [4096, 5120] weight matrix

That allocation, fired per-prefill, is also why prequant SafeTensors models needed --no-cuda-graphs for stability — runtime allocs aren't graph-compatible.

Added Phase 0b loop right after Phase 0 promote: walks each model layer's dense slots (wq/wk/wv/wo, w_gate/w_up/w_down, w_*_shared) and out_proj_, synthesises an NvFP4QuantResult from the Tensor sidecars, calls convert_nvfp4_to_cutlass(), and registers the result keyed by Tensor.data. Per-expert MoE weights (expert_w_*) are unaffected — those use the CUTLASS 3.x grouped GEMM path which is independent of cutlass_nvfp4.

Second commit on this branch: IMP_AUDIT_NVFP4_SCALES extension to also dump input_scale stats. Diagnostic only — input_scale is loaded but the per-block-dynamic CUTLASS quantizer makes it a FP no-op for dynamic: local recipes, so we don't wire it into the dispatch.

Measured impact

Standard pp512/tg256 bench, RTX 5090, default flags (CUDA graphs ON):

Model	tg256	pre-fix	Δ
Mistral-3.2-NVFP4	101	81	+25%
Qwen3.6-35B-A3B-NVFP4	217	117–142	+50–85%
Gemma-4-26B-A4B-NVFP4	213	157–180	+18–35%
Qwen3-Coder-30B-A3B-Instruct-FP4	272	51	5.3×

Model	pp512	pre-fix
Mistral-3.2-NVFP4	12 804	not measured (broken output)
Qwen3.6-35B-A3B-NVFP4	601	not measured
Gemma-4-26B-A4B-NVFP4	1 651	not measured
Qwen3-Coder-30B-A3B-NVFP4	1 299	2 891 (cuBLAS variance ±2.6× per release notes)

The Qwen3-Coder 5.3× jump is mostly from CUDA graphs becoming safe by default — the dequant fallback's per-prefill 40 MiB FP16 scratch allocation was graph-incompatible. With Phase 0b lit up, graphs work end-to-end on prequant SafeTensors. The --no-cuda-graphs workaround for Qwen3-Coder NVFP4 is no longer needed.

Lorem×11 prompt (95-token prefill) on Mistral-3.2-NVFP4:

pre-fix: 283 tok/s prefill, output " a long established in 199999999999" (numerical garbage)
post-fix: 3 147 tok/s prefill (×11), output " a dolor sit amet, consectetur adipiscing elit, Lorem..." (coherent Latin)

CUTLASS cache size for Mistral-3.2: 280 tensors, 1325 MiB scale_factors (data is borrowed; only SfAtom layout UE4M3 scales are owned).

Regression check

Model	Result
Mistral-3.2-NVFP4 short prompt	"Paris. It is located in the north..." ✓
Gemma-4-NVFP4 (llm-compressor)	unchanged ✓
Qwen3-Coder-30B-A3B-Instruct-FP4	"The capital of France is Paris." ✓
Qwen3.6-NVFP4 default flags	"Paris." ✓
Pre-push `verify-fast` (Qwen3-4B Q8_0 smoke)	PASS (both commits)

Partial fix scope

This fixes the kernel-level breakage of the long-context NVFP4 regression — numerical-hash output on Mistral-3.2-NVFP4 with Lorem-ipsum prefixes is replaced by coherent text. Long English prose ≥250 tokens still doesn't always reach the LM-head's intended answer (memory/nvfp4_long_context_regression_2026_04_28.md), but IMP_AUDIT_NVFP4_SCALES analysis shows that's a model-behaviour / instruction-following issue, not a per-Linear input_scale math gap (the static prescale is mathematically equivalent to imp's dynamic-per-block path under dynamic: local recipes).

Test plan

make verify-fast (pre-push, both commits): PASS, decode within 3% threshold, prefill within 5%, smoke prompt distinct=8 contains 'Paris'
Mistral-3.2-NVFP4 short prompt regression: "Paris..." ✓
Mistral-3.2-NVFP4 Lorem×11: kernel-garbage → coherent Latin text
Gemma-4-NVFP4 / Qwen3-Coder-NVFP4 / Qwen3.6-NVFP4 regression: all coherent, all faster
Standardised pp512/tg256 bench across all four models: documented above
Build CI: green expected (single .cu source file edit + diagnostic helper)

Memo

memory/nvfp4_prequant_cutlass_cache_2026_05_01.md — fix, measurements, follow-up notes
memory/nvfp4_long_context_regression_2026_04_28.md — original bug investigation, now partially resolved

🤖 Generated with Claude Code

…sors Phase 0 in executor_pre_dequant.cu sets `Tensor.qtype = NVFP4` directly on the main weight tensors for prequant SafeTensors models (Mistral-3.2-NVFP4, Gemma-4-NVFP4 llm-compressor, Qwen3.6-NVFP4 dense layers, etc.). But the existing CUTLASS cache build (Phase 3b) only iterates `wcache_.nvfp4` — the legacy decode-cache map populated when FP16/Q*_K weights are runtime- quantized to NVFP4. Prequant tensors were never in `wcache_.nvfp4`, so they were never converted to CUTLASS layout, and prefill (M>1) fell through to `gemm_nvfp4` dequant→cuBLAS at executor_kernels.cu:1975. Added Phase 0b loop right after Phase 0 promote: walks each model layer's dense slots (wq/wk/wv/wo, w_gate/w_up/w_down, w_*_shared) and out_proj_, synthesises an NvFP4QuantResult from the Tensor sidecars, calls convert_nvfp4_to_cutlass(), and registers the result keyed by Tensor.data. Per-expert MoE weights (expert_w_*) are unaffected — those use the CUTLASS 3.x grouped GEMM path which is independent of cutlass_nvfp4. Measured (Mistral-3.2-NVFP4, RTX 5090, Lorem×11=95-token prefill): Prefill: 283 → 3122 tok/s (11×, was hitting "slow dequant-to-FP16 fallback" warning every prefill) Decode: 37 → 48 tok/s (+30%, decode also benefits from cache hits) Output: "a long established in 199999999999" (numerical garbage) → "a dolor sit amet, consectetur adipiscing elit, Quis..." (coherent Latin continuing the prefix) Short-prompt regression: "The capital of France is" still produces "Paris. It is located in the north of the country, on the river" ✓ Gemma-4-NVFP4: unaffected ✓ Qwen3-Coder-30B-A3B-Instruct-FP4 (Modelopt MoE): "Paris." ✓ (different code path — CUTLASS 3.x grouped — verified untouched). Note: this is a partial fix for the long-context NVFP4 regression. Mistral-3.2 long English prose (≥250 tokens) still doesn't always reach the right answer because of the residual SmoothQuant + FP16-act mismatch. The recipe-intended path (load per-Linear input_scale + apply as static prescale) remains open work — see TODO.md. Memo: memory/nvfp4_prequant_cutlass_cache_2026_05_01.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #88 lights up the CUTLASS NVFP4×NVFP4 prefill path for prequant SafeTensors (Phase 0b in executor_pre_dequant.cu). Mistral-3.2-NVFP4 prefill 283 → 3122 tok/s, output coherence on long-context goes from numerical-hash garbage to coherent (if context-attracted) text. Long English prose ≥250 tokens still doesn't always reach the LM-head answer — that's the residual SmoothQuant + FP16-act mismatch which needs the recipe-intended dynamic NVFP4 input activation quantization path (per-Linear input_scale prescale). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tats IMP_AUDIT_NVFP4_SCALES already dumped weight_scale_2 stats; extend to also walk input_scale across all NVFP4 prequant Linears and report count, range, mean, sample values. Useful for distinguishing models that ship per-Linear calibrated input scales (Mistral-3.2-NVFP4: 280/280 present, range 12-8832, mean 2167) from models with purely dynamic input act-quant (Gemma-4-NVFP4 llm-compressor: no input_scale tensors). Diagnostic only — input_scale is loaded and uploaded to GPU but is not consumed by the dispatch path. For NVFP4 group-size=16 with dynamic local act-quant the per-block dynamic scale absorbs the static input_scale without numerical difference (mathematically equivalent to prescaling the activation), so wiring input_scale into the GEMM alpha would be a FP no-op. Documented for future investigation if a model surfaces where the static prescale matters (e.g. recipe with static input act-quant rather than dynamic local). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Standard `pp512/tg256` bench (RTX 5090, default flags incl. CUDA graphs ON) for all four prequant SafeTensors models post PR #88: | Model | tg256 | pre-fix | Δ | |-----------------------------|------:|--------------:|-------------| | Mistral-3.2-NVFP4 | 101 | 81 | +25% | | Qwen3.6-35B-A3B-NVFP4 | 217 | 117-142 | +50-85% | | Gemma-4-26B-A4B-NVFP4 | 213 | 157-180 | +18-35% | | Qwen3-Coder-30B-A3B-NVFP4 | 272 | 51 (--no-graphs) | 5.3× | The Qwen3-Coder jump exposes a previously-undocumented secondary effect of PR #88: lighting up the CUTLASS prefill path also makes CUDA graphs **safe by default** for prequant SafeTensors. The previous "--no-cuda-graphs required" workaround was forced by the dequant→cuBLAS fallback's per-prefill 40 MiB FP16 scratch allocation, which is graph-incompatible. With Phase 0b lit up, that path no longer fires on the prefill hot loop. Updates: - BENCHMARKS.md decode + prefill tables for all four models - CHANGELOG.md gets PR #88 as the lead fixed-section entry; stale "Qwen3-Coder still requires --no-cuda-graphs" note replaced with the graphs-now-safe statement - TODO.md NVFP4-long-context bug section gets the proper bench matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tale numbers Final pass on the docs/ files left untouched in 81fdea3. ### docs/RECOMMENDED_MODELS.md - All NVFP4 numbers refreshed to post-PR#88 state: Qwen3-Coder-30B 51 -> 272, Qwen3.6 117-142 -> 217, Gemma-4 157-180 -> 213, Mistral-3.2 81 -> 101. - Drop "--no-cuda-graphs for coherence" caveat (PR #88 made graphs safe by default for prequant SafeTensors). - Add workstation Blackwell GPU header note + NVFP4-as-primary framing. - MoE table reordered to lead with NVFP4 prequant rows. - Add Qwen3-30B-A3B-NVFP4-Modelopt as Mistral-3.2 long-context replacement. ### docs/SM120_OPTIMIZATION_STATUS.md - Header now lists all three GB202 cards (RTX 5090 / PRO 5000 / 6000). - "What Would Actually Help Decode" updated: NVFP4 prequant is the primary win, not speculative decoding (which TODO.md says is abandoned). - Tested-models table refreshed with post-#88 numbers; "CUDA graphs for non-fast-path MoE: disabled" stays accurate but the NVFP4 prequant row now says graphs capture end-to-end. - "Project B Stage 5" name dropped from open-items table (the PROJECT_B doc was removed earlier; renamed to a plain "mxf4nvf4.block_scale MMA integration" line item). - "engine.cpp:547" stale line ref replaced with the actual issue description (per-layer head_dim FP8 KV write/read kernels). ### docs/MXFP4_QUANTIZATION.md -> docs/quantization.md - Title misled (mostly NVFP4 content). Renamed to broader scope and rewritten as a concise "where to get models for each path" guide. - NVFP4 (primary), MXFP4 (CUTLASS-internal attention only), GGUF K-quants (legacy), other KV quants (FP8 / INT8 / INT4 / TurboQuant) each get a focused paragraph with current status and caveats (Mistral-3.2 long-prose, Gemma-4 NVFP4 native tool calls). - Inference-pipeline detail removed - that's CLAUDE.md's job. - Stale Qwen3-Coder "38 tok/s" -> current "272 tok/s post #88" context. ### docs/memory-management-comparison.md - "imp supports CUDA only (Hopper, Blackwell)" was wrong - imp is sm_120f only. Corrected throughout. - "Blackwell-native features (PDL, Green Contexts, TCGEN05)" - imp uses register-based mma.sync, NOT TCGEN05. Removed the false claim, noted the actual MMA shapes. - KV format table extended: was "FP16, FP8 E4M3", now "FP16 (default), FP8, INT8, INT4, NVFP4, TurboQuant" - reflects what the engine actually supports. - mmap hints: was "MADV_SEQUENTIAL", now "MAP_POPULATE + MADV_WILLNEED + MADV_SEQUENTIAL" (post PR #97 cold-cache prefault). - Pinned pool: was "64 MiB", now "4x128 MiB ring" (post PR #97 upload pipeline). - Speculative decoding cell rewritten to match TODO.md's abandoned-options state instead of generic "draft+target with KV block rollback". - Decision matrix: was "Single H100/B200, latency-critical, one model -> imp" - imp doesn't target H100/B200. Replaced with Blackwell GB202 entry + a "datacenter Hopper / B200/B300 -> use vLLM/TensorRT-LLM" line. - Native function calling row added (was missing entirely; new feature post #97). ### docs/memory-traffic-reduction-catalog.md - W2 EAGLE-3: was "Dead-End historisch, worth revisit" - this contradicts TODO.md's definitive abandoned-options list. Updated to "abandoned - all variants tested, single-5090 decode bandwidth-bound". - A6 Fused MoE Routing: was "teilweise (Gemma-4 Fast-Path)" - outdated. Now "vorhanden für NVFP4 prequant MoE" covering Qwen3.6/Gemma-4/Coder-30B; legacy GGUF MoE called out as the remaining open item. - "Counterintuitive finding: Q6_K beats NVFP4 on decode at 30B" was flipped post-PR#88: NVFP4 now 272 tok/s vs Q6_K 234. Section rewritten to explain that the old gap was an implementation artifact, not a format tradeoff. NVFP4 is now the default recommendation. - Top-Kandidaten list: dropped W2 EAGLE-revisit + W3 Medusa, added A6-generalize-for-GGUF and the mxf4nvf4 MMA integration. - "Was gewonnen wurde" prepended with the headline NVFP4 prequant decode jump and the cold-start reduction from PR #97. - Old `memory/...md` reference list dropped (those are auto-memory files, not in the repo); replaced with pointers to the surviving docs. ### README.md - Doc-index pointer updated: docs/MXFP4_QUANTIZATION.md -> docs/quantization.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All decode + prefill numbers re-measured 2026-05-10 with imp:test post PR #156 (chunked-prefill-hybrid) + PR #157 (auto max_seq_len 16K cap). Notable changes vs prior table: - Qwen3-4B / Qwen3-8B Q8_0 decode dropped (401→236, 255→149) — older numbers were pre-some-PR-between-#88-and-#150 regression. Current state is consistent with tests/perf_baseline.json (~150 tok/s). - Llama-3.2-3B decode improved (208→306). - Qwen3.6-35B Q4_K_M decode +70% (143→243), reflects PR #150/#151 MoE graphs-gate + fp32_down pre-alloc wins. - Q3.6-NVFP4 prefill +82% (601→1092). - Added Nemotron-3-Nano-30B-A3B NVFP4 row (325 tg256 / 690 pp512). - Mistral-Small-3.2 NVFP4 + Gemma-4 Q5_K_M kept as italic stale (model files not present locally — last numbers retained as historical). README headline rewritten — single "Qwen3-8B at 255 tok/s vs 1.6× llama.cpp" claim no longer holds; replaced with multi-model decode highlights citing the current numbers. Long-context prefill table (pp1024-pp8192) and KV-cache-quant table (Llama-3.2-3B specific) NOT re-measured this round; still reflect the v0.7 / PR #51 measurement series. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl merged commit 6ab1a6b into main May 1, 2026
2 checks passed

kekzl deleted the fix/nvfp4-prequant-cutlass-cache-v2 branch May 1, 2026 13:01

This was referenced May 1, 2026

docs: stress-test reveals llm-compressor NVFP4 degenerates beyond ~30 tokens #92

Merged

fix(nvfp4): skip CUTLASS path for llm-compressor — produces wrong output #93

Merged

This was referenced May 3, 2026

Public-release cleanup + v0.8.0 #100

Merged

docs(perf): refresh performance tables 2026-05-10 #158

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nvfp4): light up CUTLASS NVFP4×NVFP4 prefill for prequant SafeTensors#88

fix(nvfp4): light up CUTLASS NVFP4×NVFP4 prefill for prequant SafeTensors#88
kekzl merged 2 commits into
mainfrom
fix/nvfp4-prequant-cutlass-cache-v2

kekzl commented May 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured impact

Regression check

Partial fix scope

Test plan

Memo

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kekzl commented May 1, 2026 •

edited

Loading