Skip to content

fix(nvfp4): light up CUTLASS NVFP4×NVFP4 prefill for prequant SafeTensors#88

Merged
kekzl merged 2 commits into
mainfrom
fix/nvfp4-prequant-cutlass-cache-v2
May 1, 2026
Merged

fix(nvfp4): light up CUTLASS NVFP4×NVFP4 prefill for prequant SafeTensors#88
kekzl merged 2 commits into
mainfrom
fix/nvfp4-prequant-cutlass-cache-v2

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 1, 2026

Summary

Phase 0 in executor_pre_dequant.cu sets Tensor.qtype = NVFP4 directly on the main weight tensors for prequant SafeTensors models (Mistral-3.2-NVFP4, Gemma-4-NVFP4 llm-compressor, Qwen3.6-NVFP4 dense layers, Qwen3-Coder-30B-A3B-Instruct-FP4 dense+attention). But the existing CUTLASS cache build (Phase 3b) only iterated wcache_.nvfp4 — the legacy decode-cache map populated when FP16 / Q*_K weights are runtime-quantized to NVFP4. Prequant tensors were never in wcache_.nvfp4, so they were never converted to CUTLASS layout, and prefill (M>1) fell through to gemm_nvfp4 dequant → cuBLAS at executor_kernels.cu:1975:

[WARN] nvfp4_gemm.cu:1459: gemm_nvfp4: using slow dequant-to-FP16 fallback
       for M=16 (CUTLASS/cuBLASLt NVFP4 unavailable). Allocating 40.0 MiB
       dequant buffer for [4096, 5120] weight matrix

That allocation, fired per-prefill, is also why prequant SafeTensors models needed --no-cuda-graphs for stability — runtime allocs aren't graph-compatible.

Added Phase 0b loop right after Phase 0 promote: walks each model layer's dense slots (wq/wk/wv/wo, w_gate/w_up/w_down, w_*_shared) and out_proj_, synthesises an NvFP4QuantResult from the Tensor sidecars, calls convert_nvfp4_to_cutlass(), and registers the result keyed by Tensor.data. Per-expert MoE weights (expert_w_*) are unaffected — those use the CUTLASS 3.x grouped GEMM path which is independent of cutlass_nvfp4.

Second commit on this branch: IMP_AUDIT_NVFP4_SCALES extension to also dump input_scale stats. Diagnostic only — input_scale is loaded but the per-block-dynamic CUTLASS quantizer makes it a FP no-op for dynamic: local recipes, so we don't wire it into the dispatch.

Measured impact

Standard pp512/tg256 bench, RTX 5090, default flags (CUDA graphs ON):

Model tg256 pre-fix Δ
Mistral-3.2-NVFP4 101 81 +25%
Qwen3.6-35B-A3B-NVFP4 217 117–142 +50–85%
Gemma-4-26B-A4B-NVFP4 213 157–180 +18–35%
Qwen3-Coder-30B-A3B-Instruct-FP4 272 51 5.3×
Model pp512 pre-fix
Mistral-3.2-NVFP4 12 804 not measured (broken output)
Qwen3.6-35B-A3B-NVFP4 601 not measured
Gemma-4-26B-A4B-NVFP4 1 651 not measured
Qwen3-Coder-30B-A3B-NVFP4 1 299 2 891 (cuBLAS variance ±2.6× per release notes)

The Qwen3-Coder 5.3× jump is mostly from CUDA graphs becoming safe by default — the dequant fallback's per-prefill 40 MiB FP16 scratch allocation was graph-incompatible. With Phase 0b lit up, graphs work end-to-end on prequant SafeTensors. The --no-cuda-graphs workaround for Qwen3-Coder NVFP4 is no longer needed.

Lorem×11 prompt (95-token prefill) on Mistral-3.2-NVFP4:

  • pre-fix: 283 tok/s prefill, output " a long established in 199999999999" (numerical garbage)
  • post-fix: 3 147 tok/s prefill (×11), output " a dolor sit amet, consectetur adipiscing elit, Lorem..." (coherent Latin)

CUTLASS cache size for Mistral-3.2: 280 tensors, 1325 MiB scale_factors (data is borrowed; only SfAtom layout UE4M3 scales are owned).

Regression check

Model Result
Mistral-3.2-NVFP4 short prompt "Paris. It is located in the north..." ✓
Gemma-4-NVFP4 (llm-compressor) unchanged ✓
Qwen3-Coder-30B-A3B-Instruct-FP4 "The capital of France is Paris." ✓
Qwen3.6-NVFP4 default flags "Paris." ✓
Pre-push verify-fast (Qwen3-4B Q8_0 smoke) PASS (both commits)

Partial fix scope

This fixes the kernel-level breakage of the long-context NVFP4 regression — numerical-hash output on Mistral-3.2-NVFP4 with Lorem-ipsum prefixes is replaced by coherent text. Long English prose ≥250 tokens still doesn't always reach the LM-head's intended answer (memory/nvfp4_long_context_regression_2026_04_28.md), but IMP_AUDIT_NVFP4_SCALES analysis shows that's a model-behaviour / instruction-following issue, not a per-Linear input_scale math gap (the static prescale is mathematically equivalent to imp's dynamic-per-block path under dynamic: local recipes).

Test plan

  • make verify-fast (pre-push, both commits): PASS, decode within 3% threshold, prefill within 5%, smoke prompt distinct=8 contains 'Paris'
  • Mistral-3.2-NVFP4 short prompt regression: "Paris..." ✓
  • Mistral-3.2-NVFP4 Lorem×11: kernel-garbage → coherent Latin text
  • Gemma-4-NVFP4 / Qwen3-Coder-NVFP4 / Qwen3.6-NVFP4 regression: all coherent, all faster
  • Standardised pp512/tg256 bench across all four models: documented above
  • Build CI: green expected (single .cu source file edit + diagnostic helper)

Memo

  • memory/nvfp4_prequant_cutlass_cache_2026_05_01.md — fix, measurements, follow-up notes
  • memory/nvfp4_long_context_regression_2026_04_28.md — original bug investigation, now partially resolved

🤖 Generated with Claude Code

…sors

Phase 0 in executor_pre_dequant.cu sets `Tensor.qtype = NVFP4` directly on
the main weight tensors for prequant SafeTensors models (Mistral-3.2-NVFP4,
Gemma-4-NVFP4 llm-compressor, Qwen3.6-NVFP4 dense layers, etc.). But the
existing CUTLASS cache build (Phase 3b) only iterates `wcache_.nvfp4` —
the legacy decode-cache map populated when FP16/Q*_K weights are runtime-
quantized to NVFP4. Prequant tensors were never in `wcache_.nvfp4`, so
they were never converted to CUTLASS layout, and prefill (M>1) fell
through to `gemm_nvfp4` dequant→cuBLAS at executor_kernels.cu:1975.

Added Phase 0b loop right after Phase 0 promote: walks each model layer's
dense slots (wq/wk/wv/wo, w_gate/w_up/w_down, w_*_shared) and out_proj_,
synthesises an NvFP4QuantResult from the Tensor sidecars, calls
convert_nvfp4_to_cutlass(), and registers the result keyed by Tensor.data.
Per-expert MoE weights (expert_w_*) are unaffected — those use the
CUTLASS 3.x grouped GEMM path which is independent of cutlass_nvfp4.

Measured (Mistral-3.2-NVFP4, RTX 5090, Lorem×11=95-token prefill):
  Prefill:  283 → 3122 tok/s   (11×, was hitting "slow dequant-to-FP16
                                fallback" warning every prefill)
  Decode:    37 → 48 tok/s     (+30%, decode also benefits from cache hits)
  Output:   "a long established in 199999999999"  (numerical garbage)
         →  "a dolor sit amet, consectetur adipiscing elit, Quis..."
            (coherent Latin continuing the prefix)

Short-prompt regression: "The capital of France is" still produces
"Paris. It is located in the north of the country, on the river" ✓
Gemma-4-NVFP4: unaffected ✓
Qwen3-Coder-30B-A3B-Instruct-FP4 (Modelopt MoE): "Paris." ✓ (different
code path — CUTLASS 3.x grouped — verified untouched).

Note: this is a partial fix for the long-context NVFP4 regression.
Mistral-3.2 long English prose (≥250 tokens) still doesn't always reach
the right answer because of the residual SmoothQuant + FP16-act mismatch.
The recipe-intended path (load per-Linear input_scale + apply as static
prescale) remains open work — see TODO.md.

Memo: memory/nvfp4_prequant_cutlass_cache_2026_05_01.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl added a commit that referenced this pull request May 1, 2026
PR #88 lights up the CUTLASS NVFP4×NVFP4 prefill path for prequant
SafeTensors (Phase 0b in executor_pre_dequant.cu). Mistral-3.2-NVFP4
prefill 283 → 3122 tok/s, output coherence on long-context goes from
numerical-hash garbage to coherent (if context-attracted) text.

Long English prose ≥250 tokens still doesn't always reach the LM-head
answer — that's the residual SmoothQuant + FP16-act mismatch which needs
the recipe-intended dynamic NVFP4 input activation quantization path
(per-Linear input_scale prescale).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tats

IMP_AUDIT_NVFP4_SCALES already dumped weight_scale_2 stats; extend to
also walk input_scale across all NVFP4 prequant Linears and report
count, range, mean, sample values. Useful for distinguishing models
that ship per-Linear calibrated input scales (Mistral-3.2-NVFP4: 280/280
present, range 12-8832, mean 2167) from models with purely dynamic input
act-quant (Gemma-4-NVFP4 llm-compressor: no input_scale tensors).

Diagnostic only — input_scale is loaded and uploaded to GPU but is not
consumed by the dispatch path. For NVFP4 group-size=16 with dynamic
local act-quant the per-block dynamic scale absorbs the static
input_scale without numerical difference (mathematically equivalent to
prescaling the activation), so wiring input_scale into the GEMM alpha
would be a FP no-op. Documented for future investigation if a model
surfaces where the static prescale matters (e.g. recipe with static
input act-quant rather than dynamic local).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl added a commit that referenced this pull request May 1, 2026
Standard `pp512/tg256` bench (RTX 5090, default flags incl. CUDA graphs
ON) for all four prequant SafeTensors models post PR #88:

| Model                       | tg256 | pre-fix       | Δ           |
|-----------------------------|------:|--------------:|-------------|
| Mistral-3.2-NVFP4           |  101  | 81            | +25%        |
| Qwen3.6-35B-A3B-NVFP4       |  217  | 117-142       | +50-85%     |
| Gemma-4-26B-A4B-NVFP4       |  213  | 157-180       | +18-35%     |
| Qwen3-Coder-30B-A3B-NVFP4   |  272  | 51 (--no-graphs) | 5.3×    |

The Qwen3-Coder jump exposes a previously-undocumented secondary effect
of PR #88: lighting up the CUTLASS prefill path also makes CUDA graphs
**safe by default** for prequant SafeTensors. The previous
"--no-cuda-graphs required" workaround was forced by the dequant→cuBLAS
fallback's per-prefill 40 MiB FP16 scratch allocation, which is
graph-incompatible. With Phase 0b lit up, that path no longer fires on
the prefill hot loop.

Updates:
- BENCHMARKS.md decode + prefill tables for all four models
- CHANGELOG.md gets PR #88 as the lead fixed-section entry; stale
  "Qwen3-Coder still requires --no-cuda-graphs" note replaced with the
  graphs-now-safe statement
- TODO.md NVFP4-long-context bug section gets the proper bench matrix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl merged commit 6ab1a6b into main May 1, 2026
2 checks passed
@kekzl kekzl deleted the fix/nvfp4-prequant-cutlass-cache-v2 branch May 1, 2026 13:01
kekzl added a commit that referenced this pull request May 3, 2026
…tale numbers

Final pass on the docs/ files left untouched in 81fdea3.

### docs/RECOMMENDED_MODELS.md
- All NVFP4 numbers refreshed to post-PR#88 state: Qwen3-Coder-30B
  51 -> 272, Qwen3.6 117-142 -> 217, Gemma-4 157-180 -> 213,
  Mistral-3.2 81 -> 101.
- Drop "--no-cuda-graphs for coherence" caveat (PR #88 made graphs
  safe by default for prequant SafeTensors).
- Add workstation Blackwell GPU header note + NVFP4-as-primary
  framing.
- MoE table reordered to lead with NVFP4 prequant rows.
- Add Qwen3-30B-A3B-NVFP4-Modelopt as Mistral-3.2 long-context
  replacement.

### docs/SM120_OPTIMIZATION_STATUS.md
- Header now lists all three GB202 cards (RTX 5090 / PRO 5000 / 6000).
- "What Would Actually Help Decode" updated: NVFP4 prequant is the
  primary win, not speculative decoding (which TODO.md says is
  abandoned).
- Tested-models table refreshed with post-#88 numbers; "CUDA graphs
  for non-fast-path MoE: disabled" stays accurate but the NVFP4
  prequant row now says graphs capture end-to-end.
- "Project B Stage 5" name dropped from open-items table (the
  PROJECT_B doc was removed earlier; renamed to a plain
  "mxf4nvf4.block_scale MMA integration" line item).
- "engine.cpp:547" stale line ref replaced with the actual issue
  description (per-layer head_dim FP8 KV write/read kernels).

### docs/MXFP4_QUANTIZATION.md -> docs/quantization.md
- Title misled (mostly NVFP4 content). Renamed to broader scope and
  rewritten as a concise "where to get models for each path" guide.
- NVFP4 (primary), MXFP4 (CUTLASS-internal attention only), GGUF
  K-quants (legacy), other KV quants (FP8 / INT8 / INT4 / TurboQuant)
  each get a focused paragraph with current status and caveats
  (Mistral-3.2 long-prose, Gemma-4 NVFP4 native tool calls).
- Inference-pipeline detail removed - that's CLAUDE.md's job.
- Stale Qwen3-Coder "38 tok/s" -> current "272 tok/s post #88"
  context.

### docs/memory-management-comparison.md
- "imp supports CUDA only (Hopper, Blackwell)" was wrong - imp is
  sm_120f only. Corrected throughout.
- "Blackwell-native features (PDL, Green Contexts, TCGEN05)" - imp
  uses register-based mma.sync, NOT TCGEN05. Removed the false
  claim, noted the actual MMA shapes.
- KV format table extended: was "FP16, FP8 E4M3", now "FP16
  (default), FP8, INT8, INT4, NVFP4, TurboQuant" - reflects what
  the engine actually supports.
- mmap hints: was "MADV_SEQUENTIAL", now "MAP_POPULATE +
  MADV_WILLNEED + MADV_SEQUENTIAL" (post PR #97 cold-cache
  prefault).
- Pinned pool: was "64 MiB", now "4x128 MiB ring" (post PR #97
  upload pipeline).
- Speculative decoding cell rewritten to match TODO.md's
  abandoned-options state instead of generic "draft+target with KV
  block rollback".
- Decision matrix: was "Single H100/B200, latency-critical, one
  model -> imp" - imp doesn't target H100/B200. Replaced with
  Blackwell GB202 entry + a "datacenter Hopper / B200/B300 -> use
  vLLM/TensorRT-LLM" line.
- Native function calling row added (was missing entirely; new
  feature post #97).

### docs/memory-traffic-reduction-catalog.md
- W2 EAGLE-3: was "Dead-End historisch, worth revisit" - this
  contradicts TODO.md's definitive abandoned-options list. Updated
  to "abandoned - all variants tested, single-5090 decode
  bandwidth-bound".
- A6 Fused MoE Routing: was "teilweise (Gemma-4 Fast-Path)" -
  outdated. Now "vorhanden für NVFP4 prequant MoE" covering
  Qwen3.6/Gemma-4/Coder-30B; legacy GGUF MoE called out as the
  remaining open item.
- "Counterintuitive finding: Q6_K beats NVFP4 on decode at 30B" was
  flipped post-PR#88: NVFP4 now 272 tok/s vs Q6_K 234. Section
  rewritten to explain that the old gap was an implementation
  artifact, not a format tradeoff. NVFP4 is now the default
  recommendation.
- Top-Kandidaten list: dropped W2 EAGLE-revisit + W3 Medusa, added
  A6-generalize-for-GGUF and the mxf4nvf4 MMA integration.
- "Was gewonnen wurde" prepended with the headline NVFP4 prequant
  decode jump and the cold-start reduction from PR #97.
- Old `memory/...md` reference list dropped (those are auto-memory
  files, not in the repo); replaced with pointers to the surviving
  docs.

### README.md
- Doc-index pointer updated: docs/MXFP4_QUANTIZATION.md ->
  docs/quantization.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-actions Bot pushed a commit that referenced this pull request May 9, 2026
All decode + prefill numbers re-measured 2026-05-10 with imp:test post
PR #156 (chunked-prefill-hybrid) + PR #157 (auto max_seq_len 16K cap).

Notable changes vs prior table:
- Qwen3-4B / Qwen3-8B Q8_0 decode dropped (401→236, 255→149) — older
  numbers were pre-some-PR-between-#88-and-#150 regression. Current
  state is consistent with tests/perf_baseline.json (~150 tok/s).
- Llama-3.2-3B decode improved (208→306).
- Qwen3.6-35B Q4_K_M decode +70% (143→243), reflects PR #150/#151 MoE
  graphs-gate + fp32_down pre-alloc wins.
- Q3.6-NVFP4 prefill +82% (601→1092).
- Added Nemotron-3-Nano-30B-A3B NVFP4 row (325 tg256 / 690 pp512).
- Mistral-Small-3.2 NVFP4 + Gemma-4 Q5_K_M kept as italic stale (model
  files not present locally — last numbers retained as historical).

README headline rewritten — single "Qwen3-8B at 255 tok/s vs 1.6×
llama.cpp" claim no longer holds; replaced with multi-model decode
highlights citing the current numbers.

Long-context prefill table (pp1024-pp8192) and KV-cache-quant table
(Llama-3.2-3B specific) NOT re-measured this round; still reflect the
v0.7 / PR #51 measurement series.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant