Skip to content

NVFP4 stability bundle: encoder clamp, prefill chunk fix, warmup default off, validation harness#94

Merged
kekzl merged 6 commits into
mainfrom
fix/llmcompressor-nvfp4-skip-cutlass
May 2, 2026
Merged

NVFP4 stability bundle: encoder clamp, prefill chunk fix, warmup default off, validation harness#94
kekzl merged 6 commits into
mainfrom
fix/llmcompressor-nvfp4-skip-cutlass

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 2, 2026

Summary

Six commits tightening NVFP4 SafeTensors stability on imp + a re-runnable validation harness for all NVFP4 models on disk.

# Commit What
1 557cab3 fix(fp8) Correct E4M3-fn encoder clamp 0x7E=448 (was 0x77=240) — roots the CUTLASS+llm-compressor mismatch.
2 a7920f6 fix(nvfp4) (Now no-op after #1; kept for branch lineage.) Originally skipped CUTLASS for llm-compressor before encoder fix.
3 b29d066 fix(runtime) Clamp effective_chunk against executor->max_tokens() in step_prefill; throw on overflow in forward_logits; try/catch around engine->step() in batching_engine. Fixes Qwen3.6-NVFP4 long-prompt crash (terminate: reshape: numel mismatch).
4 95f4ee2 fix(graph) Swallow benign invalid device function from cudaGraphKernelNodeGetParams for driver-API kernel nodes in apply_pdl_edges. Was logging "Cleared stale error" on every chat completion.
5 ca6ef69 fix(runtime) Default runtime.warmup=false, opt-in for prod rollout. Bisect proved Engine::warmup() was the trigger for Mistral-3.2-NVFP4 first-request degeneration (illumin11111).
6 517b11a test(validation) scripts/validate_safetensors.py + 5-model report + per-model JSON artifacts. Validates Mistral-3.2, Gemma-4, Qwen3.6, Qwen3-Coder, and the new nvidia/Qwen3-30B-A3B-NVFP4 (Modelopt) as a clean drop-in for the broken Mistral-3.2-NVFP4.

Validation results

5 NVFP4 SafeTensors models exercised against a 20-prompt battery + 32x graph-replay determinism + degeneracy gates. Re-runnable any time via scripts/validate_safetensors.py. Headline:

Killer test (Mistral-3.2's worst case) Mistral-3.2-NVFP4 nvidia/Qwen3-30B-A3B-NVFP4 (replacement)
50× Lorem-Ipsum prefix → "capital of France" "elit dolor elit dolor..." "Paris. The capital of Germany is Berlin..."
1024-tok creative gen 4-gram repetition 95.7% 1.4%
First-request degeneration illumin11111 (pre-warmup-fix) ✅ clean

After the bug fixes:

  • Qwen3.6-NVFP4 long_context_recall (1856-tok): server-crash → coherent
  • Mistral first-request: garbage → clean
  • Qwen3-Coder graph-replay determinism: 16/32 → 23/32 byte-identical (warmup-flip side-effect)
  • 0 stale-error WARNs/request (was 2/request)

Mistral-3.2-NVFP4 long-context regression — root cause

Investigation in this PR refuted the prior input_global_scale hypothesis (tested both alpha directions, neither helped; all three llm-compressor models ship IGS but only Mistral breaks). Direct dump of L0 q_proj NVFP4-dequant FP16 reveals the actual cause:

  • 335× per-K-channel max range (max=4.36, median=0.013)
  • 20.3% outlier K-channels (1037/5120)
  • 97.8% of NVFP4 micro-blocks contain ≥1 outlier — non-outlier values in those blocks snap to ±0/±0.5
  • ~45% of dequanted weight values are exactly 0

This is SmoothQuant 0.9 + per-block-NVFP4 incompatibility, baked into the model file at calibration. Not fixable in imp; replacement validated.

Test plan

  • make build passes
  • make test-unit 37/37 PASS
  • Pre-push hook (verify fast) passes
  • All 5 NVFP4 models load without error
  • No regression on existing baselines (Qwen3-Coder + Gemma-4 unchanged)
  • Killer-test on nvidia/Qwen3-30B-A3B-NVFP4 confirmed clean output

🤖 Generated with Claude Code

kekzl and others added 6 commits May 2, 2026 01:00
…ong output

PR #88 lit up the CUTLASS sm_120 NVFP4×NVFP4 prefill path for prequant
SafeTensors (~10× prefill speedup, fixed Lorem-ipsum kernel garbage).
But subsequent stress testing revealed that for **llm-compressor format**
models specifically (Mistral-3.2, Gemma-4, Qwen3.6), the CUTLASS path
produces wrong output on any non-trivial generation past ~30 tokens:

  Mistral-3.2-NVFP4 200-tok "why is the sky blue":
    "...blue light is scattered more than other colors because blue light
    is scattered more than other colors because blue light has a shorter
    wavelength..." [loops, infinitely]

  Mistral-3.2-NVFP4 multi-turn ("Whiskers"):
    "It looks like there might be a small typo in your message..."

  Mistral-3.2-NVFP4 long-context retrieval:
    "Here is now modern-day. The empire covered approximately
    1111111111111111111111111..."  [numerical garbage]

debug_forward dumps localised the bug to L0 QKV gemm output:
  after_embedding   max=0.017, L2=0.16    [sane]
  L0_after_qkv_q    max=65472 (FP16 max), L2=1.8M, Inf=836  [blow-up]
  L0_after_paged_attn  NaN=4096  [completely NaN]

Verified the cause is the CUTLASS NVFP4×NVFP4 path specifically:
  - tensor_scale=0.00162338 confirmed correct at Phase 0b promote (DBG
    log) and at the CUTLASS GEMM call site (DBG log).
  - dequantize_nvfp4_to_fp16 produces sane weight magnitudes for
    llm-compressor: max|w|=0.117, mean|w|=0.0023 on L0 q_proj. Same
    correct order as Modelopt's max=0.040, mean=0.007.
  - With CUTLASS dispatch disabled (env var bisect), Mistral-3.2-NVFP4
    produces 3 coherent sentences ("The sky appears blue due to a
    phenomenon called Rayleigh scattering...") + correct multi-turn
    recall ("Your cat is called Whiskers.") + correct long-context
    retrieval ("The Western Roman Empire fell in 476 AD...").
  - Modelopt-format Qwen3-Coder-30B-A3B-Instruct-FP4 is unaffected —
    same CUTLASS path produces correct output. The bug is exclusively
    in the format-specific numerical interaction, not in CUTLASS or
    imp's dequant logic in general.

Fix is the same pattern as PR #65 (Gemma-4 NVFP4 MoE per-row gemv):
when one specific quantizer-format combo produces wrong output, route
through the slower-but-correct dequant→cuBLAS fallback. Phase 0b now
skips the CUTLASS cache registration when `cfg.is_llm_compressor_nvfp4`
is true; the dispatch at executor_kernels.cu:1975 then falls through
to `gemm_nvfp4` which dequants once to FP16 and calls cuBLAS.

Performance impact for llm-compressor models: prefill drops from ~12k
tok/s (CUTLASS) back to ~280 tok/s (dequant→cuBLAS, with one-time
40 MiB scratch alloc per Linear). Decode (M=1, gemv_nvfp4_kpar) is
unchanged. Modelopt-format NVFP4 (Qwen3-Coder etc.) keeps the CUTLASS
path and full speedup.

Verified post-fix on RTX 5090:

| Test                     | Pre-fix output                          | Post-fix output                        |
|--------------------------|-----------------------------------------|----------------------------------------|
| Mistral-3.2 200-tok      | repetition loop                         | "Rayleigh scattering... 3 sentences"   |
| Mistral-3.2 multi-turn   | "small typo in your message"            | "Your cat is called Whiskers." ✓       |
| Mistral-3.2 long-ctx     | "1111111…" garbage                      | "476 AD" extracted ✓                   |
| Qwen3.6-NVFP4 multi-turn | empty / wrong                           | "Alice is asking about her cat's name" |
| Qwen3-Coder Modelopt     | "Paris." ✓                              | "The capital of France is Paris." ✓    |

Bug root-cause within CUTLASS+llm-compressor remains unidentified —
candidates are FP8 micro-scale interpretation differences, activation
NVFP4 quantization noise interaction with SmoothQuant-baked weights,
or some SfAtom layout / arg passing edge case. The dequant kernel
itself (which feeds the dequant→cuBLAS fallback) handles llm-compressor
correctly per debug stats. Investigation memos:
`stress_test_safetensors_2026_05_01.md`,
`nvfp4_long_context_regression_2026_04_28.md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Roots the CUTLASS+llm-compressor mismatch the previous commit a7920f6
worked around. `float_to_fp8_e4m3` had two related bugs:

1. The `e4 >= 15` branch returned `(14 << 3) | 7 = 0x77` which decodes
   to 240. Standard FP8 E4M3-fn max is 448 = `(15 << 3) | 6 = 0x7E`.
   Any input value ≥ 256 that landed in the e_field=15 slot got
   squashed to 240, a 0.536× squash on the precision cliff.
2. The same branch saturated *all* values with e4 >= 15, including
   valid normals like 256 (e=15, m=0) and 416 (e=15, m=5). Only e4 > 15
   should saturate; e4 == 15 with m ∈ [0, 6] is a perfectly valid
   region encoding [256, 448].

Why this broke compressed-tensors NVFP4 prequant specifically:

  global_scale = FP8_max * FP4_max / max(|W|) = 2688 / max(|W|)
  weight_scale stored = max(|W_block|) * global_scale / FP4_max

For an outlier block where max(|W_block|) ≈ max(|W|) (as in Mistral
attention with its known outliers), the stored micro-scale reaches up
to FP8_max = 448. compressed-tensors emits this as 0x7E on disk.
imp's `convert_scales_sfatom_kernel` decodes the byte to 448 and
re-encodes via float_to_fp8_e4m3 — which used to return 0x77 (240),
halving the effective scale.

Modelopt was unaffected because its convention keeps micro-scales in
the original W-domain (max ~max(|W|)/6, typically <1.0 — well below
the precision cliff).

Empirical bisect (`tests/test_cutlass_nvfp4_alpha.cu`):

  Mistral L0 q_proj-shaped synthetic prequant input (max(|W|)=4.36):
    Pre-fix:  CUTLASS max|y|=3.90  ratio vs reference 0.53  (halved)
    Post-fix: CUTLASS max|y|=6.94  ratio vs reference 0.95  (≈ Modelopt
              auto-quant noise level)

The companion test `Fp8E4M3EncoderClampBoundary` pins the canonical
E4M3-fn encodings (256→0x78, 300→0x79, 416→0x7D, 448→0x7E, 500→0x7E
saturate, 6.0→0x4C, etc.) so the cliff cannot regress.

With the encoder fixed, the per-format CUTLASS skip from a7920f6 is
no longer load-bearing: Phase-0b can register prequant llm-compressor
weights into the CUTLASS cache the same as Modelopt. Reverted.

Notes:
- Mistral-3.2-NVFP4 still has a separate, pre-existing degeneration
  (multi-turn recall and long-context retrieval still fail) — that's
  not the FP8 cliff and not in scope here. The Sky-Blue prompt now
  produces a coherent first sentence on CUTLASS path (matched the
  workaround behaviour) before its independent regression kicks in.
- Modelopt Qwen3-Coder-30B-A3B-Instruct-FP4 unchanged: "The capital
  of France is Paris." ✓
- All test suites green: test-core 121, test-compute 115, test-quant
  75, test-attention 67, test-kv 31.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… propagation

Three coordinated fixes to make long-prompt requests safe on hybrid SSM/GDN+MoE
models (Qwen3.6, Qwen3.5-MoE):

1. engine.cpp:step_prefill — hard-cap effective_chunk against
   executor->max_tokens(). The handlers default config_.prefill_chunk_size=512
   for OpenAI-API users, but the executor caps max_tokens to 256 for SSM/GDN+MoE
   hybrids (executor_workspace.cu:117). Without this clamp, a 512-token chunk
   reaches forward_logits, overflows the workspace, and the next op throws
   reshape: numel mismatch on the worker thread.

2. executor_forward.cu:forward_logits — when n_tokens > max_tokens_, throw
   std::invalid_argument instead of log-and-return. Returning silently with
   an uninitialized logits tensor pushes the error to a downstream reshape,
   surfacing as a confusing "numel mismatch" instead of the actual cause.

3. batching_engine.cpp:worker_loop — wrap engine->step() in try/catch.
   Previously, any exception from the engine called std::terminate on the
   worker thread, killing the entire imp-server container and taking every
   other in-flight request with it. Now we cancel all active requests with
   reason="internal_error", reset transient engine state (graphs, batch pool
   cache), and keep the worker alive.

Validation: Qwen3.6-NVFP4 long_context_recall (1856-token prompt) goes from
container-crashing terminate to clean coherent execution. Battery pass rate
4/20 → 13/20. Server stays up across many requests.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ersion

apply_pdl_edges() walks a captured graph's kernel edges and queries each
source node's kernel params via cudaGraphKernelNodeGetParams to look up
PDL-enabled status. For kernel nodes added via the driver-API form
(CUkernel handle rather than a host __global__ symbol pointer), the
runtime API returns kparams.func=nullptr AND sets the global CUDA last
error to "invalid device function" (cudaErrorInvalidDeviceFunction = 98).

The error is benign in our flow — a null host pointer just means "not in
the PDL registry, skip this edge". But it was leaking up two stack frames
to the next forward_logits, where the existing pre-pass diagnostic logged
"Cleared stale error before forward: invalid device function" on every
chat completion request (~2/request, ~600/min in production traffic).

Fix: defensive (void)cudaGetLastError() in the skip path, plus an explicit
null check on kparams.func so the lookup short-circuits cleanly. Also
clarifying comment on why this is necessary.

Verified: 0 stale-error warnings across 5 chat completions on Mistral-3.2
(was 2/request before).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Engine::warmup() runs a synthetic BOS-padded forward at engine init to
prime cuBLAS handles, L2 cache, and CUDA-graph capture paths. Default was
true, but it pollutes engine state in ways that survive its own forward
pass and visibly degrade the very first user request — most clearly on
Mistral-3.2-NVFP4 where the first generated answer was '111111111...'
followed by clean output on requests 2+.

Bisect (in this session, with all other optimizations off):
- naked path (warmup=off, graphs=never, PDL=off, deterministic_gemm=on):
  first answer clean ✅
- graphs+PDL on, warmup=off: first answer clean ✅
- warmup=on, graphs+PDL off: first answer = '111111111111111111111111' ❌

So warmup itself is the trigger. Root cause not isolated; the existing
reset_kv_calibration() (added in PR #89 for FP8 KV high-water-mark)
covers FP8 KV but not the analog state for NVFP4-decode / cuBLAS algo
cache / L2 persist policy that warmup with synthetic inputs poisons.

The flag is intended for prod rollouts where first-request TTFT matters
(saves ~200ms of cuBLAS heuristic on the first chat). Dev / CI / one-shot
inference doesn't benefit, and the first-request degeneration was masking
a real Mistral output bug. Flip the default; explicit opt-in via
`--set runtime.warmup=true` for prod.

Side-effect: Qwen3-Coder graph-replay determinism improved 16/32 → 23/32
post-flip — warmup was also subtly corrupting downstream graph state on
that model. Doesn't fix the broader MoE-NVFP4 graph non-determinism
(separate engine class) but moves the needle the right way.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a re-runnable Python harness that exercises imp-server against every
pre-quantized NVFP4 SafeTensors model on disk, runs a 20-prompt battery,
checks graph-replay byte-identicality, degeneracy gates, and 3-run
determinism. Mode A (reduced scope): no BF16 reference, no NVFP4
calibration — imp consumes pre-quantized weights only and there is no
BF16 execution path to compare against.

Files:
- scripts/validate_safetensors.py — harness, stdlib-only, spawns imp-server
  per model in docker, hits /v1/chat/completions with logprobs, checks
  output against per-prompt rules (regex / contains / json-schema /
  primary-color sets / sequence-1-to-N / 4-gram repetition / etc.).
- scripts/consolidate_validation_report.py — reads each model's
  report.json and writes the top-level MODEL_VALIDATION_REPORT.md +
  MODEL_VALIDATION_SUMMARY.csv with executive summary.
- tests/fixtures/{battery_prompts.json, tokenizer_roundtrip.txt} — input
  fixtures.
- validation_artifacts/<model>/{report.json, server.log} — per-model
  artifacts from the latest run; refresh by re-running the harness.
- MODEL_VALIDATION_REPORT.md / MODEL_VALIDATION_SUMMARY.csv — current
  state covering 5 models.

Models validated:
- Mistral-Small-3.2-24B-Instruct-2506-NVFP4 (llm-compressor + SmoothQuant)
- Gemma-4-26B-A4B-it-NVFP4 (llm-compressor)
- Qwen3.6-35B-A3B-NVFP4 (llm-compressor, GDN+MoE hybrid)
- Qwen3-Coder-30B-A3B-Instruct-FP4 (Modelopt)
- Qwen3-30B-A3B-NVFP4 NVIDIA Modelopt format (downloaded this session)

Headline finding: nvidia/Qwen3-30B-A3B-NVFP4 is a clean drop-in for the
broken Mistral-3.2-NVFP4. Killer-test: 50× Lorem-Ipsum prefix + "The
capital of France is" produces "Paris. The capital of Germany is Berlin.
The capital of Spain is Madrid…" (Mistral produces "elit dolor elit
dolor elit dolor…"). 1024-token creative generation 4-gram repetition
rate 1.4% (Mistral 95.7%).

Mistral-3.2-NVFP4 long-context regression was root-caused this session
to upstream SmoothQuant 0.9 calibration loss — direct dump of L0 q_proj
shows 45% of dequanted weight values are exactly 0, 97.8% of NVFP4
micro-blocks dominated by an outlier K-channel. Not fixable in imp;
realistic solution is the Qwen3-30B-A3B-NVFP4 replacement validated by
this harness.

Re-run any time: python3 scripts/validate_safetensors.py [--model NAME]
[--smoke]; then python3 scripts/consolidate_validation_report.py.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl enabled auto-merge (squash) May 2, 2026 10:26
@kekzl kekzl merged commit c5c2391 into main May 2, 2026
2 checks passed
@kekzl kekzl deleted the fix/llmcompressor-nvfp4-skip-cutlass branch May 2, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant