NVFP4 stability bundle: encoder clamp, prefill chunk fix, warmup default off, validation harness by kekzl · Pull Request #94 · kekzl/imp

kekzl · 2026-05-02T10:24:22Z

Summary

Six commits tightening NVFP4 SafeTensors stability on imp + a re-runnable validation harness for all NVFP4 models on disk.

#	Commit	What
1	`557cab3` fix(fp8)	Correct E4M3-fn encoder clamp `0x7E=448` (was `0x77=240`) — roots the CUTLASS+llm-compressor mismatch.
2	`a7920f6` fix(nvfp4)	(Now no-op after #1; kept for branch lineage.) Originally skipped CUTLASS for llm-compressor before encoder fix.
3	`b29d066` fix(runtime)	Clamp `effective_chunk` against `executor->max_tokens()` in `step_prefill`; throw on overflow in `forward_logits`; try/catch around `engine->step()` in `batching_engine`. Fixes Qwen3.6-NVFP4 long-prompt crash (`terminate: reshape: numel mismatch`).
4	`95f4ee2` fix(graph)	Swallow benign `invalid device function` from `cudaGraphKernelNodeGetParams` for driver-API kernel nodes in `apply_pdl_edges`. Was logging "Cleared stale error" on every chat completion.
5	`ca6ef69` fix(runtime)	Default `runtime.warmup=false`, opt-in for prod rollout. Bisect proved `Engine::warmup()` was the trigger for Mistral-3.2-NVFP4 first-request degeneration (`illumin11111`).
6	`517b11a` test(validation)	`scripts/validate_safetensors.py` + 5-model report + per-model JSON artifacts. Validates Mistral-3.2, Gemma-4, Qwen3.6, Qwen3-Coder, and the new `nvidia/Qwen3-30B-A3B-NVFP4` (Modelopt) as a clean drop-in for the broken Mistral-3.2-NVFP4.

Validation results

5 NVFP4 SafeTensors models exercised against a 20-prompt battery + 32x graph-replay determinism + degeneracy gates. Re-runnable any time via scripts/validate_safetensors.py. Headline:

Killer test (Mistral-3.2's worst case)	Mistral-3.2-NVFP4	nvidia/Qwen3-30B-A3B-NVFP4 (replacement)
50× Lorem-Ipsum prefix → "capital of France"	❌ `"elit dolor elit dolor..."`	✅ `"Paris. The capital of Germany is Berlin..."`
1024-tok creative gen 4-gram repetition	❌ 95.7%	✅ 1.4%
First-request degeneration	❌ `illumin11111` (pre-warmup-fix)	✅ clean

After the bug fixes:

Qwen3.6-NVFP4 long_context_recall (1856-tok): server-crash → coherent
Mistral first-request: garbage → clean
Qwen3-Coder graph-replay determinism: 16/32 → 23/32 byte-identical (warmup-flip side-effect)
0 stale-error WARNs/request (was 2/request)

Mistral-3.2-NVFP4 long-context regression — root cause

Investigation in this PR refuted the prior input_global_scale hypothesis (tested both alpha directions, neither helped; all three llm-compressor models ship IGS but only Mistral breaks). Direct dump of L0 q_proj NVFP4-dequant FP16 reveals the actual cause:

335× per-K-channel max range (max=4.36, median=0.013)
20.3% outlier K-channels (1037/5120)
97.8% of NVFP4 micro-blocks contain ≥1 outlier — non-outlier values in those blocks snap to ±0/±0.5
~45% of dequanted weight values are exactly 0

This is SmoothQuant 0.9 + per-block-NVFP4 incompatibility, baked into the model file at calibration. Not fixable in imp; replacement validated.

Test plan

make build passes
make test-unit 37/37 PASS
Pre-push hook (verify fast) passes
All 5 NVFP4 models load without error
No regression on existing baselines (Qwen3-Coder + Gemma-4 unchanged)
Killer-test on nvidia/Qwen3-30B-A3B-NVFP4 confirmed clean output

🤖 Generated with Claude Code

…ong output PR #88 lit up the CUTLASS sm_120 NVFP4×NVFP4 prefill path for prequant SafeTensors (~10× prefill speedup, fixed Lorem-ipsum kernel garbage). But subsequent stress testing revealed that for **llm-compressor format** models specifically (Mistral-3.2, Gemma-4, Qwen3.6), the CUTLASS path produces wrong output on any non-trivial generation past ~30 tokens: Mistral-3.2-NVFP4 200-tok "why is the sky blue": "...blue light is scattered more than other colors because blue light is scattered more than other colors because blue light has a shorter wavelength..." [loops, infinitely] Mistral-3.2-NVFP4 multi-turn ("Whiskers"): "It looks like there might be a small typo in your message..." Mistral-3.2-NVFP4 long-context retrieval: "Here is now modern-day. The empire covered approximately 1111111111111111111111111..." [numerical garbage] debug_forward dumps localised the bug to L0 QKV gemm output: after_embedding max=0.017, L2=0.16 [sane] L0_after_qkv_q max=65472 (FP16 max), L2=1.8M, Inf=836 [blow-up] L0_after_paged_attn NaN=4096 [completely NaN] Verified the cause is the CUTLASS NVFP4×NVFP4 path specifically: - tensor_scale=0.00162338 confirmed correct at Phase 0b promote (DBG log) and at the CUTLASS GEMM call site (DBG log). - dequantize_nvfp4_to_fp16 produces sane weight magnitudes for llm-compressor: max|w|=0.117, mean|w|=0.0023 on L0 q_proj. Same correct order as Modelopt's max=0.040, mean=0.007. - With CUTLASS dispatch disabled (env var bisect), Mistral-3.2-NVFP4 produces 3 coherent sentences ("The sky appears blue due to a phenomenon called Rayleigh scattering...") + correct multi-turn recall ("Your cat is called Whiskers.") + correct long-context retrieval ("The Western Roman Empire fell in 476 AD..."). - Modelopt-format Qwen3-Coder-30B-A3B-Instruct-FP4 is unaffected — same CUTLASS path produces correct output. The bug is exclusively in the format-specific numerical interaction, not in CUTLASS or imp's dequant logic in general. Fix is the same pattern as PR #65 (Gemma-4 NVFP4 MoE per-row gemv): when one specific quantizer-format combo produces wrong output, route through the slower-but-correct dequant→cuBLAS fallback. Phase 0b now skips the CUTLASS cache registration when `cfg.is_llm_compressor_nvfp4` is true; the dispatch at executor_kernels.cu:1975 then falls through to `gemm_nvfp4` which dequants once to FP16 and calls cuBLAS. Performance impact for llm-compressor models: prefill drops from ~12k tok/s (CUTLASS) back to ~280 tok/s (dequant→cuBLAS, with one-time 40 MiB scratch alloc per Linear). Decode (M=1, gemv_nvfp4_kpar) is unchanged. Modelopt-format NVFP4 (Qwen3-Coder etc.) keeps the CUTLASS path and full speedup. Verified post-fix on RTX 5090: | Test | Pre-fix output | Post-fix output | |--------------------------|-----------------------------------------|----------------------------------------| | Mistral-3.2 200-tok | repetition loop | "Rayleigh scattering... 3 sentences" | | Mistral-3.2 multi-turn | "small typo in your message" | "Your cat is called Whiskers." ✓ | | Mistral-3.2 long-ctx | "1111111…" garbage | "476 AD" extracted ✓ | | Qwen3.6-NVFP4 multi-turn | empty / wrong | "Alice is asking about her cat's name" | | Qwen3-Coder Modelopt | "Paris." ✓ | "The capital of France is Paris." ✓ | Bug root-cause within CUTLASS+llm-compressor remains unidentified — candidates are FP8 micro-scale interpretation differences, activation NVFP4 quantization noise interaction with SmoothQuant-baked weights, or some SfAtom layout / arg passing edge case. The dequant kernel itself (which feeds the dequant→cuBLAS fallback) handles llm-compressor correctly per debug stats. Investigation memos: `stress_test_safetensors_2026_05_01.md`, `nvfp4_long_context_regression_2026_04_28.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Roots the CUTLASS+llm-compressor mismatch the previous commit a7920f6 worked around. `float_to_fp8_e4m3` had two related bugs: 1. The `e4 >= 15` branch returned `(14 << 3) | 7 = 0x77` which decodes to 240. Standard FP8 E4M3-fn max is 448 = `(15 << 3) | 6 = 0x7E`. Any input value ≥ 256 that landed in the e_field=15 slot got squashed to 240, a 0.536× squash on the precision cliff. 2. The same branch saturated *all* values with e4 >= 15, including valid normals like 256 (e=15, m=0) and 416 (e=15, m=5). Only e4 > 15 should saturate; e4 == 15 with m ∈ [0, 6] is a perfectly valid region encoding [256, 448]. Why this broke compressed-tensors NVFP4 prequant specifically: global_scale = FP8_max * FP4_max / max(|W|) = 2688 / max(|W|) weight_scale stored = max(|W_block|) * global_scale / FP4_max For an outlier block where max(|W_block|) ≈ max(|W|) (as in Mistral attention with its known outliers), the stored micro-scale reaches up to FP8_max = 448. compressed-tensors emits this as 0x7E on disk. imp's `convert_scales_sfatom_kernel` decodes the byte to 448 and re-encodes via float_to_fp8_e4m3 — which used to return 0x77 (240), halving the effective scale. Modelopt was unaffected because its convention keeps micro-scales in the original W-domain (max ~max(|W|)/6, typically <1.0 — well below the precision cliff). Empirical bisect (`tests/test_cutlass_nvfp4_alpha.cu`): Mistral L0 q_proj-shaped synthetic prequant input (max(|W|)=4.36): Pre-fix: CUTLASS max|y|=3.90 ratio vs reference 0.53 (halved) Post-fix: CUTLASS max|y|=6.94 ratio vs reference 0.95 (≈ Modelopt auto-quant noise level) The companion test `Fp8E4M3EncoderClampBoundary` pins the canonical E4M3-fn encodings (256→0x78, 300→0x79, 416→0x7D, 448→0x7E, 500→0x7E saturate, 6.0→0x4C, etc.) so the cliff cannot regress. With the encoder fixed, the per-format CUTLASS skip from a7920f6 is no longer load-bearing: Phase-0b can register prequant llm-compressor weights into the CUTLASS cache the same as Modelopt. Reverted. Notes: - Mistral-3.2-NVFP4 still has a separate, pre-existing degeneration (multi-turn recall and long-context retrieval still fail) — that's not the FP8 cliff and not in scope here. The Sky-Blue prompt now produces a coherent first sentence on CUTLASS path (matched the workaround behaviour) before its independent regression kicks in. - Modelopt Qwen3-Coder-30B-A3B-Instruct-FP4 unchanged: "The capital of France is Paris." ✓ - All test suites green: test-core 121, test-compute 115, test-quant 75, test-attention 67, test-kv 31. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… propagation Three coordinated fixes to make long-prompt requests safe on hybrid SSM/GDN+MoE models (Qwen3.6, Qwen3.5-MoE): 1. engine.cpp:step_prefill — hard-cap effective_chunk against executor->max_tokens(). The handlers default config_.prefill_chunk_size=512 for OpenAI-API users, but the executor caps max_tokens to 256 for SSM/GDN+MoE hybrids (executor_workspace.cu:117). Without this clamp, a 512-token chunk reaches forward_logits, overflows the workspace, and the next op throws reshape: numel mismatch on the worker thread. 2. executor_forward.cu:forward_logits — when n_tokens > max_tokens_, throw std::invalid_argument instead of log-and-return. Returning silently with an uninitialized logits tensor pushes the error to a downstream reshape, surfacing as a confusing "numel mismatch" instead of the actual cause. 3. batching_engine.cpp:worker_loop — wrap engine->step() in try/catch. Previously, any exception from the engine called std::terminate on the worker thread, killing the entire imp-server container and taking every other in-flight request with it. Now we cancel all active requests with reason="internal_error", reset transient engine state (graphs, batch pool cache), and keep the worker alive. Validation: Qwen3.6-NVFP4 long_context_recall (1856-token prompt) goes from container-crashing terminate to clean coherent execution. Battery pass rate 4/20 → 13/20. Server stays up across many requests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ersion apply_pdl_edges() walks a captured graph's kernel edges and queries each source node's kernel params via cudaGraphKernelNodeGetParams to look up PDL-enabled status. For kernel nodes added via the driver-API form (CUkernel handle rather than a host __global__ symbol pointer), the runtime API returns kparams.func=nullptr AND sets the global CUDA last error to "invalid device function" (cudaErrorInvalidDeviceFunction = 98). The error is benign in our flow — a null host pointer just means "not in the PDL registry, skip this edge". But it was leaking up two stack frames to the next forward_logits, where the existing pre-pass diagnostic logged "Cleared stale error before forward: invalid device function" on every chat completion request (~2/request, ~600/min in production traffic). Fix: defensive (void)cudaGetLastError() in the skip path, plus an explicit null check on kparams.func so the lookup short-circuits cleanly. Also clarifying comment on why this is necessary. Verified: 0 stale-error warnings across 5 chat completions on Mistral-3.2 (was 2/request before). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Engine::warmup() runs a synthetic BOS-padded forward at engine init to prime cuBLAS handles, L2 cache, and CUDA-graph capture paths. Default was true, but it pollutes engine state in ways that survive its own forward pass and visibly degrade the very first user request — most clearly on Mistral-3.2-NVFP4 where the first generated answer was '111111111...' followed by clean output on requests 2+. Bisect (in this session, with all other optimizations off): - naked path (warmup=off, graphs=never, PDL=off, deterministic_gemm=on): first answer clean ✅ - graphs+PDL on, warmup=off: first answer clean ✅ - warmup=on, graphs+PDL off: first answer = '111111111111111111111111' ❌ So warmup itself is the trigger. Root cause not isolated; the existing reset_kv_calibration() (added in PR #89 for FP8 KV high-water-mark) covers FP8 KV but not the analog state for NVFP4-decode / cuBLAS algo cache / L2 persist policy that warmup with synthetic inputs poisons. The flag is intended for prod rollouts where first-request TTFT matters (saves ~200ms of cuBLAS heuristic on the first chat). Dev / CI / one-shot inference doesn't benefit, and the first-request degeneration was masking a real Mistral output bug. Flip the default; explicit opt-in via `--set runtime.warmup=true` for prod. Side-effect: Qwen3-Coder graph-replay determinism improved 16/32 → 23/32 post-flip — warmup was also subtly corrupting downstream graph state on that model. Doesn't fix the broader MoE-NVFP4 graph non-determinism (separate engine class) but moves the needle the right way. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a re-runnable Python harness that exercises imp-server against every pre-quantized NVFP4 SafeTensors model on disk, runs a 20-prompt battery, checks graph-replay byte-identicality, degeneracy gates, and 3-run determinism. Mode A (reduced scope): no BF16 reference, no NVFP4 calibration — imp consumes pre-quantized weights only and there is no BF16 execution path to compare against. Files: - scripts/validate_safetensors.py — harness, stdlib-only, spawns imp-server per model in docker, hits /v1/chat/completions with logprobs, checks output against per-prompt rules (regex / contains / json-schema / primary-color sets / sequence-1-to-N / 4-gram repetition / etc.). - scripts/consolidate_validation_report.py — reads each model's report.json and writes the top-level MODEL_VALIDATION_REPORT.md + MODEL_VALIDATION_SUMMARY.csv with executive summary. - tests/fixtures/{battery_prompts.json, tokenizer_roundtrip.txt} — input fixtures. - validation_artifacts/<model>/{report.json, server.log} — per-model artifacts from the latest run; refresh by re-running the harness. - MODEL_VALIDATION_REPORT.md / MODEL_VALIDATION_SUMMARY.csv — current state covering 5 models. Models validated: - Mistral-Small-3.2-24B-Instruct-2506-NVFP4 (llm-compressor + SmoothQuant) - Gemma-4-26B-A4B-it-NVFP4 (llm-compressor) - Qwen3.6-35B-A3B-NVFP4 (llm-compressor, GDN+MoE hybrid) - Qwen3-Coder-30B-A3B-Instruct-FP4 (Modelopt) - Qwen3-30B-A3B-NVFP4 NVIDIA Modelopt format (downloaded this session) Headline finding: nvidia/Qwen3-30B-A3B-NVFP4 is a clean drop-in for the broken Mistral-3.2-NVFP4. Killer-test: 50× Lorem-Ipsum prefix + "The capital of France is" produces "Paris. The capital of Germany is Berlin. The capital of Spain is Madrid…" (Mistral produces "elit dolor elit dolor elit dolor…"). 1024-token creative generation 4-gram repetition rate 1.4% (Mistral 95.7%). Mistral-3.2-NVFP4 long-context regression was root-caused this session to upstream SmoothQuant 0.9 calibration loss — direct dump of L0 q_proj shows 45% of dequanted weight values are exactly 0, 97.8% of NVFP4 micro-blocks dominated by an outlier K-channel. Not fixable in imp; realistic solution is the Qwen3-30B-A3B-NVFP4 replacement validated by this harness. Re-run any time: python3 scripts/validate_safetensors.py [--model NAME] [--smoke]; then python3 scripts/consolidate_validation_report.py. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl and others added 6 commits May 2, 2026 01:00

kekzl enabled auto-merge (squash) May 2, 2026 10:26

kekzl merged commit c5c2391 into main May 2, 2026
2 checks passed

kekzl deleted the fix/llmcompressor-nvfp4-skip-cutlass branch May 2, 2026 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVFP4 stability bundle: encoder clamp, prefill chunk fix, warmup default off, validation harness#94

NVFP4 stability bundle: encoder clamp, prefill chunk fix, warmup default off, validation harness#94
kekzl merged 6 commits into
mainfrom
fix/llmcompressor-nvfp4-skip-cutlass

kekzl commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 2, 2026

Summary

Validation results

Mistral-3.2-NVFP4 long-context regression — root cause

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant