NVFP4 stability bundle: encoder clamp, prefill chunk fix, warmup default off, validation harness#94
Merged
Merged
Conversation
…ong output PR #88 lit up the CUTLASS sm_120 NVFP4×NVFP4 prefill path for prequant SafeTensors (~10× prefill speedup, fixed Lorem-ipsum kernel garbage). But subsequent stress testing revealed that for **llm-compressor format** models specifically (Mistral-3.2, Gemma-4, Qwen3.6), the CUTLASS path produces wrong output on any non-trivial generation past ~30 tokens: Mistral-3.2-NVFP4 200-tok "why is the sky blue": "...blue light is scattered more than other colors because blue light is scattered more than other colors because blue light has a shorter wavelength..." [loops, infinitely] Mistral-3.2-NVFP4 multi-turn ("Whiskers"): "It looks like there might be a small typo in your message..." Mistral-3.2-NVFP4 long-context retrieval: "Here is now modern-day. The empire covered approximately 1111111111111111111111111..." [numerical garbage] debug_forward dumps localised the bug to L0 QKV gemm output: after_embedding max=0.017, L2=0.16 [sane] L0_after_qkv_q max=65472 (FP16 max), L2=1.8M, Inf=836 [blow-up] L0_after_paged_attn NaN=4096 [completely NaN] Verified the cause is the CUTLASS NVFP4×NVFP4 path specifically: - tensor_scale=0.00162338 confirmed correct at Phase 0b promote (DBG log) and at the CUTLASS GEMM call site (DBG log). - dequantize_nvfp4_to_fp16 produces sane weight magnitudes for llm-compressor: max|w|=0.117, mean|w|=0.0023 on L0 q_proj. Same correct order as Modelopt's max=0.040, mean=0.007. - With CUTLASS dispatch disabled (env var bisect), Mistral-3.2-NVFP4 produces 3 coherent sentences ("The sky appears blue due to a phenomenon called Rayleigh scattering...") + correct multi-turn recall ("Your cat is called Whiskers.") + correct long-context retrieval ("The Western Roman Empire fell in 476 AD..."). - Modelopt-format Qwen3-Coder-30B-A3B-Instruct-FP4 is unaffected — same CUTLASS path produces correct output. The bug is exclusively in the format-specific numerical interaction, not in CUTLASS or imp's dequant logic in general. Fix is the same pattern as PR #65 (Gemma-4 NVFP4 MoE per-row gemv): when one specific quantizer-format combo produces wrong output, route through the slower-but-correct dequant→cuBLAS fallback. Phase 0b now skips the CUTLASS cache registration when `cfg.is_llm_compressor_nvfp4` is true; the dispatch at executor_kernels.cu:1975 then falls through to `gemm_nvfp4` which dequants once to FP16 and calls cuBLAS. Performance impact for llm-compressor models: prefill drops from ~12k tok/s (CUTLASS) back to ~280 tok/s (dequant→cuBLAS, with one-time 40 MiB scratch alloc per Linear). Decode (M=1, gemv_nvfp4_kpar) is unchanged. Modelopt-format NVFP4 (Qwen3-Coder etc.) keeps the CUTLASS path and full speedup. Verified post-fix on RTX 5090: | Test | Pre-fix output | Post-fix output | |--------------------------|-----------------------------------------|----------------------------------------| | Mistral-3.2 200-tok | repetition loop | "Rayleigh scattering... 3 sentences" | | Mistral-3.2 multi-turn | "small typo in your message" | "Your cat is called Whiskers." ✓ | | Mistral-3.2 long-ctx | "1111111…" garbage | "476 AD" extracted ✓ | | Qwen3.6-NVFP4 multi-turn | empty / wrong | "Alice is asking about her cat's name" | | Qwen3-Coder Modelopt | "Paris." ✓ | "The capital of France is Paris." ✓ | Bug root-cause within CUTLASS+llm-compressor remains unidentified — candidates are FP8 micro-scale interpretation differences, activation NVFP4 quantization noise interaction with SmoothQuant-baked weights, or some SfAtom layout / arg passing edge case. The dequant kernel itself (which feeds the dequant→cuBLAS fallback) handles llm-compressor correctly per debug stats. Investigation memos: `stress_test_safetensors_2026_05_01.md`, `nvfp4_long_context_regression_2026_04_28.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Roots the CUTLASS+llm-compressor mismatch the previous commit a7920f6 worked around. `float_to_fp8_e4m3` had two related bugs: 1. The `e4 >= 15` branch returned `(14 << 3) | 7 = 0x77` which decodes to 240. Standard FP8 E4M3-fn max is 448 = `(15 << 3) | 6 = 0x7E`. Any input value ≥ 256 that landed in the e_field=15 slot got squashed to 240, a 0.536× squash on the precision cliff. 2. The same branch saturated *all* values with e4 >= 15, including valid normals like 256 (e=15, m=0) and 416 (e=15, m=5). Only e4 > 15 should saturate; e4 == 15 with m ∈ [0, 6] is a perfectly valid region encoding [256, 448]. Why this broke compressed-tensors NVFP4 prequant specifically: global_scale = FP8_max * FP4_max / max(|W|) = 2688 / max(|W|) weight_scale stored = max(|W_block|) * global_scale / FP4_max For an outlier block where max(|W_block|) ≈ max(|W|) (as in Mistral attention with its known outliers), the stored micro-scale reaches up to FP8_max = 448. compressed-tensors emits this as 0x7E on disk. imp's `convert_scales_sfatom_kernel` decodes the byte to 448 and re-encodes via float_to_fp8_e4m3 — which used to return 0x77 (240), halving the effective scale. Modelopt was unaffected because its convention keeps micro-scales in the original W-domain (max ~max(|W|)/6, typically <1.0 — well below the precision cliff). Empirical bisect (`tests/test_cutlass_nvfp4_alpha.cu`): Mistral L0 q_proj-shaped synthetic prequant input (max(|W|)=4.36): Pre-fix: CUTLASS max|y|=3.90 ratio vs reference 0.53 (halved) Post-fix: CUTLASS max|y|=6.94 ratio vs reference 0.95 (≈ Modelopt auto-quant noise level) The companion test `Fp8E4M3EncoderClampBoundary` pins the canonical E4M3-fn encodings (256→0x78, 300→0x79, 416→0x7D, 448→0x7E, 500→0x7E saturate, 6.0→0x4C, etc.) so the cliff cannot regress. With the encoder fixed, the per-format CUTLASS skip from a7920f6 is no longer load-bearing: Phase-0b can register prequant llm-compressor weights into the CUTLASS cache the same as Modelopt. Reverted. Notes: - Mistral-3.2-NVFP4 still has a separate, pre-existing degeneration (multi-turn recall and long-context retrieval still fail) — that's not the FP8 cliff and not in scope here. The Sky-Blue prompt now produces a coherent first sentence on CUTLASS path (matched the workaround behaviour) before its independent regression kicks in. - Modelopt Qwen3-Coder-30B-A3B-Instruct-FP4 unchanged: "The capital of France is Paris." ✓ - All test suites green: test-core 121, test-compute 115, test-quant 75, test-attention 67, test-kv 31. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… propagation Three coordinated fixes to make long-prompt requests safe on hybrid SSM/GDN+MoE models (Qwen3.6, Qwen3.5-MoE): 1. engine.cpp:step_prefill — hard-cap effective_chunk against executor->max_tokens(). The handlers default config_.prefill_chunk_size=512 for OpenAI-API users, but the executor caps max_tokens to 256 for SSM/GDN+MoE hybrids (executor_workspace.cu:117). Without this clamp, a 512-token chunk reaches forward_logits, overflows the workspace, and the next op throws reshape: numel mismatch on the worker thread. 2. executor_forward.cu:forward_logits — when n_tokens > max_tokens_, throw std::invalid_argument instead of log-and-return. Returning silently with an uninitialized logits tensor pushes the error to a downstream reshape, surfacing as a confusing "numel mismatch" instead of the actual cause. 3. batching_engine.cpp:worker_loop — wrap engine->step() in try/catch. Previously, any exception from the engine called std::terminate on the worker thread, killing the entire imp-server container and taking every other in-flight request with it. Now we cancel all active requests with reason="internal_error", reset transient engine state (graphs, batch pool cache), and keep the worker alive. Validation: Qwen3.6-NVFP4 long_context_recall (1856-token prompt) goes from container-crashing terminate to clean coherent execution. Battery pass rate 4/20 → 13/20. Server stays up across many requests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ersion apply_pdl_edges() walks a captured graph's kernel edges and queries each source node's kernel params via cudaGraphKernelNodeGetParams to look up PDL-enabled status. For kernel nodes added via the driver-API form (CUkernel handle rather than a host __global__ symbol pointer), the runtime API returns kparams.func=nullptr AND sets the global CUDA last error to "invalid device function" (cudaErrorInvalidDeviceFunction = 98). The error is benign in our flow — a null host pointer just means "not in the PDL registry, skip this edge". But it was leaking up two stack frames to the next forward_logits, where the existing pre-pass diagnostic logged "Cleared stale error before forward: invalid device function" on every chat completion request (~2/request, ~600/min in production traffic). Fix: defensive (void)cudaGetLastError() in the skip path, plus an explicit null check on kparams.func so the lookup short-circuits cleanly. Also clarifying comment on why this is necessary. Verified: 0 stale-error warnings across 5 chat completions on Mistral-3.2 (was 2/request before). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Engine::warmup() runs a synthetic BOS-padded forward at engine init to prime cuBLAS handles, L2 cache, and CUDA-graph capture paths. Default was true, but it pollutes engine state in ways that survive its own forward pass and visibly degrade the very first user request — most clearly on Mistral-3.2-NVFP4 where the first generated answer was '111111111...' followed by clean output on requests 2+. Bisect (in this session, with all other optimizations off): - naked path (warmup=off, graphs=never, PDL=off, deterministic_gemm=on): first answer clean ✅ - graphs+PDL on, warmup=off: first answer clean ✅ - warmup=on, graphs+PDL off: first answer = '111111111111111111111111' ❌ So warmup itself is the trigger. Root cause not isolated; the existing reset_kv_calibration() (added in PR #89 for FP8 KV high-water-mark) covers FP8 KV but not the analog state for NVFP4-decode / cuBLAS algo cache / L2 persist policy that warmup with synthetic inputs poisons. The flag is intended for prod rollouts where first-request TTFT matters (saves ~200ms of cuBLAS heuristic on the first chat). Dev / CI / one-shot inference doesn't benefit, and the first-request degeneration was masking a real Mistral output bug. Flip the default; explicit opt-in via `--set runtime.warmup=true` for prod. Side-effect: Qwen3-Coder graph-replay determinism improved 16/32 → 23/32 post-flip — warmup was also subtly corrupting downstream graph state on that model. Doesn't fix the broader MoE-NVFP4 graph non-determinism (separate engine class) but moves the needle the right way. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a re-runnable Python harness that exercises imp-server against every
pre-quantized NVFP4 SafeTensors model on disk, runs a 20-prompt battery,
checks graph-replay byte-identicality, degeneracy gates, and 3-run
determinism. Mode A (reduced scope): no BF16 reference, no NVFP4
calibration — imp consumes pre-quantized weights only and there is no
BF16 execution path to compare against.
Files:
- scripts/validate_safetensors.py — harness, stdlib-only, spawns imp-server
per model in docker, hits /v1/chat/completions with logprobs, checks
output against per-prompt rules (regex / contains / json-schema /
primary-color sets / sequence-1-to-N / 4-gram repetition / etc.).
- scripts/consolidate_validation_report.py — reads each model's
report.json and writes the top-level MODEL_VALIDATION_REPORT.md +
MODEL_VALIDATION_SUMMARY.csv with executive summary.
- tests/fixtures/{battery_prompts.json, tokenizer_roundtrip.txt} — input
fixtures.
- validation_artifacts/<model>/{report.json, server.log} — per-model
artifacts from the latest run; refresh by re-running the harness.
- MODEL_VALIDATION_REPORT.md / MODEL_VALIDATION_SUMMARY.csv — current
state covering 5 models.
Models validated:
- Mistral-Small-3.2-24B-Instruct-2506-NVFP4 (llm-compressor + SmoothQuant)
- Gemma-4-26B-A4B-it-NVFP4 (llm-compressor)
- Qwen3.6-35B-A3B-NVFP4 (llm-compressor, GDN+MoE hybrid)
- Qwen3-Coder-30B-A3B-Instruct-FP4 (Modelopt)
- Qwen3-30B-A3B-NVFP4 NVIDIA Modelopt format (downloaded this session)
Headline finding: nvidia/Qwen3-30B-A3B-NVFP4 is a clean drop-in for the
broken Mistral-3.2-NVFP4. Killer-test: 50× Lorem-Ipsum prefix + "The
capital of France is" produces "Paris. The capital of Germany is Berlin.
The capital of Spain is Madrid…" (Mistral produces "elit dolor elit
dolor elit dolor…"). 1024-token creative generation 4-gram repetition
rate 1.4% (Mistral 95.7%).
Mistral-3.2-NVFP4 long-context regression was root-caused this session
to upstream SmoothQuant 0.9 calibration loss — direct dump of L0 q_proj
shows 45% of dequanted weight values are exactly 0, 97.8% of NVFP4
micro-blocks dominated by an outlier K-channel. Not fixable in imp;
realistic solution is the Qwen3-30B-A3B-NVFP4 replacement validated by
this harness.
Re-run any time: python3 scripts/validate_safetensors.py [--model NAME]
[--smoke]; then python3 scripts/consolidate_validation_report.py.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Six commits tightening NVFP4 SafeTensors stability on imp + a re-runnable validation harness for all NVFP4 models on disk.
557cab3fix(fp8)0x7E=448(was0x77=240) — roots the CUTLASS+llm-compressor mismatch.a7920f6fix(nvfp4)b29d066fix(runtime)effective_chunkagainstexecutor->max_tokens()instep_prefill; throw on overflow inforward_logits; try/catch aroundengine->step()inbatching_engine. Fixes Qwen3.6-NVFP4 long-prompt crash (terminate: reshape: numel mismatch).95f4ee2fix(graph)invalid device functionfromcudaGraphKernelNodeGetParamsfor driver-API kernel nodes inapply_pdl_edges. Was logging "Cleared stale error" on every chat completion.ca6ef69fix(runtime)runtime.warmup=false, opt-in for prod rollout. Bisect provedEngine::warmup()was the trigger for Mistral-3.2-NVFP4 first-request degeneration (illumin11111).517b11atest(validation)scripts/validate_safetensors.py+ 5-model report + per-model JSON artifacts. Validates Mistral-3.2, Gemma-4, Qwen3.6, Qwen3-Coder, and the newnvidia/Qwen3-30B-A3B-NVFP4(Modelopt) as a clean drop-in for the broken Mistral-3.2-NVFP4.Validation results
5 NVFP4 SafeTensors models exercised against a 20-prompt battery + 32x graph-replay determinism + degeneracy gates. Re-runnable any time via
scripts/validate_safetensors.py. Headline:"elit dolor elit dolor...""Paris. The capital of Germany is Berlin..."illumin11111(pre-warmup-fix)After the bug fixes:
Mistral-3.2-NVFP4 long-context regression — root cause
Investigation in this PR refuted the prior
input_global_scalehypothesis (tested both alpha directions, neither helped; all three llm-compressor models ship IGS but only Mistral breaks). Direct dump of L0 q_proj NVFP4-dequant FP16 reveals the actual cause:This is SmoothQuant 0.9 + per-block-NVFP4 incompatibility, baked into the model file at calibration. Not fixable in imp; replacement validated.
Test plan
make buildpassesmake test-unit37/37 PASSverify fast) passesnvidia/Qwen3-30B-A3B-NVFP4confirmed clean output🤖 Generated with Claude Code