fix(nvfp4): defensive guard for llm-compressor zero/non-finite tensor_scale + input_scale visibility by kekzl · Pull Request #113 · kekzl/imp

kekzl · 2026-05-07T06:13:46Z

Summary

Cleanup + diagnostic work on the llm-compressor NVFP4 path. No actual long-sequence degeneration to fix — empirical bracket 2026-05-07 establishes the previously-suspected NVFP4-specific bug is in fact a model-inherent / generic-imp limitation that reproduces on Q8_0 GGUF and on Modelopt-format NVFP4 too.

What this PR ships

Zero/non-finite guard on reciprocal flip (src/graph/executor_pre_dequant.cu) — w.tensor_scale = 1.0f / h_scale would silently produce +Inf when weight_scale_2 == 0, poisoning a layer's GEMM output via the alpha epilogue. Now: detects this, substitutes 0.0, emits WARN with offending key. Edge case but failure-mode-preventing.
Stat counters at INFO level — post-promotion log surfaces zero / non-finite scale events without needing IMP_AUDIT_NVFP4_SCALES=1.
Input_scale visibility — llm-compressor's input_scale is loaded into nvfp4_scratch_ but never consumed at inference; this PR makes that fact visible in the boot log.
docs/roadmap.md revised — the previous "llm-compressor NVFP4: degenerate output past ~30 tokens" section is replaced with an honest description tracking what the empirical evidence actually shows.

Empirical refutation of the input_scale hypothesis

Hypothesis: roadmap line "Final fix would load and apply the per-Linear input_scale", applied as a per-tensor scalar GEMM alpha modifier.

Tested on Gemma-4-26B-A4B-it-NVFP4 via scripts/validate_safetensors.py (full 20-prompt battery):

Build	phase4	typical failure
current main (skip-guard ON)	18/20	(baseline)
this PR (defensive only)	18/20	(no change — only edge-case guards)
`promoted_scale / h_input_scale`	4/20	`own own own own own…`
`promoted_scale * h_input_scale`	4/20	`own- own own own…`

Both directions break the model identically — refuted. Math derivation in memory/llm_compressor_input_scale_dead_end_2026_05_07.md shows imp's dynamic per-block input quant already lands at the correct GEMM output without input_scale absorption.

Long-context recall failure isn't NVFP4-specific

The 1 real failure on Gemma-4-NVFP4 (prompt 6 long_context_recall, 2048-token sentinel) reproduces identically on:

gemma-4-26B-A4B-it-Q8_0 GGUF (≈2× higher precision per weight, no NVFP4 anywhere) — manual A/B 2026-05-07
Qwen3-30B-A3B-NVFP4-Modelopt (Modelopt format, different scale convention) — full battery 2026-05-07

So this is a copy-from-context attention/recall limitation, not a weight quantization issue. The fix (if any) belongs in attention/KV cache, not the NVFP4 path.

What this PR does NOT do

Lift the CUTLASS skip-guard (separate concern; memory llm_compressor_cutlass_skip_2026_05_05.md documents the A/B that established it).
Apply input_scale (refuted as scalar; per-channel SmoothQuant variant may still apply for Mistral-3.2-NVFP4 but requires that model locally).
Fix long_context_recall (separate, not NVFP4-specific work).

Test plan

make verify-fast clean — decode tg128 154.91 (+4.78%), prefill pp512 14391 (+8.38%), smoke prompt OK
No behavior change for Modelopt-format models
scripts/validate_safetensors.py --model Gemma-4-26B-A4B-it-NVFP4: 18/20 phase4 (= baseline, no regression)
Hypothesis A/B (input_scale divide/multiply): both 4/20, refuted
Cross-format check (Modelopt + Q8_0 GGUF): same long_context_recall failure → not NVFP4-specific

🤖 Generated with Claude Code

The llm-compressor reciprocal flip (`tensor_scale = 1/h_scale`) silently produces +Inf when `weight_scale_2 == 0`, contaminating the entire layer's GEMM output with NaN/Inf via the alpha epilogue. Some llm-compressor exports legitimately emit zero scale on all-zero weight blocks; without this guard, those Linears poison every downstream token. Defensive fix: detect h_scale == 0 and non-finite reciprocal results explicitly. Substitute 0.0 (zero contribution) instead of Inf and emit a WARN with the offending key. Promotion stats are surfaced post-loop. Also surfaces input_scale presence at INFO level when an llm-compressor model carries it: the data is loaded into nvfp4_scratch_ but not yet consumed at inference (see docs/roadmap.md NVFP4 long-context section). This makes the gap visible by default rather than only under IMP_AUDIT_NVFP4_SCALES=1. Does NOT fix the underlying degeneration on long sequences with SmoothQuant-calibrated llm-compressor models — that requires per-layer numerical validation against a Modelopt-format reference (memory/ llm_compressor_cutlass_skip_2026_05_05.md identifies CUTLASS non-determinism + ElementC=half_t precision as suspect roots). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t-specific Empirical bracket 2026-05-07 against same prompt + sentinel: - Gemma-4-NVFP4 (llm-compressor) — fails (regurgitates doc content) - gemma-4-Q8_0 GGUF — same failure mode - Qwen3-30B-A3B-NVFP4-Modelopt — also fails Failure on Q8_0 (no NVFP4) and on Modelopt (different scale convention) refutes the previous "llm-compressor NVFP4 degenerate" framing. The 2048-token sentinel-recall failure is a copy-from-context attention limitation, not weight quantization noise. Any fix belongs in attention/ KV cache, not in tensor_scale handling. input_scale absorption as scalar GEMM alpha modifier (the previous "final fix" line) was tested in both directions and refuted (18/20 → 4/20). Per-channel SmoothQuant correction may still apply for Mistral-3.2, but that's not the same code path. See memory/llm_compressor_input_scale_dead_end_2026_05_07.md for the re-runnable bracket recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cture Doc-length sweep 128–2048 tokens (sentinel placed at midpoint, max_tokens=512): Gemma-4-NVFP4 (llm-compressor) — fails ≥768, regurgitates doc content gemma-4-Q8_0 GGUF — same failure mode (no NVFP4 anywhere) Qwen3-30B-A3B-NVFP4-Modelopt — passes ALL sizes 128–2048 Failure is Gemma-4-specific, not NVFP4-specific. Gemma-4 has 5:1 SWA:full attention with sliding_window=1024 — only 5/30 layers carry long-range context. Qwen3-30B (full attention everywhere) handles the same prompt fine through imp. Code review of executor_attention.cu:491-501 shows imp correctly disables sliding_window on Gemma-4 global layers and uses per-layer head_dim (256 SWA / 512 global). No obvious imp bug in the dispatch path. Two unresolved hypotheses pending llama.cpp reference: 1. Model-inherent (Gemma-4-26B can't reliably recall via 5 full layers) 2. imp Gemma-4 bug (partial_rotary_factor=0.25 rope_freqs interaction at long context) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mp bug Same gemma-4-26B-A4B-it-Q8_0.gguf file, same prompt, same sentinel: imp — fails at all sizes >= 1024 (regurgitates doc, "not in text") llama.cpp — passes ALL sizes 128-2048 (build 9049) ghcr.io/ggml-org/llama.cpp:server-cuda v9049 --model gemma-4-26B-A4B-it-Q8_0.gguf -ngl 99 -c 4096 This refutes the "model-inherent" hypothesis. imp has a Gemma-4-specific long-context attention bug; llama.cpp on identical weights does not. Roadmap section reflects the confirmed status. Real fix requires intermediate-output comparison llama.cpp <-> imp layer-by-layer at the failure boundary to pinpoint the divergence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* inventory: roadmap and backlog discovery 2026-05 Phase 0 of the SafeTensors+NVFP4 hardening run. Compiles every roadmap-like artifact in the repo (docs/roadmap.md, docs/sm120-real-perf-plan.md, the "truly unresolved" list at the bottom of docs/audit/safetensors_audit.md, plus git log + open issues/PRs) into a single inventory and classifies each item FEASIBLE / UNCERTAIN / INFEASIBLE / OBSOLETE under the Quality Gate. Result: 0 FEASIBLE, 1 UNCERTAIN (native SentencePiece parser, deferred to a dedicated session in favor of correctness-hardening work this run), 21 INFEASIBLE (multi-week kernel/architecture work or refuted dead-ends), 5 OBSOLETE (already shipped or shelved). Consistent with imp's mature state — the listed items are by construction either dead-ends or large undertakings. Per the mission's conditional model, this run focuses on Objective 1 (SafeTensors + NVFP4 hardening) only. Every deferred item is captured in docs/audit/followups.md with a specific reason and pre-conditions to revisit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * audit: safetensors + nvfp4 loader audit 2026-05 Phase 1 of the SafeTensors+NVFP4 hardening run. Builds on top of the existing docs/audit/safetensors_audit.md (Phase 1 + Phase 2 / PR #116) and identifies ten residual hardening findings F1-F10 that survived Phase 2: - F1 (P0): no reference numerical test against the compressed-tensors NVFP4 spec. Roundtrip tests cannot catch a paired sign-flip / nibble-order / missing-factor bug between imp's quantizer and dequantizer. - F2 (P1): Modelopt NVFP4 weight_scale_2 lacks the isfinite guard PR #113 added for the llm-compressor path. NaN/Inf propagates layer-wide. - F3 (P1): header-size validation has integer overflow at safetensors_loader.cpp:519-524. UINT64_MAX-4 makes 8+x wrap to 3, bypass. - F4 (P1): tensor offsets not validated against shape*dtype. Three sub-bugs: no offset_start<=offset_end, no size match, no per-tensor start in-bounds. - F5-F8 (P2): silent drops on malformed tensor entries, missing NVFP4 packed vs weight_scale shape check, missing header-size upper bound, missing weight_scale dtype enforcement (NVFP4/MXFP4 cross-misrouting risk). - F9-F10 (P3): documented and skipped — spec-compliance and UX only. Also catalogs items already deferred via the prior audit's "truly unresolved" section (GLM, SentencePiece, AWQ kernel, DeepSeek MLA, multimodal, Tiktoken), which now live in docs/audit/followups.md. No code changes — this is the read-only audit commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * plan: master plan loader + roadmap 2026-05 Phase 2 of the run. Combines Phase 1 audit findings F1-F8 (Phase 0 yielded no FEASIBLE roadmap items, so the plan is loader/NVFP4 hardening only). Eight items, ordered P0 → P1 → P2. F9-F10 deferred (spec-compliance and UX only). Two ADRs: - 0001 — Pure-C++ reference harness for unit-level NVFP4 numerics - 0002 — 128 MiB SafeTensors header-size soft cap Each item is small, isolated, has a deterministic synthetic test fixture, zero new dependencies, and a clear root-cause-vs-symptom delineation. Test infrastructure reuses two new test files (test_nvfp4_compressed_tensors_ref.cu, test_safetensors_loader.cpp) across multiple items to keep churn low. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(nvfp4): reference numerical test for compressed-tensors dequant (F1) Closes audit finding F1. Adds tests/test_nvfp4_compressed_tensors_ref.cu with four GTest cases that build a synthetic compressed-tensors NVFP4 weight in memory exactly per the on-disk spec (uint8 nibble-packed E2M1 + FP8 E4M3 weight_scale at group_size=16 + FP32 weight_scale_2) and assert that imp's gemv_nvfp4_kpar produces the same Y = W·X as a pure-host reference dequant following val = e2m1_to_f32(nibble) * fp8_e4m3_to_f32(scale) * weight_scale_2. Existing roundtrip tests (test_nvfp4_quant_ref.cu, test_nvfp4_quant_hw.cu, test_nvfp4_gemv_kpar_loop.cu) all run imp's quantizer through imp's dequantizer; a paired sign-flip / nibble-order bug or a missing factor (e.g. dropping weight_scale_2) would not be detectable. This test starts from the spec format directly so any future regression that breaks spec compliance fails CI. Cases: - BaselineUnityScales: mixed nibbles, FP8=1.0, tensor_scale=1.0 — catches alignment / nibble-decode bugs at unity factors - TwoLevelScalingVaryingPerBlock: cycling per-block scales + non-trivial tensor_scale=0.125 — catches a missing or sign-flipped factor - ZeroTensorScaleProducesZeroOutput: tensor_scale=0 must produce 0 output, not NaN/Inf, regardless of garbage in weight_packed (also pre-validates F2) - NegativeWeightsSignPreserved: every nibble = -1.0 → output exactly -K, catches sign-bit drop in nibble decode Tolerance is max-abs-diff < 1e-2 in FP16 output (FMA-order divergence between sequential reference and imp's parallel-warp accumulator dominates; 1e-5 is unrealistic for K=128 FP16 dot-product reductions). ADR 0001 records the rationale for choosing pure-C++ reference over the other two harness options (existing imp Python infra: not wired for unit numerics; subprocess into user's HF venv: cross-process complexity + runtime dep). Pure C++ is dependency-free, deterministic, exact for the formula, and fits inside the existing GTest harness. All 4 new tests pass; full test-quant binary remains 82/82 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): defensive zeroing for non-finite Modelopt weight_scale_2 (F2) Closes audit finding F2. Phase 2 / PR #113 added a zero/non-finite guard for the llm-compressor NVFP4 promote path; the Modelopt path at the else branch in executor_pre_dequant.cu was unguarded and would propagate NaN/±Inf weight_scale_2 into the GEMM, contaminating the entire layer's hidden state and downstream KV cache. Refactor: extract the scale-promotion math into nvfp4_promote_weight_scale_2 in quant/nvfp4_quant.h/.cu — pure host function, testable without CUDA. Both formats now share the defensive logic: - non-finite h_scale → 0.0f (both formats) - llm-compressor h_scale=0 → 0.0f (avoids 1/0 = +Inf via reciprocal flip) - llm-compressor 1/h_scale non-finite → 0.0f (subnormal flip overflow) - Modelopt h_scale=0 → 0.0f (legitimate "null layer", flagged for diag) executor_pre_dequant.cu's promote() lambda calls the helper; the existing counter / WARN summary remains, with WARN messages distinguishing the Modelopt-NaN/Inf case from llm-compressor reciprocal flip. Tests: 9 new unit tests in NvFP4PromoteWeightScale2 covering NaN, +Inf, -Inf, zero, denorm-flip, and finite cases for both formats. Test-quant suite: 82/82 → 91/91 (no regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(safetensors): overflow-safe header_size validation + 128 MiB cap (F3 + F7) Closes audit findings F3 and F7 from docs/audit/safetensors_nvfp4_audit_2026-05.md. The previous header_size validation at safetensors_loader.cpp:519-524 used `8 + header_size > file_size`. With a malicious / corrupt header_size = UINT64_MAX-4 the addition wraps to 3, which is NOT greater than any file_size — the check silently bypasses. The loader then constructs JsonParser(json_data, static_cast<size_t>(header_size)) and reads past the mmap region → SIGSEGV. Refactor the check into safetensors_internal::validate_header_size, exposed in safetensors_loader.h for unit testing. Two rules: - header_size > file_size - 8 (overflow-safe; file_size >= 8 is enforced upstream) - header_size > kMaxHeaderBytes (128 MiB soft cap per ADR 0002) The cap rejects pathological inputs that would force the JSON parser to scan multi-GB regions. Real models have headers below 1 MiB; 128 MiB is far above legitimate use. Tests: 6 new unit cases covering truncated files, exact-minimum, typical size, header > file, UINT64_MAX overflow attack (the F3 bug), and the 128 MiB soft cap boundary. test-core suite: 139/139 → 139/139 (1 skipped is unchanged TensorKindCoverage.NoUnknownKindsInSmallQwen). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(safetensors): per-tensor offset and size validation (F4) Closes audit finding F4. The previous in-loop validation at safetensors_loader.cpp:572-580 only checked `tensor_data_offset + offset_end > file_size`, leaving three silent correctness paths open: 1. offset_start > offset_end was not checked. A swapped data_offsets pair would compute a negative "size" interpreted as a huge unsigned value; downstream kernel reads would walk backwards into adjacent tensor data. 2. offset_end - offset_start was not compared against the byte count implied by shape × dtype. A file declaring an FP16 [1024,1024] tensor with offsets [0, 1024] would silently load 0.05% of the actual weight data, then yield uninitialized bytes for the rest. 3. tensor_data_offset + offset_start in-bounds was never asserted; only the end was. A start-past-EOF could be tolerated if the (truncated) end was in-bounds — possible with a corrupted-but-self-consistent file. Add safetensors_internal::validate_tensor_offsets, a host-only helper exposed via safetensors_loader.h for unit testing. Three rules in the order the spec implies. Overflow-safe (subtractions only, with the upstream header_size invariant guaranteeing tensor_data_offset <= file_size). The per-tensor check is wired into load_shard's tensor enumeration loop: when wire_dtype_bytes is known (mapping from SafeTensors wire string — F32, F16, BF16, FP8_E4M3 etc.), the strict 3-rule validation runs; otherwise the legacy lenient end-only check applies. The "unknown wire type" path retains existing behavior because safetensors_dtype()'s WARN already fires for those tensors at the actual emit step. Tests: 7 new unit cases in SafeTensorsValidateTensorOffsets covering valid, swap, OOB-end, byte-count mismatch, zero-size, invariant violation, and exact-boundary cases. test-core 146/146 → 146/146, test-quant 91/91 unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): enforce FP8_E4M3 weight_scale dtype at promote (F8) Closes audit finding F8. The compressed-tensors NVFP4 spec mandates float8_e4m3fn for weight_scale. Until now imp's promote step accepted whatever qtype the loader produced, which opened a NVFP4↔MXFP4 cross-misrouting silent-corruption path: a model misclassified as NVFP4 but shipping U8 (UE8M0) weight_scale bytes would silently load, then gemv_nvfp4_kpar would interpret the UE8M0 bytes as E4M3 and produce ~2× wrong scales (powers of two interpreted as E4M3 normals). Add nvfp4_validate_weight_scale_dtype(QType, *err) in nvfp4_quant.h/.cu — pure host predicate. The promote() lambda calls it before applying the two-level scaling formula; on rejection, the weight stays in its loaded state and the dequant→cuBLAS fallback runs (slower, but correct). A new end-of-load summary line surfaces the count of skipped weights so the user can detect the cross-misroute case. Tests: 5 new unit cases in NvFP4ValidateWeightScaleDtype covering the accepted dtype, the MXFP4 INT8 case, FP8_E5M2 (activation-only), F16 (some pipelines emit this), and the QType::NONE sentinel. test-quant 91/91 → 96/96, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): validate weight_packed/weight_scale shape pair at promote (F6) Closes audit finding F6. The NVFP4 GEMV kernel hard-codes group_size=16 (kMicroBlockSize=16 at nvfp4_gemm.cu:31) and reads K/16 micro-scales per row. Until now, the loader never verified that weight_scale's shape matches that contract; a checkpoint with group_size=8 or a transposed weight_scale would silently load and silently produce wrong output (12.5% per-element step quant noise on roughly half the elements, or scales aligned onto wrong rows). Add nvfp4_validate_packed_scale_shapes(packed_outer, packed_inner, scale_outer, scale_inner, *err). The promote() lambda calls it for 2D weights — the per-expert MoE case has been split to 2D by weight_upload.cu before promote runs. Mismatches WARN + skip promotion, which routes the weight to the dequant→cuBLAS fallback (slower but at least correct). Tests: 7 new unit cases in NvFP4ValidatePackedScaleShapes covering typical Qwen3 / Gemma-4 expert shapes, transposed scale, group_size=8, group_size=32, zero-inner-dim, and the tiny F1 baseline test fixture. test-quant 96/96 → 103/103, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(safetensors): warn on dropped malformed tensor entries (F5) Closes audit finding F5. Until now, six paths in load_shard silently dropped tensor entries with no log line: missing/non-string 'dtype', missing/non-array 'shape', ndim > kMaxDims, missing/wrong-arity 'data_offsets', and the post-F4 offset-validation rejections. Users with a corrupt checkpoint would see tensor_map come back partially populated with zero diagnostic output; downstream null-checks would make load look "successful" with wrong outputs at inference. Replace each silent `continue` with a counter-bumped IMP_LOG_WARN line naming the tensor and the specific reason. Add an end-of-shard summary line breaking out the counts per reason (no_dtype / no_shape / too_many_dims / no_offsets / offset_validation), so users can scan a single log line to see "did this checkpoint load cleanly". Tests: 2 new unit cases in SafeTensorsMalformedEntryWarnings covering (a) missing dtype + missing shape on synthetic blob, and (b) byte-count mismatch from F4. Both verify via gtest's CaptureStderr that the WARN line includes the tensor name and the rejection reason. test-core 146/146 → 148/148, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(audit): mark all P0/P1/P2 findings closed + DONE summary Final phase. Marks every audit finding F1-F8 with its closing commit SHA, records the final P3 deferrals (F9 spec-compliance, F10 UX), and writes docs/audit/DONE.md with the full run summary: - 1/1 P0 closed, 3/3 P1 closed, 4/4 P2 closed, 2/2 P3 deferred - 0 FEASIBLE roadmap items (all 27 deferred per the conditional model with documented reason in followups.md) - 40 new unit tests, full suite 769/769 pass-or-skip-only with 0 failures - verify-fast: decode +5.47%, prefill +8.17% over baseline (noise + prior commits; this run's changes are loader-time only and do not touch the hot path) - 2 ADRs (pure-C++ ref harness, 128 MiB header cap) - No new third-party dependencies in any manifest - CMAKE_CUDA_STANDARD=20 unchanged progress.log force-added (project gitignore covers *.log; this is an intentional audit artifact named in the spec). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl and others added 4 commits May 7, 2026 08:13

kekzl merged commit 5ee219f into main May 7, 2026
2 checks passed

kekzl deleted the fix/llm-compressor-nvfp4-defensive branch May 7, 2026 16:57

kekzl mentioned this pull request May 7, 2026

fix(moe): Nemotron-H non-gated NVFP4 — wire expert_gemm into MoE batch cache #115

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nvfp4): defensive guard for llm-compressor zero/non-finite tensor_scale + input_scale visibility#113

fix(nvfp4): defensive guard for llm-compressor zero/non-finite tensor_scale + input_scale visibility#113
kekzl merged 4 commits into
mainfrom
fix/llm-compressor-nvfp4-defensive

kekzl commented May 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR ships

Empirical refutation of the input_scale hypothesis

Long-context recall failure isn't NVFP4-specific

What this PR does NOT do

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kekzl commented May 7, 2026 •

edited

Loading