fix(nvfp4): defensive guard for llm-compressor zero/non-finite tensor_scale + input_scale visibility#113
Merged
Merged
Conversation
The llm-compressor reciprocal flip (`tensor_scale = 1/h_scale`) silently produces +Inf when `weight_scale_2 == 0`, contaminating the entire layer's GEMM output with NaN/Inf via the alpha epilogue. Some llm-compressor exports legitimately emit zero scale on all-zero weight blocks; without this guard, those Linears poison every downstream token. Defensive fix: detect h_scale == 0 and non-finite reciprocal results explicitly. Substitute 0.0 (zero contribution) instead of Inf and emit a WARN with the offending key. Promotion stats are surfaced post-loop. Also surfaces input_scale presence at INFO level when an llm-compressor model carries it: the data is loaded into nvfp4_scratch_ but not yet consumed at inference (see docs/roadmap.md NVFP4 long-context section). This makes the gap visible by default rather than only under IMP_AUDIT_NVFP4_SCALES=1. Does NOT fix the underlying degeneration on long sequences with SmoothQuant-calibrated llm-compressor models — that requires per-layer numerical validation against a Modelopt-format reference (memory/ llm_compressor_cutlass_skip_2026_05_05.md identifies CUTLASS non-determinism + ElementC=half_t precision as suspect roots). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-specific Empirical bracket 2026-05-07 against same prompt + sentinel: - Gemma-4-NVFP4 (llm-compressor) — fails (regurgitates doc content) - gemma-4-Q8_0 GGUF — same failure mode - Qwen3-30B-A3B-NVFP4-Modelopt — also fails Failure on Q8_0 (no NVFP4) and on Modelopt (different scale convention) refutes the previous "llm-compressor NVFP4 degenerate" framing. The 2048-token sentinel-recall failure is a copy-from-context attention limitation, not weight quantization noise. Any fix belongs in attention/ KV cache, not in tensor_scale handling. input_scale absorption as scalar GEMM alpha modifier (the previous "final fix" line) was tested in both directions and refuted (18/20 → 4/20). Per-channel SmoothQuant correction may still apply for Mistral-3.2, but that's not the same code path. See memory/llm_compressor_input_scale_dead_end_2026_05_07.md for the re-runnable bracket recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cture
Doc-length sweep 128–2048 tokens (sentinel placed at midpoint, max_tokens=512):
Gemma-4-NVFP4 (llm-compressor) — fails ≥768, regurgitates doc content
gemma-4-Q8_0 GGUF — same failure mode (no NVFP4 anywhere)
Qwen3-30B-A3B-NVFP4-Modelopt — passes ALL sizes 128–2048
Failure is Gemma-4-specific, not NVFP4-specific. Gemma-4 has 5:1 SWA:full
attention with sliding_window=1024 — only 5/30 layers carry long-range
context. Qwen3-30B (full attention everywhere) handles the same prompt
fine through imp.
Code review of executor_attention.cu:491-501 shows imp correctly disables
sliding_window on Gemma-4 global layers and uses per-layer head_dim
(256 SWA / 512 global). No obvious imp bug in the dispatch path.
Two unresolved hypotheses pending llama.cpp reference:
1. Model-inherent (Gemma-4-26B can't reliably recall via 5 full layers)
2. imp Gemma-4 bug (partial_rotary_factor=0.25 rope_freqs interaction
at long context)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mp bug Same gemma-4-26B-A4B-it-Q8_0.gguf file, same prompt, same sentinel: imp — fails at all sizes >= 1024 (regurgitates doc, "not in text") llama.cpp — passes ALL sizes 128-2048 (build 9049) ghcr.io/ggml-org/llama.cpp:server-cuda v9049 --model gemma-4-26B-A4B-it-Q8_0.gguf -ngl 99 -c 4096 This refutes the "model-inherent" hypothesis. imp has a Gemma-4-specific long-context attention bug; llama.cpp on identical weights does not. Roadmap section reflects the confirmed status. Real fix requires intermediate-output comparison llama.cpp <-> imp layer-by-layer at the failure boundary to pinpoint the divergence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
github-actions Bot
pushed a commit
that referenced
this pull request
May 7, 2026
* inventory: roadmap and backlog discovery 2026-05 Phase 0 of the SafeTensors+NVFP4 hardening run. Compiles every roadmap-like artifact in the repo (docs/roadmap.md, docs/sm120-real-perf-plan.md, the "truly unresolved" list at the bottom of docs/audit/safetensors_audit.md, plus git log + open issues/PRs) into a single inventory and classifies each item FEASIBLE / UNCERTAIN / INFEASIBLE / OBSOLETE under the Quality Gate. Result: 0 FEASIBLE, 1 UNCERTAIN (native SentencePiece parser, deferred to a dedicated session in favor of correctness-hardening work this run), 21 INFEASIBLE (multi-week kernel/architecture work or refuted dead-ends), 5 OBSOLETE (already shipped or shelved). Consistent with imp's mature state — the listed items are by construction either dead-ends or large undertakings. Per the mission's conditional model, this run focuses on Objective 1 (SafeTensors + NVFP4 hardening) only. Every deferred item is captured in docs/audit/followups.md with a specific reason and pre-conditions to revisit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * audit: safetensors + nvfp4 loader audit 2026-05 Phase 1 of the SafeTensors+NVFP4 hardening run. Builds on top of the existing docs/audit/safetensors_audit.md (Phase 1 + Phase 2 / PR #116) and identifies ten residual hardening findings F1-F10 that survived Phase 2: - F1 (P0): no reference numerical test against the compressed-tensors NVFP4 spec. Roundtrip tests cannot catch a paired sign-flip / nibble-order / missing-factor bug between imp's quantizer and dequantizer. - F2 (P1): Modelopt NVFP4 weight_scale_2 lacks the isfinite guard PR #113 added for the llm-compressor path. NaN/Inf propagates layer-wide. - F3 (P1): header-size validation has integer overflow at safetensors_loader.cpp:519-524. UINT64_MAX-4 makes 8+x wrap to 3, bypass. - F4 (P1): tensor offsets not validated against shape*dtype. Three sub-bugs: no offset_start<=offset_end, no size match, no per-tensor start in-bounds. - F5-F8 (P2): silent drops on malformed tensor entries, missing NVFP4 packed vs weight_scale shape check, missing header-size upper bound, missing weight_scale dtype enforcement (NVFP4/MXFP4 cross-misrouting risk). - F9-F10 (P3): documented and skipped — spec-compliance and UX only. Also catalogs items already deferred via the prior audit's "truly unresolved" section (GLM, SentencePiece, AWQ kernel, DeepSeek MLA, multimodal, Tiktoken), which now live in docs/audit/followups.md. No code changes — this is the read-only audit commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * plan: master plan loader + roadmap 2026-05 Phase 2 of the run. Combines Phase 1 audit findings F1-F8 (Phase 0 yielded no FEASIBLE roadmap items, so the plan is loader/NVFP4 hardening only). Eight items, ordered P0 → P1 → P2. F9-F10 deferred (spec-compliance and UX only). Two ADRs: - 0001 — Pure-C++ reference harness for unit-level NVFP4 numerics - 0002 — 128 MiB SafeTensors header-size soft cap Each item is small, isolated, has a deterministic synthetic test fixture, zero new dependencies, and a clear root-cause-vs-symptom delineation. Test infrastructure reuses two new test files (test_nvfp4_compressed_tensors_ref.cu, test_safetensors_loader.cpp) across multiple items to keep churn low. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(nvfp4): reference numerical test for compressed-tensors dequant (F1) Closes audit finding F1. Adds tests/test_nvfp4_compressed_tensors_ref.cu with four GTest cases that build a synthetic compressed-tensors NVFP4 weight in memory exactly per the on-disk spec (uint8 nibble-packed E2M1 + FP8 E4M3 weight_scale at group_size=16 + FP32 weight_scale_2) and assert that imp's gemv_nvfp4_kpar produces the same Y = W·X as a pure-host reference dequant following val = e2m1_to_f32(nibble) * fp8_e4m3_to_f32(scale) * weight_scale_2. Existing roundtrip tests (test_nvfp4_quant_ref.cu, test_nvfp4_quant_hw.cu, test_nvfp4_gemv_kpar_loop.cu) all run imp's quantizer through imp's dequantizer; a paired sign-flip / nibble-order bug or a missing factor (e.g. dropping weight_scale_2) would not be detectable. This test starts from the spec format directly so any future regression that breaks spec compliance fails CI. Cases: - BaselineUnityScales: mixed nibbles, FP8=1.0, tensor_scale=1.0 — catches alignment / nibble-decode bugs at unity factors - TwoLevelScalingVaryingPerBlock: cycling per-block scales + non-trivial tensor_scale=0.125 — catches a missing or sign-flipped factor - ZeroTensorScaleProducesZeroOutput: tensor_scale=0 must produce 0 output, not NaN/Inf, regardless of garbage in weight_packed (also pre-validates F2) - NegativeWeightsSignPreserved: every nibble = -1.0 → output exactly -K, catches sign-bit drop in nibble decode Tolerance is max-abs-diff < 1e-2 in FP16 output (FMA-order divergence between sequential reference and imp's parallel-warp accumulator dominates; 1e-5 is unrealistic for K=128 FP16 dot-product reductions). ADR 0001 records the rationale for choosing pure-C++ reference over the other two harness options (existing imp Python infra: not wired for unit numerics; subprocess into user's HF venv: cross-process complexity + runtime dep). Pure C++ is dependency-free, deterministic, exact for the formula, and fits inside the existing GTest harness. All 4 new tests pass; full test-quant binary remains 82/82 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): defensive zeroing for non-finite Modelopt weight_scale_2 (F2) Closes audit finding F2. Phase 2 / PR #113 added a zero/non-finite guard for the llm-compressor NVFP4 promote path; the Modelopt path at the else branch in executor_pre_dequant.cu was unguarded and would propagate NaN/±Inf weight_scale_2 into the GEMM, contaminating the entire layer's hidden state and downstream KV cache. Refactor: extract the scale-promotion math into nvfp4_promote_weight_scale_2 in quant/nvfp4_quant.h/.cu — pure host function, testable without CUDA. Both formats now share the defensive logic: - non-finite h_scale → 0.0f (both formats) - llm-compressor h_scale=0 → 0.0f (avoids 1/0 = +Inf via reciprocal flip) - llm-compressor 1/h_scale non-finite → 0.0f (subnormal flip overflow) - Modelopt h_scale=0 → 0.0f (legitimate "null layer", flagged for diag) executor_pre_dequant.cu's promote() lambda calls the helper; the existing counter / WARN summary remains, with WARN messages distinguishing the Modelopt-NaN/Inf case from llm-compressor reciprocal flip. Tests: 9 new unit tests in NvFP4PromoteWeightScale2 covering NaN, +Inf, -Inf, zero, denorm-flip, and finite cases for both formats. Test-quant suite: 82/82 → 91/91 (no regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(safetensors): overflow-safe header_size validation + 128 MiB cap (F3 + F7) Closes audit findings F3 and F7 from docs/audit/safetensors_nvfp4_audit_2026-05.md. The previous header_size validation at safetensors_loader.cpp:519-524 used `8 + header_size > file_size`. With a malicious / corrupt header_size = UINT64_MAX-4 the addition wraps to 3, which is NOT greater than any file_size — the check silently bypasses. The loader then constructs JsonParser(json_data, static_cast<size_t>(header_size)) and reads past the mmap region → SIGSEGV. Refactor the check into safetensors_internal::validate_header_size, exposed in safetensors_loader.h for unit testing. Two rules: - header_size > file_size - 8 (overflow-safe; file_size >= 8 is enforced upstream) - header_size > kMaxHeaderBytes (128 MiB soft cap per ADR 0002) The cap rejects pathological inputs that would force the JSON parser to scan multi-GB regions. Real models have headers below 1 MiB; 128 MiB is far above legitimate use. Tests: 6 new unit cases covering truncated files, exact-minimum, typical size, header > file, UINT64_MAX overflow attack (the F3 bug), and the 128 MiB soft cap boundary. test-core suite: 139/139 → 139/139 (1 skipped is unchanged TensorKindCoverage.NoUnknownKindsInSmallQwen). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(safetensors): per-tensor offset and size validation (F4) Closes audit finding F4. The previous in-loop validation at safetensors_loader.cpp:572-580 only checked `tensor_data_offset + offset_end > file_size`, leaving three silent correctness paths open: 1. offset_start > offset_end was not checked. A swapped data_offsets pair would compute a negative "size" interpreted as a huge unsigned value; downstream kernel reads would walk backwards into adjacent tensor data. 2. offset_end - offset_start was not compared against the byte count implied by shape × dtype. A file declaring an FP16 [1024,1024] tensor with offsets [0, 1024] would silently load 0.05% of the actual weight data, then yield uninitialized bytes for the rest. 3. tensor_data_offset + offset_start in-bounds was never asserted; only the end was. A start-past-EOF could be tolerated if the (truncated) end was in-bounds — possible with a corrupted-but-self-consistent file. Add safetensors_internal::validate_tensor_offsets, a host-only helper exposed via safetensors_loader.h for unit testing. Three rules in the order the spec implies. Overflow-safe (subtractions only, with the upstream header_size invariant guaranteeing tensor_data_offset <= file_size). The per-tensor check is wired into load_shard's tensor enumeration loop: when wire_dtype_bytes is known (mapping from SafeTensors wire string — F32, F16, BF16, FP8_E4M3 etc.), the strict 3-rule validation runs; otherwise the legacy lenient end-only check applies. The "unknown wire type" path retains existing behavior because safetensors_dtype()'s WARN already fires for those tensors at the actual emit step. Tests: 7 new unit cases in SafeTensorsValidateTensorOffsets covering valid, swap, OOB-end, byte-count mismatch, zero-size, invariant violation, and exact-boundary cases. test-core 146/146 → 146/146, test-quant 91/91 unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): enforce FP8_E4M3 weight_scale dtype at promote (F8) Closes audit finding F8. The compressed-tensors NVFP4 spec mandates float8_e4m3fn for weight_scale. Until now imp's promote step accepted whatever qtype the loader produced, which opened a NVFP4↔MXFP4 cross-misrouting silent-corruption path: a model misclassified as NVFP4 but shipping U8 (UE8M0) weight_scale bytes would silently load, then gemv_nvfp4_kpar would interpret the UE8M0 bytes as E4M3 and produce ~2× wrong scales (powers of two interpreted as E4M3 normals). Add nvfp4_validate_weight_scale_dtype(QType, *err) in nvfp4_quant.h/.cu — pure host predicate. The promote() lambda calls it before applying the two-level scaling formula; on rejection, the weight stays in its loaded state and the dequant→cuBLAS fallback runs (slower, but correct). A new end-of-load summary line surfaces the count of skipped weights so the user can detect the cross-misroute case. Tests: 5 new unit cases in NvFP4ValidateWeightScaleDtype covering the accepted dtype, the MXFP4 INT8 case, FP8_E5M2 (activation-only), F16 (some pipelines emit this), and the QType::NONE sentinel. test-quant 91/91 → 96/96, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): validate weight_packed/weight_scale shape pair at promote (F6) Closes audit finding F6. The NVFP4 GEMV kernel hard-codes group_size=16 (kMicroBlockSize=16 at nvfp4_gemm.cu:31) and reads K/16 micro-scales per row. Until now, the loader never verified that weight_scale's shape matches that contract; a checkpoint with group_size=8 or a transposed weight_scale would silently load and silently produce wrong output (12.5% per-element step quant noise on roughly half the elements, or scales aligned onto wrong rows). Add nvfp4_validate_packed_scale_shapes(packed_outer, packed_inner, scale_outer, scale_inner, *err). The promote() lambda calls it for 2D weights — the per-expert MoE case has been split to 2D by weight_upload.cu before promote runs. Mismatches WARN + skip promotion, which routes the weight to the dequant→cuBLAS fallback (slower but at least correct). Tests: 7 new unit cases in NvFP4ValidatePackedScaleShapes covering typical Qwen3 / Gemma-4 expert shapes, transposed scale, group_size=8, group_size=32, zero-inner-dim, and the tiny F1 baseline test fixture. test-quant 96/96 → 103/103, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(safetensors): warn on dropped malformed tensor entries (F5) Closes audit finding F5. Until now, six paths in load_shard silently dropped tensor entries with no log line: missing/non-string 'dtype', missing/non-array 'shape', ndim > kMaxDims, missing/wrong-arity 'data_offsets', and the post-F4 offset-validation rejections. Users with a corrupt checkpoint would see tensor_map come back partially populated with zero diagnostic output; downstream null-checks would make load look "successful" with wrong outputs at inference. Replace each silent `continue` with a counter-bumped IMP_LOG_WARN line naming the tensor and the specific reason. Add an end-of-shard summary line breaking out the counts per reason (no_dtype / no_shape / too_many_dims / no_offsets / offset_validation), so users can scan a single log line to see "did this checkpoint load cleanly". Tests: 2 new unit cases in SafeTensorsMalformedEntryWarnings covering (a) missing dtype + missing shape on synthetic blob, and (b) byte-count mismatch from F4. Both verify via gtest's CaptureStderr that the WARN line includes the tensor name and the rejection reason. test-core 146/146 → 148/148, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(audit): mark all P0/P1/P2 findings closed + DONE summary Final phase. Marks every audit finding F1-F8 with its closing commit SHA, records the final P3 deferrals (F9 spec-compliance, F10 UX), and writes docs/audit/DONE.md with the full run summary: - 1/1 P0 closed, 3/3 P1 closed, 4/4 P2 closed, 2/2 P3 deferred - 0 FEASIBLE roadmap items (all 27 deferred per the conditional model with documented reason in followups.md) - 40 new unit tests, full suite 769/769 pass-or-skip-only with 0 failures - verify-fast: decode +5.47%, prefill +8.17% over baseline (noise + prior commits; this run's changes are loader-time only and do not touch the hot path) - 2 ADRs (pure-C++ ref harness, 128 MiB header cap) - No new third-party dependencies in any manifest - CMAKE_CUDA_STANDARD=20 unchanged progress.log force-added (project gitignore covers *.log; this is an intentional audit artifact named in the spec). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cleanup + diagnostic work on the llm-compressor NVFP4 path. No actual long-sequence degeneration to fix — empirical bracket 2026-05-07 establishes the previously-suspected NVFP4-specific bug is in fact a model-inherent / generic-imp limitation that reproduces on Q8_0 GGUF and on Modelopt-format NVFP4 too.
What this PR ships
Zero/non-finite guard on reciprocal flip (
src/graph/executor_pre_dequant.cu) —w.tensor_scale = 1.0f / h_scalewould silently produce+Infwhenweight_scale_2 == 0, poisoning a layer's GEMM output via the alpha epilogue. Now: detects this, substitutes0.0, emits WARN with offending key. Edge case but failure-mode-preventing.Stat counters at INFO level — post-promotion log surfaces zero / non-finite scale events without needing
IMP_AUDIT_NVFP4_SCALES=1.Input_scale visibility — llm-compressor's
input_scaleis loaded intonvfp4_scratch_but never consumed at inference; this PR makes that fact visible in the boot log.docs/roadmap.mdrevised — the previous "llm-compressor NVFP4: degenerate output past ~30 tokens" section is replaced with an honest description tracking what the empirical evidence actually shows.Empirical refutation of the input_scale hypothesis
Hypothesis: roadmap line "Final fix would load and apply the per-Linear
input_scale", applied as a per-tensor scalar GEMM alpha modifier.Tested on Gemma-4-26B-A4B-it-NVFP4 via
scripts/validate_safetensors.py(full 20-prompt battery):promoted_scale / h_input_scaleown own own own own…promoted_scale * h_input_scaleown- own own own…Both directions break the model identically — refuted. Math derivation in
memory/llm_compressor_input_scale_dead_end_2026_05_07.mdshows imp's dynamic per-block input quant already lands at the correct GEMM output without input_scale absorption.Long-context recall failure isn't NVFP4-specific
The 1 real failure on Gemma-4-NVFP4 (prompt 6
long_context_recall, 2048-token sentinel) reproduces identically on:So this is a copy-from-context attention/recall limitation, not a weight quantization issue. The fix (if any) belongs in attention/KV cache, not the NVFP4 path.
What this PR does NOT do
llm_compressor_cutlass_skip_2026_05_05.mddocuments the A/B that established it).long_context_recall(separate, not NVFP4-specific work).Test plan
make verify-fastclean — decode tg128 154.91 (+4.78%), prefill pp512 14391 (+8.38%), smoke prompt OKscripts/validate_safetensors.py --model Gemma-4-26B-A4B-it-NVFP4: 18/20 phase4 (= baseline, no regression)🤖 Generated with Claude Code