Skip to content

fix(nvfp4): defensive guard for llm-compressor zero/non-finite tensor_scale + input_scale visibility#113

Merged
kekzl merged 4 commits into
mainfrom
fix/llm-compressor-nvfp4-defensive
May 7, 2026
Merged

fix(nvfp4): defensive guard for llm-compressor zero/non-finite tensor_scale + input_scale visibility#113
kekzl merged 4 commits into
mainfrom
fix/llm-compressor-nvfp4-defensive

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 7, 2026

Summary

Cleanup + diagnostic work on the llm-compressor NVFP4 path. No actual long-sequence degeneration to fix — empirical bracket 2026-05-07 establishes the previously-suspected NVFP4-specific bug is in fact a model-inherent / generic-imp limitation that reproduces on Q8_0 GGUF and on Modelopt-format NVFP4 too.

What this PR ships

  1. Zero/non-finite guard on reciprocal flip (src/graph/executor_pre_dequant.cu) — w.tensor_scale = 1.0f / h_scale would silently produce +Inf when weight_scale_2 == 0, poisoning a layer's GEMM output via the alpha epilogue. Now: detects this, substitutes 0.0, emits WARN with offending key. Edge case but failure-mode-preventing.

  2. Stat counters at INFO level — post-promotion log surfaces zero / non-finite scale events without needing IMP_AUDIT_NVFP4_SCALES=1.

  3. Input_scale visibility — llm-compressor's input_scale is loaded into nvfp4_scratch_ but never consumed at inference; this PR makes that fact visible in the boot log.

  4. docs/roadmap.md revised — the previous "llm-compressor NVFP4: degenerate output past ~30 tokens" section is replaced with an honest description tracking what the empirical evidence actually shows.

Empirical refutation of the input_scale hypothesis

Hypothesis: roadmap line "Final fix would load and apply the per-Linear input_scale", applied as a per-tensor scalar GEMM alpha modifier.

Tested on Gemma-4-26B-A4B-it-NVFP4 via scripts/validate_safetensors.py (full 20-prompt battery):

Build phase4 typical failure
current main (skip-guard ON) 18/20 (baseline)
this PR (defensive only) 18/20 (no change — only edge-case guards)
promoted_scale / h_input_scale 4/20 own own own own own…
promoted_scale * h_input_scale 4/20 own- own own own…

Both directions break the model identically — refuted. Math derivation in memory/llm_compressor_input_scale_dead_end_2026_05_07.md shows imp's dynamic per-block input quant already lands at the correct GEMM output without input_scale absorption.

Long-context recall failure isn't NVFP4-specific

The 1 real failure on Gemma-4-NVFP4 (prompt 6 long_context_recall, 2048-token sentinel) reproduces identically on:

  • gemma-4-26B-A4B-it-Q8_0 GGUF (≈2× higher precision per weight, no NVFP4 anywhere) — manual A/B 2026-05-07
  • Qwen3-30B-A3B-NVFP4-Modelopt (Modelopt format, different scale convention) — full battery 2026-05-07

So this is a copy-from-context attention/recall limitation, not a weight quantization issue. The fix (if any) belongs in attention/KV cache, not the NVFP4 path.

What this PR does NOT do

  • Lift the CUTLASS skip-guard (separate concern; memory llm_compressor_cutlass_skip_2026_05_05.md documents the A/B that established it).
  • Apply input_scale (refuted as scalar; per-channel SmoothQuant variant may still apply for Mistral-3.2-NVFP4 but requires that model locally).
  • Fix long_context_recall (separate, not NVFP4-specific work).

Test plan

  • make verify-fast clean — decode tg128 154.91 (+4.78%), prefill pp512 14391 (+8.38%), smoke prompt OK
  • No behavior change for Modelopt-format models
  • scripts/validate_safetensors.py --model Gemma-4-26B-A4B-it-NVFP4: 18/20 phase4 (= baseline, no regression)
  • Hypothesis A/B (input_scale divide/multiply): both 4/20, refuted
  • Cross-format check (Modelopt + Q8_0 GGUF): same long_context_recall failure → not NVFP4-specific

🤖 Generated with Claude Code

kekzl and others added 4 commits May 7, 2026 08:13
The llm-compressor reciprocal flip (`tensor_scale = 1/h_scale`) silently
produces +Inf when `weight_scale_2 == 0`, contaminating the entire
layer's GEMM output with NaN/Inf via the alpha epilogue. Some
llm-compressor exports legitimately emit zero scale on all-zero weight
blocks; without this guard, those Linears poison every downstream token.

Defensive fix: detect h_scale == 0 and non-finite reciprocal results
explicitly. Substitute 0.0 (zero contribution) instead of Inf and emit
a WARN with the offending key. Promotion stats are surfaced post-loop.

Also surfaces input_scale presence at INFO level when an llm-compressor
model carries it: the data is loaded into nvfp4_scratch_ but not yet
consumed at inference (see docs/roadmap.md NVFP4 long-context section).
This makes the gap visible by default rather than only under
IMP_AUDIT_NVFP4_SCALES=1.

Does NOT fix the underlying degeneration on long sequences with
SmoothQuant-calibrated llm-compressor models — that requires per-layer
numerical validation against a Modelopt-format reference (memory/
llm_compressor_cutlass_skip_2026_05_05.md identifies CUTLASS
non-determinism + ElementC=half_t precision as suspect roots).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-specific

Empirical bracket 2026-05-07 against same prompt + sentinel:
  - Gemma-4-NVFP4 (llm-compressor) — fails (regurgitates doc content)
  - gemma-4-Q8_0 GGUF              — same failure mode
  - Qwen3-30B-A3B-NVFP4-Modelopt   — also fails

Failure on Q8_0 (no NVFP4) and on Modelopt (different scale convention)
refutes the previous "llm-compressor NVFP4 degenerate" framing. The
2048-token sentinel-recall failure is a copy-from-context attention
limitation, not weight quantization noise. Any fix belongs in attention/
KV cache, not in tensor_scale handling.

input_scale absorption as scalar GEMM alpha modifier (the previous
"final fix" line) was tested in both directions and refuted (18/20 →
4/20). Per-channel SmoothQuant correction may still apply for Mistral-3.2,
but that's not the same code path.

See memory/llm_compressor_input_scale_dead_end_2026_05_07.md for the
re-runnable bracket recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cture

Doc-length sweep 128–2048 tokens (sentinel placed at midpoint, max_tokens=512):
  Gemma-4-NVFP4 (llm-compressor)   — fails ≥768, regurgitates doc content
  gemma-4-Q8_0 GGUF                — same failure mode (no NVFP4 anywhere)
  Qwen3-30B-A3B-NVFP4-Modelopt     — passes ALL sizes 128–2048

Failure is Gemma-4-specific, not NVFP4-specific. Gemma-4 has 5:1 SWA:full
attention with sliding_window=1024 — only 5/30 layers carry long-range
context. Qwen3-30B (full attention everywhere) handles the same prompt
fine through imp.

Code review of executor_attention.cu:491-501 shows imp correctly disables
sliding_window on Gemma-4 global layers and uses per-layer head_dim
(256 SWA / 512 global). No obvious imp bug in the dispatch path.

Two unresolved hypotheses pending llama.cpp reference:
  1. Model-inherent (Gemma-4-26B can't reliably recall via 5 full layers)
  2. imp Gemma-4 bug (partial_rotary_factor=0.25 rope_freqs interaction
     at long context)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mp bug

Same gemma-4-26B-A4B-it-Q8_0.gguf file, same prompt, same sentinel:

  imp        — fails at all sizes >= 1024 (regurgitates doc, "not in text")
  llama.cpp  — passes ALL sizes 128-2048 (build 9049)

  ghcr.io/ggml-org/llama.cpp:server-cuda v9049
  --model gemma-4-26B-A4B-it-Q8_0.gguf -ngl 99 -c 4096

This refutes the "model-inherent" hypothesis. imp has a Gemma-4-specific
long-context attention bug; llama.cpp on identical weights does not.

Roadmap section reflects the confirmed status. Real fix requires
intermediate-output comparison llama.cpp <-> imp layer-by-layer at the
failure boundary to pinpoint the divergence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl merged commit 5ee219f into main May 7, 2026
2 checks passed
@kekzl kekzl deleted the fix/llm-compressor-nvfp4-defensive branch May 7, 2026 16:57
github-actions Bot pushed a commit that referenced this pull request May 7, 2026
* inventory: roadmap and backlog discovery 2026-05

Phase 0 of the SafeTensors+NVFP4 hardening run. Compiles every roadmap-like
artifact in the repo (docs/roadmap.md, docs/sm120-real-perf-plan.md, the
"truly unresolved" list at the bottom of docs/audit/safetensors_audit.md, plus
git log + open issues/PRs) into a single inventory and classifies each item
FEASIBLE / UNCERTAIN / INFEASIBLE / OBSOLETE under the Quality Gate.

Result: 0 FEASIBLE, 1 UNCERTAIN (native SentencePiece parser, deferred to a
dedicated session in favor of correctness-hardening work this run), 21
INFEASIBLE (multi-week kernel/architecture work or refuted dead-ends), 5
OBSOLETE (already shipped or shelved). Consistent with imp's mature state —
the listed items are by construction either dead-ends or large undertakings.

Per the mission's conditional model, this run focuses on Objective 1
(SafeTensors + NVFP4 hardening) only. Every deferred item is captured in
docs/audit/followups.md with a specific reason and pre-conditions to revisit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* audit: safetensors + nvfp4 loader audit 2026-05

Phase 1 of the SafeTensors+NVFP4 hardening run. Builds on top of the existing
docs/audit/safetensors_audit.md (Phase 1 + Phase 2 / PR #116) and identifies
ten residual hardening findings F1-F10 that survived Phase 2:

- F1 (P0): no reference numerical test against the compressed-tensors NVFP4
  spec. Roundtrip tests cannot catch a paired sign-flip / nibble-order /
  missing-factor bug between imp's quantizer and dequantizer.
- F2 (P1): Modelopt NVFP4 weight_scale_2 lacks the isfinite guard PR #113
  added for the llm-compressor path. NaN/Inf propagates layer-wide.
- F3 (P1): header-size validation has integer overflow at
  safetensors_loader.cpp:519-524. UINT64_MAX-4 makes 8+x wrap to 3, bypass.
- F4 (P1): tensor offsets not validated against shape*dtype. Three sub-bugs:
  no offset_start<=offset_end, no size match, no per-tensor start in-bounds.
- F5-F8 (P2): silent drops on malformed tensor entries, missing NVFP4 packed
  vs weight_scale shape check, missing header-size upper bound, missing
  weight_scale dtype enforcement (NVFP4/MXFP4 cross-misrouting risk).
- F9-F10 (P3): documented and skipped — spec-compliance and UX only.

Also catalogs items already deferred via the prior audit's "truly unresolved"
section (GLM, SentencePiece, AWQ kernel, DeepSeek MLA, multimodal, Tiktoken),
which now live in docs/audit/followups.md.

No code changes — this is the read-only audit commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* plan: master plan loader + roadmap 2026-05

Phase 2 of the run. Combines Phase 1 audit findings F1-F8 (Phase 0 yielded
no FEASIBLE roadmap items, so the plan is loader/NVFP4 hardening only).

Eight items, ordered P0 → P1 → P2. F9-F10 deferred (spec-compliance and UX
only). Two ADRs:
- 0001 — Pure-C++ reference harness for unit-level NVFP4 numerics
- 0002 — 128 MiB SafeTensors header-size soft cap

Each item is small, isolated, has a deterministic synthetic test fixture,
zero new dependencies, and a clear root-cause-vs-symptom delineation.
Test infrastructure reuses two new test files
(test_nvfp4_compressed_tensors_ref.cu, test_safetensors_loader.cpp)
across multiple items to keep churn low.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(nvfp4): reference numerical test for compressed-tensors dequant (F1)

Closes audit finding F1. Adds tests/test_nvfp4_compressed_tensors_ref.cu with
four GTest cases that build a synthetic compressed-tensors NVFP4 weight in
memory exactly per the on-disk spec (uint8 nibble-packed E2M1 + FP8 E4M3
weight_scale at group_size=16 + FP32 weight_scale_2) and assert that imp's
gemv_nvfp4_kpar produces the same Y = W·X as a pure-host reference dequant
following val = e2m1_to_f32(nibble) * fp8_e4m3_to_f32(scale) * weight_scale_2.

Existing roundtrip tests (test_nvfp4_quant_ref.cu, test_nvfp4_quant_hw.cu,
test_nvfp4_gemv_kpar_loop.cu) all run imp's quantizer through imp's
dequantizer; a paired sign-flip / nibble-order bug or a missing factor (e.g.
dropping weight_scale_2) would not be detectable. This test starts from the
spec format directly so any future regression that breaks spec compliance
fails CI.

Cases:
- BaselineUnityScales: mixed nibbles, FP8=1.0, tensor_scale=1.0 — catches
  alignment / nibble-decode bugs at unity factors
- TwoLevelScalingVaryingPerBlock: cycling per-block scales + non-trivial
  tensor_scale=0.125 — catches a missing or sign-flipped factor
- ZeroTensorScaleProducesZeroOutput: tensor_scale=0 must produce 0 output,
  not NaN/Inf, regardless of garbage in weight_packed (also pre-validates F2)
- NegativeWeightsSignPreserved: every nibble = -1.0 → output exactly -K,
  catches sign-bit drop in nibble decode

Tolerance is max-abs-diff < 1e-2 in FP16 output (FMA-order divergence between
sequential reference and imp's parallel-warp accumulator dominates; 1e-5 is
unrealistic for K=128 FP16 dot-product reductions).

ADR 0001 records the rationale for choosing pure-C++ reference over the
other two harness options (existing imp Python infra: not wired for unit
numerics; subprocess into user's HF venv: cross-process complexity + runtime
dep). Pure C++ is dependency-free, deterministic, exact for the formula,
and fits inside the existing GTest harness.

All 4 new tests pass; full test-quant binary remains 82/82 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(nvfp4): defensive zeroing for non-finite Modelopt weight_scale_2 (F2)

Closes audit finding F2. Phase 2 / PR #113 added a zero/non-finite guard for
the llm-compressor NVFP4 promote path; the Modelopt path at the else branch
in executor_pre_dequant.cu was unguarded and would propagate NaN/±Inf
weight_scale_2 into the GEMM, contaminating the entire layer's hidden state
and downstream KV cache.

Refactor: extract the scale-promotion math into nvfp4_promote_weight_scale_2
in quant/nvfp4_quant.h/.cu — pure host function, testable without CUDA. Both
formats now share the defensive logic:
  - non-finite h_scale          → 0.0f (both formats)
  - llm-compressor h_scale=0    → 0.0f (avoids 1/0 = +Inf via reciprocal flip)
  - llm-compressor 1/h_scale non-finite → 0.0f (subnormal flip overflow)
  - Modelopt h_scale=0          → 0.0f (legitimate "null layer", flagged for diag)

executor_pre_dequant.cu's promote() lambda calls the helper; the existing
counter / WARN summary remains, with WARN messages distinguishing the
Modelopt-NaN/Inf case from llm-compressor reciprocal flip.

Tests: 9 new unit tests in NvFP4PromoteWeightScale2 covering NaN, +Inf,
-Inf, zero, denorm-flip, and finite cases for both formats. Test-quant
suite: 82/82 → 91/91 (no regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(safetensors): overflow-safe header_size validation + 128 MiB cap (F3 + F7)

Closes audit findings F3 and F7 from docs/audit/safetensors_nvfp4_audit_2026-05.md.

The previous header_size validation at safetensors_loader.cpp:519-524 used
`8 + header_size > file_size`. With a malicious / corrupt
header_size = UINT64_MAX-4 the addition wraps to 3, which is NOT greater
than any file_size — the check silently bypasses. The loader then constructs
JsonParser(json_data, static_cast<size_t>(header_size)) and reads past the
mmap region → SIGSEGV.

Refactor the check into safetensors_internal::validate_header_size, exposed
in safetensors_loader.h for unit testing. Two rules:
- header_size > file_size - 8 (overflow-safe; file_size >= 8 is enforced upstream)
- header_size > kMaxHeaderBytes (128 MiB soft cap per ADR 0002)

The cap rejects pathological inputs that would force the JSON parser to scan
multi-GB regions. Real models have headers below 1 MiB; 128 MiB is far above
legitimate use.

Tests: 6 new unit cases covering truncated files, exact-minimum, typical
size, header > file, UINT64_MAX overflow attack (the F3 bug), and the
128 MiB soft cap boundary. test-core suite: 139/139 → 139/139 (1 skipped is
unchanged TensorKindCoverage.NoUnknownKindsInSmallQwen).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(safetensors): per-tensor offset and size validation (F4)

Closes audit finding F4. The previous in-loop validation at
safetensors_loader.cpp:572-580 only checked
`tensor_data_offset + offset_end > file_size`, leaving three silent
correctness paths open:

1. offset_start > offset_end was not checked. A swapped data_offsets pair
   would compute a negative "size" interpreted as a huge unsigned value;
   downstream kernel reads would walk backwards into adjacent tensor data.
2. offset_end - offset_start was not compared against the byte count implied
   by shape × dtype. A file declaring an FP16 [1024,1024] tensor with
   offsets [0, 1024] would silently load 0.05% of the actual weight data,
   then yield uninitialized bytes for the rest.
3. tensor_data_offset + offset_start in-bounds was never asserted; only the
   end was. A start-past-EOF could be tolerated if the (truncated) end was
   in-bounds — possible with a corrupted-but-self-consistent file.

Add safetensors_internal::validate_tensor_offsets, a host-only helper
exposed via safetensors_loader.h for unit testing. Three rules in the
order the spec implies. Overflow-safe (subtractions only, with the upstream
header_size invariant guaranteeing tensor_data_offset <= file_size).

The per-tensor check is wired into load_shard's tensor enumeration loop:
when wire_dtype_bytes is known (mapping from SafeTensors wire string —
F32, F16, BF16, FP8_E4M3 etc.), the strict 3-rule validation runs;
otherwise the legacy lenient end-only check applies. The "unknown wire
type" path retains existing behavior because safetensors_dtype()'s WARN
already fires for those tensors at the actual emit step.

Tests: 7 new unit cases in SafeTensorsValidateTensorOffsets covering valid,
swap, OOB-end, byte-count mismatch, zero-size, invariant violation, and
exact-boundary cases. test-core 146/146 → 146/146, test-quant 91/91
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(nvfp4): enforce FP8_E4M3 weight_scale dtype at promote (F8)

Closes audit finding F8. The compressed-tensors NVFP4 spec mandates
float8_e4m3fn for weight_scale. Until now imp's promote step accepted
whatever qtype the loader produced, which opened a NVFP4↔MXFP4
cross-misrouting silent-corruption path: a model misclassified as NVFP4
but shipping U8 (UE8M0) weight_scale bytes would silently load, then
gemv_nvfp4_kpar would interpret the UE8M0 bytes as E4M3 and produce
~2× wrong scales (powers of two interpreted as E4M3 normals).

Add nvfp4_validate_weight_scale_dtype(QType, *err) in nvfp4_quant.h/.cu —
pure host predicate. The promote() lambda calls it before applying the
two-level scaling formula; on rejection, the weight stays in its loaded
state and the dequant→cuBLAS fallback runs (slower, but correct).

A new end-of-load summary line surfaces the count of skipped weights so
the user can detect the cross-misroute case.

Tests: 5 new unit cases in NvFP4ValidateWeightScaleDtype covering the
accepted dtype, the MXFP4 INT8 case, FP8_E5M2 (activation-only), F16
(some pipelines emit this), and the QType::NONE sentinel. test-quant
91/91 → 96/96, no regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(nvfp4): validate weight_packed/weight_scale shape pair at promote (F6)

Closes audit finding F6. The NVFP4 GEMV kernel hard-codes group_size=16
(kMicroBlockSize=16 at nvfp4_gemm.cu:31) and reads K/16 micro-scales per row.
Until now, the loader never verified that weight_scale's shape matches that
contract; a checkpoint with group_size=8 or a transposed weight_scale would
silently load and silently produce wrong output (12.5% per-element step
quant noise on roughly half the elements, or scales aligned onto wrong rows).

Add nvfp4_validate_packed_scale_shapes(packed_outer, packed_inner,
scale_outer, scale_inner, *err). The promote() lambda calls it for 2D
weights — the per-expert MoE case has been split to 2D by weight_upload.cu
before promote runs. Mismatches WARN + skip promotion, which routes the
weight to the dequant→cuBLAS fallback (slower but at least correct).

Tests: 7 new unit cases in NvFP4ValidatePackedScaleShapes covering
typical Qwen3 / Gemma-4 expert shapes, transposed scale, group_size=8,
group_size=32, zero-inner-dim, and the tiny F1 baseline test fixture.
test-quant 96/96 → 103/103, no regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(safetensors): warn on dropped malformed tensor entries (F5)

Closes audit finding F5. Until now, six paths in load_shard silently dropped
tensor entries with no log line: missing/non-string 'dtype', missing/non-array
'shape', ndim > kMaxDims, missing/wrong-arity 'data_offsets', and the
post-F4 offset-validation rejections. Users with a corrupt checkpoint
would see tensor_map come back partially populated with zero diagnostic
output; downstream null-checks would make load look "successful" with
wrong outputs at inference.

Replace each silent `continue` with a counter-bumped IMP_LOG_WARN line
naming the tensor and the specific reason. Add an end-of-shard summary
line breaking out the counts per reason (no_dtype / no_shape /
too_many_dims / no_offsets / offset_validation), so users can scan a single
log line to see "did this checkpoint load cleanly".

Tests: 2 new unit cases in SafeTensorsMalformedEntryWarnings covering
(a) missing dtype + missing shape on synthetic blob, and (b) byte-count
mismatch from F4. Both verify via gtest's CaptureStderr that the WARN line
includes the tensor name and the rejection reason. test-core 146/146 →
148/148, no regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(audit): mark all P0/P1/P2 findings closed + DONE summary

Final phase. Marks every audit finding F1-F8 with its closing commit SHA,
records the final P3 deferrals (F9 spec-compliance, F10 UX), and writes
docs/audit/DONE.md with the full run summary:

- 1/1 P0 closed, 3/3 P1 closed, 4/4 P2 closed, 2/2 P3 deferred
- 0 FEASIBLE roadmap items (all 27 deferred per the conditional model with
  documented reason in followups.md)
- 40 new unit tests, full suite 769/769 pass-or-skip-only with 0 failures
- verify-fast: decode +5.47%, prefill +8.17% over baseline (noise + prior
  commits; this run's changes are loader-time only and do not touch the
  hot path)
- 2 ADRs (pure-C++ ref harness, 128 MiB header cap)
- No new third-party dependencies in any manifest
- CMAKE_CUDA_STANDARD=20 unchanged

progress.log force-added (project gitignore covers *.log; this is an
intentional audit artifact named in the spec).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant