Skip to content

feat(safetensors): zero-config auto-detect + observability (audit Phase 2)#116

Merged
kekzl merged 11 commits into
mainfrom
chore/safetensors-audit-phase-2
May 7, 2026
Merged

feat(safetensors): zero-config auto-detect + observability (audit Phase 2)#116
kekzl merged 11 commits into
mainfrom
chore/safetensors-audit-phase-2

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 7, 2026

Summary

Phase 2 of the SafeTensors loading audit — closes 15 of 18 actionable items from docs/audit/safetensors_audit.md (also added in this PR). The remaining three (#16-18) land detection + warnings only; the actual implementations are multi-week efforts tracked separately.

The unifying goal: any local SafeTensors model directory should either work out of the box or fail with an actionable warning, never silently produce wrong output.

What this PR does

Real auto-detect / quality fixes (#1-15)

  • Llama-3 RoPE scaling (type: "llama3") per-frequency factor table — long-context Llama-3.x checkpoints no longer silently lose the wavelen ramp. Reuses LongRoPE infrastructure.
  • HF config tri-states for tie_word_embeddings, attention_bias, mlp_bias — loader can now cross-check config flags vs actual tensor presence and warn on mismatches.
  • Multi-element architectures[] arrays now log dropped entries (was silent).
  • Unknown-arch fallback to GENERIC is now an explicit WARN with actionable advice (add a class mapping).
  • Modelopt kv_cache_quant_algo parsed and surfaced informationally (FP8 KV not auto-flipped because correctness varies by family).
  • F8_E5M2 → FP8_E4M3 silent proxy now logs a one-shot WARN.
  • Multi-shard load logs [i/N] mmap'd shard … per worker for visibility on big checkpoints.
  • recipe.yaml non-NVFP4 schemes soften from hard-error to warn-and-fallback (load with wire dtype).
  • GPTQ desc_act plumbed through; warn when config promises desc_act but g_idx tensor is absent (silent miscompute path).
  • NVFP4 input_scale documented as audit-only (refuted as long-context-bug cause); skip prod GPU upload, save VRAM.
  • IMP_GDN_LAYOUT=tiled\|grouped env override for cross-converted Qwen3.5/3.6 GDN checkpoints.
  • Llama-4 / Qwen3.5 non-MoE HF class names mapped (previously fell back to GENERIC).
  • MXFP4 SafeTensors detection (no decode path yet — points users to GGUF).
  • SentencePiece-only checkpoints get an actionable error with the conversion recipe instead of a confusing null-tokenizer crash.

Detection + warnings only (#16-18)

  • AWQ — config detection across both nesting layouts (HF standard + AutoAWQ legacy), supports legacy w_bit / q_group_size field names. WARN on load that no AWQ dequant kernel exists yet.
  • DeepSeek MLA — V2/V3 checkpoints with kv_lora_rank > 0 or q_lora_rank > 0 trigger a WARN that imp's DEEPSEEK path uses standard MHA and produces incorrect outputs.
  • Multimodal vision towervision_config block presence triggers a WARN that the vision tower will be skipped (only the language head loads).

Audit doc

docs/audit/safetensors_audit.md — 5-section deep dive with file:line citations: model detection, tokenizer loading, weight loading, quantization auto-detection, architecture quirks. Cross-cutting findings + actionable items grouped by user impact.

Test plan

  • make build passes (CUDA 13.2 toolchain inside Docker).
  • make test-unit — 322 tests pass (1 skipped baseline). 9 new tests added in test_hf_config_loader.cpp covering Llama-3 per-pair factor table, tri-state flags, arch fallback, AWQ config, MLA detection, vision config, MXFP4 detection, newly-mapped arch names, and Llama-3 degenerate config skip.
  • Pre-push hook (verify-fast) passes — full test suite + perf gate + smoke prompt (Qwen3-4B Q8_0 distinct=8, contains 'Paris').
  • Recommended manual smoke: load a Llama-3.1-8B SafeTensors directory and confirm long-context coherence (no regressions vs FP16/BF16 baseline).

Out of scope

These remain as future PRs (each multi-week):

  • Native AWQ dequant kernel (or dequant-to-FP16 fallback)
  • DeepSeek MLA attention path
  • Multimodal SafeTensors loader (Qwen-VL, Llava, Pixtral, Gemma-3 vision)
  • Native SentencePiece (.model protobuf) parser

🤖 Generated with Claude Code

kekzl and others added 11 commits May 7, 2026 20:37
Adds the missing pieces from the SafeTensors audit's "cheap auto-detect"
group:

- Llama-3 RoPE scaling: parses `rope_scaling.type=="llama3"` with
  low_freq_factor / high_freq_factor / original_max_position_embeddings
  and reuses the existing LongRoPE per-pair frequency infrastructure.
  Without it, long-context Llama-3.x checkpoints silently lost the
  per-frequency wavelen ramp.

- Tri-state flags (-1 unset, 0 false, 1 true) for `tie_word_embeddings`,
  `attention_bias`, `mlp_bias`. Previously logged-only or not parsed.
  The SafeTensors loader uses these to cross-check actual tensor
  presence (next commit) instead of pure null-detection.

- `kv_cache_quant_hint` (string from Modelopt's `kv_cache_quant_algo`)
  surfaced via NvFP4Config. Engine does NOT auto-flip the KV dtype —
  FP8 KV is known to break several model families even with author
  opt-in (see kv_dtype_tradeoffs memory). Hint is informational.

- `arch_inferred_fallback` flag + improved warn when config.json's
  `architectures[0]` is unknown or absent. Multi-element `architectures`
  arrays now log the dropped entries.

Tests added for Llama-3 RoPE pair-table, tie_word_embeddings tri-state,
arch fallback flag, and degenerate Llama-3 config skip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gress

Consumes the new ModelConfig flags added in the previous commit and
addresses three observability gaps from the SafeTensors audit:

- tie_word_embeddings cross-check: warn when config says `tie=false`
  but lm_head.weight is missing (silent tying would mask the gap).
- attention_bias cross-check: warn at end of load when config promises
  bias but Q/K/V bias tensors are missing per layer.
- arch_inferred_fallback: loud final WARN when config.json detection
  fell back to GENERIC + tensor-name heuristics, so users know their
  model may run incoherently.

Plus:

- F8_E5M2 -> FP8_E4M3 silent proxy now emits a one-shot WARN at first
  occurrence (avoid per-tensor spam).
- Shard load progress: each parallel shard worker logs `[i/N] mmap'd
  shard ...` on completion. Useful when working set spikes on 32 GB
  host with multi-shard NVFP4 / Gemma-4 / Qwen3.6 checkpoints.
- Surfaces Modelopt's kv_cache_quant_hint as an info log (decision to
  honor it stays with the user via --kv-fp8 / --kv-nvfp4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously a `recipe.yaml` whose scheme is not NVFP4 / NVFP4_W4A16 hard-
errored at parse time and blocked the entire model load. The SafeTensors
loader treats `load_nvfp4_config()==false` as "no NVFP4 metadata", so
returning false here lets weights load with their wire dtype (FP16/BF16/
FP8) instead of failing outright. Quantization metadata is dropped — the
weights are decoded as raw — but inference can still proceed for
unsupported schemes (W8A8-INT8, FP8, …).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Produced via 5 parallel research passes, citations file:line. Maps
imp's current SafeTensors pipeline against vLLM/llama.cpp parity for
zero-config local model loading. Cross-cutting findings split into:

  Hard gaps     — DeepSeek MLA, AWQ, MXFP4-on-SafeTensors,
                  SentencePiece parser, Llama-3 RoPE, multimodal
                  loaders, missing arch enums (GLM, Llama4, Qwen3.5).
  Quality drops — tie_word_embeddings (now wired), GPTQ desc_act
                  ignored, NVFP4 input_scale loaded-but-unused,
                  FP8 E5M2 proxy, bias config flags unvalidated,
                  GDN head layout chosen by loader path.
  UX            — silent GENERIC fallback (now warns), multi-arch
                  array drop (now warns), recipe.yaml hard-error
                  (now soft-fails), no shard progress (now logs).

Phase-2 cheap auto-detect/UX wins are addressed in the three preceding
commits; remaining items are tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GPTQ models exported with activation reordering ship a g_idx tensor
that maps each K-position to its quantization group. The dequant
kernel uses g_idx if present, else falls back to sequential grouping
(`col / group_size`). Sequential grouping on a desc_act model
silently produces wrong outputs.

Plumb the desc_act flag from `quantize_config.json` through
`GPTQConfig` → `TransformerLayer::GPTQWeight` and warn once at upload
time if the export ships `desc_act:true` without a g_idx tensor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some llm-compressor NVFP4 exports ship a SmoothQuant-style activation
rescaling vector (`input_scale`). Imp loads it for diagnostics but no
GEMM kernel reads it. The long-context-bug investigation refuted the
hypothesis that absorbing it would fix Mistral-3.2-NVFP4 drift (see
memory `llm_compressor_input_scale_dead_end_2026_05_07.md`), so its
status is "intentionally unused".

- Document that explicitly in the Phase-0 promote() log line.
- Skip the GPU upload of `input_scale` unless `IMP_AUDIT_NVFP4_SCALES=1`
  is set. Saves ~one FP32 scalar per Linear of VRAM in production
  (small, but zero-value-add today).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The GDN head layout (grouped vs tiled) is currently chosen by loader
path: HF SafeTensors → grouped, GGUF → tiled. Cross-converted
checkpoints (HF → GGUF → HF, or third-party re-packers) can ship in
the opposite layout, making the per-layer scan kernel mis-route heads
to the wrong groups.

Add `IMP_GDN_LAYOUT=tiled|grouped` to override the SafeTensors default.
Default behaviour is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related auto-detect extensions from the SafeTensors audit:

#15 Architecture mappings:
  - Llama4ForCausalLM / Llama4ForConditionalGeneration → ModelArch::LLAMA4
  - Qwen3_5ForCausalLM / Qwen3_5ForConditionalGeneration → ModelArch::QWEN35
  Plus the corresponding `model_type` fallback entries (qwen3_5*, llama4)
  and synced HF class entries in model.cpp parse_model_arch. GLM
  remains unmapped — its forward path needs a real implementation;
  silently routing GLM through LLAMA risked wrong outputs.

#13 MXFP4 SafeTensors detection:
  - New `HFConfigLoader::MxFP4Config` + `load_mxfp4_config()` parses
    `config.json:quantization_config.quant_method == "mxfp4"` (case
    variations included), extracts `block_size` (default 32).
  - ModelConfig gains `is_mxfp4_prequant` / `mxfp4_block_size` flags.
  - SafeTensors loader sets the flags + emits a clear WARN that the
    SafeTensors decode path is not yet implemented (use GGUF for
    actual MXFP4 inference). The metadata is now visible to higher
    layers, even if the decode path is future work.

Tests added for both: NewlyMappedArchClassNames and
Mxfp4QuantConfigDetection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
imp's SafeTensors path parses tokenizer.json natively but has no
SentencePiece (.model protobuf) parser yet. Older Llama 1/2, original
Mistral, and similar checkpoints ship only `tokenizer.model`. Before
this commit they would load weights, fail to encode chat input, and
crash with a confusing null-tokenizer dereference.

Now: detect the situation and emit an explicit IMP_LOG_ERROR with the
two known workarounds (regenerate tokenizer.json via transformers, or
convert to GGUF). No-tokenizer-at-all gets a separate WARN.

A native SentencePiece parser remains a deferred audit item (gap #14).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `HFConfigLoader::load_awq_config()` covering both flavour of AWQ
config:
  1. `quantization_config.quant_method == "awq"` in `config.json`
     (HuggingFace standard layout)
  2. `quant_method == "awq"` at the top of `quant_config.json` (older
     AutoAWQ exports), supporting the legacy `w_bit` / `q_group_size`
     field names too.

Loader sets `cfg.is_awq_prequant` / `cfg.awq_group_size` and emits a
WARN explaining that imp does not yet have an AWQ dequant kernel —
weights will load as their wire dtype and inference produces wrong
outputs. Direction users to GPTQ / NVFP4 alternatives.

Tests added for both config layouts and field-name variations.

This commit's test file also bundles tests for the MLA + vision-tower
detections that follow in the next commit (single shared hunk in
test_hf_config_loader.cpp).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#18

Two early-warning detections that catch silently-incorrect inference
paths that imp does NOT yet implement:

#17 DeepSeek-V2/V3 MLA (Multi-head Latent Attention):
  Detected via `kv_lora_rank > 0` or `q_lora_rank > 0` in config.json.
  imp's DEEPSEEK forward path is standard MHA, so feeding it an MLA
  checkpoint produces wrong outputs. Loader emits a WARN naming both
  ranks. Fixing this requires a real MLA kernel (separate audit item).

#18 Multimodal vision tower:
  Detected via `vision_config` block presence in config.json. imp's
  SafeTensors loader skips vision tensors today, so chat-with-images
  silently degrades to text-only. WARN names the vision model_type
  (typical: `siglip_vision_model`).

Both detections are pure config inspection — no runtime cost, no
behaviour change. The actual implementation work is tracked as
separate audit items.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl enabled auto-merge (squash) May 7, 2026 18:43
@kekzl kekzl merged commit 2de77fb into main May 7, 2026
2 checks passed
@kekzl kekzl deleted the chore/safetensors-audit-phase-2 branch May 7, 2026 18:47
github-actions Bot pushed a commit that referenced this pull request May 7, 2026
* inventory: roadmap and backlog discovery 2026-05

Phase 0 of the SafeTensors+NVFP4 hardening run. Compiles every roadmap-like
artifact in the repo (docs/roadmap.md, docs/sm120-real-perf-plan.md, the
"truly unresolved" list at the bottom of docs/audit/safetensors_audit.md, plus
git log + open issues/PRs) into a single inventory and classifies each item
FEASIBLE / UNCERTAIN / INFEASIBLE / OBSOLETE under the Quality Gate.

Result: 0 FEASIBLE, 1 UNCERTAIN (native SentencePiece parser, deferred to a
dedicated session in favor of correctness-hardening work this run), 21
INFEASIBLE (multi-week kernel/architecture work or refuted dead-ends), 5
OBSOLETE (already shipped or shelved). Consistent with imp's mature state —
the listed items are by construction either dead-ends or large undertakings.

Per the mission's conditional model, this run focuses on Objective 1
(SafeTensors + NVFP4 hardening) only. Every deferred item is captured in
docs/audit/followups.md with a specific reason and pre-conditions to revisit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* audit: safetensors + nvfp4 loader audit 2026-05

Phase 1 of the SafeTensors+NVFP4 hardening run. Builds on top of the existing
docs/audit/safetensors_audit.md (Phase 1 + Phase 2 / PR #116) and identifies
ten residual hardening findings F1-F10 that survived Phase 2:

- F1 (P0): no reference numerical test against the compressed-tensors NVFP4
  spec. Roundtrip tests cannot catch a paired sign-flip / nibble-order /
  missing-factor bug between imp's quantizer and dequantizer.
- F2 (P1): Modelopt NVFP4 weight_scale_2 lacks the isfinite guard PR #113
  added for the llm-compressor path. NaN/Inf propagates layer-wide.
- F3 (P1): header-size validation has integer overflow at
  safetensors_loader.cpp:519-524. UINT64_MAX-4 makes 8+x wrap to 3, bypass.
- F4 (P1): tensor offsets not validated against shape*dtype. Three sub-bugs:
  no offset_start<=offset_end, no size match, no per-tensor start in-bounds.
- F5-F8 (P2): silent drops on malformed tensor entries, missing NVFP4 packed
  vs weight_scale shape check, missing header-size upper bound, missing
  weight_scale dtype enforcement (NVFP4/MXFP4 cross-misrouting risk).
- F9-F10 (P3): documented and skipped — spec-compliance and UX only.

Also catalogs items already deferred via the prior audit's "truly unresolved"
section (GLM, SentencePiece, AWQ kernel, DeepSeek MLA, multimodal, Tiktoken),
which now live in docs/audit/followups.md.

No code changes — this is the read-only audit commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* plan: master plan loader + roadmap 2026-05

Phase 2 of the run. Combines Phase 1 audit findings F1-F8 (Phase 0 yielded
no FEASIBLE roadmap items, so the plan is loader/NVFP4 hardening only).

Eight items, ordered P0 → P1 → P2. F9-F10 deferred (spec-compliance and UX
only). Two ADRs:
- 0001 — Pure-C++ reference harness for unit-level NVFP4 numerics
- 0002 — 128 MiB SafeTensors header-size soft cap

Each item is small, isolated, has a deterministic synthetic test fixture,
zero new dependencies, and a clear root-cause-vs-symptom delineation.
Test infrastructure reuses two new test files
(test_nvfp4_compressed_tensors_ref.cu, test_safetensors_loader.cpp)
across multiple items to keep churn low.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(nvfp4): reference numerical test for compressed-tensors dequant (F1)

Closes audit finding F1. Adds tests/test_nvfp4_compressed_tensors_ref.cu with
four GTest cases that build a synthetic compressed-tensors NVFP4 weight in
memory exactly per the on-disk spec (uint8 nibble-packed E2M1 + FP8 E4M3
weight_scale at group_size=16 + FP32 weight_scale_2) and assert that imp's
gemv_nvfp4_kpar produces the same Y = W·X as a pure-host reference dequant
following val = e2m1_to_f32(nibble) * fp8_e4m3_to_f32(scale) * weight_scale_2.

Existing roundtrip tests (test_nvfp4_quant_ref.cu, test_nvfp4_quant_hw.cu,
test_nvfp4_gemv_kpar_loop.cu) all run imp's quantizer through imp's
dequantizer; a paired sign-flip / nibble-order bug or a missing factor (e.g.
dropping weight_scale_2) would not be detectable. This test starts from the
spec format directly so any future regression that breaks spec compliance
fails CI.

Cases:
- BaselineUnityScales: mixed nibbles, FP8=1.0, tensor_scale=1.0 — catches
  alignment / nibble-decode bugs at unity factors
- TwoLevelScalingVaryingPerBlock: cycling per-block scales + non-trivial
  tensor_scale=0.125 — catches a missing or sign-flipped factor
- ZeroTensorScaleProducesZeroOutput: tensor_scale=0 must produce 0 output,
  not NaN/Inf, regardless of garbage in weight_packed (also pre-validates F2)
- NegativeWeightsSignPreserved: every nibble = -1.0 → output exactly -K,
  catches sign-bit drop in nibble decode

Tolerance is max-abs-diff < 1e-2 in FP16 output (FMA-order divergence between
sequential reference and imp's parallel-warp accumulator dominates; 1e-5 is
unrealistic for K=128 FP16 dot-product reductions).

ADR 0001 records the rationale for choosing pure-C++ reference over the
other two harness options (existing imp Python infra: not wired for unit
numerics; subprocess into user's HF venv: cross-process complexity + runtime
dep). Pure C++ is dependency-free, deterministic, exact for the formula,
and fits inside the existing GTest harness.

All 4 new tests pass; full test-quant binary remains 82/82 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(nvfp4): defensive zeroing for non-finite Modelopt weight_scale_2 (F2)

Closes audit finding F2. Phase 2 / PR #113 added a zero/non-finite guard for
the llm-compressor NVFP4 promote path; the Modelopt path at the else branch
in executor_pre_dequant.cu was unguarded and would propagate NaN/±Inf
weight_scale_2 into the GEMM, contaminating the entire layer's hidden state
and downstream KV cache.

Refactor: extract the scale-promotion math into nvfp4_promote_weight_scale_2
in quant/nvfp4_quant.h/.cu — pure host function, testable without CUDA. Both
formats now share the defensive logic:
  - non-finite h_scale          → 0.0f (both formats)
  - llm-compressor h_scale=0    → 0.0f (avoids 1/0 = +Inf via reciprocal flip)
  - llm-compressor 1/h_scale non-finite → 0.0f (subnormal flip overflow)
  - Modelopt h_scale=0          → 0.0f (legitimate "null layer", flagged for diag)

executor_pre_dequant.cu's promote() lambda calls the helper; the existing
counter / WARN summary remains, with WARN messages distinguishing the
Modelopt-NaN/Inf case from llm-compressor reciprocal flip.

Tests: 9 new unit tests in NvFP4PromoteWeightScale2 covering NaN, +Inf,
-Inf, zero, denorm-flip, and finite cases for both formats. Test-quant
suite: 82/82 → 91/91 (no regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(safetensors): overflow-safe header_size validation + 128 MiB cap (F3 + F7)

Closes audit findings F3 and F7 from docs/audit/safetensors_nvfp4_audit_2026-05.md.

The previous header_size validation at safetensors_loader.cpp:519-524 used
`8 + header_size > file_size`. With a malicious / corrupt
header_size = UINT64_MAX-4 the addition wraps to 3, which is NOT greater
than any file_size — the check silently bypasses. The loader then constructs
JsonParser(json_data, static_cast<size_t>(header_size)) and reads past the
mmap region → SIGSEGV.

Refactor the check into safetensors_internal::validate_header_size, exposed
in safetensors_loader.h for unit testing. Two rules:
- header_size > file_size - 8 (overflow-safe; file_size >= 8 is enforced upstream)
- header_size > kMaxHeaderBytes (128 MiB soft cap per ADR 0002)

The cap rejects pathological inputs that would force the JSON parser to scan
multi-GB regions. Real models have headers below 1 MiB; 128 MiB is far above
legitimate use.

Tests: 6 new unit cases covering truncated files, exact-minimum, typical
size, header > file, UINT64_MAX overflow attack (the F3 bug), and the
128 MiB soft cap boundary. test-core suite: 139/139 → 139/139 (1 skipped is
unchanged TensorKindCoverage.NoUnknownKindsInSmallQwen).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(safetensors): per-tensor offset and size validation (F4)

Closes audit finding F4. The previous in-loop validation at
safetensors_loader.cpp:572-580 only checked
`tensor_data_offset + offset_end > file_size`, leaving three silent
correctness paths open:

1. offset_start > offset_end was not checked. A swapped data_offsets pair
   would compute a negative "size" interpreted as a huge unsigned value;
   downstream kernel reads would walk backwards into adjacent tensor data.
2. offset_end - offset_start was not compared against the byte count implied
   by shape × dtype. A file declaring an FP16 [1024,1024] tensor with
   offsets [0, 1024] would silently load 0.05% of the actual weight data,
   then yield uninitialized bytes for the rest.
3. tensor_data_offset + offset_start in-bounds was never asserted; only the
   end was. A start-past-EOF could be tolerated if the (truncated) end was
   in-bounds — possible with a corrupted-but-self-consistent file.

Add safetensors_internal::validate_tensor_offsets, a host-only helper
exposed via safetensors_loader.h for unit testing. Three rules in the
order the spec implies. Overflow-safe (subtractions only, with the upstream
header_size invariant guaranteeing tensor_data_offset <= file_size).

The per-tensor check is wired into load_shard's tensor enumeration loop:
when wire_dtype_bytes is known (mapping from SafeTensors wire string —
F32, F16, BF16, FP8_E4M3 etc.), the strict 3-rule validation runs;
otherwise the legacy lenient end-only check applies. The "unknown wire
type" path retains existing behavior because safetensors_dtype()'s WARN
already fires for those tensors at the actual emit step.

Tests: 7 new unit cases in SafeTensorsValidateTensorOffsets covering valid,
swap, OOB-end, byte-count mismatch, zero-size, invariant violation, and
exact-boundary cases. test-core 146/146 → 146/146, test-quant 91/91
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(nvfp4): enforce FP8_E4M3 weight_scale dtype at promote (F8)

Closes audit finding F8. The compressed-tensors NVFP4 spec mandates
float8_e4m3fn for weight_scale. Until now imp's promote step accepted
whatever qtype the loader produced, which opened a NVFP4↔MXFP4
cross-misrouting silent-corruption path: a model misclassified as NVFP4
but shipping U8 (UE8M0) weight_scale bytes would silently load, then
gemv_nvfp4_kpar would interpret the UE8M0 bytes as E4M3 and produce
~2× wrong scales (powers of two interpreted as E4M3 normals).

Add nvfp4_validate_weight_scale_dtype(QType, *err) in nvfp4_quant.h/.cu —
pure host predicate. The promote() lambda calls it before applying the
two-level scaling formula; on rejection, the weight stays in its loaded
state and the dequant→cuBLAS fallback runs (slower, but correct).

A new end-of-load summary line surfaces the count of skipped weights so
the user can detect the cross-misroute case.

Tests: 5 new unit cases in NvFP4ValidateWeightScaleDtype covering the
accepted dtype, the MXFP4 INT8 case, FP8_E5M2 (activation-only), F16
(some pipelines emit this), and the QType::NONE sentinel. test-quant
91/91 → 96/96, no regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(nvfp4): validate weight_packed/weight_scale shape pair at promote (F6)

Closes audit finding F6. The NVFP4 GEMV kernel hard-codes group_size=16
(kMicroBlockSize=16 at nvfp4_gemm.cu:31) and reads K/16 micro-scales per row.
Until now, the loader never verified that weight_scale's shape matches that
contract; a checkpoint with group_size=8 or a transposed weight_scale would
silently load and silently produce wrong output (12.5% per-element step
quant noise on roughly half the elements, or scales aligned onto wrong rows).

Add nvfp4_validate_packed_scale_shapes(packed_outer, packed_inner,
scale_outer, scale_inner, *err). The promote() lambda calls it for 2D
weights — the per-expert MoE case has been split to 2D by weight_upload.cu
before promote runs. Mismatches WARN + skip promotion, which routes the
weight to the dequant→cuBLAS fallback (slower but at least correct).

Tests: 7 new unit cases in NvFP4ValidatePackedScaleShapes covering
typical Qwen3 / Gemma-4 expert shapes, transposed scale, group_size=8,
group_size=32, zero-inner-dim, and the tiny F1 baseline test fixture.
test-quant 96/96 → 103/103, no regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(safetensors): warn on dropped malformed tensor entries (F5)

Closes audit finding F5. Until now, six paths in load_shard silently dropped
tensor entries with no log line: missing/non-string 'dtype', missing/non-array
'shape', ndim > kMaxDims, missing/wrong-arity 'data_offsets', and the
post-F4 offset-validation rejections. Users with a corrupt checkpoint
would see tensor_map come back partially populated with zero diagnostic
output; downstream null-checks would make load look "successful" with
wrong outputs at inference.

Replace each silent `continue` with a counter-bumped IMP_LOG_WARN line
naming the tensor and the specific reason. Add an end-of-shard summary
line breaking out the counts per reason (no_dtype / no_shape /
too_many_dims / no_offsets / offset_validation), so users can scan a single
log line to see "did this checkpoint load cleanly".

Tests: 2 new unit cases in SafeTensorsMalformedEntryWarnings covering
(a) missing dtype + missing shape on synthetic blob, and (b) byte-count
mismatch from F4. Both verify via gtest's CaptureStderr that the WARN line
includes the tensor name and the rejection reason. test-core 146/146 →
148/148, no regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(audit): mark all P0/P1/P2 findings closed + DONE summary

Final phase. Marks every audit finding F1-F8 with its closing commit SHA,
records the final P3 deferrals (F9 spec-compliance, F10 UX), and writes
docs/audit/DONE.md with the full run summary:

- 1/1 P0 closed, 3/3 P1 closed, 4/4 P2 closed, 2/2 P3 deferred
- 0 FEASIBLE roadmap items (all 27 deferred per the conditional model with
  documented reason in followups.md)
- 40 new unit tests, full suite 769/769 pass-or-skip-only with 0 failures
- verify-fast: decode +5.47%, prefill +8.17% over baseline (noise + prior
  commits; this run's changes are loader-time only and do not touch the
  hot path)
- 2 ADRs (pure-C++ ref harness, 128 MiB header cap)
- No new third-party dependencies in any manifest
- CMAKE_CUDA_STANDARD=20 unchanged

progress.log force-added (project gitignore covers *.log; this is an
intentional audit artifact named in the spec).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant