feat(safetensors): zero-config auto-detect + observability (audit Phase 2) by kekzl · Pull Request #116 · kekzl/imp

kekzl · 2026-05-07T18:39:22Z

Summary

Phase 2 of the SafeTensors loading audit — closes 15 of 18 actionable items from docs/audit/safetensors_audit.md (also added in this PR). The remaining three (#16-18) land detection + warnings only; the actual implementations are multi-week efforts tracked separately.

The unifying goal: any local SafeTensors model directory should either work out of the box or fail with an actionable warning, never silently produce wrong output.

What this PR does

Real auto-detect / quality fixes (#1-15)

Llama-3 RoPE scaling (type: "llama3") per-frequency factor table — long-context Llama-3.x checkpoints no longer silently lose the wavelen ramp. Reuses LongRoPE infrastructure.
HF config tri-states for tie_word_embeddings, attention_bias, mlp_bias — loader can now cross-check config flags vs actual tensor presence and warn on mismatches.
Multi-element architectures[] arrays now log dropped entries (was silent).
Unknown-arch fallback to GENERIC is now an explicit WARN with actionable advice (add a class mapping).
Modelopt kv_cache_quant_algo parsed and surfaced informationally (FP8 KV not auto-flipped because correctness varies by family).
F8_E5M2 → FP8_E4M3 silent proxy now logs a one-shot WARN.
Multi-shard load logs [i/N] mmap'd shard … per worker for visibility on big checkpoints.
recipe.yaml non-NVFP4 schemes soften from hard-error to warn-and-fallback (load with wire dtype).
GPTQ desc_act plumbed through; warn when config promises desc_act but g_idx tensor is absent (silent miscompute path).
NVFP4 input_scale documented as audit-only (refuted as long-context-bug cause); skip prod GPU upload, save VRAM.
IMP_GDN_LAYOUT=tiled\|grouped env override for cross-converted Qwen3.5/3.6 GDN checkpoints.
Llama-4 / Qwen3.5 non-MoE HF class names mapped (previously fell back to GENERIC).
MXFP4 SafeTensors detection (no decode path yet — points users to GGUF).
SentencePiece-only checkpoints get an actionable error with the conversion recipe instead of a confusing null-tokenizer crash.

Detection + warnings only (#16-18)

AWQ — config detection across both nesting layouts (HF standard + AutoAWQ legacy), supports legacy w_bit / q_group_size field names. WARN on load that no AWQ dequant kernel exists yet.
DeepSeek MLA — V2/V3 checkpoints with kv_lora_rank > 0 or q_lora_rank > 0 trigger a WARN that imp's DEEPSEEK path uses standard MHA and produces incorrect outputs.
Multimodal vision tower — vision_config block presence triggers a WARN that the vision tower will be skipped (only the language head loads).

Audit doc

docs/audit/safetensors_audit.md — 5-section deep dive with file:line citations: model detection, tokenizer loading, weight loading, quantization auto-detection, architecture quirks. Cross-cutting findings + actionable items grouped by user impact.

Test plan

make build passes (CUDA 13.2 toolchain inside Docker).
make test-unit — 322 tests pass (1 skipped baseline). 9 new tests added in test_hf_config_loader.cpp covering Llama-3 per-pair factor table, tri-state flags, arch fallback, AWQ config, MLA detection, vision config, MXFP4 detection, newly-mapped arch names, and Llama-3 degenerate config skip.
Pre-push hook (verify-fast) passes — full test suite + perf gate + smoke prompt (Qwen3-4B Q8_0 distinct=8, contains 'Paris').
Recommended manual smoke: load a Llama-3.1-8B SafeTensors directory and confirm long-context coherence (no regressions vs FP16/BF16 baseline).

Out of scope

These remain as future PRs (each multi-week):

Native AWQ dequant kernel (or dequant-to-FP16 fallback)
DeepSeek MLA attention path
Multimodal SafeTensors loader (Qwen-VL, Llava, Pixtral, Gemma-3 vision)
Native SentencePiece (.model protobuf) parser

🤖 Generated with Claude Code

Adds the missing pieces from the SafeTensors audit's "cheap auto-detect" group: - Llama-3 RoPE scaling: parses `rope_scaling.type=="llama3"` with low_freq_factor / high_freq_factor / original_max_position_embeddings and reuses the existing LongRoPE per-pair frequency infrastructure. Without it, long-context Llama-3.x checkpoints silently lost the per-frequency wavelen ramp. - Tri-state flags (-1 unset, 0 false, 1 true) for `tie_word_embeddings`, `attention_bias`, `mlp_bias`. Previously logged-only or not parsed. The SafeTensors loader uses these to cross-check actual tensor presence (next commit) instead of pure null-detection. - `kv_cache_quant_hint` (string from Modelopt's `kv_cache_quant_algo`) surfaced via NvFP4Config. Engine does NOT auto-flip the KV dtype — FP8 KV is known to break several model families even with author opt-in (see kv_dtype_tradeoffs memory). Hint is informational. - `arch_inferred_fallback` flag + improved warn when config.json's `architectures[0]` is unknown or absent. Multi-element `architectures` arrays now log the dropped entries. Tests added for Llama-3 RoPE pair-table, tie_word_embeddings tri-state, arch fallback flag, and degenerate Llama-3 config skip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gress Consumes the new ModelConfig flags added in the previous commit and addresses three observability gaps from the SafeTensors audit: - tie_word_embeddings cross-check: warn when config says `tie=false` but lm_head.weight is missing (silent tying would mask the gap). - attention_bias cross-check: warn at end of load when config promises bias but Q/K/V bias tensors are missing per layer. - arch_inferred_fallback: loud final WARN when config.json detection fell back to GENERIC + tensor-name heuristics, so users know their model may run incoherently. Plus: - F8_E5M2 -> FP8_E4M3 silent proxy now emits a one-shot WARN at first occurrence (avoid per-tensor spam). - Shard load progress: each parallel shard worker logs `[i/N] mmap'd shard ...` on completion. Useful when working set spikes on 32 GB host with multi-shard NVFP4 / Gemma-4 / Qwen3.6 checkpoints. - Surfaces Modelopt's kv_cache_quant_hint as an info log (decision to honor it stays with the user via --kv-fp8 / --kv-nvfp4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously a `recipe.yaml` whose scheme is not NVFP4 / NVFP4_W4A16 hard- errored at parse time and blocked the entire model load. The SafeTensors loader treats `load_nvfp4_config()==false` as "no NVFP4 metadata", so returning false here lets weights load with their wire dtype (FP16/BF16/ FP8) instead of failing outright. Quantization metadata is dropped — the weights are decoded as raw — but inference can still proceed for unsupported schemes (W8A8-INT8, FP8, …). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Produced via 5 parallel research passes, citations file:line. Maps imp's current SafeTensors pipeline against vLLM/llama.cpp parity for zero-config local model loading. Cross-cutting findings split into: Hard gaps — DeepSeek MLA, AWQ, MXFP4-on-SafeTensors, SentencePiece parser, Llama-3 RoPE, multimodal loaders, missing arch enums (GLM, Llama4, Qwen3.5). Quality drops — tie_word_embeddings (now wired), GPTQ desc_act ignored, NVFP4 input_scale loaded-but-unused, FP8 E5M2 proxy, bias config flags unvalidated, GDN head layout chosen by loader path. UX — silent GENERIC fallback (now warns), multi-arch array drop (now warns), recipe.yaml hard-error (now soft-fails), no shard progress (now logs). Phase-2 cheap auto-detect/UX wins are addressed in the three preceding commits; remaining items are tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GPTQ models exported with activation reordering ship a g_idx tensor that maps each K-position to its quantization group. The dequant kernel uses g_idx if present, else falls back to sequential grouping (`col / group_size`). Sequential grouping on a desc_act model silently produces wrong outputs. Plumb the desc_act flag from `quantize_config.json` through `GPTQConfig` → `TransformerLayer::GPTQWeight` and warn once at upload time if the export ships `desc_act:true` without a g_idx tensor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Some llm-compressor NVFP4 exports ship a SmoothQuant-style activation rescaling vector (`input_scale`). Imp loads it for diagnostics but no GEMM kernel reads it. The long-context-bug investigation refuted the hypothesis that absorbing it would fix Mistral-3.2-NVFP4 drift (see memory `llm_compressor_input_scale_dead_end_2026_05_07.md`), so its status is "intentionally unused". - Document that explicitly in the Phase-0 promote() log line. - Skip the GPU upload of `input_scale` unless `IMP_AUDIT_NVFP4_SCALES=1` is set. Saves ~one FP32 scalar per Linear of VRAM in production (small, but zero-value-add today). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The GDN head layout (grouped vs tiled) is currently chosen by loader path: HF SafeTensors → grouped, GGUF → tiled. Cross-converted checkpoints (HF → GGUF → HF, or third-party re-packers) can ship in the opposite layout, making the per-layer scan kernel mis-route heads to the wrong groups. Add `IMP_GDN_LAYOUT=tiled|grouped` to override the SafeTensors default. Default behaviour is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related auto-detect extensions from the SafeTensors audit: #15 Architecture mappings: - Llama4ForCausalLM / Llama4ForConditionalGeneration → ModelArch::LLAMA4 - Qwen3_5ForCausalLM / Qwen3_5ForConditionalGeneration → ModelArch::QWEN35 Plus the corresponding `model_type` fallback entries (qwen3_5*, llama4) and synced HF class entries in model.cpp parse_model_arch. GLM remains unmapped — its forward path needs a real implementation; silently routing GLM through LLAMA risked wrong outputs. #13 MXFP4 SafeTensors detection: - New `HFConfigLoader::MxFP4Config` + `load_mxfp4_config()` parses `config.json:quantization_config.quant_method == "mxfp4"` (case variations included), extracts `block_size` (default 32). - ModelConfig gains `is_mxfp4_prequant` / `mxfp4_block_size` flags. - SafeTensors loader sets the flags + emits a clear WARN that the SafeTensors decode path is not yet implemented (use GGUF for actual MXFP4 inference). The metadata is now visible to higher layers, even if the decode path is future work. Tests added for both: NewlyMappedArchClassNames and Mxfp4QuantConfigDetection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

imp's SafeTensors path parses tokenizer.json natively but has no SentencePiece (.model protobuf) parser yet. Older Llama 1/2, original Mistral, and similar checkpoints ship only `tokenizer.model`. Before this commit they would load weights, fail to encode chat input, and crash with a confusing null-tokenizer dereference. Now: detect the situation and emit an explicit IMP_LOG_ERROR with the two known workarounds (regenerate tokenizer.json via transformers, or convert to GGUF). No-tokenizer-at-all gets a separate WARN. A native SentencePiece parser remains a deferred audit item (gap #14). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `HFConfigLoader::load_awq_config()` covering both flavour of AWQ config: 1. `quantization_config.quant_method == "awq"` in `config.json` (HuggingFace standard layout) 2. `quant_method == "awq"` at the top of `quant_config.json` (older AutoAWQ exports), supporting the legacy `w_bit` / `q_group_size` field names too. Loader sets `cfg.is_awq_prequant` / `cfg.awq_group_size` and emits a WARN explaining that imp does not yet have an AWQ dequant kernel — weights will load as their wire dtype and inference produces wrong outputs. Direction users to GPTQ / NVFP4 alternatives. Tests added for both config layouts and field-name variations. This commit's test file also bundles tests for the MLA + vision-tower detections that follow in the next commit (single shared hunk in test_hf_config_loader.cpp). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#18 Two early-warning detections that catch silently-incorrect inference paths that imp does NOT yet implement: #17 DeepSeek-V2/V3 MLA (Multi-head Latent Attention): Detected via `kv_lora_rank > 0` or `q_lora_rank > 0` in config.json. imp's DEEPSEEK forward path is standard MHA, so feeding it an MLA checkpoint produces wrong outputs. Loader emits a WARN naming both ranks. Fixing this requires a real MLA kernel (separate audit item). #18 Multimodal vision tower: Detected via `vision_config` block presence in config.json. imp's SafeTensors loader skips vision tensors today, so chat-with-images silently degrades to text-only. WARN names the vision model_type (typical: `siglip_vision_model`). Both detections are pure config inspection — no runtime cost, no behaviour change. The actual implementation work is tracked as separate audit items. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* inventory: roadmap and backlog discovery 2026-05 Phase 0 of the SafeTensors+NVFP4 hardening run. Compiles every roadmap-like artifact in the repo (docs/roadmap.md, docs/sm120-real-perf-plan.md, the "truly unresolved" list at the bottom of docs/audit/safetensors_audit.md, plus git log + open issues/PRs) into a single inventory and classifies each item FEASIBLE / UNCERTAIN / INFEASIBLE / OBSOLETE under the Quality Gate. Result: 0 FEASIBLE, 1 UNCERTAIN (native SentencePiece parser, deferred to a dedicated session in favor of correctness-hardening work this run), 21 INFEASIBLE (multi-week kernel/architecture work or refuted dead-ends), 5 OBSOLETE (already shipped or shelved). Consistent with imp's mature state — the listed items are by construction either dead-ends or large undertakings. Per the mission's conditional model, this run focuses on Objective 1 (SafeTensors + NVFP4 hardening) only. Every deferred item is captured in docs/audit/followups.md with a specific reason and pre-conditions to revisit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * audit: safetensors + nvfp4 loader audit 2026-05 Phase 1 of the SafeTensors+NVFP4 hardening run. Builds on top of the existing docs/audit/safetensors_audit.md (Phase 1 + Phase 2 / PR #116) and identifies ten residual hardening findings F1-F10 that survived Phase 2: - F1 (P0): no reference numerical test against the compressed-tensors NVFP4 spec. Roundtrip tests cannot catch a paired sign-flip / nibble-order / missing-factor bug between imp's quantizer and dequantizer. - F2 (P1): Modelopt NVFP4 weight_scale_2 lacks the isfinite guard PR #113 added for the llm-compressor path. NaN/Inf propagates layer-wide. - F3 (P1): header-size validation has integer overflow at safetensors_loader.cpp:519-524. UINT64_MAX-4 makes 8+x wrap to 3, bypass. - F4 (P1): tensor offsets not validated against shape*dtype. Three sub-bugs: no offset_start<=offset_end, no size match, no per-tensor start in-bounds. - F5-F8 (P2): silent drops on malformed tensor entries, missing NVFP4 packed vs weight_scale shape check, missing header-size upper bound, missing weight_scale dtype enforcement (NVFP4/MXFP4 cross-misrouting risk). - F9-F10 (P3): documented and skipped — spec-compliance and UX only. Also catalogs items already deferred via the prior audit's "truly unresolved" section (GLM, SentencePiece, AWQ kernel, DeepSeek MLA, multimodal, Tiktoken), which now live in docs/audit/followups.md. No code changes — this is the read-only audit commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * plan: master plan loader + roadmap 2026-05 Phase 2 of the run. Combines Phase 1 audit findings F1-F8 (Phase 0 yielded no FEASIBLE roadmap items, so the plan is loader/NVFP4 hardening only). Eight items, ordered P0 → P1 → P2. F9-F10 deferred (spec-compliance and UX only). Two ADRs: - 0001 — Pure-C++ reference harness for unit-level NVFP4 numerics - 0002 — 128 MiB SafeTensors header-size soft cap Each item is small, isolated, has a deterministic synthetic test fixture, zero new dependencies, and a clear root-cause-vs-symptom delineation. Test infrastructure reuses two new test files (test_nvfp4_compressed_tensors_ref.cu, test_safetensors_loader.cpp) across multiple items to keep churn low. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(nvfp4): reference numerical test for compressed-tensors dequant (F1) Closes audit finding F1. Adds tests/test_nvfp4_compressed_tensors_ref.cu with four GTest cases that build a synthetic compressed-tensors NVFP4 weight in memory exactly per the on-disk spec (uint8 nibble-packed E2M1 + FP8 E4M3 weight_scale at group_size=16 + FP32 weight_scale_2) and assert that imp's gemv_nvfp4_kpar produces the same Y = W·X as a pure-host reference dequant following val = e2m1_to_f32(nibble) * fp8_e4m3_to_f32(scale) * weight_scale_2. Existing roundtrip tests (test_nvfp4_quant_ref.cu, test_nvfp4_quant_hw.cu, test_nvfp4_gemv_kpar_loop.cu) all run imp's quantizer through imp's dequantizer; a paired sign-flip / nibble-order bug or a missing factor (e.g. dropping weight_scale_2) would not be detectable. This test starts from the spec format directly so any future regression that breaks spec compliance fails CI. Cases: - BaselineUnityScales: mixed nibbles, FP8=1.0, tensor_scale=1.0 — catches alignment / nibble-decode bugs at unity factors - TwoLevelScalingVaryingPerBlock: cycling per-block scales + non-trivial tensor_scale=0.125 — catches a missing or sign-flipped factor - ZeroTensorScaleProducesZeroOutput: tensor_scale=0 must produce 0 output, not NaN/Inf, regardless of garbage in weight_packed (also pre-validates F2) - NegativeWeightsSignPreserved: every nibble = -1.0 → output exactly -K, catches sign-bit drop in nibble decode Tolerance is max-abs-diff < 1e-2 in FP16 output (FMA-order divergence between sequential reference and imp's parallel-warp accumulator dominates; 1e-5 is unrealistic for K=128 FP16 dot-product reductions). ADR 0001 records the rationale for choosing pure-C++ reference over the other two harness options (existing imp Python infra: not wired for unit numerics; subprocess into user's HF venv: cross-process complexity + runtime dep). Pure C++ is dependency-free, deterministic, exact for the formula, and fits inside the existing GTest harness. All 4 new tests pass; full test-quant binary remains 82/82 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): defensive zeroing for non-finite Modelopt weight_scale_2 (F2) Closes audit finding F2. Phase 2 / PR #113 added a zero/non-finite guard for the llm-compressor NVFP4 promote path; the Modelopt path at the else branch in executor_pre_dequant.cu was unguarded and would propagate NaN/±Inf weight_scale_2 into the GEMM, contaminating the entire layer's hidden state and downstream KV cache. Refactor: extract the scale-promotion math into nvfp4_promote_weight_scale_2 in quant/nvfp4_quant.h/.cu — pure host function, testable without CUDA. Both formats now share the defensive logic: - non-finite h_scale → 0.0f (both formats) - llm-compressor h_scale=0 → 0.0f (avoids 1/0 = +Inf via reciprocal flip) - llm-compressor 1/h_scale non-finite → 0.0f (subnormal flip overflow) - Modelopt h_scale=0 → 0.0f (legitimate "null layer", flagged for diag) executor_pre_dequant.cu's promote() lambda calls the helper; the existing counter / WARN summary remains, with WARN messages distinguishing the Modelopt-NaN/Inf case from llm-compressor reciprocal flip. Tests: 9 new unit tests in NvFP4PromoteWeightScale2 covering NaN, +Inf, -Inf, zero, denorm-flip, and finite cases for both formats. Test-quant suite: 82/82 → 91/91 (no regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(safetensors): overflow-safe header_size validation + 128 MiB cap (F3 + F7) Closes audit findings F3 and F7 from docs/audit/safetensors_nvfp4_audit_2026-05.md. The previous header_size validation at safetensors_loader.cpp:519-524 used `8 + header_size > file_size`. With a malicious / corrupt header_size = UINT64_MAX-4 the addition wraps to 3, which is NOT greater than any file_size — the check silently bypasses. The loader then constructs JsonParser(json_data, static_cast<size_t>(header_size)) and reads past the mmap region → SIGSEGV. Refactor the check into safetensors_internal::validate_header_size, exposed in safetensors_loader.h for unit testing. Two rules: - header_size > file_size - 8 (overflow-safe; file_size >= 8 is enforced upstream) - header_size > kMaxHeaderBytes (128 MiB soft cap per ADR 0002) The cap rejects pathological inputs that would force the JSON parser to scan multi-GB regions. Real models have headers below 1 MiB; 128 MiB is far above legitimate use. Tests: 6 new unit cases covering truncated files, exact-minimum, typical size, header > file, UINT64_MAX overflow attack (the F3 bug), and the 128 MiB soft cap boundary. test-core suite: 139/139 → 139/139 (1 skipped is unchanged TensorKindCoverage.NoUnknownKindsInSmallQwen). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(safetensors): per-tensor offset and size validation (F4) Closes audit finding F4. The previous in-loop validation at safetensors_loader.cpp:572-580 only checked `tensor_data_offset + offset_end > file_size`, leaving three silent correctness paths open: 1. offset_start > offset_end was not checked. A swapped data_offsets pair would compute a negative "size" interpreted as a huge unsigned value; downstream kernel reads would walk backwards into adjacent tensor data. 2. offset_end - offset_start was not compared against the byte count implied by shape × dtype. A file declaring an FP16 [1024,1024] tensor with offsets [0, 1024] would silently load 0.05% of the actual weight data, then yield uninitialized bytes for the rest. 3. tensor_data_offset + offset_start in-bounds was never asserted; only the end was. A start-past-EOF could be tolerated if the (truncated) end was in-bounds — possible with a corrupted-but-self-consistent file. Add safetensors_internal::validate_tensor_offsets, a host-only helper exposed via safetensors_loader.h for unit testing. Three rules in the order the spec implies. Overflow-safe (subtractions only, with the upstream header_size invariant guaranteeing tensor_data_offset <= file_size). The per-tensor check is wired into load_shard's tensor enumeration loop: when wire_dtype_bytes is known (mapping from SafeTensors wire string — F32, F16, BF16, FP8_E4M3 etc.), the strict 3-rule validation runs; otherwise the legacy lenient end-only check applies. The "unknown wire type" path retains existing behavior because safetensors_dtype()'s WARN already fires for those tensors at the actual emit step. Tests: 7 new unit cases in SafeTensorsValidateTensorOffsets covering valid, swap, OOB-end, byte-count mismatch, zero-size, invariant violation, and exact-boundary cases. test-core 146/146 → 146/146, test-quant 91/91 unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): enforce FP8_E4M3 weight_scale dtype at promote (F8) Closes audit finding F8. The compressed-tensors NVFP4 spec mandates float8_e4m3fn for weight_scale. Until now imp's promote step accepted whatever qtype the loader produced, which opened a NVFP4↔MXFP4 cross-misrouting silent-corruption path: a model misclassified as NVFP4 but shipping U8 (UE8M0) weight_scale bytes would silently load, then gemv_nvfp4_kpar would interpret the UE8M0 bytes as E4M3 and produce ~2× wrong scales (powers of two interpreted as E4M3 normals). Add nvfp4_validate_weight_scale_dtype(QType, *err) in nvfp4_quant.h/.cu — pure host predicate. The promote() lambda calls it before applying the two-level scaling formula; on rejection, the weight stays in its loaded state and the dequant→cuBLAS fallback runs (slower, but correct). A new end-of-load summary line surfaces the count of skipped weights so the user can detect the cross-misroute case. Tests: 5 new unit cases in NvFP4ValidateWeightScaleDtype covering the accepted dtype, the MXFP4 INT8 case, FP8_E5M2 (activation-only), F16 (some pipelines emit this), and the QType::NONE sentinel. test-quant 91/91 → 96/96, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): validate weight_packed/weight_scale shape pair at promote (F6) Closes audit finding F6. The NVFP4 GEMV kernel hard-codes group_size=16 (kMicroBlockSize=16 at nvfp4_gemm.cu:31) and reads K/16 micro-scales per row. Until now, the loader never verified that weight_scale's shape matches that contract; a checkpoint with group_size=8 or a transposed weight_scale would silently load and silently produce wrong output (12.5% per-element step quant noise on roughly half the elements, or scales aligned onto wrong rows). Add nvfp4_validate_packed_scale_shapes(packed_outer, packed_inner, scale_outer, scale_inner, *err). The promote() lambda calls it for 2D weights — the per-expert MoE case has been split to 2D by weight_upload.cu before promote runs. Mismatches WARN + skip promotion, which routes the weight to the dequant→cuBLAS fallback (slower but at least correct). Tests: 7 new unit cases in NvFP4ValidatePackedScaleShapes covering typical Qwen3 / Gemma-4 expert shapes, transposed scale, group_size=8, group_size=32, zero-inner-dim, and the tiny F1 baseline test fixture. test-quant 96/96 → 103/103, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(safetensors): warn on dropped malformed tensor entries (F5) Closes audit finding F5. Until now, six paths in load_shard silently dropped tensor entries with no log line: missing/non-string 'dtype', missing/non-array 'shape', ndim > kMaxDims, missing/wrong-arity 'data_offsets', and the post-F4 offset-validation rejections. Users with a corrupt checkpoint would see tensor_map come back partially populated with zero diagnostic output; downstream null-checks would make load look "successful" with wrong outputs at inference. Replace each silent `continue` with a counter-bumped IMP_LOG_WARN line naming the tensor and the specific reason. Add an end-of-shard summary line breaking out the counts per reason (no_dtype / no_shape / too_many_dims / no_offsets / offset_validation), so users can scan a single log line to see "did this checkpoint load cleanly". Tests: 2 new unit cases in SafeTensorsMalformedEntryWarnings covering (a) missing dtype + missing shape on synthetic blob, and (b) byte-count mismatch from F4. Both verify via gtest's CaptureStderr that the WARN line includes the tensor name and the rejection reason. test-core 146/146 → 148/148, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(audit): mark all P0/P1/P2 findings closed + DONE summary Final phase. Marks every audit finding F1-F8 with its closing commit SHA, records the final P3 deferrals (F9 spec-compliance, F10 UX), and writes docs/audit/DONE.md with the full run summary: - 1/1 P0 closed, 3/3 P1 closed, 4/4 P2 closed, 2/2 P3 deferred - 0 FEASIBLE roadmap items (all 27 deferred per the conditional model with documented reason in followups.md) - 40 new unit tests, full suite 769/769 pass-or-skip-only with 0 failures - verify-fast: decode +5.47%, prefill +8.17% over baseline (noise + prior commits; this run's changes are loader-time only and do not touch the hot path) - 2 ADRs (pure-C++ ref harness, 128 MiB header cap) - No new third-party dependencies in any manifest - CMAKE_CUDA_STANDARD=20 unchanged progress.log force-added (project gitignore covers *.log; this is an intentional audit artifact named in the spec). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl and others added 11 commits May 7, 2026 20:37

kekzl enabled auto-merge (squash) May 7, 2026 18:43

kekzl merged commit 2de77fb into main May 7, 2026
2 checks passed

kekzl deleted the chore/safetensors-audit-phase-2 branch May 7, 2026 18:47

This was referenced May 7, 2026

docs(audit): annotate SafeTensors audit with Phase-2 resolution status #118

Merged

audit: SafeTensors + NVFP4 hardening 2026-05 (F1-F8) #126

Merged

kekzl mentioned this pull request May 8, 2026

feat(safetensors): native SentencePiece (.model) parser (AU2) #128

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(safetensors): zero-config auto-detect + observability (audit Phase 2)#116

feat(safetensors): zero-config auto-detect + observability (audit Phase 2)#116
kekzl merged 11 commits into
mainfrom
chore/safetensors-audit-phase-2

kekzl commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 7, 2026

Summary

What this PR does

Real auto-detect / quality fixes (#1-15)

Detection + warnings only (#16-18)

Audit doc

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant