Skip to content

fix(distill): accept APR magic bytes in load_safetensors_weights (PMAT-698f)#1809

Merged
noahgift merged 1 commit into
mainfrom
fix/distill-apr-format-load-pmat-698f
May 19, 2026
Merged

fix(distill): accept APR magic bytes in load_safetensors_weights (PMAT-698f)#1809
noahgift merged 1 commit into
mainfrom
fix/distill-apr-format-load-pmat-698f

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Phase 3 dispatch on gx10 reaches the training loop after PMAT-698e (workspace cap) but then fails at:

Validation failed: cuda pipeline.execute failed: Serialization error:
invalid SafeTensors file /home/noah/runs/.../teacher/model.apr: header too large

pipeline.train() calls load_safetensors_weights on both teacher and student paths. The cuda backend (PMAT-697 wiring) stages files as APR v2 (via apr import) because CudaTrainerTeacher::for_inference reads APR metadata, not SafeTensors. SafeTensors's deserializer reads the APR v2 binary header bytes as a u64 header length and rejects the file.

Fix

Detect APR v1/v2 magic bytes (APR\0 / APRN) in load_safetensors_weights and return empty maps. The cuda path's pipeline.train() uses providers (not these maps) for forward/backward; the empty maps are no-op placeholders for the post-train logit projection.

Test plan

  • 58 distill lib tests pass (fixture path FALSIFY-APR-DISTILL-TRAIN-001/002 unchanged)
  • Existing test_load_safetensors_weights_invalid_data test passes (non-APR invalid data still errors)
  • Live gx10 dispatch (post-merge) reaches training loop steps (or surfaces NEXT defect)

Relationship to ongoing Phase 3 fixes

This is the THIRD live-dispatch-discovered defect in the cuda path:

  1. PMAT-700-B — skip PTX GEMM pre-warm when cuBLAS active (JIT cache pressure)
  2. PMAT-698e — cap max_position_embeddings at 2048 (workspace sizing)
  3. PMAT-698f — accept APR magic in weight loader (format compatibility)

Each is independent, each was surfaced by the previous fix unblocking further progress.

🤖 Generated with Claude Code

…T-698f)

Phase 3 dispatch on gx10 reaches the training loop after PMAT-698e
(workspace cap) but then fails at:

  Validation failed: cuda pipeline.execute failed: Serialization error:
  invalid SafeTensors file /home/noah/runs/.../teacher/model.apr:
  header too large

Cause: pipeline.train() helper calls load_safetensors_weights on both
teacher and student paths as a fixture-path sanity check + to seed the
student_weights buffer used for the FALSIFY-APR-DISTILL-TRAIN-001
post-train logit projection assertion. The cuda backend (PMAT-697 wiring)
stages teacher/student as APR v2 files via `apr import` because
CudaTrainerTeacher::for_inference reads APR metadata, not SafeTensors.
SafeTensors's deserializer reads the APR v2 binary header bytes as a u64
header length and rejects the file.

Fix: detect APR v1/v2 magic bytes ("APR\0" / "APRN") and return empty
maps. The cuda path's pipeline.train() uses providers (not these maps)
for forward/backward; the empty maps become no-op placeholders for the
post-train projection. FALSIFY-APR-DISTILL-TRAIN-001 is a fixture-path
contract and remains intact for that path.

Test plan:
- [x] 58 distill lib tests pass (fixture path FALSIFY contracts unchanged)
- [x] APR magic-byte short-circuit verified via the existing weights tests
- [ ] Live gx10 dispatch (post-merge) reaches training loop and starts
      stepping (or surfaces the NEXT defect)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 19, 2026 10:01
@noahgift noahgift merged commit 913631e into main May 19, 2026
11 checks passed
@noahgift noahgift deleted the fix/distill-apr-format-load-pmat-698f branch May 19, 2026 10:22
noahgift added a commit that referenced this pull request May 20, 2026
2026-05-20 — real distillation 1.5B teacher → 0.5B student on
Blackwell GB10 with the full PMAT-698e..n + PMAT-700-B cascade active.

  initial_loss = 7.6746
  final_loss   = 7.2036   ← LESS THAN initial
  62 steps, 122.7s, no errors

F-DISTILL-SMOKE-001 ("final_loss < initial_loss") discharged.

Phase 3 of SPEC-DISTILL-001 is COMPLETE.

Evidence:
- evidence/distill-phase-3-real-kd/dispatch.json — dispatch manifest
- evidence/distill-phase-3-real-kd/launch-final-pass.txt — full training log

Run dir on gx10: /home/noah/runs/distill-smoke-20260520-070404/
Trained student checkpoint: student-trained.apr/model.safetensors

Cascade summary (all merged):
- #1804 PMAT-700-B  (cuBLAS prewarm skip)
- #1808 PMAT-698e   (workspace cap)
- #1809 PMAT-698f   (APR magic in weights loader)
- #1810 PMAT-698g   (non-LoRA backward pre-warm)
- #1813 PMAT-698h   (rms_norm_gamma_reduce pre-warm)
- #1815 PMAT-698i   (FWD-CACHE diagnostic logging)
- #1817 PMAT-698j   (THE root cause — warm! macro key)
- #1820 PMAT-698k   (cache-key alignment: rope fwd + rmsnorm eps)
- #1823 PMAT-698m   (smoke setup: non-degenerate batch)
- #1824             (post-mortem doc)
- #1827 PMAT-698n   (rmsnorm pre-warm at both 1e-6 + 1e-5 eps)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant