fix(distill): accept APR magic bytes in load_safetensors_weights (PMAT-698f) by noahgift · Pull Request #1809 · paiml/aprender

noahgift · 2026-05-19T10:01:10Z

Summary

Phase 3 dispatch on gx10 reaches the training loop after PMAT-698e (workspace cap) but then fails at:

Validation failed: cuda pipeline.execute failed: Serialization error:
invalid SafeTensors file /home/noah/runs/.../teacher/model.apr: header too large

pipeline.train() calls load_safetensors_weights on both teacher and student paths. The cuda backend (PMAT-697 wiring) stages files as APR v2 (via apr import) because CudaTrainerTeacher::for_inference reads APR metadata, not SafeTensors. SafeTensors's deserializer reads the APR v2 binary header bytes as a u64 header length and rejects the file.

Fix

Detect APR v1/v2 magic bytes (APR\0 / APRN) in load_safetensors_weights and return empty maps. The cuda path's pipeline.train() uses providers (not these maps) for forward/backward; the empty maps are no-op placeholders for the post-train logit projection.

Test plan

58 distill lib tests pass (fixture path FALSIFY-APR-DISTILL-TRAIN-001/002 unchanged)
Existing test_load_safetensors_weights_invalid_data test passes (non-APR invalid data still errors)
Live gx10 dispatch (post-merge) reaches training loop steps (or surfaces NEXT defect)

Relationship to ongoing Phase 3 fixes

This is the THIRD live-dispatch-discovered defect in the cuda path:

PMAT-700-B — skip PTX GEMM pre-warm when cuBLAS active (JIT cache pressure)
PMAT-698e — cap max_position_embeddings at 2048 (workspace sizing)
PMAT-698f — accept APR magic in weight loader (format compatibility)

Each is independent, each was surfaced by the previous fix unblocking further progress.

🤖 Generated with Claude Code

…T-698f) Phase 3 dispatch on gx10 reaches the training loop after PMAT-698e (workspace cap) but then fails at: Validation failed: cuda pipeline.execute failed: Serialization error: invalid SafeTensors file /home/noah/runs/.../teacher/model.apr: header too large Cause: pipeline.train() helper calls load_safetensors_weights on both teacher and student paths as a fixture-path sanity check + to seed the student_weights buffer used for the FALSIFY-APR-DISTILL-TRAIN-001 post-train logit projection assertion. The cuda backend (PMAT-697 wiring) stages teacher/student as APR v2 files via `apr import` because CudaTrainerTeacher::for_inference reads APR metadata, not SafeTensors. SafeTensors's deserializer reads the APR v2 binary header bytes as a u64 header length and rejects the file. Fix: detect APR v1/v2 magic bytes ("APR\0" / "APRN") and return empty maps. The cuda path's pipeline.train() uses providers (not these maps) for forward/backward; the empty maps become no-op placeholders for the post-train projection. FALSIFY-APR-DISTILL-TRAIN-001 is a fixture-path contract and remains intact for that path. Test plan: - [x] 58 distill lib tests pass (fixture path FALSIFY contracts unchanged) - [x] APR magic-byte short-circuit verified via the existing weights tests - [ ] Live gx10 dispatch (post-merge) reaches training loop and starts stepping (or surfaces the NEXT defect) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-20 — real distillation 1.5B teacher → 0.5B student on Blackwell GB10 with the full PMAT-698e..n + PMAT-700-B cascade active. initial_loss = 7.6746 final_loss = 7.2036 ← LESS THAN initial 62 steps, 122.7s, no errors F-DISTILL-SMOKE-001 ("final_loss < initial_loss") discharged. Phase 3 of SPEC-DISTILL-001 is COMPLETE. Evidence: - evidence/distill-phase-3-real-kd/dispatch.json — dispatch manifest - evidence/distill-phase-3-real-kd/launch-final-pass.txt — full training log Run dir on gx10: /home/noah/runs/distill-smoke-20260520-070404/ Trained student checkpoint: student-trained.apr/model.safetensors Cascade summary (all merged): - #1804 PMAT-700-B (cuBLAS prewarm skip) - #1808 PMAT-698e (workspace cap) - #1809 PMAT-698f (APR magic in weights loader) - #1810 PMAT-698g (non-LoRA backward pre-warm) - #1813 PMAT-698h (rms_norm_gamma_reduce pre-warm) - #1815 PMAT-698i (FWD-CACHE diagnostic logging) - #1817 PMAT-698j (THE root cause — warm! macro key) - #1820 PMAT-698k (cache-key alignment: rope fwd + rmsnorm eps) - #1823 PMAT-698m (smoke setup: non-degenerate batch) - #1824 (post-mortem doc) - #1827 PMAT-698n (rmsnorm pre-warm at both 1e-6 + 1e-5 eps) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 19, 2026 10:01

noahgift merged commit 913631e into main May 19, 2026
11 checks passed

noahgift deleted the fix/distill-apr-format-load-pmat-698f branch May 19, 2026 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distill): accept APR magic bytes in load_safetensors_weights (PMAT-698f)#1809

fix(distill): accept APR magic bytes in load_safetensors_weights (PMAT-698f)#1809
noahgift merged 1 commit into
mainfrom
fix/distill-apr-format-load-pmat-698f

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

Fix

Test plan

Relationship to ongoing Phase 3 fixes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant