fix(distill): accept APR magic bytes in load_safetensors_weights (PMAT-698f)#1809
Merged
Merged
Conversation
…T-698f)
Phase 3 dispatch on gx10 reaches the training loop after PMAT-698e
(workspace cap) but then fails at:
Validation failed: cuda pipeline.execute failed: Serialization error:
invalid SafeTensors file /home/noah/runs/.../teacher/model.apr:
header too large
Cause: pipeline.train() helper calls load_safetensors_weights on both
teacher and student paths as a fixture-path sanity check + to seed the
student_weights buffer used for the FALSIFY-APR-DISTILL-TRAIN-001
post-train logit projection assertion. The cuda backend (PMAT-697 wiring)
stages teacher/student as APR v2 files via `apr import` because
CudaTrainerTeacher::for_inference reads APR metadata, not SafeTensors.
SafeTensors's deserializer reads the APR v2 binary header bytes as a u64
header length and rejects the file.
Fix: detect APR v1/v2 magic bytes ("APR\0" / "APRN") and return empty
maps. The cuda path's pipeline.train() uses providers (not these maps)
for forward/backward; the empty maps become no-op placeholders for the
post-train projection. FALSIFY-APR-DISTILL-TRAIN-001 is a fixture-path
contract and remains intact for that path.
Test plan:
- [x] 58 distill lib tests pass (fixture path FALSIFY contracts unchanged)
- [x] APR magic-byte short-circuit verified via the existing weights tests
- [ ] Live gx10 dispatch (post-merge) reaches training loop and starts
stepping (or surfaces the NEXT defect)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 19, 2026
Merged
noahgift
added a commit
that referenced
this pull request
May 20, 2026
2026-05-20 — real distillation 1.5B teacher → 0.5B student on
Blackwell GB10 with the full PMAT-698e..n + PMAT-700-B cascade active.
initial_loss = 7.6746
final_loss = 7.2036 ← LESS THAN initial
62 steps, 122.7s, no errors
F-DISTILL-SMOKE-001 ("final_loss < initial_loss") discharged.
Phase 3 of SPEC-DISTILL-001 is COMPLETE.
Evidence:
- evidence/distill-phase-3-real-kd/dispatch.json — dispatch manifest
- evidence/distill-phase-3-real-kd/launch-final-pass.txt — full training log
Run dir on gx10: /home/noah/runs/distill-smoke-20260520-070404/
Trained student checkpoint: student-trained.apr/model.safetensors
Cascade summary (all merged):
- #1804 PMAT-700-B (cuBLAS prewarm skip)
- #1808 PMAT-698e (workspace cap)
- #1809 PMAT-698f (APR magic in weights loader)
- #1810 PMAT-698g (non-LoRA backward pre-warm)
- #1813 PMAT-698h (rms_norm_gamma_reduce pre-warm)
- #1815 PMAT-698i (FWD-CACHE diagnostic logging)
- #1817 PMAT-698j (THE root cause — warm! macro key)
- #1820 PMAT-698k (cache-key alignment: rope fwd + rmsnorm eps)
- #1823 PMAT-698m (smoke setup: non-degenerate batch)
- #1824 (post-mortem doc)
- #1827 PMAT-698n (rmsnorm pre-warm at both 1e-6 + 1e-5 eps)
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 3 dispatch on gx10 reaches the training loop after PMAT-698e (workspace cap) but then fails at:
pipeline.train()callsload_safetensors_weightson both teacher and student paths. The cuda backend (PMAT-697 wiring) stages files as APR v2 (viaapr import) becauseCudaTrainerTeacher::for_inferencereads APR metadata, not SafeTensors. SafeTensors's deserializer reads the APR v2 binary header bytes as a u64 header length and rejects the file.Fix
Detect APR v1/v2 magic bytes (
APR\0/APRN) inload_safetensors_weightsand return empty maps. The cuda path'spipeline.train()uses providers (not these maps) for forward/backward; the empty maps are no-op placeholders for the post-train logit projection.Test plan
test_load_safetensors_weights_invalid_datatest passes (non-APR invalid data still errors)Relationship to ongoing Phase 3 fixes
This is the THIRD live-dispatch-discovered defect in the cuda path:
Each is independent, each was surfaced by the previous fix unblocking further progress.
🤖 Generated with Claude Code