fix(distill): Phase 3 dispatch script CLI-flag alignment (PMAT-698b)#1799
Merged
Conversation
The Phase 3 dispatch script (shipped in #1795, squash-landed via #1797) invoked apr distill with six flags that don't exist on the post-Phase-3-prep CLI: --num-steps, --batch-size, --learning-rate, --student-init, --output-dir, --device. Apr distill rejects them, so the smoke run cannot fire as-shipped. Realign to existing flags: --teacher REPO → positional <TEACHER_DIR> --student-init REPO → --student <STUDENT_DIR> --num-steps 500 → --epochs 17 (round-up of 500/31 default-batch) --batch-size, -lr → dropped (CLI doesn't surface; PMAT-698c follow-up) --output-dir DIR → --output DIR/student.apr --device cuda → --backend cuda HF repo IDs are resolved to local cache snapshot dirs via a shell function inside the SSH heredoc (the apr distill CudaTrainerTeacher::for_inference signature accepts a directory containing model.safetensors or model.apr, which matches the HF cache layout). Evidence: - evidence/distill-phase-3-readiness/findings.md documents the original aspirational-flag defect + resolution + deferred PMAT-698c scope. Unblocks task #124 (Phase 3 real smoke dispatch on gx10). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
noahgift
added a commit
that referenced
this pull request
May 18, 2026
…d) (#1800) * fix(distill): Phase 3 gx10 dispatch — staging + path layout (PMAT-698d) Three real defects surfaced by live #1799 dispatch attempt to gx10: 1. /mnt/nvme-raid0/runs/ is lambda-vector layout — doesn't exist on gx10 (916GB root, no /mnt). Switch default to $HOME/runs, env-overridable via GX10_RUN_PREFIX. 2. The HF cache lookup (hf_repo_to_dir from #1799) targeted the wrong layout. `apr pull` uses pacha, which caches as: ~/.cache/pacha/models/<sha>.safetensors ~/.cache/pacha/models/<sha>.tokenizer.json ... not HF hub's snapshots/<sha>/ directory structure. 3. apr distill --backend cuda calls CudaTrainerTeacher::for_inference which expects a directory containing model.apr or model.safetensors. The pacha cache is a flat file. Need to symlink-stage into a dir. Fixes: - GX10_RUN_PREFIX env var (default $HOME/runs) - New stage_repo() shell function inside SSH heredoc that: - captures `apr pull` Path: from stdout - mkdirs RUN_DIR_REMOTE/teacher and /student stage dirs - symlinks pacha-cached files as model.<ext> - symlinks companion tokenizer.json / config.json / tokenizer_config.json - for GGUF teachers, runs apr import --preserve-q4k to convert to APR - Default TEACHER_REPO changed to Qwen/Qwen2.5-Coder-1.5B-Instruct (SafeTensors, loads directly into CudaTrainerTeacher). The original paiml/qwen2.5-coder-7b-apache-q4k-v1 (GGUF) needs the apr import conversion path, which works but is slow and disk-intensive on gx10 (58GB free). Defer real-MODEL-1 dispatch to PMAT-698e after smoke validates the pipeline. Test plan: - [x] DRY_RUN=1 STEPS=50 bash scripts/... exits cleanly - [x] bashrs lint: 11 errors (pre-existing heredoc/string mis-parses, no regression) - [ ] STEPS=50 dispatch on gx10 reaches the training loop (verified live via the PR description, not in CI since this is a script) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(distill): convert pulled models to APR v2 before dispatch (PMAT-698d cont.) Second live-dispatch attempt revealed another defect: apr distill --backend cuda reads teacher_path AS A FILE via std::fs::read + AprV2Reader::from_bytes, THEN uses its parent directory for for_inference. Two expectations are tied together: - the file at teacher_path must be a valid APR v2 binary (for metadata) - the parent directory must contain a loadable checkpoint (model.apr or model.safetensors) for CudaTransformerTrainer The previous staging symlinked .safetensors at stage_dir/model.safetensors which satisfied for_inference but failed the AprV2Reader read step (symlink target had wrong magic bytes). Fix: always run `apr import` (with --preserve-q4k for GGUF) to produce a real APR v2 file at stage_dir/model.apr. The teacher_path passed to apr distill is that .apr file. Its parent dir is the stage dir, which now satisfies both expectations. Also rename the dispatch output from student.apr to student-trained.apr to disambiguate from the staged input checkpoints. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(distill): stage config.json + companions next to source for apr import (PMAT-698d cont.) Fourth defect surfaced live: apr import requires config.json next to the source file (default search path), but pacha caches them sha-prefixed at: ~/.cache/pacha/models/<sha>.config.json ~/.cache/pacha/models/<sha>.tokenizer.json The previous staging copied them to stage_dir/ AFTER running apr import, so import couldn't find them and failed: error: Validation failed: Invalid model format: config.json not found at /home/noah/.cache/pacha/models/config.json Fix: stage all companion files into stage_dir BEFORE apr import, and also symlink the source file itself (.safetensors or .gguf) into stage_dir as source.<ext>. apr import then finds everything in the same directory. Result is still stage_dir/model.apr — that's what apr distill consumes. Reordering only; no semantic change to the directory layout consumed by apr distill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(distill): 0.5B=0.5B for Phase 3 smoke to fit GB10 training memory (PMAT-698d cont.) Fifth defect surfaced live: 1.5B Qwen teacher loaded fine for inference but hit CUDA_ERROR_OUT_OF_MEMORY at "Block 0 upload" during the for_inference GPU upload path. Blackwell's unified 128GB pool reports correctly but the training-time peak (weights + gradients + Adam optimizer state + per-block activations + workspace) overflows the actual VRAM budget for >1B models. For a Phase 3 SMOKE (whose contract is just "val_loss decreases over N steps"), teacher and student don't have to be different. Using the 0.5B Qwen for both exercises every KD-loop branch (forward, kd_step, gradient, optimizer) at minimal memory. This lets us validate the engineering tower (PMAT-693 through PMAT-697 + 698b + 698d) end-to-end before scoping the GB10 memory budget for the real 7B-Q4K MODEL-1 teacher (deferred to PMAT-698e). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
apr distillflags that don't exist on the current CLI:--num-steps,--batch-size,--learning-rate,--student-init,--output-dir,--device--epochs,--student,--backend,--output) and drop the two without CLI surface (deferred to PMAT-698c)CudaTrainerTeacher::for_inference's directory expectationUnblocks task #124: Phase 3 real smoke dispatch on gx10.
Defect surfaced
PR #1795 shipped
scripts/dispatch-distill-phase-3-gx10.shwith flags PR #1796/#1797 didn't add. The smoke run cannot fire as-shipped —apr distillrejects unknown flags. Detailed audit inevidence/distill-phase-3-readiness/findings.md.Mapping
--teacher REPO<TEACHER_DIR>(resolved)--student-init REPO--student <STUDENT_DIR>(resolved)--num-steps 500--epochs 17(round-up of 500/31)--batch-size 4--learning-rate 1.5e-5--output-dir DIR--output DIR/student.apr--device cuda--backend cudaTest plan
DRY_RUN=1 bash scripts/dispatch-distill-phase-3-gx10.shexits cleanlybashrs linterror count drops from 14 → 11 (script-local improvement; remaining 11 are pre-existing heredoc/string mis-parses on the SSH block)Deferred to PMAT-698c
Adding
--max-steps,--batch-size,--learning-rateCLI flags so the dispatch can use user-preferred hyperparameters (batch=4, lr=1.5e-5) instead of defaults. Scope: ~150 LOC + falsifier in pipeline.rs to honormax_steps: Option<u32>cap.🤖 Generated with Claude Code