fix(distill): Phase 3 dispatch script CLI-flag alignment (PMAT-698b) by noahgift · Pull Request #1799 · paiml/aprender

noahgift · 2026-05-18T16:46:56Z

Summary

Dispatch script (squash-landed via feat(distill): apr distill --backend cuda real construction (SPEC-DISTILL-001 Phase 3-prep second half, PMAT-697) #1797) used six aspirational apr distill flags that don't exist on the current CLI: --num-steps, --batch-size, --learning-rate, --student-init, --output-dir, --device
Realign to existing flags (--epochs, --student, --backend, --output) and drop the two without CLI surface (deferred to PMAT-698c)
HF repo IDs resolved to local cache snapshot dirs via shell function inside the SSH heredoc — matches CudaTrainerTeacher::for_inference's directory expectation

Unblocks task #124: Phase 3 real smoke dispatch on gx10.

Defect surfaced

PR #1795 shipped scripts/dispatch-distill-phase-3-gx10.sh with flags PR #1796/#1797 didn't add. The smoke run cannot fire as-shipped — apr distill rejects unknown flags. Detailed audit in evidence/distill-phase-3-readiness/findings.md.

Mapping

Aspirational flag	Fixed to
`--teacher REPO`	positional `<TEACHER_DIR>` (resolved)
`--student-init REPO`	`--student <STUDENT_DIR>` (resolved)
`--num-steps 500`	`--epochs 17` (round-up of 500/31)
`--batch-size 4`	dropped — uses default 32
`--learning-rate 1.5e-5`	dropped — uses default 1e-4
`--output-dir DIR`	`--output DIR/student.apr`
`--device cuda`	`--backend cuda`

Test plan

DRY_RUN=1 bash scripts/dispatch-distill-phase-3-gx10.sh exits cleanly
bashrs lint error count drops from 14 → 11 (script-local improvement; remaining 11 are pre-existing heredoc/string mis-parses on the SSH block)
After merge, gx10 dispatch validates F-DISTILL-SMOKE-001 (val_loss at end < val_loss at start)

Deferred to PMAT-698c

Adding --max-steps, --batch-size, --learning-rate CLI flags so the dispatch can use user-preferred hyperparameters (batch=4, lr=1.5e-5) instead of defaults. Scope: ~150 LOC + falsifier in pipeline.rs to honor max_steps: Option<u32> cap.

🤖 Generated with Claude Code

The Phase 3 dispatch script (shipped in #1795, squash-landed via #1797) invoked apr distill with six flags that don't exist on the post-Phase-3-prep CLI: --num-steps, --batch-size, --learning-rate, --student-init, --output-dir, --device. Apr distill rejects them, so the smoke run cannot fire as-shipped. Realign to existing flags: --teacher REPO → positional <TEACHER_DIR> --student-init REPO → --student <STUDENT_DIR> --num-steps 500 → --epochs 17 (round-up of 500/31 default-batch) --batch-size, -lr → dropped (CLI doesn't surface; PMAT-698c follow-up) --output-dir DIR → --output DIR/student.apr --device cuda → --backend cuda HF repo IDs are resolved to local cache snapshot dirs via a shell function inside the SSH heredoc (the apr distill CudaTrainerTeacher::for_inference signature accepts a directory containing model.safetensors or model.apr, which matches the HF cache layout). Evidence: - evidence/distill-phase-3-readiness/findings.md documents the original aspirational-flag defect + resolution + deferred PMAT-698c scope. Unblocks task #124 (Phase 3 real smoke dispatch on gx10). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…d) (#1800) * fix(distill): Phase 3 gx10 dispatch — staging + path layout (PMAT-698d) Three real defects surfaced by live #1799 dispatch attempt to gx10: 1. /mnt/nvme-raid0/runs/ is lambda-vector layout — doesn't exist on gx10 (916GB root, no /mnt). Switch default to $HOME/runs, env-overridable via GX10_RUN_PREFIX. 2. The HF cache lookup (hf_repo_to_dir from #1799) targeted the wrong layout. `apr pull` uses pacha, which caches as: ~/.cache/pacha/models/<sha>.safetensors ~/.cache/pacha/models/<sha>.tokenizer.json ... not HF hub's snapshots/<sha>/ directory structure. 3. apr distill --backend cuda calls CudaTrainerTeacher::for_inference which expects a directory containing model.apr or model.safetensors. The pacha cache is a flat file. Need to symlink-stage into a dir. Fixes: - GX10_RUN_PREFIX env var (default $HOME/runs) - New stage_repo() shell function inside SSH heredoc that: - captures `apr pull` Path: from stdout - mkdirs RUN_DIR_REMOTE/teacher and /student stage dirs - symlinks pacha-cached files as model.<ext> - symlinks companion tokenizer.json / config.json / tokenizer_config.json - for GGUF teachers, runs apr import --preserve-q4k to convert to APR - Default TEACHER_REPO changed to Qwen/Qwen2.5-Coder-1.5B-Instruct (SafeTensors, loads directly into CudaTrainerTeacher). The original paiml/qwen2.5-coder-7b-apache-q4k-v1 (GGUF) needs the apr import conversion path, which works but is slow and disk-intensive on gx10 (58GB free). Defer real-MODEL-1 dispatch to PMAT-698e after smoke validates the pipeline. Test plan: - [x] DRY_RUN=1 STEPS=50 bash scripts/... exits cleanly - [x] bashrs lint: 11 errors (pre-existing heredoc/string mis-parses, no regression) - [ ] STEPS=50 dispatch on gx10 reaches the training loop (verified live via the PR description, not in CI since this is a script) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(distill): convert pulled models to APR v2 before dispatch (PMAT-698d cont.) Second live-dispatch attempt revealed another defect: apr distill --backend cuda reads teacher_path AS A FILE via std::fs::read + AprV2Reader::from_bytes, THEN uses its parent directory for for_inference. Two expectations are tied together: - the file at teacher_path must be a valid APR v2 binary (for metadata) - the parent directory must contain a loadable checkpoint (model.apr or model.safetensors) for CudaTransformerTrainer The previous staging symlinked .safetensors at stage_dir/model.safetensors which satisfied for_inference but failed the AprV2Reader read step (symlink target had wrong magic bytes). Fix: always run `apr import` (with --preserve-q4k for GGUF) to produce a real APR v2 file at stage_dir/model.apr. The teacher_path passed to apr distill is that .apr file. Its parent dir is the stage dir, which now satisfies both expectations. Also rename the dispatch output from student.apr to student-trained.apr to disambiguate from the staged input checkpoints. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(distill): stage config.json + companions next to source for apr import (PMAT-698d cont.) Fourth defect surfaced live: apr import requires config.json next to the source file (default search path), but pacha caches them sha-prefixed at: ~/.cache/pacha/models/<sha>.config.json ~/.cache/pacha/models/<sha>.tokenizer.json The previous staging copied them to stage_dir/ AFTER running apr import, so import couldn't find them and failed: error: Validation failed: Invalid model format: config.json not found at /home/noah/.cache/pacha/models/config.json Fix: stage all companion files into stage_dir BEFORE apr import, and also symlink the source file itself (.safetensors or .gguf) into stage_dir as source.<ext>. apr import then finds everything in the same directory. Result is still stage_dir/model.apr — that's what apr distill consumes. Reordering only; no semantic change to the directory layout consumed by apr distill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(distill): 0.5B=0.5B for Phase 3 smoke to fit GB10 training memory (PMAT-698d cont.) Fifth defect surfaced live: 1.5B Qwen teacher loaded fine for inference but hit CUDA_ERROR_OUT_OF_MEMORY at "Block 0 upload" during the for_inference GPU upload path. Blackwell's unified 128GB pool reports correctly but the training-time peak (weights + gradients + Adam optimizer state + per-block activations + workspace) overflows the actual VRAM budget for >1B models. For a Phase 3 SMOKE (whose contract is just "val_loss decreases over N steps"), teacher and student don't have to be different. Using the 0.5B Qwen for both exercises every KD-loop branch (forward, kd_step, gradient, optimizer) at minimal memory. This lets us validate the engineering tower (PMAT-693 through PMAT-697 + 698b + 698d) end-to-end before scoping the GB10 memory budget for the real 7B-Q4K MODEL-1 teacher (deferred to PMAT-698e). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 18, 2026 16:47

noahgift merged commit d762012 into main May 18, 2026
11 checks passed

noahgift deleted the fix/distill-phase-3-dispatch-flags-pmat-698b branch May 18, 2026 17:12

noahgift mentioned this pull request May 18, 2026

fix(distill): Phase 3 gx10 dispatch — staging + path layout (PMAT-698d) #1800

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distill): Phase 3 dispatch script CLI-flag alignment (PMAT-698b)#1799

fix(distill): Phase 3 dispatch script CLI-flag alignment (PMAT-698b)#1799
noahgift merged 1 commit into
mainfrom
fix/distill-phase-3-dispatch-flags-pmat-698b

noahgift commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 18, 2026

Summary

Defect surfaced

Mapping

Test plan

Deferred to PMAT-698c

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant