Skip to content

fix(distill): Phase 3 dispatch script CLI-flag alignment (PMAT-698b)#1799

Merged
noahgift merged 1 commit into
mainfrom
fix/distill-phase-3-dispatch-flags-pmat-698b
May 18, 2026
Merged

fix(distill): Phase 3 dispatch script CLI-flag alignment (PMAT-698b)#1799
noahgift merged 1 commit into
mainfrom
fix/distill-phase-3-dispatch-flags-pmat-698b

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

  • Dispatch script (squash-landed via feat(distill): apr distill --backend cuda real construction (SPEC-DISTILL-001 Phase 3-prep second half, PMAT-697) #1797) used six aspirational apr distill flags that don't exist on the current CLI: --num-steps, --batch-size, --learning-rate, --student-init, --output-dir, --device
  • Realign to existing flags (--epochs, --student, --backend, --output) and drop the two without CLI surface (deferred to PMAT-698c)
  • HF repo IDs resolved to local cache snapshot dirs via shell function inside the SSH heredoc — matches CudaTrainerTeacher::for_inference's directory expectation

Unblocks task #124: Phase 3 real smoke dispatch on gx10.

Defect surfaced

PR #1795 shipped scripts/dispatch-distill-phase-3-gx10.sh with flags PR #1796/#1797 didn't add. The smoke run cannot fire as-shipped — apr distill rejects unknown flags. Detailed audit in evidence/distill-phase-3-readiness/findings.md.

Mapping

Aspirational flag Fixed to
--teacher REPO positional <TEACHER_DIR> (resolved)
--student-init REPO --student <STUDENT_DIR> (resolved)
--num-steps 500 --epochs 17 (round-up of 500/31)
--batch-size 4 dropped — uses default 32
--learning-rate 1.5e-5 dropped — uses default 1e-4
--output-dir DIR --output DIR/student.apr
--device cuda --backend cuda

Test plan

  • DRY_RUN=1 bash scripts/dispatch-distill-phase-3-gx10.sh exits cleanly
  • bashrs lint error count drops from 14 → 11 (script-local improvement; remaining 11 are pre-existing heredoc/string mis-parses on the SSH block)
  • After merge, gx10 dispatch validates F-DISTILL-SMOKE-001 (val_loss at end < val_loss at start)

Deferred to PMAT-698c

Adding --max-steps, --batch-size, --learning-rate CLI flags so the dispatch can use user-preferred hyperparameters (batch=4, lr=1.5e-5) instead of defaults. Scope: ~150 LOC + falsifier in pipeline.rs to honor max_steps: Option<u32> cap.

🤖 Generated with Claude Code

The Phase 3 dispatch script (shipped in #1795, squash-landed via #1797)
invoked apr distill with six flags that don't exist on the post-Phase-3-prep
CLI: --num-steps, --batch-size, --learning-rate, --student-init, --output-dir,
--device. Apr distill rejects them, so the smoke run cannot fire as-shipped.

Realign to existing flags:
  --teacher REPO          → positional <TEACHER_DIR>
  --student-init REPO     → --student <STUDENT_DIR>
  --num-steps 500         → --epochs 17 (round-up of 500/31 default-batch)
  --batch-size, -lr       → dropped (CLI doesn't surface; PMAT-698c follow-up)
  --output-dir DIR        → --output DIR/student.apr
  --device cuda           → --backend cuda

HF repo IDs are resolved to local cache snapshot dirs via a shell function
inside the SSH heredoc (the apr distill CudaTrainerTeacher::for_inference
signature accepts a directory containing model.safetensors or model.apr,
which matches the HF cache layout).

Evidence:
- evidence/distill-phase-3-readiness/findings.md documents the original
  aspirational-flag defect + resolution + deferred PMAT-698c scope.

Unblocks task #124 (Phase 3 real smoke dispatch on gx10).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 18, 2026 16:47
@noahgift noahgift merged commit d762012 into main May 18, 2026
11 checks passed
@noahgift noahgift deleted the fix/distill-phase-3-dispatch-flags-pmat-698b branch May 18, 2026 17:12
noahgift added a commit that referenced this pull request May 18, 2026
…d) (#1800)

* fix(distill): Phase 3 gx10 dispatch — staging + path layout (PMAT-698d)

Three real defects surfaced by live #1799 dispatch attempt to gx10:

1. /mnt/nvme-raid0/runs/ is lambda-vector layout — doesn't exist on gx10
   (916GB root, no /mnt). Switch default to $HOME/runs, env-overridable
   via GX10_RUN_PREFIX.

2. The HF cache lookup (hf_repo_to_dir from #1799) targeted the wrong
   layout. `apr pull` uses pacha, which caches as:
     ~/.cache/pacha/models/<sha>.safetensors
     ~/.cache/pacha/models/<sha>.tokenizer.json
     ...
   not HF hub's snapshots/<sha>/ directory structure.

3. apr distill --backend cuda calls CudaTrainerTeacher::for_inference
   which expects a directory containing model.apr or model.safetensors.
   The pacha cache is a flat file. Need to symlink-stage into a dir.

Fixes:
- GX10_RUN_PREFIX env var (default $HOME/runs)
- New stage_repo() shell function inside SSH heredoc that:
  - captures `apr pull` Path: from stdout
  - mkdirs RUN_DIR_REMOTE/teacher and /student stage dirs
  - symlinks pacha-cached files as model.<ext>
  - symlinks companion tokenizer.json / config.json / tokenizer_config.json
  - for GGUF teachers, runs apr import --preserve-q4k to convert to APR
- Default TEACHER_REPO changed to Qwen/Qwen2.5-Coder-1.5B-Instruct
  (SafeTensors, loads directly into CudaTrainerTeacher). The original
  paiml/qwen2.5-coder-7b-apache-q4k-v1 (GGUF) needs the apr import
  conversion path, which works but is slow and disk-intensive on gx10
  (58GB free). Defer real-MODEL-1 dispatch to PMAT-698e after smoke
  validates the pipeline.

Test plan:
- [x] DRY_RUN=1 STEPS=50 bash scripts/...  exits cleanly
- [x] bashrs lint: 11 errors (pre-existing heredoc/string mis-parses, no
      regression)
- [ ] STEPS=50 dispatch on gx10 reaches the training loop (verified
      live via the PR description, not in CI since this is a script)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(distill): convert pulled models to APR v2 before dispatch (PMAT-698d cont.)

Second live-dispatch attempt revealed another defect: apr distill --backend
cuda reads teacher_path AS A FILE via std::fs::read + AprV2Reader::from_bytes,
THEN uses its parent directory for for_inference. Two expectations are tied
together:

  - the file at teacher_path must be a valid APR v2 binary (for metadata)
  - the parent directory must contain a loadable checkpoint (model.apr
    or model.safetensors) for CudaTransformerTrainer

The previous staging symlinked .safetensors at stage_dir/model.safetensors
which satisfied for_inference but failed the AprV2Reader read step
(symlink target had wrong magic bytes).

Fix: always run `apr import` (with --preserve-q4k for GGUF) to produce a
real APR v2 file at stage_dir/model.apr. The teacher_path passed to apr
distill is that .apr file. Its parent dir is the stage dir, which now
satisfies both expectations.

Also rename the dispatch output from student.apr to student-trained.apr
to disambiguate from the staged input checkpoints.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(distill): stage config.json + companions next to source for apr import (PMAT-698d cont.)

Fourth defect surfaced live: apr import requires config.json next to the
source file (default search path), but pacha caches them sha-prefixed at:

  ~/.cache/pacha/models/<sha>.config.json
  ~/.cache/pacha/models/<sha>.tokenizer.json

The previous staging copied them to stage_dir/ AFTER running apr import,
so import couldn't find them and failed:

  error: Validation failed: Invalid model format: config.json not found
  at /home/noah/.cache/pacha/models/config.json

Fix: stage all companion files into stage_dir BEFORE apr import, and
also symlink the source file itself (.safetensors or .gguf) into stage_dir
as source.<ext>. apr import then finds everything in the same directory.
Result is still stage_dir/model.apr — that's what apr distill consumes.

Reordering only; no semantic change to the directory layout consumed
by apr distill.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(distill): 0.5B=0.5B for Phase 3 smoke to fit GB10 training memory (PMAT-698d cont.)

Fifth defect surfaced live: 1.5B Qwen teacher loaded fine for inference but
hit CUDA_ERROR_OUT_OF_MEMORY at "Block 0 upload" during the for_inference
GPU upload path. Blackwell's unified 128GB pool reports correctly but the
training-time peak (weights + gradients + Adam optimizer state + per-block
activations + workspace) overflows the actual VRAM budget for >1B models.

For a Phase 3 SMOKE (whose contract is just "val_loss decreases over N
steps"), teacher and student don't have to be different. Using the 0.5B
Qwen for both exercises every KD-loop branch (forward, kd_step, gradient,
optimizer) at minimal memory.

This lets us validate the engineering tower (PMAT-693 through PMAT-697 +
698b + 698d) end-to-end before scoping the GB10 memory budget for the
real 7B-Q4K MODEL-1 teacher (deferred to PMAT-698e).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant