chore(distill): Phase 3 smoke-run dispatch + watch scripts for gx10 (SPEC-DISTILL-001) by noahgift · Pull Request #1795 · paiml/aprender

noahgift · 2026-05-18T13:00:45Z

Summary

Reproducible dispatch artifact for SPEC-DISTILL-001 Phase 3 — the 500-step E2E smoke run targeting paiml/albor-370m-v2. The user's preference is "use gx10 for compute only"; this PR ships the scripts to do that.

scripts/dispatch-distill-phase-3-gx10.sh

Five-stage idempotent dispatch:

Local preflight (env + args echo)
Remote git pull main + cargo build --features cuda
Remote apr pull of teacher + student-init (no-op if cached)
Background dispatch of apr distill with KD hyperparameters from SPEC-DISTILL-001 Phase 4 plan (T=4.0, α=0.3, LR=1.5e-5)
Captures evidence manifest (dispatch.json) locally

Overrides via env: GX10_HOST, GX10_USER, GX10_REPO_PATH, TEACHER_REPO, STUDENT_INIT, STEPS, BATCH_SIZE, LR, T, ALPHA, EVIDENCE_DIR, DRY_RUN.

scripts/watch-distill-phase-3-gx10.sh

Blackwell JIT constraint

gx10 is GB10 (sm_121). The Blackwell JIT pre-warming bug (PMAT-587 lineage) may crash the custom-PTX backward kernels on first invocation. The script tolerates this: forward path (CudaTrainerTeacher inference) is unaffected; backward may need the trueno 0.4.36 fix. The script logs whatever outcome occurs and the F-DISTILL-SMOKE-001 verdict is read from the launch log.

Fallback lane: lambda-vector (RTX 4090) via the same script with GX10_HOST=lambda-vector.

Prerequisites

PRs #1787, #1788, #1791, #1792, #1793 (Phase 1b → 2d) must be on main before the dispatch script's git pull step can build a working apr distill. Currently all five are auto-merge armed in flight.

Test plan

Dry-run mode works locally (DRY_RUN=1 ./scripts/dispatch-distill-phase-3-gx10.sh reports plan, exits before remote work)
Real dispatch after Phase 1b-2d land on main
F-DISTILL-SMOKE-001 verdict (val_loss step 500 < step 0) emitted in launch.log

🤖 Generated with Claude Code

…L-001 Phase 1b, PMAT-693) Adds the real teacher backend the SPEC-DISTILL-001 Phase 1b ticket scopes: CudaTrainerTeacher wraps entrenar's CudaTransformerTrainer in inference-only mode, delegates logits_for_batch to forward_logits() per batch element, returns shape [batch, vocab_size]. Gated behind a new `cuda` feature on aprender-train-distill that propagates to entrenar/cuda. Without the feature, only FixtureTeacher (Phase 1) is available — sufficient for unit tests but not for real training. Real distillation runs (Phase 4) require --features cuda. Surface ======= #[cfg(feature = "cuda")] pub struct CudaTrainerTeacher { /* wraps CudaTransformerTrainer */ } impl CudaTrainerTeacher { pub fn for_inference( checkpoint_dir: impl AsRef<Path>, model_config: TransformerConfig, ) -> Result<Self> { ... } } impl TeacherLogitsProvider for CudaTrainerTeacher { fn vocab_size(&self) -> usize { ... } fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>> { ... } } Defensive checks ================ - forward_logits returning None → EntrenarError::Internal with a clear "likely missing weights or CUDA init failure" message - logits.len() != vocab_size → EntrenarError::Internal flagging TransformerConfig vs checkpoint vocab drift (the common silent failure mode for loaded-from-disk distillation runs) Tests ===== All 6 teacher_provider tests pass under both --features (none) and --features cuda. Compile gates verified: cargo check -p aprender-train-distill # clean cargo check -p aprender-train-distill --features cuda # clean What's next =========== Phase 2 (PMAT-694, follow-up): wire CudaTransformerTrainer's KD-loss backward into the student path — replaces the remaining build_synthetic_logits call site for the student in pipeline.rs::train(). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ent (SPEC-DISTILL-001 Phase 2) Wires Phase 1's teacher provider into a per-batch KD orchestration step that produces both the combined α·CE + (1-α)·T²·KL scalar loss (for telemetry) and the KD-aware logit-space gradient (Phase 2b plug point). What this PR adds ================= New module `aprender-train-distill::kd_step`: pub fn kd_loss( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> f32; pub fn kd_logit_gradient( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> Vec<f32>; pub fn kd_step<F: FnMut(&[u32]) -> Vec<f32>>( teacher: &mut dyn TeacherLogitsProvider, input_ids: &[Vec<u32>], labels: &[usize], temperature: f32, alpha: f32, compute_student_logits: F, ) -> Result<(f32, Vec<Vec<f32>>)>; The gradient is the Hinton et al. 2015 §2 derivation: ∂L/∂s = α · (softmax(s) - one_hot(label)) + (1-α) · T · (softmax(s/T) - softmax(t/T)) (T factor, not T² — one T factor is absorbed by the softmax derivative chain rule.) Scope: Phase 2 vs Phase 2b ========================== Phase 2 (this PR) ships the orchestration math, all in pure Rust on the CPU. The output `Vec<Vec<f32>>` of per-batch gradients is what Phase 2b will plumb into `CudaTransformerTrainer.forward_backward_kd_batch` as the backward-pass seed (replacing the CE-only gradient currently used by forward_backward_batch). Splitting Phase 2 into 2a/2b lets us land the orchestration layer + its tests now, separate from extending the complex GPU trainer code path. Falsifiers pinned ================= 3 KD-step falsifiers + 6 sanity tests, all passing: - F-DISTILL-KDSTEP-001 (alpha=1 → pure CE) - F-DISTILL-KDSTEP-002 (student==teacher → zero KL gradient under alpha=0) - F-DISTILL-KDSTEP-003 (loss monotone in student-teacher divergence) - softmax unit-sum + non-negative - CE gradient correct sign at label vs non-label positions - kd_step orchestration end-to-end - kd_step empty-batch sanity - kd_step vocab-mismatch error path - kd_loss alpha=1 collapses to pure CE All 50 aprender-train-distill lib tests pass (was 41 — 9 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…TILL-001 Phase 2b) Mirrors Phase 1's TeacherLogitsProvider for the student side. The student has two methods: logits_for_batch (forward) and apply_kd_gradient (backward + optimizer step). FixtureStudent implements both for CPU-only unit testing — Phase 2c will add a CudaStudentProvider that wraps CudaTransformerTrainer. What this PR adds ================= pub trait StudentLogitsProvider { fn vocab_size(&self) -> usize; fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>>; fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>]) -> Result<()>; } pub struct FixtureStudent { vocab_size: usize, logits: Vec<f32>, // current student parameters learning_rate: f32, } FixtureStudent's apply_kd_gradient averages the gradient across batch elements (canonical SGD batch averaging) and subtracts the scaled gradient from its internal logits buffer. This isn't a real model — it's a logit-space optimization fixture that lets us validate the KD pipeline's gradient direction is correct without needing CUDA. Falsifiers pinned ================= 7 student_provider tests + 2 falsifiers, all passing: - F-DISTILL-STUDENT-001 — one KD step moves student logits toward teacher's preferred token. Setup: uniform student, teacher prefers token 5, alpha=0 (pure KL signal). After one step, student logit at index 5 must be strictly greater than before. - F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial. Validates the gradient direction is correct (descent, not ascent). Plus 5 sanity tests covering vocab_size reporting, batch broadcast, shape validation, in-place logit update, and batch averaging math. Architecture ============ Stacks on top of #1788 (kd_step). Pipeline integration that uses both TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c. Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps CudaTransformerTrainer for production runs. Once it lands, end-to-end GPU distillation is unblocked. Tests ===== All 57 aprender-train-distill lib tests pass (was 50 — 7 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…o-end (SPEC-DISTILL-001 Phase 2c) Rewrites Pipeline::train() to use TeacherLogitsProvider + StudentLogitsProvider + kd_step end-to-end. Replaces the build_synthetic_logits stubs on both sides with real abstraction calls. Pipeline gains a `student: Box<dyn StudentLogitsProvider + Send>` field and a `Pipeline::with_student()` builder mirroring `with_teacher()`. Default backends are FixtureTeacher + FixtureStudent so legacy tests behave identically. Phase 2d swaps in CudaStudentProvider. Training loop, per step: 1. dummy_batch = [[0u32]; batch_size] (Phase 4 plugs in real tokens) 2. teacher.logits_for_batch(dummy_batch) → teacher logits 3. kd_step(teacher, dummy_batch, labels, T, α, student_logits_closure) → (scalar loss, per-batch logit gradients) 4. student.apply_kd_gradient(grads) → student updates bracketed by initial-loss + final-loss measurements via the new kd_step_loss_for_pipeline helper. Falsifiers ========== F-DISTILL-PIPELINE-001 (new) — end-to-end falsifier: runs Pipeline::execute() with FixtureTeacher + FixtureStudent + 3 epochs and asserts final_loss < initial_loss. Pins the entire data flow: teacher → student → kd_step → apply_kd_gradient. Any broken link either flatlines or increases the loss. Phase 2d plug points ==================== - The dummy_batch in train() is the natural insertion point for a real dataset iterator (Phase 4 work). - The student-logits closure in kd_step is the natural insertion point for CudaStudentProvider's forward_logits (Phase 2d). - The apply_kd_gradient call is the natural insertion point for CudaStudentProvider's forward_backward_kd_batch path (Phase 2d). Tests ===== All 58 aprender-train-distill lib tests pass (was 57 — 1 new F-DISTILL-PIPELINE-001 integration falsifier). The four legacy helpers (build_synthetic_logits, kd_gradient, softmax_2d, write_logits_to_weights) are marked #[allow(dead_code)] for back-compat until Phase 2d's wiring fully replaces the on-disk weights round-trip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-DISTILL-001 Phase 2d) Closes the last engineering gap before Phase 3 (500-step E2E smoke). Real GPU student backend wraps CudaTransformerTrainer and implements the Phase 2b StudentLogitsProvider trait. Two complementary pieces ======================== 1. New pub method on CudaTransformerTrainer (cuda_trainer.rs:1247-1311): pub fn forward_backward_with_grad( &mut self, input_ids: &[u32], logit_gradient: &[f32], ) -> Option<()> Runs forward, uploads the caller-supplied logit gradient into the last-position slice of logits_buf (replacing what gpu_forward wrote), runs gpu_backward (which back-props from the in-place gradient through the transformer stack), and runs embed_backward. Matches the KAIZEN-052 in-place gradient convention that fused_cross_entropy_cuda uses for the CE path — except the gradient now comes from `kd_step::kd_logit_gradient` (Phase 2a). 2. New CudaStudentProvider in aprender-train-distill (cuda-gated): pub struct CudaStudentProvider { trainer: CudaTransformerTrainer, vocab_size: usize, last_input_ids: Option<Vec<u32>>, } impl StudentLogitsProvider for CudaStudentProvider: - logits_for_batch → trainer.forward_logits per batch element + caches last_input_ids - apply_kd_gradient → trainer.forward_backward_with_grad on the cached last input_ids with the last gradient Phase 2d limitation =================== batch_size=1 only — the trait's apply_kd_gradient doesn't take input_ids, so the provider has to cache from the most-recent logits_for_batch call. With batches >1 only the last element gets a real gradient update. Phase 2e (PMAT-698, follow-up) generalizes via a fused-step trait method that takes input_ids + gradient together so all batch elements process correctly. Tests ===== All 58 aprender-train-distill lib tests pass under BOTH --features (none) and --features cuda. The CudaStudentProvider itself doesn't have a unit test — exercising it needs CUDA at test time, and the F-DISTILL-CUDA-STUDENT-001 falsifier (logits parity within 1e-6 vs a standalone forward_logits call) is integration-tested in Phase 4 production runs. What's next =========== With Phase 2d landed, Phase 3 (500-step E2E smoke run with real CudaTrainerTeacher + CudaStudentProvider) is unblocked. Phase 4 is the actual 50K-step distillation training run for albor-370m-v2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

SPEC-DISTILL-001 Phase 3 dispatch artifact. Reproducible 500-step smoke run targeting paiml/albor-370m-v2 (MODEL-2 distillation). scripts/dispatch-distill-phase-3-gx10.sh ======================================== Idempotent dispatch: 1. Local preflight (env, hosts, args echo). 2. Remote git pull main + build with --features cuda. 3. Remote apr pull of teacher + student-init (no-op if cached). 4. Background dispatch of `apr distill` with KD hyperparameters from SPEC-DISTILL-001 Phase 4 plan (T=4.0, alpha=0.3, LR=1.5e-5). 5. Captures evidence manifest locally (dispatch.json). Overridable via env: GX10_HOST, GX10_USER, GX10_REPO_PATH, TEACHER_REPO, STUDENT_INIT, STEPS (default 500), BATCH_SIZE, LR, T, ALPHA, EVIDENCE_DIR, DRY_RUN. scripts/watch-distill-phase-3-gx10.sh ===================================== Tails the remote training log + filters for step counts, loss markers, panic/error lines, and F-DISTILL-* falsifier verdicts. Blackwell JIT constraint ======================== gx10 is sm_121 (GB10). Memory rule `Blackwell JIT pre-warming bug blocks training` (PMAT-587 lineage) applies — custom PTX kernels in gpu_backward may crash on JIT. The dispatch script tolerates this: forward path runs fine (CudaTrainerTeacher inference works); backward may or may not depending on trueno 0.4.36 status. The script logs either outcome and the F-DISTILL-SMOKE-001 verdict is read from the launch log. The fallback compute lane is lambda-vector (RTX 4090) where backward is proven via §82 P2-A. The script accepts a different GX10_HOST env to dispatch to either. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-18T16:40:23Z

Subsumed by #1797 squash-merge (chain-PR leapfrog pattern per memory rule). All content landed on main at aee8716.

…1799) The Phase 3 dispatch script (shipped in #1795, squash-landed via #1797) invoked apr distill with six flags that don't exist on the post-Phase-3-prep CLI: --num-steps, --batch-size, --learning-rate, --student-init, --output-dir, --device. Apr distill rejects them, so the smoke run cannot fire as-shipped. Realign to existing flags: --teacher REPO → positional <TEACHER_DIR> --student-init REPO → --student <STUDENT_DIR> --num-steps 500 → --epochs 17 (round-up of 500/31 default-batch) --batch-size, -lr → dropped (CLI doesn't surface; PMAT-698c follow-up) --output-dir DIR → --output DIR/student.apr --device cuda → --backend cuda HF repo IDs are resolved to local cache snapshot dirs via a shell function inside the SSH heredoc (the apr distill CudaTrainerTeacher::for_inference signature accepts a directory containing model.safetensors or model.apr, which matches the HF cache layout). Evidence: - evidence/distill-phase-3-readiness/findings.md documents the original aspirational-flag defect + resolution + deferred PMAT-698c scope. Unblocks task #124 (Phase 3 real smoke dispatch on gx10). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 18, 2026 13:00

noahgift mentioned this pull request May 18, 2026

feat(distill): apr distill --backend flag + Phase 2c regression fix (SPEC-DISTILL-001 Phase 3-prep, PMAT-697) #1796

Closed

noahgift and others added 6 commits May 18, 2026 15:19

noahgift force-pushed the feat/distill-phase-3-dispatch-script branch from 942a9d3 to e596352 Compare May 18, 2026 13:19

noahgift mentioned this pull request May 18, 2026

feat(distill): apr distill --backend cuda real construction (SPEC-DISTILL-001 Phase 3-prep second half, PMAT-697) #1797

Merged

noahgift added 2 commits May 18, 2026 16:33

Merge branch 'main' into feat/distill-phase-3-dispatch-script

8e57ca3

Merge branch 'main' into feat/distill-phase-3-dispatch-script

670f112

noahgift closed this May 18, 2026

auto-merge was automatically disabled May 18, 2026 16:40
Pull request was closed

noahgift deleted the feat/distill-phase-3-dispatch-script branch May 18, 2026 16:40

noahgift mentioned this pull request May 18, 2026

fix(distill): Phase 3 dispatch script CLI-flag alignment (PMAT-698b) #1799

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(distill): Phase 3 smoke-run dispatch + watch scripts for gx10 (SPEC-DISTILL-001)#1795

chore(distill): Phase 3 smoke-run dispatch + watch scripts for gx10 (SPEC-DISTILL-001)#1795
noahgift wants to merge 8 commits into
mainfrom
feat/distill-phase-3-dispatch-script

noahgift commented May 18, 2026

Uh oh!

noahgift commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 18, 2026

Summary

scripts/dispatch-distill-phase-3-gx10.sh

scripts/watch-distill-phase-3-gx10.sh

Blackwell JIT constraint

Prerequisites

Test plan

Uh oh!

noahgift commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant