Skip to content

chore(distill): Phase 3 smoke-run dispatch + watch scripts for gx10 (SPEC-DISTILL-001)#1795

Closed
noahgift wants to merge 8 commits into
mainfrom
feat/distill-phase-3-dispatch-script
Closed

chore(distill): Phase 3 smoke-run dispatch + watch scripts for gx10 (SPEC-DISTILL-001)#1795
noahgift wants to merge 8 commits into
mainfrom
feat/distill-phase-3-dispatch-script

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Reproducible dispatch artifact for SPEC-DISTILL-001 Phase 3 — the 500-step E2E smoke run targeting paiml/albor-370m-v2. The user's preference is "use gx10 for compute only"; this PR ships the scripts to do that.

scripts/dispatch-distill-phase-3-gx10.sh

Five-stage idempotent dispatch:

  1. Local preflight (env + args echo)
  2. Remote git pull main + cargo build --features cuda
  3. Remote apr pull of teacher + student-init (no-op if cached)
  4. Background dispatch of apr distill with KD hyperparameters from SPEC-DISTILL-001 Phase 4 plan (T=4.0, α=0.3, LR=1.5e-5)
  5. Captures evidence manifest (dispatch.json) locally

Overrides via env: GX10_HOST, GX10_USER, GX10_REPO_PATH, TEACHER_REPO, STUDENT_INIT, STEPS, BATCH_SIZE, LR, T, ALPHA, EVIDENCE_DIR, DRY_RUN.

scripts/watch-distill-phase-3-gx10.sh

Tails the remote log with line-buffered grep on ^step|initial_loss|final_loss|panic|error|F-DISTILL. The F-DISTILL-SMOKE-001 verdict appears here once the training loop emits its final metrics block.

Blackwell JIT constraint

gx10 is GB10 (sm_121). The Blackwell JIT pre-warming bug (PMAT-587 lineage) may crash the custom-PTX backward kernels on first invocation. The script tolerates this: forward path (CudaTrainerTeacher inference) is unaffected; backward may need the trueno 0.4.36 fix. The script logs whatever outcome occurs and the F-DISTILL-SMOKE-001 verdict is read from the launch log.

Fallback lane: lambda-vector (RTX 4090) via the same script with GX10_HOST=lambda-vector.

Prerequisites

PRs #1787, #1788, #1791, #1792, #1793 (Phase 1b → 2d) must be on main before the dispatch script's git pull step can build a working apr distill. Currently all five are auto-merge armed in flight.

Test plan

  • Dry-run mode works locally (DRY_RUN=1 ./scripts/dispatch-distill-phase-3-gx10.sh reports plan, exits before remote work)
  • Real dispatch after Phase 1b-2d land on main
  • F-DISTILL-SMOKE-001 verdict (val_loss step 500 < step 0) emitted in launch.log

🤖 Generated with Claude Code

noahgift and others added 6 commits May 18, 2026 15:19
…L-001 Phase 1b, PMAT-693)

Adds the real teacher backend the SPEC-DISTILL-001 Phase 1b ticket
scopes: CudaTrainerTeacher wraps entrenar's CudaTransformerTrainer in
inference-only mode, delegates logits_for_batch to forward_logits()
per batch element, returns shape [batch, vocab_size].

Gated behind a new `cuda` feature on aprender-train-distill that
propagates to entrenar/cuda. Without the feature, only FixtureTeacher
(Phase 1) is available — sufficient for unit tests but not for real
training. Real distillation runs (Phase 4) require --features cuda.

Surface
=======

  #[cfg(feature = "cuda")]
  pub struct CudaTrainerTeacher { /* wraps CudaTransformerTrainer */ }

  impl CudaTrainerTeacher {
      pub fn for_inference(
          checkpoint_dir: impl AsRef<Path>,
          model_config: TransformerConfig,
      ) -> Result<Self> { ... }
  }

  impl TeacherLogitsProvider for CudaTrainerTeacher {
      fn vocab_size(&self) -> usize { ... }
      fn logits_for_batch(&mut self, input_ids: &[Vec<u32>])
          -> Result<Vec<Vec<f32>>> { ... }
  }

Defensive checks
================

- forward_logits returning None → EntrenarError::Internal with a clear
  "likely missing weights or CUDA init failure" message
- logits.len() != vocab_size → EntrenarError::Internal flagging
  TransformerConfig vs checkpoint vocab drift (the common silent failure
  mode for loaded-from-disk distillation runs)

Tests
=====

All 6 teacher_provider tests pass under both --features (none) and
--features cuda. Compile gates verified:
  cargo check -p aprender-train-distill                  # clean
  cargo check -p aprender-train-distill --features cuda  # clean

What's next
===========

Phase 2 (PMAT-694, follow-up): wire CudaTransformerTrainer's KD-loss
backward into the student path — replaces the remaining
build_synthetic_logits call site for the student in pipeline.rs::train().

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ent (SPEC-DISTILL-001 Phase 2)

Wires Phase 1's teacher provider into a per-batch KD orchestration step
that produces both the combined α·CE + (1-α)·T²·KL scalar loss (for
telemetry) and the KD-aware logit-space gradient (Phase 2b plug point).

What this PR adds
=================

New module `aprender-train-distill::kd_step`:

  pub fn kd_loss(
      student_logits: &[f32],
      teacher_logits: &[f32],
      label: usize,
      temperature: f32,
      alpha: f32,
  ) -> f32;

  pub fn kd_logit_gradient(
      student_logits: &[f32],
      teacher_logits: &[f32],
      label: usize,
      temperature: f32,
      alpha: f32,
  ) -> Vec<f32>;

  pub fn kd_step<F: FnMut(&[u32]) -> Vec<f32>>(
      teacher: &mut dyn TeacherLogitsProvider,
      input_ids: &[Vec<u32>],
      labels: &[usize],
      temperature: f32,
      alpha: f32,
      compute_student_logits: F,
  ) -> Result<(f32, Vec<Vec<f32>>)>;

The gradient is the Hinton et al. 2015 §2 derivation:

    ∂L/∂s = α · (softmax(s) - one_hot(label))
          + (1-α) · T · (softmax(s/T) - softmax(t/T))

(T factor, not T² — one T factor is absorbed by the softmax derivative
chain rule.)

Scope: Phase 2 vs Phase 2b
==========================

Phase 2 (this PR) ships the orchestration math, all in pure Rust on the
CPU. The output `Vec<Vec<f32>>` of per-batch gradients is what Phase 2b
will plumb into `CudaTransformerTrainer.forward_backward_kd_batch` as
the backward-pass seed (replacing the CE-only gradient currently used by
forward_backward_batch).

Splitting Phase 2 into 2a/2b lets us land the orchestration layer + its
tests now, separate from extending the complex GPU trainer code path.

Falsifiers pinned
=================

3 KD-step falsifiers + 6 sanity tests, all passing:

- F-DISTILL-KDSTEP-001 (alpha=1 → pure CE)
- F-DISTILL-KDSTEP-002 (student==teacher → zero KL gradient under alpha=0)
- F-DISTILL-KDSTEP-003 (loss monotone in student-teacher divergence)
- softmax unit-sum + non-negative
- CE gradient correct sign at label vs non-label positions
- kd_step orchestration end-to-end
- kd_step empty-batch sanity
- kd_step vocab-mismatch error path
- kd_loss alpha=1 collapses to pure CE

All 50 aprender-train-distill lib tests pass (was 41 — 9 new).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…TILL-001 Phase 2b)

Mirrors Phase 1's TeacherLogitsProvider for the student side. The
student has two methods: logits_for_batch (forward) and
apply_kd_gradient (backward + optimizer step). FixtureStudent
implements both for CPU-only unit testing — Phase 2c will add a
CudaStudentProvider that wraps CudaTransformerTrainer.

What this PR adds
=================

  pub trait StudentLogitsProvider {
      fn vocab_size(&self) -> usize;
      fn logits_for_batch(&mut self, input_ids: &[Vec<u32>])
          -> Result<Vec<Vec<f32>>>;
      fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>])
          -> Result<()>;
  }

  pub struct FixtureStudent {
      vocab_size: usize,
      logits: Vec<f32>,           // current student parameters
      learning_rate: f32,
  }

FixtureStudent's apply_kd_gradient averages the gradient across batch
elements (canonical SGD batch averaging) and subtracts the scaled
gradient from its internal logits buffer. This isn't a real model —
it's a logit-space optimization fixture that lets us validate the
KD pipeline's gradient direction is correct without needing CUDA.

Falsifiers pinned
=================

7 student_provider tests + 2 falsifiers, all passing:

- F-DISTILL-STUDENT-001 — one KD step moves student logits toward
  teacher's preferred token. Setup: uniform student, teacher prefers
  token 5, alpha=0 (pure KL signal). After one step, student logit
  at index 5 must be strictly greater than before.

- F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease
  per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial.
  Validates the gradient direction is correct (descent, not ascent).

Plus 5 sanity tests covering vocab_size reporting, batch broadcast,
shape validation, in-place logit update, and batch averaging math.

Architecture
============

Stacks on top of #1788 (kd_step). Pipeline integration that uses both
TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c.

Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps
CudaTransformerTrainer for production runs. Once it lands, end-to-end
GPU distillation is unblocked.

Tests
=====

All 57 aprender-train-distill lib tests pass (was 50 — 7 new).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…o-end (SPEC-DISTILL-001 Phase 2c)

Rewrites Pipeline::train() to use TeacherLogitsProvider +
StudentLogitsProvider + kd_step end-to-end. Replaces the
build_synthetic_logits stubs on both sides with real abstraction calls.

Pipeline gains a `student: Box<dyn StudentLogitsProvider + Send>` field
and a `Pipeline::with_student()` builder mirroring `with_teacher()`.
Default backends are FixtureTeacher + FixtureStudent so legacy tests
behave identically. Phase 2d swaps in CudaStudentProvider.

Training loop, per step:

  1. dummy_batch = [[0u32]; batch_size]  (Phase 4 plugs in real tokens)
  2. teacher.logits_for_batch(dummy_batch) → teacher logits
  3. kd_step(teacher, dummy_batch, labels, T, α, student_logits_closure)
     → (scalar loss, per-batch logit gradients)
  4. student.apply_kd_gradient(grads) → student updates

bracketed by initial-loss + final-loss measurements via the new
kd_step_loss_for_pipeline helper.

Falsifiers
==========

F-DISTILL-PIPELINE-001 (new) — end-to-end falsifier: runs
Pipeline::execute() with FixtureTeacher + FixtureStudent + 3 epochs
and asserts final_loss < initial_loss. Pins the entire data flow:
teacher → student → kd_step → apply_kd_gradient. Any broken link
either flatlines or increases the loss.

Phase 2d plug points
====================

- The dummy_batch in train() is the natural insertion point for a real
  dataset iterator (Phase 4 work).
- The student-logits closure in kd_step is the natural insertion point
  for CudaStudentProvider's forward_logits (Phase 2d).
- The apply_kd_gradient call is the natural insertion point for
  CudaStudentProvider's forward_backward_kd_batch path (Phase 2d).

Tests
=====

All 58 aprender-train-distill lib tests pass (was 57 — 1 new
F-DISTILL-PIPELINE-001 integration falsifier). The four legacy
helpers (build_synthetic_logits, kd_gradient, softmax_2d,
write_logits_to_weights) are marked #[allow(dead_code)] for back-compat
until Phase 2d's wiring fully replaces the on-disk weights round-trip.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-DISTILL-001 Phase 2d)

Closes the last engineering gap before Phase 3 (500-step E2E smoke).
Real GPU student backend wraps CudaTransformerTrainer and implements
the Phase 2b StudentLogitsProvider trait.

Two complementary pieces
========================

1. New pub method on CudaTransformerTrainer (cuda_trainer.rs:1247-1311):

     pub fn forward_backward_with_grad(
         &mut self,
         input_ids: &[u32],
         logit_gradient: &[f32],
     ) -> Option<()>

   Runs forward, uploads the caller-supplied logit gradient into the
   last-position slice of logits_buf (replacing what gpu_forward wrote),
   runs gpu_backward (which back-props from the in-place gradient through
   the transformer stack), and runs embed_backward. Matches the KAIZEN-052
   in-place gradient convention that fused_cross_entropy_cuda uses for
   the CE path — except the gradient now comes from
   `kd_step::kd_logit_gradient` (Phase 2a).

2. New CudaStudentProvider in aprender-train-distill (cuda-gated):

     pub struct CudaStudentProvider {
         trainer: CudaTransformerTrainer,
         vocab_size: usize,
         last_input_ids: Option<Vec<u32>>,
     }

   impl StudentLogitsProvider for CudaStudentProvider:
     - logits_for_batch → trainer.forward_logits per batch element +
       caches last_input_ids
     - apply_kd_gradient → trainer.forward_backward_with_grad on the
       cached last input_ids with the last gradient

Phase 2d limitation
===================

batch_size=1 only — the trait's apply_kd_gradient doesn't take
input_ids, so the provider has to cache from the most-recent
logits_for_batch call. With batches >1 only the last element gets a
real gradient update.

Phase 2e (PMAT-698, follow-up) generalizes via a fused-step trait
method that takes input_ids + gradient together so all batch elements
process correctly.

Tests
=====

All 58 aprender-train-distill lib tests pass under BOTH
--features (none) and --features cuda. The CudaStudentProvider
itself doesn't have a unit test — exercising it needs CUDA at test
time, and the F-DISTILL-CUDA-STUDENT-001 falsifier (logits parity
within 1e-6 vs a standalone forward_logits call) is integration-tested
in Phase 4 production runs.

What's next
===========

With Phase 2d landed, Phase 3 (500-step E2E smoke run with real
CudaTrainerTeacher + CudaStudentProvider) is unblocked. Phase 4 is
the actual 50K-step distillation training run for albor-370m-v2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SPEC-DISTILL-001 Phase 3 dispatch artifact. Reproducible 500-step
smoke run targeting paiml/albor-370m-v2 (MODEL-2 distillation).

scripts/dispatch-distill-phase-3-gx10.sh
========================================

Idempotent dispatch:
  1. Local preflight (env, hosts, args echo).
  2. Remote git pull main + build with --features cuda.
  3. Remote apr pull of teacher + student-init (no-op if cached).
  4. Background dispatch of `apr distill` with KD hyperparameters
     from SPEC-DISTILL-001 Phase 4 plan (T=4.0, alpha=0.3, LR=1.5e-5).
  5. Captures evidence manifest locally (dispatch.json).

Overridable via env:
  GX10_HOST, GX10_USER, GX10_REPO_PATH, TEACHER_REPO, STUDENT_INIT,
  STEPS (default 500), BATCH_SIZE, LR, T, ALPHA, EVIDENCE_DIR, DRY_RUN.

scripts/watch-distill-phase-3-gx10.sh
=====================================

Tails the remote training log + filters for step counts, loss markers,
panic/error lines, and F-DISTILL-* falsifier verdicts.

Blackwell JIT constraint
========================

gx10 is sm_121 (GB10). Memory rule `Blackwell JIT pre-warming bug
blocks training` (PMAT-587 lineage) applies — custom PTX kernels in
gpu_backward may crash on JIT. The dispatch script tolerates this:
forward path runs fine (CudaTrainerTeacher inference works); backward
may or may not depending on trueno 0.4.36 status. The script logs
either outcome and the F-DISTILL-SMOKE-001 verdict is read from
the launch log.

The fallback compute lane is lambda-vector (RTX 4090) where backward
is proven via §82 P2-A. The script accepts a different GX10_HOST env
to dispatch to either.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift
Copy link
Copy Markdown
Contributor Author

Subsumed by #1797 squash-merge (chain-PR leapfrog pattern per memory rule). All content landed on main at aee8716.

@noahgift noahgift closed this May 18, 2026
auto-merge was automatically disabled May 18, 2026 16:40

Pull request was closed

@noahgift noahgift deleted the feat/distill-phase-3-dispatch-script branch May 18, 2026 16:40
noahgift added a commit that referenced this pull request May 18, 2026
…1799)

The Phase 3 dispatch script (shipped in #1795, squash-landed via #1797)
invoked apr distill with six flags that don't exist on the post-Phase-3-prep
CLI: --num-steps, --batch-size, --learning-rate, --student-init, --output-dir,
--device. Apr distill rejects them, so the smoke run cannot fire as-shipped.

Realign to existing flags:
  --teacher REPO          → positional <TEACHER_DIR>
  --student-init REPO     → --student <STUDENT_DIR>
  --num-steps 500         → --epochs 17 (round-up of 500/31 default-batch)
  --batch-size, -lr       → dropped (CLI doesn't surface; PMAT-698c follow-up)
  --output-dir DIR        → --output DIR/student.apr
  --device cuda           → --backend cuda

HF repo IDs are resolved to local cache snapshot dirs via a shell function
inside the SSH heredoc (the apr distill CudaTrainerTeacher::for_inference
signature accepts a directory containing model.safetensors or model.apr,
which matches the HF cache layout).

Evidence:
- evidence/distill-phase-3-readiness/findings.md documents the original
  aspirational-flag defect + resolution + deferred PMAT-698c scope.

Unblocks task #124 (Phase 3 real smoke dispatch on gx10).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant