feat(distill): KD step orchestration — combined loss + KD logit gradient (SPEC-DISTILL-001 Phase 2) by noahgift · Pull Request #1788 · paiml/aprender

noahgift · 2026-05-18T11:03:08Z

Summary

Wires Phase 1's teacher provider into a per-batch KD orchestration step that produces both:

The combined α·CE + (1-α)·T²·KL scalar loss (for telemetry + logging)
The KD-aware logit-space gradient (Phase 2b plug point)

Stacks on top of #1787 (Phase 1b CudaTrainerTeacher). The Phase 2 work landed here is the pure-Rust orchestration math; Phase 2b (PMAT-694 follow-up) wires the gradient into CudaTransformerTrainer.forward_backward_kd_batch so the student actually learns from the KD signal.

Why split Phase 2 into 2a/2b

Phase 2 is 16-24h of engineering per SPEC-DISTILL-001. Splitting lets us land the orchestration layer + its tests (12 deliverable functions/tests) as a clean reviewable unit, separate from extending the complex CUDA trainer code. Phase 2b then becomes a focused "add forward_backward_kd_batch to CudaTransformerTrainer using this module's kd_logit_gradient" PR.

New module: `aprender-train-distill::kd_step`

pub fn kd_loss(s_logits, t_logits, label, temperature, alpha) -> f32;
pub fn kd_logit_gradient(s_logits, t_logits, label, temperature, alpha) -> Vec<f32>;
pub fn kd_step<F>(teacher, input_ids, labels, T, α, compute_student_logits) -> Result<(f32, Vec<Vec<f32>>)>;

The gradient is the Hinton et al. 2015 §2 derivation:

∂L/∂s = α · (softmax(s) - one_hot(label))
      + (1-α) · T · (softmax(s/T) - softmax(t/T))

(T factor not T² — one T factor is absorbed by the softmax chain rule.)

Falsifiers pinned

ID	Statement	Status
F-DISTILL-KDSTEP-001	`alpha=1` collapses KD gradient to pure CE gradient	✓ passing
F-DISTILL-KDSTEP-002	`student==teacher` + `alpha=0` → zero KL gradient	✓ passing
F-DISTILL-KDSTEP-003	`kd_loss` strictly increases as student diverges from teacher	✓ passing

Plus 6 sanity tests covering softmax unit-sum, CE gradient signs, orchestration end-to-end, empty-batch, vocab-size mismatch error path, and alpha=1 loss collapse.

All 50 aprender-train-distill lib tests pass (was 41 — 9 new).

What's next

Phase 2b — PMAT-694 (follow-up): extend CudaTransformerTrainer with forward_backward_kd_batch(batch, teacher_logits). The new method uses kd_logit_gradient (from this PR) as the backward-pass seed instead of CE-only. With Phase 2b landed, the pipeline starts producing genuinely distilled student weights instead of CE-only.
Phase 3 — 500-step E2E smoke per SPEC-DISTILL-001.

Test plan

9 new kd_step tests pass (3 falsifiers + 6 sanity)
All 50 aprender-train-distill lib tests pass
cargo check -p aprender-train-distill clean (both with and without --features cuda)
Phase 2b: real KD-aware backward through CudaTransformerTrainer

🤖 Generated with Claude Code

…L-001 Phase 1b, PMAT-693) Adds the real teacher backend the SPEC-DISTILL-001 Phase 1b ticket scopes: CudaTrainerTeacher wraps entrenar's CudaTransformerTrainer in inference-only mode, delegates logits_for_batch to forward_logits() per batch element, returns shape [batch, vocab_size]. Gated behind a new `cuda` feature on aprender-train-distill that propagates to entrenar/cuda. Without the feature, only FixtureTeacher (Phase 1) is available — sufficient for unit tests but not for real training. Real distillation runs (Phase 4) require --features cuda. Surface ======= #[cfg(feature = "cuda")] pub struct CudaTrainerTeacher { /* wraps CudaTransformerTrainer */ } impl CudaTrainerTeacher { pub fn for_inference( checkpoint_dir: impl AsRef<Path>, model_config: TransformerConfig, ) -> Result<Self> { ... } } impl TeacherLogitsProvider for CudaTrainerTeacher { fn vocab_size(&self) -> usize { ... } fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>> { ... } } Defensive checks ================ - forward_logits returning None → EntrenarError::Internal with a clear "likely missing weights or CUDA init failure" message - logits.len() != vocab_size → EntrenarError::Internal flagging TransformerConfig vs checkpoint vocab drift (the common silent failure mode for loaded-from-disk distillation runs) Tests ===== All 6 teacher_provider tests pass under both --features (none) and --features cuda. Compile gates verified: cargo check -p aprender-train-distill # clean cargo check -p aprender-train-distill --features cuda # clean What's next =========== Phase 2 (PMAT-694, follow-up): wire CudaTransformerTrainer's KD-loss backward into the student path — replaces the remaining build_synthetic_logits call site for the student in pipeline.rs::train(). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ent (SPEC-DISTILL-001 Phase 2) Wires Phase 1's teacher provider into a per-batch KD orchestration step that produces both the combined α·CE + (1-α)·T²·KL scalar loss (for telemetry) and the KD-aware logit-space gradient (Phase 2b plug point). What this PR adds ================= New module `aprender-train-distill::kd_step`: pub fn kd_loss( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> f32; pub fn kd_logit_gradient( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> Vec<f32>; pub fn kd_step<F: FnMut(&[u32]) -> Vec<f32>>( teacher: &mut dyn TeacherLogitsProvider, input_ids: &[Vec<u32>], labels: &[usize], temperature: f32, alpha: f32, compute_student_logits: F, ) -> Result<(f32, Vec<Vec<f32>>)>; The gradient is the Hinton et al. 2015 §2 derivation: ∂L/∂s = α · (softmax(s) - one_hot(label)) + (1-α) · T · (softmax(s/T) - softmax(t/T)) (T factor, not T² — one T factor is absorbed by the softmax derivative chain rule.) Scope: Phase 2 vs Phase 2b ========================== Phase 2 (this PR) ships the orchestration math, all in pure Rust on the CPU. The output `Vec<Vec<f32>>` of per-batch gradients is what Phase 2b will plumb into `CudaTransformerTrainer.forward_backward_kd_batch` as the backward-pass seed (replacing the CE-only gradient currently used by forward_backward_batch). Splitting Phase 2 into 2a/2b lets us land the orchestration layer + its tests now, separate from extending the complex GPU trainer code path. Falsifiers pinned ================= 3 KD-step falsifiers + 6 sanity tests, all passing: - F-DISTILL-KDSTEP-001 (alpha=1 → pure CE) - F-DISTILL-KDSTEP-002 (student==teacher → zero KL gradient under alpha=0) - F-DISTILL-KDSTEP-003 (loss monotone in student-teacher divergence) - softmax unit-sum + non-negative - CE gradient correct sign at label vs non-label positions - kd_step orchestration end-to-end - kd_step empty-batch sanity - kd_step vocab-mismatch error path - kd_loss alpha=1 collapses to pure CE All 50 aprender-train-distill lib tests pass (was 41 — 9 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…TILL-001 Phase 2b) Mirrors Phase 1's TeacherLogitsProvider for the student side. The student has two methods: logits_for_batch (forward) and apply_kd_gradient (backward + optimizer step). FixtureStudent implements both for CPU-only unit testing — Phase 2c will add a CudaStudentProvider that wraps CudaTransformerTrainer. What this PR adds ================= pub trait StudentLogitsProvider { fn vocab_size(&self) -> usize; fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>>; fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>]) -> Result<()>; } pub struct FixtureStudent { vocab_size: usize, logits: Vec<f32>, // current student parameters learning_rate: f32, } FixtureStudent's apply_kd_gradient averages the gradient across batch elements (canonical SGD batch averaging) and subtracts the scaled gradient from its internal logits buffer. This isn't a real model — it's a logit-space optimization fixture that lets us validate the KD pipeline's gradient direction is correct without needing CUDA. Falsifiers pinned ================= 7 student_provider tests + 2 falsifiers, all passing: - F-DISTILL-STUDENT-001 — one KD step moves student logits toward teacher's preferred token. Setup: uniform student, teacher prefers token 5, alpha=0 (pure KL signal). After one step, student logit at index 5 must be strictly greater than before. - F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial. Validates the gradient direction is correct (descent, not ascent). Plus 5 sanity tests covering vocab_size reporting, batch broadcast, shape validation, in-place logit update, and batch averaging math. Architecture ============ Stacks on top of #1788 (kd_step). Pipeline integration that uses both TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c. Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps CudaTransformerTrainer for production runs. Once it lands, end-to-end GPU distillation is unblocked. Tests ===== All 57 aprender-train-distill lib tests pass (was 50 — 7 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…TILL-001 Phase 3-prep second half, PMAT-697) (#1797) * feat(distill): CudaTrainerTeacher — real teacher backend (SPEC-DISTILL-001 Phase 1b, PMAT-693) Adds the real teacher backend the SPEC-DISTILL-001 Phase 1b ticket scopes: CudaTrainerTeacher wraps entrenar's CudaTransformerTrainer in inference-only mode, delegates logits_for_batch to forward_logits() per batch element, returns shape [batch, vocab_size]. Gated behind a new `cuda` feature on aprender-train-distill that propagates to entrenar/cuda. Without the feature, only FixtureTeacher (Phase 1) is available — sufficient for unit tests but not for real training. Real distillation runs (Phase 4) require --features cuda. Surface ======= #[cfg(feature = "cuda")] pub struct CudaTrainerTeacher { /* wraps CudaTransformerTrainer */ } impl CudaTrainerTeacher { pub fn for_inference( checkpoint_dir: impl AsRef<Path>, model_config: TransformerConfig, ) -> Result<Self> { ... } } impl TeacherLogitsProvider for CudaTrainerTeacher { fn vocab_size(&self) -> usize { ... } fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>> { ... } } Defensive checks ================ - forward_logits returning None → EntrenarError::Internal with a clear "likely missing weights or CUDA init failure" message - logits.len() != vocab_size → EntrenarError::Internal flagging TransformerConfig vs checkpoint vocab drift (the common silent failure mode for loaded-from-disk distillation runs) Tests ===== All 6 teacher_provider tests pass under both --features (none) and --features cuda. Compile gates verified: cargo check -p aprender-train-distill # clean cargo check -p aprender-train-distill --features cuda # clean What's next =========== Phase 2 (PMAT-694, follow-up): wire CudaTransformerTrainer's KD-loss backward into the student path — replaces the remaining build_synthetic_logits call site for the student in pipeline.rs::train(). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): KD step orchestration — combined loss + KD logit gradient (SPEC-DISTILL-001 Phase 2) Wires Phase 1's teacher provider into a per-batch KD orchestration step that produces both the combined α·CE + (1-α)·T²·KL scalar loss (for telemetry) and the KD-aware logit-space gradient (Phase 2b plug point). What this PR adds ================= New module `aprender-train-distill::kd_step`: pub fn kd_loss( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> f32; pub fn kd_logit_gradient( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> Vec<f32>; pub fn kd_step<F: FnMut(&[u32]) -> Vec<f32>>( teacher: &mut dyn TeacherLogitsProvider, input_ids: &[Vec<u32>], labels: &[usize], temperature: f32, alpha: f32, compute_student_logits: F, ) -> Result<(f32, Vec<Vec<f32>>)>; The gradient is the Hinton et al. 2015 §2 derivation: ∂L/∂s = α · (softmax(s) - one_hot(label)) + (1-α) · T · (softmax(s/T) - softmax(t/T)) (T factor, not T² — one T factor is absorbed by the softmax derivative chain rule.) Scope: Phase 2 vs Phase 2b ========================== Phase 2 (this PR) ships the orchestration math, all in pure Rust on the CPU. The output `Vec<Vec<f32>>` of per-batch gradients is what Phase 2b will plumb into `CudaTransformerTrainer.forward_backward_kd_batch` as the backward-pass seed (replacing the CE-only gradient currently used by forward_backward_batch). Splitting Phase 2 into 2a/2b lets us land the orchestration layer + its tests now, separate from extending the complex GPU trainer code path. Falsifiers pinned ================= 3 KD-step falsifiers + 6 sanity tests, all passing: - F-DISTILL-KDSTEP-001 (alpha=1 → pure CE) - F-DISTILL-KDSTEP-002 (student==teacher → zero KL gradient under alpha=0) - F-DISTILL-KDSTEP-003 (loss monotone in student-teacher divergence) - softmax unit-sum + non-negative - CE gradient correct sign at label vs non-label positions - kd_step orchestration end-to-end - kd_step empty-batch sanity - kd_step vocab-mismatch error path - kd_loss alpha=1 collapses to pure CE All 50 aprender-train-distill lib tests pass (was 41 — 9 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): StudentLogitsProvider trait + FixtureStudent (SPEC-DISTILL-001 Phase 2b) Mirrors Phase 1's TeacherLogitsProvider for the student side. The student has two methods: logits_for_batch (forward) and apply_kd_gradient (backward + optimizer step). FixtureStudent implements both for CPU-only unit testing — Phase 2c will add a CudaStudentProvider that wraps CudaTransformerTrainer. What this PR adds ================= pub trait StudentLogitsProvider { fn vocab_size(&self) -> usize; fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>>; fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>]) -> Result<()>; } pub struct FixtureStudent { vocab_size: usize, logits: Vec<f32>, // current student parameters learning_rate: f32, } FixtureStudent's apply_kd_gradient averages the gradient across batch elements (canonical SGD batch averaging) and subtracts the scaled gradient from its internal logits buffer. This isn't a real model — it's a logit-space optimization fixture that lets us validate the KD pipeline's gradient direction is correct without needing CUDA. Falsifiers pinned ================= 7 student_provider tests + 2 falsifiers, all passing: - F-DISTILL-STUDENT-001 — one KD step moves student logits toward teacher's preferred token. Setup: uniform student, teacher prefers token 5, alpha=0 (pure KL signal). After one step, student logit at index 5 must be strictly greater than before. - F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial. Validates the gradient direction is correct (descent, not ascent). Plus 5 sanity tests covering vocab_size reporting, batch broadcast, shape validation, in-place logit update, and batch averaging math. Architecture ============ Stacks on top of #1788 (kd_step). Pipeline integration that uses both TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c. Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps CudaTransformerTrainer for production runs. Once it lands, end-to-end GPU distillation is unblocked. Tests ===== All 57 aprender-train-distill lib tests pass (was 50 — 7 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): pipeline integration teacher + student + kd_step end-to-end (SPEC-DISTILL-001 Phase 2c) Rewrites Pipeline::train() to use TeacherLogitsProvider + StudentLogitsProvider + kd_step end-to-end. Replaces the build_synthetic_logits stubs on both sides with real abstraction calls. Pipeline gains a `student: Box<dyn StudentLogitsProvider + Send>` field and a `Pipeline::with_student()` builder mirroring `with_teacher()`. Default backends are FixtureTeacher + FixtureStudent so legacy tests behave identically. Phase 2d swaps in CudaStudentProvider. Training loop, per step: 1. dummy_batch = [[0u32]; batch_size] (Phase 4 plugs in real tokens) 2. teacher.logits_for_batch(dummy_batch) → teacher logits 3. kd_step(teacher, dummy_batch, labels, T, α, student_logits_closure) → (scalar loss, per-batch logit gradients) 4. student.apply_kd_gradient(grads) → student updates bracketed by initial-loss + final-loss measurements via the new kd_step_loss_for_pipeline helper. Falsifiers ========== F-DISTILL-PIPELINE-001 (new) — end-to-end falsifier: runs Pipeline::execute() with FixtureTeacher + FixtureStudent + 3 epochs and asserts final_loss < initial_loss. Pins the entire data flow: teacher → student → kd_step → apply_kd_gradient. Any broken link either flatlines or increases the loss. Phase 2d plug points ==================== - The dummy_batch in train() is the natural insertion point for a real dataset iterator (Phase 4 work). - The student-logits closure in kd_step is the natural insertion point for CudaStudentProvider's forward_logits (Phase 2d). - The apply_kd_gradient call is the natural insertion point for CudaStudentProvider's forward_backward_kd_batch path (Phase 2d). Tests ===== All 58 aprender-train-distill lib tests pass (was 57 — 1 new F-DISTILL-PIPELINE-001 integration falsifier). The four legacy helpers (build_synthetic_logits, kd_gradient, softmax_2d, write_logits_to_weights) are marked #[allow(dead_code)] for back-compat until Phase 2d's wiring fully replaces the on-disk weights round-trip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): CudaStudentProvider + forward_backward_with_grad (SPEC-DISTILL-001 Phase 2d) Closes the last engineering gap before Phase 3 (500-step E2E smoke). Real GPU student backend wraps CudaTransformerTrainer and implements the Phase 2b StudentLogitsProvider trait. Two complementary pieces ======================== 1. New pub method on CudaTransformerTrainer (cuda_trainer.rs:1247-1311): pub fn forward_backward_with_grad( &mut self, input_ids: &[u32], logit_gradient: &[f32], ) -> Option<()> Runs forward, uploads the caller-supplied logit gradient into the last-position slice of logits_buf (replacing what gpu_forward wrote), runs gpu_backward (which back-props from the in-place gradient through the transformer stack), and runs embed_backward. Matches the KAIZEN-052 in-place gradient convention that fused_cross_entropy_cuda uses for the CE path — except the gradient now comes from `kd_step::kd_logit_gradient` (Phase 2a). 2. New CudaStudentProvider in aprender-train-distill (cuda-gated): pub struct CudaStudentProvider { trainer: CudaTransformerTrainer, vocab_size: usize, last_input_ids: Option<Vec<u32>>, } impl StudentLogitsProvider for CudaStudentProvider: - logits_for_batch → trainer.forward_logits per batch element + caches last_input_ids - apply_kd_gradient → trainer.forward_backward_with_grad on the cached last input_ids with the last gradient Phase 2d limitation =================== batch_size=1 only — the trait's apply_kd_gradient doesn't take input_ids, so the provider has to cache from the most-recent logits_for_batch call. With batches >1 only the last element gets a real gradient update. Phase 2e (PMAT-698, follow-up) generalizes via a fused-step trait method that takes input_ids + gradient together so all batch elements process correctly. Tests ===== All 58 aprender-train-distill lib tests pass under BOTH --features (none) and --features cuda. The CudaStudentProvider itself doesn't have a unit test — exercising it needs CUDA at test time, and the F-DISTILL-CUDA-STUDENT-001 falsifier (logits parity within 1e-6 vs a standalone forward_logits call) is integration-tested in Phase 4 production runs. What's next =========== With Phase 2d landed, Phase 3 (500-step E2E smoke run with real CudaTrainerTeacher + CudaStudentProvider) is unblocked. Phase 4 is the actual 50K-step distillation training run for albor-370m-v2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(distill): Phase 3 smoke-run dispatch + watch scripts for gx10 SPEC-DISTILL-001 Phase 3 dispatch artifact. Reproducible 500-step smoke run targeting paiml/albor-370m-v2 (MODEL-2 distillation). scripts/dispatch-distill-phase-3-gx10.sh ======================================== Idempotent dispatch: 1. Local preflight (env, hosts, args echo). 2. Remote git pull main + build with --features cuda. 3. Remote apr pull of teacher + student-init (no-op if cached). 4. Background dispatch of `apr distill` with KD hyperparameters from SPEC-DISTILL-001 Phase 4 plan (T=4.0, alpha=0.3, LR=1.5e-5). 5. Captures evidence manifest locally (dispatch.json). Overridable via env: GX10_HOST, GX10_USER, GX10_REPO_PATH, TEACHER_REPO, STUDENT_INIT, STEPS (default 500), BATCH_SIZE, LR, T, ALPHA, EVIDENCE_DIR, DRY_RUN. scripts/watch-distill-phase-3-gx10.sh ===================================== Tails the remote training log + filters for step counts, loss markers, panic/error lines, and F-DISTILL-* falsifier verdicts. Blackwell JIT constraint ======================== gx10 is sm_121 (GB10). Memory rule `Blackwell JIT pre-warming bug blocks training` (PMAT-587 lineage) applies — custom PTX kernels in gpu_backward may crash on JIT. The dispatch script tolerates this: forward path runs fine (CudaTrainerTeacher inference works); backward may or may not depending on trueno 0.4.36 status. The script logs either outcome and the F-DISTILL-SMOKE-001 verdict is read from the launch log. The fallback compute lane is lambda-vector (RTX 4090) where backward is proven via §82 P2-A. The script accepts a different GX10_HOST env to dispatch to either. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): apr distill --backend flag + Phase 2c regression fix (SPEC-DISTILL-001 Phase 3-prep, PMAT-697) Adds the --backend selector to `apr distill` so the dispatch script can target the real CudaTrainerTeacher + CudaStudentProvider once the upstream stack lands. Two pieces: 1. CLI surface (apr-cli) ======================== apr distill <teacher> --student <path> --backend <fixture|cuda> ... - default 'fixture': in-memory CPU stub providers (Phase 1 trait + FixtureTeacher / FixtureStudent). Useful for CI + plumbing tests. - 'cuda': registered but errors with a clear "Phase 3-prep follow-up" diagnostic message. The structural placeholder is in place; the actual CudaXxxProvider construction (from .apr metadata → TransformerConfig → for_inference) is the second-half work scoped under the same PMAT-697. - unknown value: errors with the enumeration of valid options. Threaded through dispatch.rs (model_ops_commands.rs::Distill + dispatch::Commands → distill::run signature gains `backend: &str`). 9 test call sites in distill_include_01.rs updated to pass "fixture". 2. Phase 2c regression fix (aprender-train-distill::pipeline) ============================================================= The Phase 2c refactor moved student parameter ownership into the StudentLogitsProvider trait — but the legacy FALSIFY-APR-DISTILL-TRAIN-001 contract asserts that the output student safetensors differ from the input by at least Q4K_TOLERANCE after training. That test broke when the pipeline stopped writing back into student_weights. Fix: after the training loop, fetch the student's current logits one more time via `logits_for_batch` and project them back into the [batch, vocab] slice of student_weights via the legacy write_logits_to_weights helper. The projection is correctness- preserving for FixtureStudent (whose logits = its full parameter state). For Phase 2d's CudaStudentProvider it's a no-op (the GPU weight tensors are the real state — Phase 4 wires the real save_checkpoint path). Tests ===== - All 58 aprender-train-distill lib tests pass - All 21 commands::distill::tests pass (including FALSIFY-APR-DISTILL-TRAIN-001 which was failing pre-fix) - `cargo check -p apr-cli --features hf-hub` clean What's next =========== PMAT-697 second-half: the actual cuda-backend construction. Plumbing required: - load .apr metadata at teacher_path, build TransformerConfig - construct CudaTrainerTeacher::for_inference + CudaStudentProvider::for_training - pass to Pipeline::with_teacher / with_student - thread the resulting PipelineResult through distill::run Effort: 4-8h. Then dispatch-distill-phase-3-gx10.sh produces a real F-DISTILL-SMOKE-001 verdict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): KD step orchestration — combined loss + KD logit gradient (SPEC-DISTILL-001 Phase 2) Wires Phase 1's teacher provider into a per-batch KD orchestration step that produces both the combined α·CE + (1-α)·T²·KL scalar loss (for telemetry) and the KD-aware logit-space gradient (Phase 2b plug point). What this PR adds ================= New module `aprender-train-distill::kd_step`: pub fn kd_loss( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> f32; pub fn kd_logit_gradient( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> Vec<f32>; pub fn kd_step<F: FnMut(&[u32]) -> Vec<f32>>( teacher: &mut dyn TeacherLogitsProvider, input_ids: &[Vec<u32>], labels: &[usize], temperature: f32, alpha: f32, compute_student_logits: F, ) -> Result<(f32, Vec<Vec<f32>>)>; The gradient is the Hinton et al. 2015 §2 derivation: ∂L/∂s = α · (softmax(s) - one_hot(label)) + (1-α) · T · (softmax(s/T) - softmax(t/T)) (T factor, not T² — one T factor is absorbed by the softmax derivative chain rule.) Scope: Phase 2 vs Phase 2b ========================== Phase 2 (this PR) ships the orchestration math, all in pure Rust on the CPU. The output `Vec<Vec<f32>>` of per-batch gradients is what Phase 2b will plumb into `CudaTransformerTrainer.forward_backward_kd_batch` as the backward-pass seed (replacing the CE-only gradient currently used by forward_backward_batch). Splitting Phase 2 into 2a/2b lets us land the orchestration layer + its tests now, separate from extending the complex GPU trainer code path. Falsifiers pinned ================= 3 KD-step falsifiers + 6 sanity tests, all passing: - F-DISTILL-KDSTEP-001 (alpha=1 → pure CE) - F-DISTILL-KDSTEP-002 (student==teacher → zero KL gradient under alpha=0) - F-DISTILL-KDSTEP-003 (loss monotone in student-teacher divergence) - softmax unit-sum + non-negative - CE gradient correct sign at label vs non-label positions - kd_step orchestration end-to-end - kd_step empty-batch sanity - kd_step vocab-mismatch error path - kd_loss alpha=1 collapses to pure CE All 50 aprender-train-distill lib tests pass (was 41 — 9 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): apr distill --backend cuda real construction (SPEC-DISTILL-001 Phase 3-prep, PMAT-697 second half) Wires CudaTrainerTeacher + CudaStudentProvider construction into `apr distill --backend cuda`. End-to-end flow from CLI to GPU: apr distill teacher.apr --student student.apr --output runs/x/ \\ --backend cuda --temperature 4.0 --alpha 0.3 --epochs 3 ↓ run_cuda_backend() ↓ AprV2Reader::from_bytes(teacher.apr) → AprV2Metadata TransformerConfig::from_apr_metadata(...) → teacher config (same for student) ↓ CudaTrainerTeacher::for_inference(teacher_dir, teacher_config) CudaStudentProvider::for_training(student_dir, student_config) ↓ DistillConfig::minimal(...) + per-CLI overrides ↓ Pipeline::new(&config).with_teacher(...).with_student(...).execute() ↓ PipelineResult { metrics, output_path, duration_seconds } Output: trained student safetensors + distillation_metadata.json sidecar, JSON or text formatted depending on --json. Cargo.toml ========== apr-cli's cuda feature now propagates to aprender-train-distill/cuda so the CudaTrainerTeacher + CudaStudentProvider types are reachable from the CLI when --features cuda,training is built. Send-bound removal ================== CudaTransformerTrainer holds Rc<...> internally, so CudaStudentProvider + CudaTrainerTeacher can't be `Send`. Dropped the `+ Send` requirement from Pipeline's Box<dyn ...> fields — the pipeline doesn't move providers across threads, so the constraint was unnecessary anyway. Tests ===== - All 58 aprender-train-distill lib tests pass under both feature configs - All 21 commands::distill::tests pass - `cargo check -p apr-cli --features cuda,training,hf-hub` clean - `cargo check -p apr-cli --features hf-hub` (no cuda) clean The --backend cuda path itself is integration-tested via dispatch-distill-phase-3-gx10.sh on real CUDA hardware. F-DISTILL-SMOKE-001 discharge requires gx10 (Blackwell) or lambda-vector (RTX 4090). What's next =========== With this PR landed, the dispatch script chain: scripts/dispatch-distill-phase-3-gx10.sh STEPS=500 runs end-to-end: 1. ssh gx10, git pull main, cargo build --features cuda 2. apr pull teacher + student-init 3. apr distill ... --backend cuda --temperature 4.0 --alpha 0.3 4. F-DISTILL-SMOKE-001 verdict (initial_loss > final_loss) in launch.log Phase 2e (PMAT-698, follow-up): generalize CudaStudentProvider beyond batch_size=1 via a fused-step trait method. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… 4 RUNNING (#1851) Captures the live state of the distillation epic as of 2026-05-20: Phase 1 — Teacher provider ✅ MERGED (#1786, #1787) Phase 2 — Student fwd/bwd + KD ✅ MERGED (#1788–#1797) Phase 3 — E2E smoke on Blackwell GB10 ✅ DISCHARGED (#1828) Phase 3b — seq_len=256 scale verify ✅ DISCHARGED (#1833) Phase 4 — 50K training (Stage D) 🟡 RUNNING (PID 196378, gx10) Phase 5 — HumanEval pass@1 ⏳ ready (#1847) Phase 6 — Publish v2 ⏳ ready (#1848) Inserts a new top-of-doc status table that points at: - The 11-PR Blackwell cascade (post-mortem in blackwell-cascade-postmortem.md) - Stage C real-corpus dispatch result (15.61 → 6.01 over 124 steps) - Stage D running with ETA ~22h from 2026-05-20 13:43 UTC - Phase 5/6 turnkey scripts ready post-D This captures institutional knowledge for the team and future sessions: the spec doc reflects what's actually shipped rather than the original plan from 2026-05-18 when the epic was still scaffolded. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 18, 2026 11:03

noahgift and others added 2 commits May 18, 2026 14:15

noahgift force-pushed the feat/distill-phase-2-kd-student-backward branch from c3fe54b to 828492d Compare May 18, 2026 12:17

noahgift mentioned this pull request May 18, 2026

feat(distill): StudentLogitsProvider trait + FixtureStudent (SPEC-DISTILL-001 Phase 2b) #1791

Closed

4 tasks

Merge branch 'main' into feat/distill-phase-2-kd-student-backward

389725c

noahgift mentioned this pull request May 18, 2026

chore(distill): Phase 3 smoke-run dispatch + watch scripts for gx10 (SPEC-DISTILL-001) #1795

Closed

3 tasks

noahgift added 2 commits May 18, 2026 15:42

Merge branch 'main' into feat/distill-phase-2-kd-student-backward

b756e79

Merge branch 'main' into feat/distill-phase-2-kd-student-backward

476db84

noahgift merged commit 11a0ba7 into main May 18, 2026
10 checks passed

noahgift deleted the feat/distill-phase-2-kd-student-backward branch May 18, 2026 15:22

noahgift mentioned this pull request May 20, 2026

docs(spec): SPEC-DISTILL-001 — Phases 1-3 CLOSED, Phase 4 RUNNING (2026-05-20) #1851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distill): KD step orchestration — combined loss + KD logit gradient (SPEC-DISTILL-001 Phase 2)#1788

feat(distill): KD step orchestration — combined loss + KD logit gradient (SPEC-DISTILL-001 Phase 2)#1788
noahgift merged 5 commits into
mainfrom
feat/distill-phase-2-kd-student-backward

noahgift commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 18, 2026

Summary

Why split Phase 2 into 2a/2b

New module: aprender-train-distill::kd_step

Falsifiers pinned

What's next

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New module: `aprender-train-distill::kd_step`