feat(distill): KD step orchestration — combined loss + KD logit gradient (SPEC-DISTILL-001 Phase 2)#1788
Merged
Merged
Conversation
…L-001 Phase 1b, PMAT-693)
Adds the real teacher backend the SPEC-DISTILL-001 Phase 1b ticket
scopes: CudaTrainerTeacher wraps entrenar's CudaTransformerTrainer in
inference-only mode, delegates logits_for_batch to forward_logits()
per batch element, returns shape [batch, vocab_size].
Gated behind a new `cuda` feature on aprender-train-distill that
propagates to entrenar/cuda. Without the feature, only FixtureTeacher
(Phase 1) is available — sufficient for unit tests but not for real
training. Real distillation runs (Phase 4) require --features cuda.
Surface
=======
#[cfg(feature = "cuda")]
pub struct CudaTrainerTeacher { /* wraps CudaTransformerTrainer */ }
impl CudaTrainerTeacher {
pub fn for_inference(
checkpoint_dir: impl AsRef<Path>,
model_config: TransformerConfig,
) -> Result<Self> { ... }
}
impl TeacherLogitsProvider for CudaTrainerTeacher {
fn vocab_size(&self) -> usize { ... }
fn logits_for_batch(&mut self, input_ids: &[Vec<u32>])
-> Result<Vec<Vec<f32>>> { ... }
}
Defensive checks
================
- forward_logits returning None → EntrenarError::Internal with a clear
"likely missing weights or CUDA init failure" message
- logits.len() != vocab_size → EntrenarError::Internal flagging
TransformerConfig vs checkpoint vocab drift (the common silent failure
mode for loaded-from-disk distillation runs)
Tests
=====
All 6 teacher_provider tests pass under both --features (none) and
--features cuda. Compile gates verified:
cargo check -p aprender-train-distill # clean
cargo check -p aprender-train-distill --features cuda # clean
What's next
===========
Phase 2 (PMAT-694, follow-up): wire CudaTransformerTrainer's KD-loss
backward into the student path — replaces the remaining
build_synthetic_logits call site for the student in pipeline.rs::train().
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ent (SPEC-DISTILL-001 Phase 2)
Wires Phase 1's teacher provider into a per-batch KD orchestration step
that produces both the combined α·CE + (1-α)·T²·KL scalar loss (for
telemetry) and the KD-aware logit-space gradient (Phase 2b plug point).
What this PR adds
=================
New module `aprender-train-distill::kd_step`:
pub fn kd_loss(
student_logits: &[f32],
teacher_logits: &[f32],
label: usize,
temperature: f32,
alpha: f32,
) -> f32;
pub fn kd_logit_gradient(
student_logits: &[f32],
teacher_logits: &[f32],
label: usize,
temperature: f32,
alpha: f32,
) -> Vec<f32>;
pub fn kd_step<F: FnMut(&[u32]) -> Vec<f32>>(
teacher: &mut dyn TeacherLogitsProvider,
input_ids: &[Vec<u32>],
labels: &[usize],
temperature: f32,
alpha: f32,
compute_student_logits: F,
) -> Result<(f32, Vec<Vec<f32>>)>;
The gradient is the Hinton et al. 2015 §2 derivation:
∂L/∂s = α · (softmax(s) - one_hot(label))
+ (1-α) · T · (softmax(s/T) - softmax(t/T))
(T factor, not T² — one T factor is absorbed by the softmax derivative
chain rule.)
Scope: Phase 2 vs Phase 2b
==========================
Phase 2 (this PR) ships the orchestration math, all in pure Rust on the
CPU. The output `Vec<Vec<f32>>` of per-batch gradients is what Phase 2b
will plumb into `CudaTransformerTrainer.forward_backward_kd_batch` as
the backward-pass seed (replacing the CE-only gradient currently used by
forward_backward_batch).
Splitting Phase 2 into 2a/2b lets us land the orchestration layer + its
tests now, separate from extending the complex GPU trainer code path.
Falsifiers pinned
=================
3 KD-step falsifiers + 6 sanity tests, all passing:
- F-DISTILL-KDSTEP-001 (alpha=1 → pure CE)
- F-DISTILL-KDSTEP-002 (student==teacher → zero KL gradient under alpha=0)
- F-DISTILL-KDSTEP-003 (loss monotone in student-teacher divergence)
- softmax unit-sum + non-negative
- CE gradient correct sign at label vs non-label positions
- kd_step orchestration end-to-end
- kd_step empty-batch sanity
- kd_step vocab-mismatch error path
- kd_loss alpha=1 collapses to pure CE
All 50 aprender-train-distill lib tests pass (was 41 — 9 new).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c3fe54b to
828492d
Compare
Closed
4 tasks
noahgift
added a commit
that referenced
this pull request
May 18, 2026
…TILL-001 Phase 2b)
Mirrors Phase 1's TeacherLogitsProvider for the student side. The
student has two methods: logits_for_batch (forward) and
apply_kd_gradient (backward + optimizer step). FixtureStudent
implements both for CPU-only unit testing — Phase 2c will add a
CudaStudentProvider that wraps CudaTransformerTrainer.
What this PR adds
=================
pub trait StudentLogitsProvider {
fn vocab_size(&self) -> usize;
fn logits_for_batch(&mut self, input_ids: &[Vec<u32>])
-> Result<Vec<Vec<f32>>>;
fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>])
-> Result<()>;
}
pub struct FixtureStudent {
vocab_size: usize,
logits: Vec<f32>, // current student parameters
learning_rate: f32,
}
FixtureStudent's apply_kd_gradient averages the gradient across batch
elements (canonical SGD batch averaging) and subtracts the scaled
gradient from its internal logits buffer. This isn't a real model —
it's a logit-space optimization fixture that lets us validate the
KD pipeline's gradient direction is correct without needing CUDA.
Falsifiers pinned
=================
7 student_provider tests + 2 falsifiers, all passing:
- F-DISTILL-STUDENT-001 — one KD step moves student logits toward
teacher's preferred token. Setup: uniform student, teacher prefers
token 5, alpha=0 (pure KL signal). After one step, student logit
at index 5 must be strictly greater than before.
- F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease
per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial.
Validates the gradient direction is correct (descent, not ascent).
Plus 5 sanity tests covering vocab_size reporting, batch broadcast,
shape validation, in-place logit update, and batch averaging math.
Architecture
============
Stacks on top of #1788 (kd_step). Pipeline integration that uses both
TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c.
Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps
CudaTransformerTrainer for production runs. Once it lands, end-to-end
GPU distillation is unblocked.
Tests
=====
All 57 aprender-train-distill lib tests pass (was 50 — 7 new).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 18, 2026
…TILL-001 Phase 2b)
Mirrors Phase 1's TeacherLogitsProvider for the student side. The
student has two methods: logits_for_batch (forward) and
apply_kd_gradient (backward + optimizer step). FixtureStudent
implements both for CPU-only unit testing — Phase 2c will add a
CudaStudentProvider that wraps CudaTransformerTrainer.
What this PR adds
=================
pub trait StudentLogitsProvider {
fn vocab_size(&self) -> usize;
fn logits_for_batch(&mut self, input_ids: &[Vec<u32>])
-> Result<Vec<Vec<f32>>>;
fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>])
-> Result<()>;
}
pub struct FixtureStudent {
vocab_size: usize,
logits: Vec<f32>, // current student parameters
learning_rate: f32,
}
FixtureStudent's apply_kd_gradient averages the gradient across batch
elements (canonical SGD batch averaging) and subtracts the scaled
gradient from its internal logits buffer. This isn't a real model —
it's a logit-space optimization fixture that lets us validate the
KD pipeline's gradient direction is correct without needing CUDA.
Falsifiers pinned
=================
7 student_provider tests + 2 falsifiers, all passing:
- F-DISTILL-STUDENT-001 — one KD step moves student logits toward
teacher's preferred token. Setup: uniform student, teacher prefers
token 5, alpha=0 (pure KL signal). After one step, student logit
at index 5 must be strictly greater than before.
- F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease
per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial.
Validates the gradient direction is correct (descent, not ascent).
Plus 5 sanity tests covering vocab_size reporting, batch broadcast,
shape validation, in-place logit update, and batch averaging math.
Architecture
============
Stacks on top of #1788 (kd_step). Pipeline integration that uses both
TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c.
Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps
CudaTransformerTrainer for production runs. Once it lands, end-to-end
GPU distillation is unblocked.
Tests
=====
All 57 aprender-train-distill lib tests pass (was 50 — 7 new).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
noahgift
added a commit
that referenced
this pull request
May 18, 2026
…TILL-001 Phase 2b)
Mirrors Phase 1's TeacherLogitsProvider for the student side. The
student has two methods: logits_for_batch (forward) and
apply_kd_gradient (backward + optimizer step). FixtureStudent
implements both for CPU-only unit testing — Phase 2c will add a
CudaStudentProvider that wraps CudaTransformerTrainer.
What this PR adds
=================
pub trait StudentLogitsProvider {
fn vocab_size(&self) -> usize;
fn logits_for_batch(&mut self, input_ids: &[Vec<u32>])
-> Result<Vec<Vec<f32>>>;
fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>])
-> Result<()>;
}
pub struct FixtureStudent {
vocab_size: usize,
logits: Vec<f32>, // current student parameters
learning_rate: f32,
}
FixtureStudent's apply_kd_gradient averages the gradient across batch
elements (canonical SGD batch averaging) and subtracts the scaled
gradient from its internal logits buffer. This isn't a real model —
it's a logit-space optimization fixture that lets us validate the
KD pipeline's gradient direction is correct without needing CUDA.
Falsifiers pinned
=================
7 student_provider tests + 2 falsifiers, all passing:
- F-DISTILL-STUDENT-001 — one KD step moves student logits toward
teacher's preferred token. Setup: uniform student, teacher prefers
token 5, alpha=0 (pure KL signal). After one step, student logit
at index 5 must be strictly greater than before.
- F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease
per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial.
Validates the gradient direction is correct (descent, not ascent).
Plus 5 sanity tests covering vocab_size reporting, batch broadcast,
shape validation, in-place logit update, and batch averaging math.
Architecture
============
Stacks on top of #1788 (kd_step). Pipeline integration that uses both
TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c.
Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps
CudaTransformerTrainer for production runs. Once it lands, end-to-end
GPU distillation is unblocked.
Tests
=====
All 57 aprender-train-distill lib tests pass (was 50 — 7 new).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 18, 2026
…TILL-001 Phase 2b)
Mirrors Phase 1's TeacherLogitsProvider for the student side. The
student has two methods: logits_for_batch (forward) and
apply_kd_gradient (backward + optimizer step). FixtureStudent
implements both for CPU-only unit testing — Phase 2c will add a
CudaStudentProvider that wraps CudaTransformerTrainer.
What this PR adds
=================
pub trait StudentLogitsProvider {
fn vocab_size(&self) -> usize;
fn logits_for_batch(&mut self, input_ids: &[Vec<u32>])
-> Result<Vec<Vec<f32>>>;
fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>])
-> Result<()>;
}
pub struct FixtureStudent {
vocab_size: usize,
logits: Vec<f32>, // current student parameters
learning_rate: f32,
}
FixtureStudent's apply_kd_gradient averages the gradient across batch
elements (canonical SGD batch averaging) and subtracts the scaled
gradient from its internal logits buffer. This isn't a real model —
it's a logit-space optimization fixture that lets us validate the
KD pipeline's gradient direction is correct without needing CUDA.
Falsifiers pinned
=================
7 student_provider tests + 2 falsifiers, all passing:
- F-DISTILL-STUDENT-001 — one KD step moves student logits toward
teacher's preferred token. Setup: uniform student, teacher prefers
token 5, alpha=0 (pure KL signal). After one step, student logit
at index 5 must be strictly greater than before.
- F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease
per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial.
Validates the gradient direction is correct (descent, not ascent).
Plus 5 sanity tests covering vocab_size reporting, batch broadcast,
shape validation, in-place logit update, and batch averaging math.
Architecture
============
Stacks on top of #1788 (kd_step). Pipeline integration that uses both
TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c.
Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps
CudaTransformerTrainer for production runs. Once it lands, end-to-end
GPU distillation is unblocked.
Tests
=====
All 57 aprender-train-distill lib tests pass (was 50 — 7 new).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 18, 2026
…TILL-001 Phase 2b)
Mirrors Phase 1's TeacherLogitsProvider for the student side. The
student has two methods: logits_for_batch (forward) and
apply_kd_gradient (backward + optimizer step). FixtureStudent
implements both for CPU-only unit testing — Phase 2c will add a
CudaStudentProvider that wraps CudaTransformerTrainer.
What this PR adds
=================
pub trait StudentLogitsProvider {
fn vocab_size(&self) -> usize;
fn logits_for_batch(&mut self, input_ids: &[Vec<u32>])
-> Result<Vec<Vec<f32>>>;
fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>])
-> Result<()>;
}
pub struct FixtureStudent {
vocab_size: usize,
logits: Vec<f32>, // current student parameters
learning_rate: f32,
}
FixtureStudent's apply_kd_gradient averages the gradient across batch
elements (canonical SGD batch averaging) and subtracts the scaled
gradient from its internal logits buffer. This isn't a real model —
it's a logit-space optimization fixture that lets us validate the
KD pipeline's gradient direction is correct without needing CUDA.
Falsifiers pinned
=================
7 student_provider tests + 2 falsifiers, all passing:
- F-DISTILL-STUDENT-001 — one KD step moves student logits toward
teacher's preferred token. Setup: uniform student, teacher prefers
token 5, alpha=0 (pure KL signal). After one step, student logit
at index 5 must be strictly greater than before.
- F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease
per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial.
Validates the gradient direction is correct (descent, not ascent).
Plus 5 sanity tests covering vocab_size reporting, batch broadcast,
shape validation, in-place logit update, and batch averaging math.
Architecture
============
Stacks on top of #1788 (kd_step). Pipeline integration that uses both
TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c.
Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps
CudaTransformerTrainer for production runs. Once it lands, end-to-end
GPU distillation is unblocked.
Tests
=====
All 57 aprender-train-distill lib tests pass (was 50 — 7 new).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 18, 2026
…TILL-001 Phase 3-prep second half, PMAT-697) (#1797) * feat(distill): CudaTrainerTeacher — real teacher backend (SPEC-DISTILL-001 Phase 1b, PMAT-693) Adds the real teacher backend the SPEC-DISTILL-001 Phase 1b ticket scopes: CudaTrainerTeacher wraps entrenar's CudaTransformerTrainer in inference-only mode, delegates logits_for_batch to forward_logits() per batch element, returns shape [batch, vocab_size]. Gated behind a new `cuda` feature on aprender-train-distill that propagates to entrenar/cuda. Without the feature, only FixtureTeacher (Phase 1) is available — sufficient for unit tests but not for real training. Real distillation runs (Phase 4) require --features cuda. Surface ======= #[cfg(feature = "cuda")] pub struct CudaTrainerTeacher { /* wraps CudaTransformerTrainer */ } impl CudaTrainerTeacher { pub fn for_inference( checkpoint_dir: impl AsRef<Path>, model_config: TransformerConfig, ) -> Result<Self> { ... } } impl TeacherLogitsProvider for CudaTrainerTeacher { fn vocab_size(&self) -> usize { ... } fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>> { ... } } Defensive checks ================ - forward_logits returning None → EntrenarError::Internal with a clear "likely missing weights or CUDA init failure" message - logits.len() != vocab_size → EntrenarError::Internal flagging TransformerConfig vs checkpoint vocab drift (the common silent failure mode for loaded-from-disk distillation runs) Tests ===== All 6 teacher_provider tests pass under both --features (none) and --features cuda. Compile gates verified: cargo check -p aprender-train-distill # clean cargo check -p aprender-train-distill --features cuda # clean What's next =========== Phase 2 (PMAT-694, follow-up): wire CudaTransformerTrainer's KD-loss backward into the student path — replaces the remaining build_synthetic_logits call site for the student in pipeline.rs::train(). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): KD step orchestration — combined loss + KD logit gradient (SPEC-DISTILL-001 Phase 2) Wires Phase 1's teacher provider into a per-batch KD orchestration step that produces both the combined α·CE + (1-α)·T²·KL scalar loss (for telemetry) and the KD-aware logit-space gradient (Phase 2b plug point). What this PR adds ================= New module `aprender-train-distill::kd_step`: pub fn kd_loss( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> f32; pub fn kd_logit_gradient( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> Vec<f32>; pub fn kd_step<F: FnMut(&[u32]) -> Vec<f32>>( teacher: &mut dyn TeacherLogitsProvider, input_ids: &[Vec<u32>], labels: &[usize], temperature: f32, alpha: f32, compute_student_logits: F, ) -> Result<(f32, Vec<Vec<f32>>)>; The gradient is the Hinton et al. 2015 §2 derivation: ∂L/∂s = α · (softmax(s) - one_hot(label)) + (1-α) · T · (softmax(s/T) - softmax(t/T)) (T factor, not T² — one T factor is absorbed by the softmax derivative chain rule.) Scope: Phase 2 vs Phase 2b ========================== Phase 2 (this PR) ships the orchestration math, all in pure Rust on the CPU. The output `Vec<Vec<f32>>` of per-batch gradients is what Phase 2b will plumb into `CudaTransformerTrainer.forward_backward_kd_batch` as the backward-pass seed (replacing the CE-only gradient currently used by forward_backward_batch). Splitting Phase 2 into 2a/2b lets us land the orchestration layer + its tests now, separate from extending the complex GPU trainer code path. Falsifiers pinned ================= 3 KD-step falsifiers + 6 sanity tests, all passing: - F-DISTILL-KDSTEP-001 (alpha=1 → pure CE) - F-DISTILL-KDSTEP-002 (student==teacher → zero KL gradient under alpha=0) - F-DISTILL-KDSTEP-003 (loss monotone in student-teacher divergence) - softmax unit-sum + non-negative - CE gradient correct sign at label vs non-label positions - kd_step orchestration end-to-end - kd_step empty-batch sanity - kd_step vocab-mismatch error path - kd_loss alpha=1 collapses to pure CE All 50 aprender-train-distill lib tests pass (was 41 — 9 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): StudentLogitsProvider trait + FixtureStudent (SPEC-DISTILL-001 Phase 2b) Mirrors Phase 1's TeacherLogitsProvider for the student side. The student has two methods: logits_for_batch (forward) and apply_kd_gradient (backward + optimizer step). FixtureStudent implements both for CPU-only unit testing — Phase 2c will add a CudaStudentProvider that wraps CudaTransformerTrainer. What this PR adds ================= pub trait StudentLogitsProvider { fn vocab_size(&self) -> usize; fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>>; fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>]) -> Result<()>; } pub struct FixtureStudent { vocab_size: usize, logits: Vec<f32>, // current student parameters learning_rate: f32, } FixtureStudent's apply_kd_gradient averages the gradient across batch elements (canonical SGD batch averaging) and subtracts the scaled gradient from its internal logits buffer. This isn't a real model — it's a logit-space optimization fixture that lets us validate the KD pipeline's gradient direction is correct without needing CUDA. Falsifiers pinned ================= 7 student_provider tests + 2 falsifiers, all passing: - F-DISTILL-STUDENT-001 — one KD step moves student logits toward teacher's preferred token. Setup: uniform student, teacher prefers token 5, alpha=0 (pure KL signal). After one step, student logit at index 5 must be strictly greater than before. - F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial. Validates the gradient direction is correct (descent, not ascent). Plus 5 sanity tests covering vocab_size reporting, batch broadcast, shape validation, in-place logit update, and batch averaging math. Architecture ============ Stacks on top of #1788 (kd_step). Pipeline integration that uses both TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c. Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps CudaTransformerTrainer for production runs. Once it lands, end-to-end GPU distillation is unblocked. Tests ===== All 57 aprender-train-distill lib tests pass (was 50 — 7 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): pipeline integration teacher + student + kd_step end-to-end (SPEC-DISTILL-001 Phase 2c) Rewrites Pipeline::train() to use TeacherLogitsProvider + StudentLogitsProvider + kd_step end-to-end. Replaces the build_synthetic_logits stubs on both sides with real abstraction calls. Pipeline gains a `student: Box<dyn StudentLogitsProvider + Send>` field and a `Pipeline::with_student()` builder mirroring `with_teacher()`. Default backends are FixtureTeacher + FixtureStudent so legacy tests behave identically. Phase 2d swaps in CudaStudentProvider. Training loop, per step: 1. dummy_batch = [[0u32]; batch_size] (Phase 4 plugs in real tokens) 2. teacher.logits_for_batch(dummy_batch) → teacher logits 3. kd_step(teacher, dummy_batch, labels, T, α, student_logits_closure) → (scalar loss, per-batch logit gradients) 4. student.apply_kd_gradient(grads) → student updates bracketed by initial-loss + final-loss measurements via the new kd_step_loss_for_pipeline helper. Falsifiers ========== F-DISTILL-PIPELINE-001 (new) — end-to-end falsifier: runs Pipeline::execute() with FixtureTeacher + FixtureStudent + 3 epochs and asserts final_loss < initial_loss. Pins the entire data flow: teacher → student → kd_step → apply_kd_gradient. Any broken link either flatlines or increases the loss. Phase 2d plug points ==================== - The dummy_batch in train() is the natural insertion point for a real dataset iterator (Phase 4 work). - The student-logits closure in kd_step is the natural insertion point for CudaStudentProvider's forward_logits (Phase 2d). - The apply_kd_gradient call is the natural insertion point for CudaStudentProvider's forward_backward_kd_batch path (Phase 2d). Tests ===== All 58 aprender-train-distill lib tests pass (was 57 — 1 new F-DISTILL-PIPELINE-001 integration falsifier). The four legacy helpers (build_synthetic_logits, kd_gradient, softmax_2d, write_logits_to_weights) are marked #[allow(dead_code)] for back-compat until Phase 2d's wiring fully replaces the on-disk weights round-trip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): CudaStudentProvider + forward_backward_with_grad (SPEC-DISTILL-001 Phase 2d) Closes the last engineering gap before Phase 3 (500-step E2E smoke). Real GPU student backend wraps CudaTransformerTrainer and implements the Phase 2b StudentLogitsProvider trait. Two complementary pieces ======================== 1. New pub method on CudaTransformerTrainer (cuda_trainer.rs:1247-1311): pub fn forward_backward_with_grad( &mut self, input_ids: &[u32], logit_gradient: &[f32], ) -> Option<()> Runs forward, uploads the caller-supplied logit gradient into the last-position slice of logits_buf (replacing what gpu_forward wrote), runs gpu_backward (which back-props from the in-place gradient through the transformer stack), and runs embed_backward. Matches the KAIZEN-052 in-place gradient convention that fused_cross_entropy_cuda uses for the CE path — except the gradient now comes from `kd_step::kd_logit_gradient` (Phase 2a). 2. New CudaStudentProvider in aprender-train-distill (cuda-gated): pub struct CudaStudentProvider { trainer: CudaTransformerTrainer, vocab_size: usize, last_input_ids: Option<Vec<u32>>, } impl StudentLogitsProvider for CudaStudentProvider: - logits_for_batch → trainer.forward_logits per batch element + caches last_input_ids - apply_kd_gradient → trainer.forward_backward_with_grad on the cached last input_ids with the last gradient Phase 2d limitation =================== batch_size=1 only — the trait's apply_kd_gradient doesn't take input_ids, so the provider has to cache from the most-recent logits_for_batch call. With batches >1 only the last element gets a real gradient update. Phase 2e (PMAT-698, follow-up) generalizes via a fused-step trait method that takes input_ids + gradient together so all batch elements process correctly. Tests ===== All 58 aprender-train-distill lib tests pass under BOTH --features (none) and --features cuda. The CudaStudentProvider itself doesn't have a unit test — exercising it needs CUDA at test time, and the F-DISTILL-CUDA-STUDENT-001 falsifier (logits parity within 1e-6 vs a standalone forward_logits call) is integration-tested in Phase 4 production runs. What's next =========== With Phase 2d landed, Phase 3 (500-step E2E smoke run with real CudaTrainerTeacher + CudaStudentProvider) is unblocked. Phase 4 is the actual 50K-step distillation training run for albor-370m-v2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(distill): Phase 3 smoke-run dispatch + watch scripts for gx10 SPEC-DISTILL-001 Phase 3 dispatch artifact. Reproducible 500-step smoke run targeting paiml/albor-370m-v2 (MODEL-2 distillation). scripts/dispatch-distill-phase-3-gx10.sh ======================================== Idempotent dispatch: 1. Local preflight (env, hosts, args echo). 2. Remote git pull main + build with --features cuda. 3. Remote apr pull of teacher + student-init (no-op if cached). 4. Background dispatch of `apr distill` with KD hyperparameters from SPEC-DISTILL-001 Phase 4 plan (T=4.0, alpha=0.3, LR=1.5e-5). 5. Captures evidence manifest locally (dispatch.json). Overridable via env: GX10_HOST, GX10_USER, GX10_REPO_PATH, TEACHER_REPO, STUDENT_INIT, STEPS (default 500), BATCH_SIZE, LR, T, ALPHA, EVIDENCE_DIR, DRY_RUN. scripts/watch-distill-phase-3-gx10.sh ===================================== Tails the remote training log + filters for step counts, loss markers, panic/error lines, and F-DISTILL-* falsifier verdicts. Blackwell JIT constraint ======================== gx10 is sm_121 (GB10). Memory rule `Blackwell JIT pre-warming bug blocks training` (PMAT-587 lineage) applies — custom PTX kernels in gpu_backward may crash on JIT. The dispatch script tolerates this: forward path runs fine (CudaTrainerTeacher inference works); backward may or may not depending on trueno 0.4.36 status. The script logs either outcome and the F-DISTILL-SMOKE-001 verdict is read from the launch log. The fallback compute lane is lambda-vector (RTX 4090) where backward is proven via §82 P2-A. The script accepts a different GX10_HOST env to dispatch to either. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): apr distill --backend flag + Phase 2c regression fix (SPEC-DISTILL-001 Phase 3-prep, PMAT-697) Adds the --backend selector to `apr distill` so the dispatch script can target the real CudaTrainerTeacher + CudaStudentProvider once the upstream stack lands. Two pieces: 1. CLI surface (apr-cli) ======================== apr distill <teacher> --student <path> --backend <fixture|cuda> ... - default 'fixture': in-memory CPU stub providers (Phase 1 trait + FixtureTeacher / FixtureStudent). Useful for CI + plumbing tests. - 'cuda': registered but errors with a clear "Phase 3-prep follow-up" diagnostic message. The structural placeholder is in place; the actual CudaXxxProvider construction (from .apr metadata → TransformerConfig → for_inference) is the second-half work scoped under the same PMAT-697. - unknown value: errors with the enumeration of valid options. Threaded through dispatch.rs (model_ops_commands.rs::Distill + dispatch::Commands → distill::run signature gains `backend: &str`). 9 test call sites in distill_include_01.rs updated to pass "fixture". 2. Phase 2c regression fix (aprender-train-distill::pipeline) ============================================================= The Phase 2c refactor moved student parameter ownership into the StudentLogitsProvider trait — but the legacy FALSIFY-APR-DISTILL-TRAIN-001 contract asserts that the output student safetensors differ from the input by at least Q4K_TOLERANCE after training. That test broke when the pipeline stopped writing back into student_weights. Fix: after the training loop, fetch the student's current logits one more time via `logits_for_batch` and project them back into the [batch, vocab] slice of student_weights via the legacy write_logits_to_weights helper. The projection is correctness- preserving for FixtureStudent (whose logits = its full parameter state). For Phase 2d's CudaStudentProvider it's a no-op (the GPU weight tensors are the real state — Phase 4 wires the real save_checkpoint path). Tests ===== - All 58 aprender-train-distill lib tests pass - All 21 commands::distill::tests pass (including FALSIFY-APR-DISTILL-TRAIN-001 which was failing pre-fix) - `cargo check -p apr-cli --features hf-hub` clean What's next =========== PMAT-697 second-half: the actual cuda-backend construction. Plumbing required: - load .apr metadata at teacher_path, build TransformerConfig - construct CudaTrainerTeacher::for_inference + CudaStudentProvider::for_training - pass to Pipeline::with_teacher / with_student - thread the resulting PipelineResult through distill::run Effort: 4-8h. Then dispatch-distill-phase-3-gx10.sh produces a real F-DISTILL-SMOKE-001 verdict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): KD step orchestration — combined loss + KD logit gradient (SPEC-DISTILL-001 Phase 2) Wires Phase 1's teacher provider into a per-batch KD orchestration step that produces both the combined α·CE + (1-α)·T²·KL scalar loss (for telemetry) and the KD-aware logit-space gradient (Phase 2b plug point). What this PR adds ================= New module `aprender-train-distill::kd_step`: pub fn kd_loss( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> f32; pub fn kd_logit_gradient( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> Vec<f32>; pub fn kd_step<F: FnMut(&[u32]) -> Vec<f32>>( teacher: &mut dyn TeacherLogitsProvider, input_ids: &[Vec<u32>], labels: &[usize], temperature: f32, alpha: f32, compute_student_logits: F, ) -> Result<(f32, Vec<Vec<f32>>)>; The gradient is the Hinton et al. 2015 §2 derivation: ∂L/∂s = α · (softmax(s) - one_hot(label)) + (1-α) · T · (softmax(s/T) - softmax(t/T)) (T factor, not T² — one T factor is absorbed by the softmax derivative chain rule.) Scope: Phase 2 vs Phase 2b ========================== Phase 2 (this PR) ships the orchestration math, all in pure Rust on the CPU. The output `Vec<Vec<f32>>` of per-batch gradients is what Phase 2b will plumb into `CudaTransformerTrainer.forward_backward_kd_batch` as the backward-pass seed (replacing the CE-only gradient currently used by forward_backward_batch). Splitting Phase 2 into 2a/2b lets us land the orchestration layer + its tests now, separate from extending the complex GPU trainer code path. Falsifiers pinned ================= 3 KD-step falsifiers + 6 sanity tests, all passing: - F-DISTILL-KDSTEP-001 (alpha=1 → pure CE) - F-DISTILL-KDSTEP-002 (student==teacher → zero KL gradient under alpha=0) - F-DISTILL-KDSTEP-003 (loss monotone in student-teacher divergence) - softmax unit-sum + non-negative - CE gradient correct sign at label vs non-label positions - kd_step orchestration end-to-end - kd_step empty-batch sanity - kd_step vocab-mismatch error path - kd_loss alpha=1 collapses to pure CE All 50 aprender-train-distill lib tests pass (was 41 — 9 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): apr distill --backend cuda real construction (SPEC-DISTILL-001 Phase 3-prep, PMAT-697 second half) Wires CudaTrainerTeacher + CudaStudentProvider construction into `apr distill --backend cuda`. End-to-end flow from CLI to GPU: apr distill teacher.apr --student student.apr --output runs/x/ \\ --backend cuda --temperature 4.0 --alpha 0.3 --epochs 3 ↓ run_cuda_backend() ↓ AprV2Reader::from_bytes(teacher.apr) → AprV2Metadata TransformerConfig::from_apr_metadata(...) → teacher config (same for student) ↓ CudaTrainerTeacher::for_inference(teacher_dir, teacher_config) CudaStudentProvider::for_training(student_dir, student_config) ↓ DistillConfig::minimal(...) + per-CLI overrides ↓ Pipeline::new(&config).with_teacher(...).with_student(...).execute() ↓ PipelineResult { metrics, output_path, duration_seconds } Output: trained student safetensors + distillation_metadata.json sidecar, JSON or text formatted depending on --json. Cargo.toml ========== apr-cli's cuda feature now propagates to aprender-train-distill/cuda so the CudaTrainerTeacher + CudaStudentProvider types are reachable from the CLI when --features cuda,training is built. Send-bound removal ================== CudaTransformerTrainer holds Rc<...> internally, so CudaStudentProvider + CudaTrainerTeacher can't be `Send`. Dropped the `+ Send` requirement from Pipeline's Box<dyn ...> fields — the pipeline doesn't move providers across threads, so the constraint was unnecessary anyway. Tests ===== - All 58 aprender-train-distill lib tests pass under both feature configs - All 21 commands::distill::tests pass - `cargo check -p apr-cli --features cuda,training,hf-hub` clean - `cargo check -p apr-cli --features hf-hub` (no cuda) clean The --backend cuda path itself is integration-tested via dispatch-distill-phase-3-gx10.sh on real CUDA hardware. F-DISTILL-SMOKE-001 discharge requires gx10 (Blackwell) or lambda-vector (RTX 4090). What's next =========== With this PR landed, the dispatch script chain: scripts/dispatch-distill-phase-3-gx10.sh STEPS=500 runs end-to-end: 1. ssh gx10, git pull main, cargo build --features cuda 2. apr pull teacher + student-init 3. apr distill ... --backend cuda --temperature 4.0 --alpha 0.3 4. F-DISTILL-SMOKE-001 verdict (initial_loss > final_loss) in launch.log Phase 2e (PMAT-698, follow-up): generalize CudaStudentProvider beyond batch_size=1 via a fused-step trait method. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 20, 2026
… 4 RUNNING (#1851) Captures the live state of the distillation epic as of 2026-05-20: Phase 1 — Teacher provider ✅ MERGED (#1786, #1787) Phase 2 — Student fwd/bwd + KD ✅ MERGED (#1788–#1797) Phase 3 — E2E smoke on Blackwell GB10 ✅ DISCHARGED (#1828) Phase 3b — seq_len=256 scale verify ✅ DISCHARGED (#1833) Phase 4 — 50K training (Stage D) 🟡 RUNNING (PID 196378, gx10) Phase 5 — HumanEval pass@1 ⏳ ready (#1847) Phase 6 — Publish v2 ⏳ ready (#1848) Inserts a new top-of-doc status table that points at: - The 11-PR Blackwell cascade (post-mortem in blackwell-cascade-postmortem.md) - Stage C real-corpus dispatch result (15.61 → 6.01 over 124 steps) - Stage D running with ETA ~22h from 2026-05-20 13:43 UTC - Phase 5/6 turnkey scripts ready post-D This captures institutional knowledge for the team and future sessions: the spec doc reflects what's actually shipped rather than the original plan from 2026-05-18 when the epic was still scaffolded. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires Phase 1's teacher provider into a per-batch KD orchestration step that produces both:
α·CE + (1-α)·T²·KLscalar loss (for telemetry + logging)Stacks on top of #1787 (Phase 1b CudaTrainerTeacher). The Phase 2 work landed here is the pure-Rust orchestration math; Phase 2b (PMAT-694 follow-up) wires the gradient into
CudaTransformerTrainer.forward_backward_kd_batchso the student actually learns from the KD signal.Why split Phase 2 into 2a/2b
Phase 2 is 16-24h of engineering per SPEC-DISTILL-001. Splitting lets us land the orchestration layer + its tests (12 deliverable functions/tests) as a clean reviewable unit, separate from extending the complex CUDA trainer code. Phase 2b then becomes a focused "add forward_backward_kd_batch to CudaTransformerTrainer using this module's
kd_logit_gradient" PR.New module:
aprender-train-distill::kd_stepThe gradient is the Hinton et al. 2015 §2 derivation:
(T factor not T² — one T factor is absorbed by the softmax chain rule.)
Falsifiers pinned
alpha=1collapses KD gradient to pure CE gradientstudent==teacher+alpha=0→ zero KL gradientkd_lossstrictly increases as student diverges from teacherPlus 6 sanity tests covering softmax unit-sum, CE gradient signs, orchestration end-to-end, empty-batch, vocab-size mismatch error path, and
alpha=1loss collapse.All 50 aprender-train-distill lib tests pass (was 41 — 9 new).
What's next
CudaTransformerTrainerwithforward_backward_kd_batch(batch, teacher_logits). The new method useskd_logit_gradient(from this PR) as the backward-pass seed instead of CE-only. With Phase 2b landed, the pipeline starts producing genuinely distilled student weights instead of CE-only.Test plan
cargo check -p aprender-train-distillclean (both with and without--features cuda)🤖 Generated with Claude Code