feat(distill): CudaStudentProvider + forward_backward_with_grad (SPEC-DISTILL-001 Phase 2d) by noahgift · Pull Request #1793 · paiml/aprender

noahgift · 2026-05-18T12:44:47Z

Summary

Closes the last engineering gap before Phase 3 (500-step E2E smoke). Real GPU student backend wraps CudaTransformerTrainer and implements the Phase 2b StudentLogitsProvider trait.

Stacks on top of #1792 (Phase 2c pipeline integration).

Two complementary pieces

1. New pub method on CudaTransformerTrainer (cuda_trainer.rs:1247-1311):

pub fn forward_backward_with_grad(
    &mut self,
    input_ids: &[u32],
    logit_gradient: &[f32],
) -> Option<()>

Runs forward, uploads the caller-supplied logit gradient into the last-position slice of logits_buf (replacing what gpu_forward wrote), runs gpu_backward (which back-props from the in-place gradient through the transformer stack), and runs embed_backward. Matches the KAIZEN-052 in-place gradient convention that fused_cross_entropy_cuda uses for the CE path — except the gradient now comes from kd_step::kd_logit_gradient (Phase 2a).

2. New CudaStudentProvider in aprender-train-distill (cuda-gated):

pub struct CudaStudentProvider {
    trainer: CudaTransformerTrainer,
    vocab_size: usize,
    last_input_ids: Option<Vec<u32>>,  // batch_size=1 cache
}

impl StudentLogitsProvider for CudaStudentProvider {
    // logits_for_batch → trainer.forward_logits per batch element
    // apply_kd_gradient → trainer.forward_backward_with_grad on cached
    //                      last input_ids with the last gradient
}

Phase 2d limitation

batch_size=1 only — the trait's apply_kd_gradient doesn't take input_ids, so the provider caches from the most-recent logits_for_batch call. With batches >1, only the last element gets a real gradient update.

Phase 2e (PMAT-698, follow-up) generalizes via a fused-step trait method that takes input_ids + gradient together so all batch elements process correctly.

Tests

All 58 aprender-train-distill lib tests pass under both --features (none) and --features cuda. cargo check clean under both gates.

The CudaStudentProvider itself doesn't have a unit test — exercising it needs CUDA at test time, and the F-DISTILL-CUDA-STUDENT-001 falsifier (logits parity within 1e-6 vs a standalone forward_logits call) is integration-tested in Phase 4 production runs.

What's next

With Phase 2d landed, Phase 3 (500-step E2E smoke run with real CudaTrainerTeacher + CudaStudentProvider) is unblocked. Phase 4 is the actual 50K-step distillation training run for albor-370m-v2.

Test plan

cargo check -p aprender-train-distill clean (no cuda)
cargo check -p aprender-train-distill --features cuda clean
cargo check -p aprender-train --features cuda clean (new method)
All 58 distill lib tests pass under both feature configs
Phase 3: 500-step E2E smoke on lambda-vector (real CUDA)
Phase 2e: fused-step trait method for batch_size>1 (PMAT-698)

🤖 Generated with Claude Code

…L-001 Phase 1b, PMAT-693) Adds the real teacher backend the SPEC-DISTILL-001 Phase 1b ticket scopes: CudaTrainerTeacher wraps entrenar's CudaTransformerTrainer in inference-only mode, delegates logits_for_batch to forward_logits() per batch element, returns shape [batch, vocab_size]. Gated behind a new `cuda` feature on aprender-train-distill that propagates to entrenar/cuda. Without the feature, only FixtureTeacher (Phase 1) is available — sufficient for unit tests but not for real training. Real distillation runs (Phase 4) require --features cuda. Surface ======= #[cfg(feature = "cuda")] pub struct CudaTrainerTeacher { /* wraps CudaTransformerTrainer */ } impl CudaTrainerTeacher { pub fn for_inference( checkpoint_dir: impl AsRef<Path>, model_config: TransformerConfig, ) -> Result<Self> { ... } } impl TeacherLogitsProvider for CudaTrainerTeacher { fn vocab_size(&self) -> usize { ... } fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>> { ... } } Defensive checks ================ - forward_logits returning None → EntrenarError::Internal with a clear "likely missing weights or CUDA init failure" message - logits.len() != vocab_size → EntrenarError::Internal flagging TransformerConfig vs checkpoint vocab drift (the common silent failure mode for loaded-from-disk distillation runs) Tests ===== All 6 teacher_provider tests pass under both --features (none) and --features cuda. Compile gates verified: cargo check -p aprender-train-distill # clean cargo check -p aprender-train-distill --features cuda # clean What's next =========== Phase 2 (PMAT-694, follow-up): wire CudaTransformerTrainer's KD-loss backward into the student path — replaces the remaining build_synthetic_logits call site for the student in pipeline.rs::train(). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ent (SPEC-DISTILL-001 Phase 2) Wires Phase 1's teacher provider into a per-batch KD orchestration step that produces both the combined α·CE + (1-α)·T²·KL scalar loss (for telemetry) and the KD-aware logit-space gradient (Phase 2b plug point). What this PR adds ================= New module `aprender-train-distill::kd_step`: pub fn kd_loss( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> f32; pub fn kd_logit_gradient( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> Vec<f32>; pub fn kd_step<F: FnMut(&[u32]) -> Vec<f32>>( teacher: &mut dyn TeacherLogitsProvider, input_ids: &[Vec<u32>], labels: &[usize], temperature: f32, alpha: f32, compute_student_logits: F, ) -> Result<(f32, Vec<Vec<f32>>)>; The gradient is the Hinton et al. 2015 §2 derivation: ∂L/∂s = α · (softmax(s) - one_hot(label)) + (1-α) · T · (softmax(s/T) - softmax(t/T)) (T factor, not T² — one T factor is absorbed by the softmax derivative chain rule.) Scope: Phase 2 vs Phase 2b ========================== Phase 2 (this PR) ships the orchestration math, all in pure Rust on the CPU. The output `Vec<Vec<f32>>` of per-batch gradients is what Phase 2b will plumb into `CudaTransformerTrainer.forward_backward_kd_batch` as the backward-pass seed (replacing the CE-only gradient currently used by forward_backward_batch). Splitting Phase 2 into 2a/2b lets us land the orchestration layer + its tests now, separate from extending the complex GPU trainer code path. Falsifiers pinned ================= 3 KD-step falsifiers + 6 sanity tests, all passing: - F-DISTILL-KDSTEP-001 (alpha=1 → pure CE) - F-DISTILL-KDSTEP-002 (student==teacher → zero KL gradient under alpha=0) - F-DISTILL-KDSTEP-003 (loss monotone in student-teacher divergence) - softmax unit-sum + non-negative - CE gradient correct sign at label vs non-label positions - kd_step orchestration end-to-end - kd_step empty-batch sanity - kd_step vocab-mismatch error path - kd_loss alpha=1 collapses to pure CE All 50 aprender-train-distill lib tests pass (was 41 — 9 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…TILL-001 Phase 2b) Mirrors Phase 1's TeacherLogitsProvider for the student side. The student has two methods: logits_for_batch (forward) and apply_kd_gradient (backward + optimizer step). FixtureStudent implements both for CPU-only unit testing — Phase 2c will add a CudaStudentProvider that wraps CudaTransformerTrainer. What this PR adds ================= pub trait StudentLogitsProvider { fn vocab_size(&self) -> usize; fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>>; fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>]) -> Result<()>; } pub struct FixtureStudent { vocab_size: usize, logits: Vec<f32>, // current student parameters learning_rate: f32, } FixtureStudent's apply_kd_gradient averages the gradient across batch elements (canonical SGD batch averaging) and subtracts the scaled gradient from its internal logits buffer. This isn't a real model — it's a logit-space optimization fixture that lets us validate the KD pipeline's gradient direction is correct without needing CUDA. Falsifiers pinned ================= 7 student_provider tests + 2 falsifiers, all passing: - F-DISTILL-STUDENT-001 — one KD step moves student logits toward teacher's preferred token. Setup: uniform student, teacher prefers token 5, alpha=0 (pure KL signal). After one step, student logit at index 5 must be strictly greater than before. - F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial. Validates the gradient direction is correct (descent, not ascent). Plus 5 sanity tests covering vocab_size reporting, batch broadcast, shape validation, in-place logit update, and batch averaging math. Architecture ============ Stacks on top of #1788 (kd_step). Pipeline integration that uses both TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c. Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps CudaTransformerTrainer for production runs. Once it lands, end-to-end GPU distillation is unblocked. Tests ===== All 57 aprender-train-distill lib tests pass (was 50 — 7 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…o-end (SPEC-DISTILL-001 Phase 2c) Rewrites Pipeline::train() to use TeacherLogitsProvider + StudentLogitsProvider + kd_step end-to-end. Replaces the build_synthetic_logits stubs on both sides with real abstraction calls. Pipeline gains a `student: Box<dyn StudentLogitsProvider + Send>` field and a `Pipeline::with_student()` builder mirroring `with_teacher()`. Default backends are FixtureTeacher + FixtureStudent so legacy tests behave identically. Phase 2d swaps in CudaStudentProvider. Training loop, per step: 1. dummy_batch = [[0u32]; batch_size] (Phase 4 plugs in real tokens) 2. teacher.logits_for_batch(dummy_batch) → teacher logits 3. kd_step(teacher, dummy_batch, labels, T, α, student_logits_closure) → (scalar loss, per-batch logit gradients) 4. student.apply_kd_gradient(grads) → student updates bracketed by initial-loss + final-loss measurements via the new kd_step_loss_for_pipeline helper. Falsifiers ========== F-DISTILL-PIPELINE-001 (new) — end-to-end falsifier: runs Pipeline::execute() with FixtureTeacher + FixtureStudent + 3 epochs and asserts final_loss < initial_loss. Pins the entire data flow: teacher → student → kd_step → apply_kd_gradient. Any broken link either flatlines or increases the loss. Phase 2d plug points ==================== - The dummy_batch in train() is the natural insertion point for a real dataset iterator (Phase 4 work). - The student-logits closure in kd_step is the natural insertion point for CudaStudentProvider's forward_logits (Phase 2d). - The apply_kd_gradient call is the natural insertion point for CudaStudentProvider's forward_backward_kd_batch path (Phase 2d). Tests ===== All 58 aprender-train-distill lib tests pass (was 57 — 1 new F-DISTILL-PIPELINE-001 integration falsifier). The four legacy helpers (build_synthetic_logits, kd_gradient, softmax_2d, write_logits_to_weights) are marked #[allow(dead_code)] for back-compat until Phase 2d's wiring fully replaces the on-disk weights round-trip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-DISTILL-001 Phase 2d) Closes the last engineering gap before Phase 3 (500-step E2E smoke). Real GPU student backend wraps CudaTransformerTrainer and implements the Phase 2b StudentLogitsProvider trait. Two complementary pieces ======================== 1. New pub method on CudaTransformerTrainer (cuda_trainer.rs:1247-1311): pub fn forward_backward_with_grad( &mut self, input_ids: &[u32], logit_gradient: &[f32], ) -> Option<()> Runs forward, uploads the caller-supplied logit gradient into the last-position slice of logits_buf (replacing what gpu_forward wrote), runs gpu_backward (which back-props from the in-place gradient through the transformer stack), and runs embed_backward. Matches the KAIZEN-052 in-place gradient convention that fused_cross_entropy_cuda uses for the CE path — except the gradient now comes from `kd_step::kd_logit_gradient` (Phase 2a). 2. New CudaStudentProvider in aprender-train-distill (cuda-gated): pub struct CudaStudentProvider { trainer: CudaTransformerTrainer, vocab_size: usize, last_input_ids: Option<Vec<u32>>, } impl StudentLogitsProvider for CudaStudentProvider: - logits_for_batch → trainer.forward_logits per batch element + caches last_input_ids - apply_kd_gradient → trainer.forward_backward_with_grad on the cached last input_ids with the last gradient Phase 2d limitation =================== batch_size=1 only — the trait's apply_kd_gradient doesn't take input_ids, so the provider has to cache from the most-recent logits_for_batch call. With batches >1 only the last element gets a real gradient update. Phase 2e (PMAT-698, follow-up) generalizes via a fused-step trait method that takes input_ids + gradient together so all batch elements process correctly. Tests ===== All 58 aprender-train-distill lib tests pass under BOTH --features (none) and --features cuda. The CudaStudentProvider itself doesn't have a unit test — exercising it needs CUDA at test time, and the F-DISTILL-CUDA-STUDENT-001 falsifier (logits parity within 1e-6 vs a standalone forward_logits call) is integration-tested in Phase 4 production runs. What's next =========== With Phase 2d landed, Phase 3 (500-step E2E smoke run with real CudaTrainerTeacher + CudaStudentProvider) is unblocked. Phase 4 is the actual 50K-step distillation training run for albor-370m-v2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-18T16:40:20Z

Subsumed by #1797 squash-merge (chain-PR leapfrog pattern per memory rule). All content landed on main at aee8716.

noahgift enabled auto-merge (squash) May 18, 2026 12:44

noahgift mentioned this pull request May 18, 2026

chore(distill): Phase 3 smoke-run dispatch + watch scripts for gx10 (SPEC-DISTILL-001) #1795

Closed

3 tasks

noahgift and others added 5 commits May 18, 2026 15:17

noahgift force-pushed the feat/distill-phase-2d-cuda-student branch from 20b7800 to 30a3bb3 Compare May 18, 2026 13:17

noahgift added 2 commits May 18, 2026 16:33

Merge branch 'main' into feat/distill-phase-2d-cuda-student

305d998

Merge branch 'main' into feat/distill-phase-2d-cuda-student

40d0edd

noahgift closed this May 18, 2026

auto-merge was automatically disabled May 18, 2026 16:40
Pull request was closed

noahgift deleted the feat/distill-phase-2d-cuda-student branch May 18, 2026 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distill): CudaStudentProvider + forward_backward_with_grad (SPEC-DISTILL-001 Phase 2d)#1793

feat(distill): CudaStudentProvider + forward_backward_with_grad (SPEC-DISTILL-001 Phase 2d)#1793
noahgift wants to merge 7 commits into
mainfrom
feat/distill-phase-2d-cuda-student

noahgift commented May 18, 2026

Uh oh!

noahgift commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 18, 2026

Summary

Two complementary pieces

Phase 2d limitation

Tests

What's next

Test plan

Uh oh!

noahgift commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant