fix(distill): non-degenerate smoke batch — per-row input matches label (PMAT-698m) by noahgift · Pull Request #1823 · paiml/aprender

noahgift · 2026-05-19T20:47:00Z

Summary

Phase 3 GB10 smoke completed end-to-end after the 7-PR cascade but reported initial_loss=6.0760 → final_loss=8.3914 — loss INCREASED. The pipeline was working; this was a smoke-setup artifact.

Root cause

let dummy_batch = vec![vec![0u32]; batch_size];               // all token 0
let labels      = (0..batch_size).map(|i| i % num_classes);   // 0,1,2,...

Same input paired with N distinct labels. The student is asked to emit different targets for IDENTICAL features — impossible to learn, CE diverges. With teacher==student in the smoke, KD ~ 0 → CE dominates → final > initial.

Fix

Per-row input matches per-row label. Trivial identity mapping (input token → predict same token). CE decreases monotonically.

dummy_batch[i] = vec![(i % num_classes) as u32]
labels[i]      = i % num_classes

For real distillation with a dataset, callers override the pipeline's batch construction entirely.

Test plan

58 distill lib tests pass (FALSIFY-APR-DISTILL-TRAIN-001/002 + falsify_pipeline_001_end_to_end_loss_decreases all green)
Live gx10 dispatch: final_loss < initial_loss (F-DISTILL-SMOKE-001 positive)

Relationship to cascade

The PMAT-698e..m cascade closed Phase 3 dispatch readiness:

#	PR	What
1-8	various	Pipeline mechanics (workspace, format, pre-warm, macro, key align)
9	THIS PMAT-698m	Smoke contract semantics — F-DISTILL-SMOKE-001 actually meaningful now

🤖 Generated with Claude Code

…l (PMAT-698m) Phase 3 GB10 smoke completed end-to-end after the 7-PR cascade (PMAT-698e..k + PMAT-700-B) but reported: initial_loss = 6.0760 final_loss = 8.3914 Loss INCREASED. F-DISTILL-SMOKE-001 ("final_loss < initial_loss") falsified — but the pipeline was working; this was a smoke-setup artifact. Root cause: pipeline.train()'s synthetic batch was: let dummy_batch = vec![vec![0u32]; batch_size]; // all token 0 let labels = (0..batch_size).map(|i| i % num_classes); // i.e. 0,1,2,... Same input paired with N distinct labels. The student is being asked to emit different targets for IDENTICAL features — impossible to learn, CE loss diverges. With teacher==student in the smoke (Qwen 0.5B both sides), the KD component contributes ~0 signal, so the CE divergence dominates and final > initial. Fix: per-row input matches per-row label. Each row carries a distinct token; the label is that same token. Student learns trivial identity mapping, CE decreases monotonically, F-DISTILL-SMOKE-001 satisfiable. dummy_batch[i] = vec![(i % num_classes) as u32] labels[i] = i % num_classes For real distillation with a real dataset, callers override batch construction entirely; this default is for the smoke + fixture tests. Test plan: - [x] 58 distill lib tests pass (FALSIFY-APR-DISTILL-TRAIN-001/002 remain green; fixture-path-loss-decreases assertion is now ACTUALLY meaningful instead of accidentally passing on degenerate data) - [ ] Live gx10 dispatch: final_loss < initial_loss (F-DISTILL-SMOKE-001 verdict POSITIVE, post-merge) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-20 — real distillation 1.5B teacher → 0.5B student on Blackwell GB10 with the full PMAT-698e..n + PMAT-700-B cascade active. initial_loss = 7.6746 final_loss = 7.2036 ← LESS THAN initial 62 steps, 122.7s, no errors F-DISTILL-SMOKE-001 ("final_loss < initial_loss") discharged. Phase 3 of SPEC-DISTILL-001 is COMPLETE. Evidence: - evidence/distill-phase-3-real-kd/dispatch.json — dispatch manifest - evidence/distill-phase-3-real-kd/launch-final-pass.txt — full training log Run dir on gx10: /home/noah/runs/distill-smoke-20260520-070404/ Trained student checkpoint: student-trained.apr/model.safetensors Cascade summary (all merged): - #1804 PMAT-700-B (cuBLAS prewarm skip) - #1808 PMAT-698e (workspace cap) - #1809 PMAT-698f (APR magic in weights loader) - #1810 PMAT-698g (non-LoRA backward pre-warm) - #1813 PMAT-698h (rms_norm_gamma_reduce pre-warm) - #1815 PMAT-698i (FWD-CACHE diagnostic logging) - #1817 PMAT-698j (THE root cause — warm! macro key) - #1820 PMAT-698k (cache-key alignment: rope fwd + rmsnorm eps) - #1823 PMAT-698m (smoke setup: non-degenerate batch) - #1824 (post-mortem doc) - #1827 PMAT-698n (rmsnorm pre-warm at both 1e-6 + 1e-5 eps) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 19, 2026 20:47

noahgift mentioned this pull request May 19, 2026

docs(blackwell): cascade post-mortem — 8 PRs / 7 defects / 1 root cause #1824

Merged

noahgift merged commit 1c5746e into main May 19, 2026
11 checks passed

noahgift deleted the fix/distill-smoke-batch-non-degenerate-pmat-698m branch May 19, 2026 21:16

noahgift mentioned this pull request May 20, 2026

evidence(distill): Phase 3 F-DISTILL-SMOKE-001 PASS on gx10 GB10 #1828

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distill): non-degenerate smoke batch — per-row input matches label (PMAT-698m)#1823

fix(distill): non-degenerate smoke batch — per-row input matches label (PMAT-698m)#1823
noahgift merged 1 commit into
mainfrom
fix/distill-smoke-batch-non-degenerate-pmat-698m

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

Root cause

Fix

Test plan

Relationship to cascade

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant