Skip to content

fix(distill): non-degenerate smoke batch — per-row input matches label (PMAT-698m)#1823

Merged
noahgift merged 1 commit into
mainfrom
fix/distill-smoke-batch-non-degenerate-pmat-698m
May 19, 2026
Merged

fix(distill): non-degenerate smoke batch — per-row input matches label (PMAT-698m)#1823
noahgift merged 1 commit into
mainfrom
fix/distill-smoke-batch-non-degenerate-pmat-698m

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Phase 3 GB10 smoke completed end-to-end after the 7-PR cascade but reported initial_loss=6.0760 → final_loss=8.3914 — loss INCREASED. The pipeline was working; this was a smoke-setup artifact.

Root cause

let dummy_batch = vec![vec![0u32]; batch_size];               // all token 0
let labels      = (0..batch_size).map(|i| i % num_classes);   // 0,1,2,...

Same input paired with N distinct labels. The student is asked to emit different targets for IDENTICAL features — impossible to learn, CE diverges. With teacher==student in the smoke, KD ~ 0 → CE dominates → final > initial.

Fix

Per-row input matches per-row label. Trivial identity mapping (input token → predict same token). CE decreases monotonically.

dummy_batch[i] = vec![(i % num_classes) as u32]
labels[i]      = i % num_classes

For real distillation with a dataset, callers override the pipeline's batch construction entirely.

Test plan

  • 58 distill lib tests pass (FALSIFY-APR-DISTILL-TRAIN-001/002 + falsify_pipeline_001_end_to_end_loss_decreases all green)
  • Live gx10 dispatch: final_loss < initial_loss (F-DISTILL-SMOKE-001 positive)

Relationship to cascade

The PMAT-698e..m cascade closed Phase 3 dispatch readiness:

# PR What
1-8 various Pipeline mechanics (workspace, format, pre-warm, macro, key align)
9 THIS PMAT-698m Smoke contract semantics — F-DISTILL-SMOKE-001 actually meaningful now

🤖 Generated with Claude Code

…l (PMAT-698m)

Phase 3 GB10 smoke completed end-to-end after the 7-PR cascade
(PMAT-698e..k + PMAT-700-B) but reported:

  initial_loss = 6.0760
  final_loss   = 8.3914

Loss INCREASED. F-DISTILL-SMOKE-001 ("final_loss < initial_loss")
falsified — but the pipeline was working; this was a smoke-setup artifact.

Root cause: pipeline.train()'s synthetic batch was:

  let dummy_batch = vec![vec![0u32]; batch_size];           // all token 0
  let labels      = (0..batch_size).map(|i| i % num_classes); // i.e. 0,1,2,...

Same input paired with N distinct labels. The student is being asked to
emit different targets for IDENTICAL features — impossible to learn,
CE loss diverges. With teacher==student in the smoke (Qwen 0.5B both
sides), the KD component contributes ~0 signal, so the CE divergence
dominates and final > initial.

Fix: per-row input matches per-row label. Each row carries a distinct
token; the label is that same token. Student learns trivial identity
mapping, CE decreases monotonically, F-DISTILL-SMOKE-001 satisfiable.

  dummy_batch[i] = vec![(i % num_classes) as u32]
  labels[i]      = i % num_classes

For real distillation with a real dataset, callers override batch
construction entirely; this default is for the smoke + fixture tests.

Test plan:
- [x] 58 distill lib tests pass (FALSIFY-APR-DISTILL-TRAIN-001/002
      remain green; fixture-path-loss-decreases assertion is now ACTUALLY
      meaningful instead of accidentally passing on degenerate data)
- [ ] Live gx10 dispatch: final_loss < initial_loss (F-DISTILL-SMOKE-001
      verdict POSITIVE, post-merge)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 19, 2026 20:47
@noahgift noahgift merged commit 1c5746e into main May 19, 2026
11 checks passed
@noahgift noahgift deleted the fix/distill-smoke-batch-non-degenerate-pmat-698m branch May 19, 2026 21:16
noahgift added a commit that referenced this pull request May 20, 2026
2026-05-20 — real distillation 1.5B teacher → 0.5B student on
Blackwell GB10 with the full PMAT-698e..n + PMAT-700-B cascade active.

  initial_loss = 7.6746
  final_loss   = 7.2036   ← LESS THAN initial
  62 steps, 122.7s, no errors

F-DISTILL-SMOKE-001 ("final_loss < initial_loss") discharged.

Phase 3 of SPEC-DISTILL-001 is COMPLETE.

Evidence:
- evidence/distill-phase-3-real-kd/dispatch.json — dispatch manifest
- evidence/distill-phase-3-real-kd/launch-final-pass.txt — full training log

Run dir on gx10: /home/noah/runs/distill-smoke-20260520-070404/
Trained student checkpoint: student-trained.apr/model.safetensors

Cascade summary (all merged):
- #1804 PMAT-700-B  (cuBLAS prewarm skip)
- #1808 PMAT-698e   (workspace cap)
- #1809 PMAT-698f   (APR magic in weights loader)
- #1810 PMAT-698g   (non-LoRA backward pre-warm)
- #1813 PMAT-698h   (rms_norm_gamma_reduce pre-warm)
- #1815 PMAT-698i   (FWD-CACHE diagnostic logging)
- #1817 PMAT-698j   (THE root cause — warm! macro key)
- #1820 PMAT-698k   (cache-key alignment: rope fwd + rmsnorm eps)
- #1823 PMAT-698m   (smoke setup: non-degenerate batch)
- #1824             (post-mortem doc)
- #1827 PMAT-698n   (rmsnorm pre-warm at both 1e-6 + 1e-5 eps)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant