docs(spec): SPEC-DISTILL-001 — distillation epic 6-phase plan (PMAT-683/684) by noahgift · Pull Request #1785 · paiml/aprender

noahgift · 2026-05-18T08:43:03Z

Summary

Opens the distillation track that picks up MODEL-2 v2 (paiml/albor-370m-v2) from where the §88 stack-existence-proof ship left off. The 370M-v1 model honestly produces gibberish at val_perplexity≈102; the v2 path distills from MODEL-1's paiml/qwen2.5-coder-7b-apache-q4k-v1 teacher into a student that actually hits a usable HumanEval pass@1.

Current state audit

Component	State
`DistillationLoss` (KD = α·CE + (1-α)·T²·KL)	REAL — `aprender-train/src/distill/loss.rs`
`CudaTransformerTrainer` forward+backward	REAL — proven by §82 P2-A 5000-step run
realizar teacher inference	REAL — proven by SHIP-005 86.59% HumanEval
Pipeline orchestrator	STUB — `pipeline.rs:115` uses `build_synthetic_logits()` instead of `teacher.forward()`

The gap to close is the orchestrator stub. Closing it is this epic.

6-phase plan

Phase	Eng	Compute	Calendar	Falsifier
1 — Teacher logits cache (`apr distill prepare`)	16-24h	4-8h	3 days	F-DISTILL-PREP-001 — cache vs realizar cos sim ≥ 0.999
2 — Wire KD-loss to CudaTransformerTrainer (replace stub)	16-24h	<1h	2 days	F-DISTILL-KD-001 — loss monotone over 100 steps
3 — 500-step E2E smoke	8h	4h	1 day	F-DISTILL-SMOKE-001 — val_loss step 500 < step 0
4 — V2 training run (50K steps × 8192 tok/step = 1.6B tokens)	4h	30h unattended	2-3 days	F-DISTILL-V2-001/002 — val_loss < 3.0 AND pass@1 ≥ 15%
5 — Full HumanEval discharge (PMAT-684)	4h	5-8h gx10	1 day	F-HUMANEVAL-V2-001/002 — pass@1 ≥ 15% / ≥ 25%
6 — Publish v2 per SPEC-HF-PUBLISH-001	3h	1h	0.5 day	three-path verification
Total	~70h eng	~45h compute	~10 days

Key architectural decisions

Top-K=64 cache (not full logits) — DistilBERT/Distil-Qwen precedent. ~4GB instead of ~100GB.
Cached teacher (not online) — hyperparameter sweep cost < single cache regen.
Vanilla KD (not MiniLM intermediate-layer matching) — teacher is Q4_K, intermediate activations aren't recoverable post-quantization.
Matched tokenizer (Qwen2 151,936 vocab) — strongest argument for Qwen2.5-Coder-0.5B-Instruct as init.

Cross-references

Parent: ship-model-2-spec.md §84.5
Roadmap: PMAT-683 (P2-D distill) + PMAT-684 (P1-B HumanEval)
Companion: SPEC-HF-PUBLISH-001 (used for Phase 6)
Audit: AUDIT-Q4K-SHAPE-001 — confirms teacher's Q4_K artifact is bit-correct (no re-export needed before distillation)

Test plan

Spec authored at docs/specifications/aprender-train/distillation-epic-spec.md
Cross-links resolve (ship-model-2 §84.5, SPEC-HF-PUBLISH-001, AUDIT-Q4K-SHAPE-001, roadmap PMAT-683/684)
All 5 AC-DISTILL-* acceptance criteria pin a measurable falsifier
Phase 1 implementation kick-off in a follow-up PR

🤖 Generated with Claude Code

…83/684) Opens the distillation track that picks up MODEL-2 v2 (paiml/albor-370m-v2) from where the §88 stack-existence-proof ship left off. What this spec scopes ===================== Current state audit: - DistillationLoss is REAL (KD = α·CE + (1-α)·T²·KL) - CudaTransformerTrainer forward+backward is REAL (proven by §82 P2-A) - realizar teacher inference is REAL (proven by SHIP-005 86.59% HumanEval) - Pipeline orchestrator at aprender-train-distill/src/pipeline.rs:115 is STUB — uses build_synthetic_logits() instead of teacher.forward(); never calls CudaTransformerTrainer for the student. Closing this gap is the epic. 6-phase plan ============ Phase 1 (3 days, 16-24h eng): `apr distill prepare` — realizar runs MODEL-1 teacher over the corpus, caches top-K=64 logits to disk. 100-batch test asserts cosine sim ≥ 0.999 against online realizar recompute. Phase 2 (2 days, 16-24h eng): wire CudaTransformerTrainer to KD loss via new forward_backward_kd_batch(); replace synthetic-logits stub in pipeline.rs::train(). Unit test on toy student verifies loss monotone. Phase 3 (1 day + 4h compute): 500-step E2E smoke on a 10K-batch slice. Falsifier F-DISTILL-SMOKE-001 — val_loss at step 500 < step 0. Phase 4 (4h dispatch + 30h unattended compute): the v2 training run. 50K steps × 8192 tok/step = 1.6B tokens. Init from Qwen2.5-Coder-0.5B (matched tokenizer + arch family). Falsifiers F-DISTILL-V2-001/002 — val_loss < 3.0 AND HumanEval pass@1 ≥ 15%. Phase 5 (5-8h gx10 compute): full 164-problem HumanEval discharge of PMAT-684. Acceptance threshold pass@1 ≥ 15%; ship-goal ≥ 25%. Phase 6 (3h staging + 1h compute): publish v2 per SPEC-HF-PUBLISH-001. With v0.34.0+#1783 binary, companion files + model.safetensors alias are auto-emitted. Three-path verification (apr run + HF Transformers + llama-cli). Total: ~70h eng + ~45h compute, ~10 days calendar. Risk register ============= - Cache size: top-K=64 sparsification → ~4GB instead of ~100GB - KD numerical stability: Phase 2 unit test compares against PyTorch nn.KLDivLoss within 1e-4 absolute - Teacher inference cost: Phase 1 cache amortizes one-time ~6h prep to <100ms/batch reads during training - HumanEval miss: two-path fallback — widen corpus OR drop T from 4.0 to 2.0 (each adds ~1 week) Architectural decisions ======================= 1. Top-K=64 cache (NOT full logits) — DistilBERT/Distil-Qwen precedent 2. Cached teacher (NOT online) — hyperparameter sweep cost < cache regen 3. Vanilla KD (NOT MiniLM intermediate-layer matching) — teacher is Q4_K, intermediate activations aren't recoverable post-quantization 4. Matched tokenizer (Qwen2 151,936 vocab) — strongest argument for Qwen2.5-Coder-0.5B-Instruct as init 5 AC-DISTILL-* criteria authored; cross-linked to SPEC-HF-PUBLISH-001 (used in Phase 6) and AUDIT-Q4K-SHAPE-001 (confirms teacher Q4_K is bit-correct, no re-export needed before distillation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… teacher (Refs PMAT-691) Revision driven by storage-math sanity check + pmat work priority promotion: 1. Priority promoted to HIGH (pmat work edit on PMAT-683 + PMAT-684, plus new PMAT-691 for Phase 1 implementation). 2. Phase 1 redesigned from on-disk top-K=64 cache to online teacher logits provider. Storage math: 1.24B tokens × 64 entries × 6 bytes ≈ 476 GB, exceeds available NVMe budget. Top-K cache approach moves to Phase 1.5 as an optional in-memory ring-buffer optimization that hides teacher latency under student compute. 3. Effort totals: Phase 1 compute drops from 4-8h to <1h. Total epic eng stays ~70h but compute drops 45h → 40h. 4. New falsifier F-DISTILL-TEACHER-001 — RealizarTeacher.logits_for_batch matches realizar's apr trace logits output within 1e-3 absolute error on a frozen 3-layer fixture. Implementation: PMAT-691 work session started 2026-05-18. Phase 1 deliverables are: teacher_provider.rs module, RealizarTeacher wrapper, pipeline.rs::train() rewrite to use it, unit test against golden fixture. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 18, 2026 08:43

noahgift and others added 3 commits May 18, 2026 11:10

Merge branch 'main' into docs/spec-distill-001-epic-scoping

b349929

Merge branch 'main' into docs/spec-distill-001-epic-scoping

5431fe2

noahgift mentioned this pull request May 18, 2026

feat(distill): teacher logits provider abstraction (SPEC-DISTILL-001 Phase 1, PMAT-691) #1786

Merged

4 tasks

Merge branch 'main' into docs/spec-distill-001-epic-scoping

cc3c5f7

noahgift merged commit c247afb into main May 18, 2026
10 checks passed

noahgift deleted the docs/spec-distill-001-epic-scoping branch May 18, 2026 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): SPEC-DISTILL-001 — distillation epic 6-phase plan (PMAT-683/684)#1785

docs(spec): SPEC-DISTILL-001 — distillation epic 6-phase plan (PMAT-683/684)#1785
noahgift merged 5 commits into
mainfrom
docs/spec-distill-001-epic-scoping

noahgift commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 18, 2026

Summary

Current state audit

6-phase plan

Key architectural decisions

Cross-references

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant