docs(spec): SPEC-DISTILL-001 — distillation epic 6-phase plan (PMAT-683/684)#1785
Merged
Conversation
…83/684) Opens the distillation track that picks up MODEL-2 v2 (paiml/albor-370m-v2) from where the §88 stack-existence-proof ship left off. What this spec scopes ===================== Current state audit: - DistillationLoss is REAL (KD = α·CE + (1-α)·T²·KL) - CudaTransformerTrainer forward+backward is REAL (proven by §82 P2-A) - realizar teacher inference is REAL (proven by SHIP-005 86.59% HumanEval) - Pipeline orchestrator at aprender-train-distill/src/pipeline.rs:115 is STUB — uses build_synthetic_logits() instead of teacher.forward(); never calls CudaTransformerTrainer for the student. Closing this gap is the epic. 6-phase plan ============ Phase 1 (3 days, 16-24h eng): `apr distill prepare` — realizar runs MODEL-1 teacher over the corpus, caches top-K=64 logits to disk. 100-batch test asserts cosine sim ≥ 0.999 against online realizar recompute. Phase 2 (2 days, 16-24h eng): wire CudaTransformerTrainer to KD loss via new forward_backward_kd_batch(); replace synthetic-logits stub in pipeline.rs::train(). Unit test on toy student verifies loss monotone. Phase 3 (1 day + 4h compute): 500-step E2E smoke on a 10K-batch slice. Falsifier F-DISTILL-SMOKE-001 — val_loss at step 500 < step 0. Phase 4 (4h dispatch + 30h unattended compute): the v2 training run. 50K steps × 8192 tok/step = 1.6B tokens. Init from Qwen2.5-Coder-0.5B (matched tokenizer + arch family). Falsifiers F-DISTILL-V2-001/002 — val_loss < 3.0 AND HumanEval pass@1 ≥ 15%. Phase 5 (5-8h gx10 compute): full 164-problem HumanEval discharge of PMAT-684. Acceptance threshold pass@1 ≥ 15%; ship-goal ≥ 25%. Phase 6 (3h staging + 1h compute): publish v2 per SPEC-HF-PUBLISH-001. With v0.34.0+#1783 binary, companion files + model.safetensors alias are auto-emitted. Three-path verification (apr run + HF Transformers + llama-cli). Total: ~70h eng + ~45h compute, ~10 days calendar. Risk register ============= - Cache size: top-K=64 sparsification → ~4GB instead of ~100GB - KD numerical stability: Phase 2 unit test compares against PyTorch nn.KLDivLoss within 1e-4 absolute - Teacher inference cost: Phase 1 cache amortizes one-time ~6h prep to <100ms/batch reads during training - HumanEval miss: two-path fallback — widen corpus OR drop T from 4.0 to 2.0 (each adds ~1 week) Architectural decisions ======================= 1. Top-K=64 cache (NOT full logits) — DistilBERT/Distil-Qwen precedent 2. Cached teacher (NOT online) — hyperparameter sweep cost < cache regen 3. Vanilla KD (NOT MiniLM intermediate-layer matching) — teacher is Q4_K, intermediate activations aren't recoverable post-quantization 4. Matched tokenizer (Qwen2 151,936 vocab) — strongest argument for Qwen2.5-Coder-0.5B-Instruct as init 5 AC-DISTILL-* criteria authored; cross-linked to SPEC-HF-PUBLISH-001 (used in Phase 6) and AUDIT-Q4K-SHAPE-001 (confirms teacher Q4_K is bit-correct, no re-export needed before distillation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… teacher (Refs PMAT-691) Revision driven by storage-math sanity check + pmat work priority promotion: 1. Priority promoted to HIGH (pmat work edit on PMAT-683 + PMAT-684, plus new PMAT-691 for Phase 1 implementation). 2. Phase 1 redesigned from on-disk top-K=64 cache to online teacher logits provider. Storage math: 1.24B tokens × 64 entries × 6 bytes ≈ 476 GB, exceeds available NVMe budget. Top-K cache approach moves to Phase 1.5 as an optional in-memory ring-buffer optimization that hides teacher latency under student compute. 3. Effort totals: Phase 1 compute drops from 4-8h to <1h. Total epic eng stays ~70h but compute drops 45h → 40h. 4. New falsifier F-DISTILL-TEACHER-001 — RealizarTeacher.logits_for_batch matches realizar's apr trace logits output within 1e-3 absolute error on a frozen 3-layer fixture. Implementation: PMAT-691 work session started 2026-05-18. Phase 1 deliverables are: teacher_provider.rs module, RealizarTeacher wrapper, pipeline.rs::train() rewrite to use it, unit test against golden fixture. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Merged
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Opens the distillation track that picks up MODEL-2 v2 (
paiml/albor-370m-v2) from where the §88 stack-existence-proof ship left off. The 370M-v1 model honestly produces gibberish at val_perplexity≈102; the v2 path distills from MODEL-1'spaiml/qwen2.5-coder-7b-apache-q4k-v1teacher into a student that actually hits a usable HumanEval pass@1.Current state audit
DistillationLoss(KD = α·CE + (1-α)·T²·KL)aprender-train/src/distill/loss.rsCudaTransformerTrainerforward+backwardpipeline.rs:115usesbuild_synthetic_logits()instead ofteacher.forward()The gap to close is the orchestrator stub. Closing it is this epic.
6-phase plan
apr distill prepare)Key architectural decisions
Qwen2.5-Coder-0.5B-Instructas init.Cross-references
Test plan
docs/specifications/aprender-train/distillation-epic-spec.md🤖 Generated with Claude Code