Skip to content

docs(spec): SPEC-DISTILL-001 — distillation epic 6-phase plan (PMAT-683/684)#1785

Merged
noahgift merged 5 commits into
mainfrom
docs/spec-distill-001-epic-scoping
May 18, 2026
Merged

docs(spec): SPEC-DISTILL-001 — distillation epic 6-phase plan (PMAT-683/684)#1785
noahgift merged 5 commits into
mainfrom
docs/spec-distill-001-epic-scoping

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Opens the distillation track that picks up MODEL-2 v2 (paiml/albor-370m-v2) from where the §88 stack-existence-proof ship left off. The 370M-v1 model honestly produces gibberish at val_perplexity≈102; the v2 path distills from MODEL-1's paiml/qwen2.5-coder-7b-apache-q4k-v1 teacher into a student that actually hits a usable HumanEval pass@1.

Current state audit

Component State
DistillationLoss (KD = α·CE + (1-α)·T²·KL) REALaprender-train/src/distill/loss.rs
CudaTransformerTrainer forward+backward REAL — proven by §82 P2-A 5000-step run
realizar teacher inference REAL — proven by SHIP-005 86.59% HumanEval
Pipeline orchestrator STUBpipeline.rs:115 uses build_synthetic_logits() instead of teacher.forward()

The gap to close is the orchestrator stub. Closing it is this epic.

6-phase plan

Phase Eng Compute Calendar Falsifier
1 — Teacher logits cache (apr distill prepare) 16-24h 4-8h 3 days F-DISTILL-PREP-001 — cache vs realizar cos sim ≥ 0.999
2 — Wire KD-loss to CudaTransformerTrainer (replace stub) 16-24h <1h 2 days F-DISTILL-KD-001 — loss monotone over 100 steps
3 — 500-step E2E smoke 8h 4h 1 day F-DISTILL-SMOKE-001 — val_loss step 500 < step 0
4 — V2 training run (50K steps × 8192 tok/step = 1.6B tokens) 4h 30h unattended 2-3 days F-DISTILL-V2-001/002 — val_loss < 3.0 AND pass@1 ≥ 15%
5 — Full HumanEval discharge (PMAT-684) 4h 5-8h gx10 1 day F-HUMANEVAL-V2-001/002 — pass@1 ≥ 15% / ≥ 25%
6 — Publish v2 per SPEC-HF-PUBLISH-001 3h 1h 0.5 day three-path verification
Total ~70h eng ~45h compute ~10 days

Key architectural decisions

  1. Top-K=64 cache (not full logits) — DistilBERT/Distil-Qwen precedent. ~4GB instead of ~100GB.
  2. Cached teacher (not online) — hyperparameter sweep cost < single cache regen.
  3. Vanilla KD (not MiniLM intermediate-layer matching) — teacher is Q4_K, intermediate activations aren't recoverable post-quantization.
  4. Matched tokenizer (Qwen2 151,936 vocab) — strongest argument for Qwen2.5-Coder-0.5B-Instruct as init.

Cross-references

Test plan

  • Spec authored at docs/specifications/aprender-train/distillation-epic-spec.md
  • Cross-links resolve (ship-model-2 §84.5, SPEC-HF-PUBLISH-001, AUDIT-Q4K-SHAPE-001, roadmap PMAT-683/684)
  • All 5 AC-DISTILL-* acceptance criteria pin a measurable falsifier
  • Phase 1 implementation kick-off in a follow-up PR

🤖 Generated with Claude Code

…83/684)

Opens the distillation track that picks up MODEL-2 v2 (paiml/albor-370m-v2)
from where the §88 stack-existence-proof ship left off.

What this spec scopes
=====================

Current state audit:
- DistillationLoss is REAL (KD = α·CE + (1-α)·T²·KL)
- CudaTransformerTrainer forward+backward is REAL (proven by §82 P2-A)
- realizar teacher inference is REAL (proven by SHIP-005 86.59% HumanEval)
- Pipeline orchestrator at aprender-train-distill/src/pipeline.rs:115 is STUB
  — uses build_synthetic_logits() instead of teacher.forward(); never calls
  CudaTransformerTrainer for the student. Closing this gap is the epic.

6-phase plan
============

Phase 1 (3 days, 16-24h eng): `apr distill prepare` — realizar runs MODEL-1
  teacher over the corpus, caches top-K=64 logits to disk. 100-batch test
  asserts cosine sim ≥ 0.999 against online realizar recompute.

Phase 2 (2 days, 16-24h eng): wire CudaTransformerTrainer to KD loss via
  new forward_backward_kd_batch(); replace synthetic-logits stub in
  pipeline.rs::train(). Unit test on toy student verifies loss monotone.

Phase 3 (1 day + 4h compute): 500-step E2E smoke on a 10K-batch slice.
  Falsifier F-DISTILL-SMOKE-001 — val_loss at step 500 < step 0.

Phase 4 (4h dispatch + 30h unattended compute): the v2 training run.
  50K steps × 8192 tok/step = 1.6B tokens. Init from Qwen2.5-Coder-0.5B
  (matched tokenizer + arch family). Falsifiers F-DISTILL-V2-001/002 —
  val_loss < 3.0 AND HumanEval pass@1 ≥ 15%.

Phase 5 (5-8h gx10 compute): full 164-problem HumanEval discharge of
  PMAT-684. Acceptance threshold pass@1 ≥ 15%; ship-goal ≥ 25%.

Phase 6 (3h staging + 1h compute): publish v2 per SPEC-HF-PUBLISH-001.
  With v0.34.0+#1783 binary, companion files + model.safetensors alias
  are auto-emitted. Three-path verification (apr run + HF Transformers
  + llama-cli).

Total: ~70h eng + ~45h compute, ~10 days calendar.

Risk register
=============

- Cache size: top-K=64 sparsification → ~4GB instead of ~100GB
- KD numerical stability: Phase 2 unit test compares against PyTorch
  nn.KLDivLoss within 1e-4 absolute
- Teacher inference cost: Phase 1 cache amortizes one-time ~6h prep
  to <100ms/batch reads during training
- HumanEval miss: two-path fallback — widen corpus OR drop T from
  4.0 to 2.0 (each adds ~1 week)

Architectural decisions
=======================

1. Top-K=64 cache (NOT full logits) — DistilBERT/Distil-Qwen precedent
2. Cached teacher (NOT online) — hyperparameter sweep cost < cache regen
3. Vanilla KD (NOT MiniLM intermediate-layer matching) — teacher is Q4_K,
   intermediate activations aren't recoverable post-quantization
4. Matched tokenizer (Qwen2 151,936 vocab) — strongest argument for
   Qwen2.5-Coder-0.5B-Instruct as init

5 AC-DISTILL-* criteria authored; cross-linked to SPEC-HF-PUBLISH-001
(used in Phase 6) and AUDIT-Q4K-SHAPE-001 (confirms teacher Q4_K is
bit-correct, no re-export needed before distillation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 18, 2026 08:43
noahgift and others added 3 commits May 18, 2026 11:10
… teacher (Refs PMAT-691)

Revision driven by storage-math sanity check + pmat work priority
promotion:

1. Priority promoted to HIGH (pmat work edit on PMAT-683 + PMAT-684,
   plus new PMAT-691 for Phase 1 implementation).

2. Phase 1 redesigned from on-disk top-K=64 cache to online teacher
   logits provider. Storage math: 1.24B tokens × 64 entries × 6 bytes
   ≈ 476 GB, exceeds available NVMe budget. Top-K cache approach moves
   to Phase 1.5 as an optional in-memory ring-buffer optimization that
   hides teacher latency under student compute.

3. Effort totals: Phase 1 compute drops from 4-8h to <1h. Total epic
   eng stays ~70h but compute drops 45h → 40h.

4. New falsifier F-DISTILL-TEACHER-001 — RealizarTeacher.logits_for_batch
   matches realizar's apr trace logits output within 1e-3 absolute error
   on a frozen 3-layer fixture.

Implementation: PMAT-691 work session started 2026-05-18. Phase 1
deliverables are: teacher_provider.rs module, RealizarTeacher wrapper,
pipeline.rs::train() rewrite to use it, unit test against golden fixture.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit c247afb into main May 18, 2026
10 checks passed
@noahgift noahgift deleted the docs/spec-distill-001-epic-scoping branch May 18, 2026 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant