From 2cba93bee66a63dff27fe4a8f8afc81e6fe431f3 Mon Sep 17 00:00:00 2001 From: Noah Gift Date: Mon, 18 May 2026 10:41:28 +0200 Subject: [PATCH 1/2] =?UTF-8?q?docs(spec):=20SPEC-DISTILL-001=20=E2=80=94?= =?UTF-8?q?=20distillation=20epic=206-phase=20plan=20(PMAT-683/684)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Opens the distillation track that picks up MODEL-2 v2 (paiml/albor-370m-v2) from where the §88 stack-existence-proof ship left off. What this spec scopes ===================== Current state audit: - DistillationLoss is REAL (KD = α·CE + (1-α)·T²·KL) - CudaTransformerTrainer forward+backward is REAL (proven by §82 P2-A) - realizar teacher inference is REAL (proven by SHIP-005 86.59% HumanEval) - Pipeline orchestrator at aprender-train-distill/src/pipeline.rs:115 is STUB — uses build_synthetic_logits() instead of teacher.forward(); never calls CudaTransformerTrainer for the student. Closing this gap is the epic. 6-phase plan ============ Phase 1 (3 days, 16-24h eng): `apr distill prepare` — realizar runs MODEL-1 teacher over the corpus, caches top-K=64 logits to disk. 100-batch test asserts cosine sim ≥ 0.999 against online realizar recompute. Phase 2 (2 days, 16-24h eng): wire CudaTransformerTrainer to KD loss via new forward_backward_kd_batch(); replace synthetic-logits stub in pipeline.rs::train(). Unit test on toy student verifies loss monotone. Phase 3 (1 day + 4h compute): 500-step E2E smoke on a 10K-batch slice. Falsifier F-DISTILL-SMOKE-001 — val_loss at step 500 < step 0. Phase 4 (4h dispatch + 30h unattended compute): the v2 training run. 50K steps × 8192 tok/step = 1.6B tokens. Init from Qwen2.5-Coder-0.5B (matched tokenizer + arch family). Falsifiers F-DISTILL-V2-001/002 — val_loss < 3.0 AND HumanEval pass@1 ≥ 15%. Phase 5 (5-8h gx10 compute): full 164-problem HumanEval discharge of PMAT-684. Acceptance threshold pass@1 ≥ 15%; ship-goal ≥ 25%. Phase 6 (3h staging + 1h compute): publish v2 per SPEC-HF-PUBLISH-001. With v0.34.0+#1783 binary, companion files + model.safetensors alias are auto-emitted. Three-path verification (apr run + HF Transformers + llama-cli). Total: ~70h eng + ~45h compute, ~10 days calendar. Risk register ============= - Cache size: top-K=64 sparsification → ~4GB instead of ~100GB - KD numerical stability: Phase 2 unit test compares against PyTorch nn.KLDivLoss within 1e-4 absolute - Teacher inference cost: Phase 1 cache amortizes one-time ~6h prep to <100ms/batch reads during training - HumanEval miss: two-path fallback — widen corpus OR drop T from 4.0 to 2.0 (each adds ~1 week) Architectural decisions ======================= 1. Top-K=64 cache (NOT full logits) — DistilBERT/Distil-Qwen precedent 2. Cached teacher (NOT online) — hyperparameter sweep cost < cache regen 3. Vanilla KD (NOT MiniLM intermediate-layer matching) — teacher is Q4_K, intermediate activations aren't recoverable post-quantization 4. Matched tokenizer (Qwen2 151,936 vocab) — strongest argument for Qwen2.5-Coder-0.5B-Instruct as init 5 AC-DISTILL-* criteria authored; cross-linked to SPEC-HF-PUBLISH-001 (used in Phase 6) and AUDIT-Q4K-SHAPE-001 (confirms teacher Q4_K is bit-correct, no re-export needed before distillation). Co-Authored-By: Claude Opus 4.7 --- .../aprender-train/distillation-epic-spec.md | 176 ++++++++++++++++++ 1 file changed, 176 insertions(+) create mode 100644 docs/specifications/aprender-train/distillation-epic-spec.md diff --git a/docs/specifications/aprender-train/distillation-epic-spec.md b/docs/specifications/aprender-train/distillation-epic-spec.md new file mode 100644 index 000000000..6a6582c79 --- /dev/null +++ b/docs/specifications/aprender-train/distillation-epic-spec.md @@ -0,0 +1,176 @@ +# Specification: Distillation Epic — paiml/albor-370m-v2 + +**Document ID:** SPEC-DISTILL-001 +**Version:** 1.0.0 +**Status:** Live — opens the distillation track that picks up where MODEL-2's §88 stack-existence-proof ship left off +**Parent:** [ship-model-2-spec.md §84.5](./ship-model-2-spec.md) +**Triggers:** PMAT-683 (P2-D: True distillation from MODEL-1), PMAT-684 (P1-B: HumanEval pass@1) +**First applied:** TBD — see Phase 1 below + +## Purpose + +MODEL-2's v1 ship (`paiml/albor-370m-v1`, val_loss=4.6227, 2026-05-18) proved the pure-Rust Sovereign AI Stack runs end-to-end. The model is honestly framed as a stack-existence-proof — it produces gibberish at val_perplexity≈102 and is explicitly NOT a production code-completion model. + +The distillation epic ships **MODEL-2 v2** — `paiml/albor-370m-v2` (or v1.1.0) — a ~370M-param student distilled from the **MODEL-1 teacher** (`paiml/qwen2.5-coder-7b-apache-q4k-v1`) that actually hits a usable HumanEval pass@1. Target: pass@1 ≥ 25% (loose first-target; competitive 0.5B is ~30-40%; the upstream 7B teacher is at ~91%). + +The epic's strategic value: it converts the stack from "runs end-to-end" → "produces models worth using". That's the difference between an existence proof and a product. + +## Current state (2026-05-18) + +The distillation infrastructure is **scaffolded but stubbed**: + +| Component | File | State | +|---|---|---| +| CLI: `apr distill ...` | `crates/apr-cli/src/commands/distill.rs:1119` | Stub — writes a metadata-only "pending_download" manifest | +| Pipeline orchestrator | `crates/aprender-train-distill/src/pipeline.rs:115-160` | Stub — uses **synthetic logits derived from weight tensor slices** instead of actual forward passes through teacher + student | +| KD loss | `crates/aprender-train/src/distill/loss.rs:34` (`DistillationLoss`) | **Real** — KD loss = α · CE(student, hard_target) + (1-α) · T² · KL(softmax(student/T), softmax(teacher/T)) | +| HF-pipeline KD variant | `crates/aprender-train/src/hf_pipeline/distillation/loss.rs:24` | **Real** — used for distillation under the hf_pipeline path | +| Student forward+backward | `crates/aprender-train/src/train/transformer_trainer/cuda_trainer.rs:1256` (`forward_logits`), `:2378` (`forward_backward_batch`) | **Real** — `CudaTransformerTrainer` proven via MODEL-2 §82 P2-A 5000-step run | +| Teacher inference | `realizar` (aprender-serve) — `apr run` path | **Real** — proven via MODEL-1 SHIP-005 HumanEval=86.59% | +| GGUF / SafeTensors weight loading | `aprender-core::format::converter` | **Real** — v0.34.0 layout + Q4_K paths shipped | + +**The gap**: pipeline.rs's `train()` method constructs `teacher_logits` from `build_synthetic_logits(&teacher_weights, batch_size, num_classes=32)` — a flat 32-class slice of weight bytes. It NEVER calls realizar to run the teacher, NEVER calls `CudaTransformerTrainer.forward_backward_batch` on the student. The loss math is exercised on synthetic data; the gradient updates do nothing meaningful. + +Closing this gap is the epic. + +## Phased plan + +### Phase 1 — Teacher logits service via realizar (PMAT-683 P1, ~3 days) + +**Goal**: `apr distill` invokes realizar to compute teacher logits over the training corpus and caches them to disk. + +Why a separate phase: teacher forward over a 7B model is expensive (~3-5s/batch on RTX 4090 for batch=8 seq=512). Recomputing per training step is wasteful. The HF community standard (DistilBERT, MiniLM, Distil-Qwen) caches teacher logits once per corpus and replays them across training epochs. We do the same. + +**Deliverables:** +1. `apr distill prepare --teacher --dataset --out ` — new subcommand. Iterates batches, calls `realizar::Model::forward_logits(input_ids) -> Vec`, writes per-batch `.logits.bin` files (top-K + indices to reduce storage; K=64 is standard). +2. New module `aprender-train-distill/src/teacher_cache.rs` — defines on-disk format (header magic `APRLOG\0`, per-batch records `(batch_idx, seq_len, top_k_logits[K], top_k_indices[K])`). +3. Integration test: prepare 100-batch slice, verify cache file size + read-back matches realizar's online compute within float-roundoff. + +**Effort estimate**: 16-24 hours engineering + 4-8h compute (1B-token corpus prep takes ~6h on RTX 4090 for a 7B teacher at batch=8). + +**Falsifier**: F-DISTILL-PREP-001 — `apr distill prepare` produces a cache where, for any random batch, calling `realizar` online produces the same top-K within cosine sim ≥ 0.999. + +### Phase 2 — Student forward+backward wired to KD loss (PMAT-683 P2, ~2 days) + +**Goal**: Replace pipeline.rs's `build_synthetic_logits(...)` with a real `CudaTransformerTrainer.forward_backward_batch_with_kd(...)` call. + +**Deliverables:** +1. Extend `CudaTransformerTrainer` (`crates/aprender-train/src/train/transformer_trainer/cuda_trainer.rs`) with `forward_backward_kd_batch(batch: &LMBatch, teacher_logits: &TeacherLogitsBatch) -> f32`. The trainer already has `forward_backward_batch`; the KD variant computes the combined loss = α·CE + (1-α)·T²·KL via `DistillationLoss` instead of CE alone, then back-props. +2. Rewrite `aprender-train-distill/src/pipeline.rs::train()` to (a) load teacher cache, (b) iterate corpus batches alongside cached teacher logits, (c) call `forward_backward_kd_batch`, (d) optimizer step. +3. Unit test: 10-step run on a 3-layer toy student + 4-layer toy teacher, verify loss strictly decreases. + +**Effort estimate**: 16-24 hours engineering. No compute beyond the unit test. + +**Falsifier**: F-DISTILL-KD-001 — student loss after 100 steps < student loss at step 0 (sanity); F-DISTILL-KD-002 — student logits cosine sim to teacher logits increases monotonically over the first 1000 steps (modulo noise). + +### Phase 3 — End-to-end smoke (~1 day + 4h compute) + +**Goal**: Run `apr distill` for 500 steps on a 10K-batch subset of the qwen-v3 corpus, with the MODEL-1 teacher cache. Verify it doesn't crash, val_loss decreases, output passes `apr qa`. + +**Deliverables**: +1. `apr distill --teacher paiml/qwen2.5-coder-7b-apache-q4k-v1 --init Qwen/Qwen2.5-Coder-0.5B-Instruct --dataset qwen-v3-10k --steps 500 --out runs/distill-smoke/` — end-to-end command. +2. Smoke evidence in `evidence/distill-smoke-/` — launch.log, per-step metrics, final val_loss. +3. Spec amendment to ship-model-2-spec.md §85 documenting the smoke. + +**Effort estimate**: 8h engineering + 4h gx10 compute. + +**Falsifier**: F-DISTILL-SMOKE-001 — val_loss at step 500 < val_loss at step 0. + +### Phase 4 — Distillation training run for the v2 ship (~10 days compute, mostly unattended) + +**Goal**: Long-running distillation training to produce the v2 student checkpoint. + +**Hyperparameters** (Chinchilla-adjacent, tuned from §82 P2-A): +- Init: `Qwen/Qwen2.5-Coder-0.5B-Instruct` (same as v1) +- Teacher: `paiml/qwen2.5-coder-7b-apache-q4k-v1` (Q4_K, our distilled teacher) +- Corpus: qwen-v3 (1.24B tokens, the §77 5g.1 corpus) — full pass for the first time +- Steps: ~50K (≈ 1.6B tokens consumed at batch=16 seq=512 = 8192 tokens/step) +- LR: 1.5e-5 → 5e-7 cosine +- T (KD temperature): 4.0 (DistilBERT default; tune to 2.0 if loss curve plateaus early) +- α (CE weight): 0.3 (per Hinton et al. 2015 — KD signal dominates when teacher is reliable) +- Wall time on RTX 4090: 50K steps × ~1.5 s/step (student forward+backward + teacher cache read) ≈ 21 hours active training. With teacher cache prep + checkpointing overhead ≈ 30 hours total. + +**Deliverables**: +1. Best checkpoint `runs/albor-370m-v2/ckpt/epoch-N.apr`. +2. Per-epoch eval against `apr eval humaneval` (sample 20 problems mid-run, full 164 at end). +3. Spec amendment §86 (re-using the deprecated §86 slot) documenting the v2 run. + +**Effort estimate**: 4h dispatch + setup + 30h compute (unattended on gx10 or lambda-labs). + +**Falsifier**: F-DISTILL-V2-001 — best val_loss < 3.0 (would unblock PMAT-684 HumanEval per the audit Rec #3); F-DISTILL-V2-002 — HumanEval pass@1 ≥ 15% (loose first-target). + +### Phase 5 — HumanEval discharge (PMAT-684, ~1 day) + +**Goal**: Run full 164-problem HumanEval on the v2 best checkpoint, verify pass@1 against the target. + +**Deliverables**: +1. `apr eval humaneval --model runs/albor-370m-v2/ckpt/epoch-N.apr --samples 164 --temperature 0.2 --top-p 0.95 --gpu cuda` — run. +2. Evidence at `evidence/distill-v2-humaneval-/`. +3. Discharge AC-SHIP2-005 in ship-model-2-spec.md. + +**Effort estimate**: 5-8h gx10 compute (sequential through 164 problems, 16 samples each for pass@1@k). + +**Falsifier**: F-HUMANEVAL-V2-001 — pass@1 ≥ 15% (acceptance threshold; under is "the corpus didn't transfer; need to widen corpus per P2-C lineage"); F-HUMANEVAL-V2-002 — pass@1 ≥ 25% (loose ship-goal threshold). + +### Phase 6 — Publish v2 (~3h) + +**Goal**: Re-stamp the best checkpoint with provenance + tokenizer (per SPEC-HF-PUBLISH-001) and ship to `paiml/albor-370m-v2` (or `paiml/albor-370m-v1.1.0`). + +**Deliverables**: Follows SPEC-HF-PUBLISH-001 end-to-end — `apr stamp --tokenizer` → `apr export --quantize int4` → `apr publish`. With the v0.34.0+1783 (defect 6) binary, everything is autodetected from the staging directory. Three-path verification: `apr run` + HF Transformers + llama-cli. + +**Effort estimate**: 3h staging + 1h compute (Q4_K export + upload). + +## Total effort + timeline + +| Phase | Engineering | Compute | Calendar | +|---|---|---|---| +| 1: Teacher logits cache | 16-24h | 4-8h | 3 days | +| 2: KD-wired forward+backward | 16-24h | <1h | 2 days | +| 3: E2E smoke (500 steps) | 8h | 4h | 1 day | +| 4: V2 training run | 4h | 30h (unattended) | 2-3 days wall | +| 5: HumanEval discharge | 4h | 5-8h | 1 day | +| 6: Publish v2 | 3h | 1h | 0.5 day | +| **Total** | **~70h** | **~45h** | **~10 days** | + +This matches PMAT-683's "16-40h. Δship +10. P=25%" original estimate when scaled to the realistic full epic; PMAT-683 was authored before §82 P2-A landed the trainer infrastructure that Phase 2 reuses. + +## Risk register + +| Risk | Mitigation | +|---|---| +| Teacher logits cache is too large (~100 GB for full qwen-v3) | Top-K=64 sparsification reduces by ~25× → ~4 GB. Stream from disk per batch. | +| Student forward+backward + KD has unmeasured CUDA bugs that only surface at scale | Phase 3 smoke catches anything that surfaces in the first 500 steps. Phase 4 has per-epoch sanity (val_loss monotone + checkpoint preserved). | +| HumanEval pass@1 falls below 15% at end of Phase 4 → ship is blocked | Two-path fallback: (a) widen corpus per the P2-C lineage (4× corpus), retrain Phase 4 from scratch; (b) drop KD temperature from 4.0 → 2.0 to sharpen teacher distribution. Both add ~1 week each. | +| KD loss is mathematically wrong or numerically unstable (e.g., logsumexp drift, FP16 overflow on KL) | Phase 2 unit test compares against a Python reference (PyTorch nn.KLDivLoss) within 1e-4 absolute error. | +| Realizar inference of the 7B teacher mid-training is too slow (~5s/batch) — burns 50K × 5s ≈ 70h of teacher time on top of training | Phase 1 cache amortizes this once. The cache prep itself takes ~6h, but reads at <100ms/batch during training. | + +## Open architectural decisions + +1. **Top-K cache vs full logits**: top-K=64 trades fidelity for storage. KD literature (DistilBERT, Distil-Qwen) shows pass@1 deltas from K=32 to K=full are within noise for code corpora. We start with K=64; if Phase 3 smoke shows loss-curve degradation vs a comparison full-logits batch, bump K. +2. **Online teacher vs cached**: cached (decided). Online would let temperature/α be tuned mid-training without re-prep cost; cached requires re-prep per α/T change. The hyperparameter sweep cost is < single cache regen, so cache wins. +3. **Distillation algorithm variant**: vanilla KD (Hinton 2015), NOT MiniLM (which adds attention-distillation losses) or TinyBERT (intermediate-layer matching). MODEL-1 is a Q4_K teacher — its FP16 intermediate activations aren't recoverable post-quantization. Vanilla KD on logits-only is the practical match. +4. **Tokenizer**: same Qwen2 vocab (151,936 tokens) as v1 — teacher + student share vocab, no token-alignment loss. This is the strongest argument for using `Qwen2.5-Coder-0.5B-Instruct` as init: matched tokenizer + matched arch family. + +## Acceptance criteria + +| ID | Criterion | Phase | Status | +|---|---|---|---| +| AC-DISTILL-001 | `apr distill prepare` end-to-end cache write + read-back parity ≥ 0.999 cos sim | Phase 1 | planned | +| AC-DISTILL-002 | `apr distill` runs 500 steps without crash; val_loss monotone | Phase 3 | planned | +| AC-DISTILL-003 | Phase 4 best val_loss < 3.0 | Phase 4 | planned | +| AC-DISTILL-004 | Phase 5 HumanEval pass@1 ≥ 15% (acceptance); aspire to ≥ 25% (ship-goal) | Phase 5 | planned | +| AC-DISTILL-005 | Phase 6 publish at `paiml/albor-370m-v2` (or v1.1.0) per SPEC-HF-PUBLISH-001 with all 3 usage paths verified | Phase 6 | planned | + +## Cross-references + +- Parent: [ship-model-2-spec.md §84.5](./ship-model-2-spec.md) — "What's next for MODEL-2" +- Predecessor: SPEC-SHIP-MODEL-2 §35 (`apr distill` stub identification, 2026-04-28), §82 (P2-A trainer infrastructure proven, 2026-05-15) +- Companion: [SPEC-HF-PUBLISH-001](./model-hf-publish-pipeline-spec.md) — used for Phase 6 publish +- Roadmap: [docs/roadmaps/roadmap.yaml](../../roadmaps/roadmap.yaml) PMAT-683 + PMAT-684 +- Teacher: [`paiml/qwen2.5-coder-7b-apache-q4k-v1`](https://huggingface.co/paiml/qwen2.5-coder-7b-apache-q4k-v1) (MODEL-1, HumanEval=86.59%) +- Init: [`Qwen/Qwen2.5-Coder-0.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct) (Apache-2.0, same arch + vocab as student) +- Audit: [audits/q4k-shape-swap-impact.md](../audits/q4k-shape-swap-impact.md) — confirms teacher's Q4_K artifact is bit-correct (no re-export needed) + +## Changelog + +- **1.0.0 (2026-05-18)** — Initial publish. Scopes the 6-phase plan opening the distillation track that picks up MODEL-2 v2 from where the §88 stack-existence-proof ship left off. From fd888932f2a8448bd493d1e34b33e43e9a2e891d Mon Sep 17 00:00:00 2001 From: Noah Gift Date: Mon, 18 May 2026 12:05:48 +0200 Subject: [PATCH 2/2] =?UTF-8?q?docs(spec):=20SPEC-DISTILL-001=20v1.1.0=20?= =?UTF-8?q?=E2=80=94=20priority=20HIGH;=20Phase=201=20=E2=86=92=20online?= =?UTF-8?q?=20teacher=20(Refs=20PMAT-691)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Revision driven by storage-math sanity check + pmat work priority promotion: 1. Priority promoted to HIGH (pmat work edit on PMAT-683 + PMAT-684, plus new PMAT-691 for Phase 1 implementation). 2. Phase 1 redesigned from on-disk top-K=64 cache to online teacher logits provider. Storage math: 1.24B tokens × 64 entries × 6 bytes ≈ 476 GB, exceeds available NVMe budget. Top-K cache approach moves to Phase 1.5 as an optional in-memory ring-buffer optimization that hides teacher latency under student compute. 3. Effort totals: Phase 1 compute drops from 4-8h to <1h. Total epic eng stays ~70h but compute drops 45h → 40h. 4. New falsifier F-DISTILL-TEACHER-001 — RealizarTeacher.logits_for_batch matches realizar's apr trace logits output within 1e-3 absolute error on a frozen 3-layer fixture. Implementation: PMAT-691 work session started 2026-05-18. Phase 1 deliverables are: teacher_provider.rs module, RealizarTeacher wrapper, pipeline.rs::train() rewrite to use it, unit test against golden fixture. Co-Authored-By: Claude Opus 4.7 --- docs/roadmaps/roadmap.yaml | 25 +++++++-- .../aprender-train/distillation-epic-spec.md | 54 ++++++++++++------- 2 files changed, 57 insertions(+), 22 deletions(-) diff --git a/docs/roadmaps/roadmap.yaml b/docs/roadmaps/roadmap.yaml index c581c6d08..4fe6deb73 100644 --- a/docs/roadmaps/roadmap.yaml +++ b/docs/roadmaps/roadmap.yaml @@ -9227,10 +9227,10 @@ roadmap: item_type: task title: 'P2-D: True distillation from MODEL-1 (apr distill)' status: planned - priority: medium + priority: high assigned_to: null created: 2026-05-16T07:15:09Z - updated: 2026-05-16T07:15:09Z + updated: 2026-05-18T10:01:40Z spec: null acceptance_criteria: - 'Ship apr distill per §35 (currently STUB). Architectural change; defer until P2-C exhausted. Multi-week scope. Effort: 16-40h. Δship +10. P=25%.' @@ -9250,10 +9250,10 @@ roadmap: item_type: task title: 'P1-B: HumanEval pass@1 on best checkpoint (DEFERRED)' status: planned - priority: medium + priority: high assigned_to: null created: 2026-05-16T07:15:26Z - updated: 2026-05-16T07:15:26Z + updated: 2026-05-18T10:01:40Z spec: null acceptance_criteria: - 'BLOCKED on val_loss < 3.0 per audit Rec #3 (was < 4.0). At current val_loss=4.71 (perplexity ~111) zero-shot reasoning is mathematically impossible. apr eval humaneval. Effort: 5-8h gx10. Δship +3 if pass>5%. P=3%. Run AFTER P2-C lands a better checkpoint.' @@ -9402,3 +9402,20 @@ roadmap: - ship-model-2 - upstream-fix notes: null +- id: PMAT-691 + github_issue: null + item_type: task + title: Distillation Phase 1 — apr distill prepare (teacher logits cache) + status: inprogress + priority: high + assigned_to: null + created: 2026-05-18T10:01:48Z + updated: 2026-05-18T10:01:53.808598526+00:00 + spec: null + acceptance_criteria: + - 'SPEC-DISTILL-001 Phase 1. New ''apr distill prepare --teacher --dataset --out '' subcommand. Iterates batches via realizar::Model::forward_logits, writes top-K=64 sparse cache (.logits.bin per batch) with header magic APRLOG. New module aprender-train-distill/src/teacher_cache.rs. Falsifier F-DISTILL-PREP-001: 100-batch slice, cos sim >= 0.999 vs online realizar recompute. Effort: 16-24h eng + 4-8h compute, 3 days calendar.' + phases: [] + subtasks: [] + estimated_effort: null + labels: [] + notes: null diff --git a/docs/specifications/aprender-train/distillation-epic-spec.md b/docs/specifications/aprender-train/distillation-epic-spec.md index 6a6582c79..f8fe8d47e 100644 --- a/docs/specifications/aprender-train/distillation-epic-spec.md +++ b/docs/specifications/aprender-train/distillation-epic-spec.md @@ -1,11 +1,12 @@ # Specification: Distillation Epic — paiml/albor-370m-v2 **Document ID:** SPEC-DISTILL-001 -**Version:** 1.0.0 -**Status:** Live — opens the distillation track that picks up where MODEL-2's §88 stack-existence-proof ship left off +**Version:** 1.1.0 (priority promoted to HIGH; Phase 1 design revised from cache → online teacher provider after the storage-math sanity check) +**Status:** **Live and ACTIVE** — distillation track is the highest-priority work after MODEL-2 §88 shipped +**Priority:** **HIGH** (per `pmat work edit`, 2026-05-18 — PMAT-683 + PMAT-684 both elevated from `medium` to `high`) **Parent:** [ship-model-2-spec.md §84.5](./ship-model-2-spec.md) -**Triggers:** PMAT-683 (P2-D: True distillation from MODEL-1), PMAT-684 (P1-B: HumanEval pass@1) -**First applied:** TBD — see Phase 1 below +**Triggers:** PMAT-683 (P2-D: True distillation from MODEL-1) **HIGH**, PMAT-684 (P1-B: HumanEval pass@1) **HIGH**, PMAT-691 (Phase 1 implementation kickoff) **HIGH** +**First applied:** Phase 1 work session started 2026-05-18 (PMAT-691) ## Purpose @@ -35,20 +36,35 @@ Closing this gap is the epic. ## Phased plan -### Phase 1 — Teacher logits service via realizar (PMAT-683 P1, ~3 days) +### Phase 1 — Online teacher logits provider (PMAT-683 P1 + PMAT-691, ~2 days) -**Goal**: `apr distill` invokes realizar to compute teacher logits over the training corpus and caches them to disk. +**Goal**: `aprender-train-distill` can fetch full teacher logits for an arbitrary batch by delegating to realizar — synchronously, in the training hot path. -Why a separate phase: teacher forward over a 7B model is expensive (~3-5s/batch on RTX 4090 for batch=8 seq=512). Recomputing per training step is wasteful. The HF community standard (DistilBERT, MiniLM, Distil-Qwen) caches teacher logits once per corpus and replays them across training epochs. We do the same. +**Why online instead of on-disk cache** (revised from v1.0.0 of this spec): the original plan (top-K=64 sparse cache) does not scale to a real corpus. The math: 1.24B tokens (qwen-v3 corpus) × 64 entries/position × ~6 bytes/entry (u32 index + f16 logit) ≈ **476 GB**. That exceeds the lambda-vector NVMe budget. Lowering K further degrades KD signal-to-noise. Modern distillation pipelines (DistilBERT 2019, MiniLM 2020, Distil-Qwen 2024) all use **online teacher inference** — pay the ~2× student-step cost in exchange for zero cache footprint. We do the same. -**Deliverables:** -1. `apr distill prepare --teacher --dataset --out ` — new subcommand. Iterates batches, calls `realizar::Model::forward_logits(input_ids) -> Vec`, writes per-batch `.logits.bin` files (top-K + indices to reduce storage; K=64 is standard). -2. New module `aprender-train-distill/src/teacher_cache.rs` — defines on-disk format (header magic `APRLOG\0`, per-batch records `(batch_idx, seq_len, top_k_logits[K], top_k_indices[K])`). -3. Integration test: prepare 100-batch slice, verify cache file size + read-back matches realizar's online compute within float-roundoff. - -**Effort estimate**: 16-24 hours engineering + 4-8h compute (1B-token corpus prep takes ~6h on RTX 4090 for a 7B teacher at batch=8). +The cache approach is preserved as a **Phase 1.5 optional optimization**: an in-memory ring of N pre-computed batches that the producer thread fills while the student GPU is busy on the previous batch. Adds 0 disk cost, hides teacher latency under student compute. Implement after Phase 4 lands real numbers worth optimizing. -**Falsifier**: F-DISTILL-PREP-001 — `apr distill prepare` produces a cache where, for any random batch, calling `realizar` online produces the same top-K within cosine sim ≥ 0.999. +**Deliverables:** +1. New module `aprender-train-distill/src/teacher_provider.rs` — defines: + ```rust + pub trait TeacherLogitsProvider { + fn logits_for_batch(&mut self, input_ids: &[Vec]) -> Result>>; + } + pub struct RealizarTeacher { /* wraps realizar::Model */ } + impl TeacherLogitsProvider for RealizarTeacher { ... } + ``` +2. `RealizarTeacher::new(path: &Path, device: Device)` loads the teacher .apr/.gguf via realizar's standard path; subsequent `logits_for_batch` calls run forward and return the full V-dim distribution (or last-position logits depending on training objective). +3. Wired call site: `aprender-train-distill/src/pipeline.rs::train()` replaces `build_synthetic_logits(...)` with `self.teacher.logits_for_batch(...)`. +4. Unit test against a frozen golden reference: load a 3-layer toy teacher; assert `logits_for_batch([[1, 2, 3]])` returns bytes matching a recorded fixture within `1e-3` absolute. + +**Effort estimate**: 16-24 hours engineering + <1h compute. No corpus prep needed (no cache). + +**Falsifier**: F-DISTILL-TEACHER-001 — `RealizarTeacher::logits_for_batch` output matches `realizar`'s standard `apr trace --layer logits` JSON dump on the same input within 1e-3 absolute error, for a frozen 3-layer fixture model. + +**Implementation notes**: +- For the v2 ship we need full V-dim logits during the student step (KL divergence is computed over the full distribution). Top-K truncation can come later as an optimization once Phase 4 numbers exist. +- The teacher is held in GPU memory between calls. On RTX 4090 (24GB) the 7B Q4_K teacher fits with ~16GB headroom — enough for a batch-8 seq-512 student to share the GPU. +- realizar's `Model::forward_logits` already returns `Vec` for the last position. For sequence-wise KD (every position), Phase 1 needs to expose an extended API `forward_logits_full` returning shape `[batch, seq_len, vocab]`. If not present, add it in `aprender-serve` as a small extension. ### Phase 2 — Student forward+backward wired to KD loss (PMAT-683 P2, ~2 days) @@ -124,21 +140,22 @@ Why a separate phase: teacher forward over a 7B model is expensive (~3-5s/batch | Phase | Engineering | Compute | Calendar | |---|---|---|---| -| 1: Teacher logits cache | 16-24h | 4-8h | 3 days | +| 1: Online teacher provider (PMAT-691) | 16-24h | <1h | 2 days | | 2: KD-wired forward+backward | 16-24h | <1h | 2 days | | 3: E2E smoke (500 steps) | 8h | 4h | 1 day | | 4: V2 training run | 4h | 30h (unattended) | 2-3 days wall | | 5: HumanEval discharge | 4h | 5-8h | 1 day | | 6: Publish v2 | 3h | 1h | 0.5 day | -| **Total** | **~70h** | **~45h** | **~10 days** | +| **Total** | **~70h** | **~40h** | **~9 days** | -This matches PMAT-683's "16-40h. Δship +10. P=25%" original estimate when scaled to the realistic full epic; PMAT-683 was authored before §82 P2-A landed the trainer infrastructure that Phase 2 reuses. +This matches PMAT-683's "16-40h. Δship +10. P=25%" original estimate when scaled to the realistic full epic; PMAT-683 was authored before §82 P2-A landed the trainer infrastructure that Phase 2 reuses. The v1.1.0 revision (online teacher provider instead of disk cache) drops Phase 1 compute from 4-8h to <1h. ## Risk register | Risk | Mitigation | |---|---| -| Teacher logits cache is too large (~100 GB for full qwen-v3) | Top-K=64 sparsification reduces by ~25× → ~4 GB. Stream from disk per batch. | +| Online teacher inference doubles training step time (~3-5s teacher fwd + ~1.5s student step on RTX 4090) | Acceptable for Phase 4's 50K-step run (≈ 21 hr → 42 hr active training). Phase 1.5 ring-cache reclaims it later if needed. | +| Teacher + student both on the same GPU could OOM at batch > 8 | Phase 3 smoke confirms VRAM headroom. Fallback: place teacher on a separate device or use `--device-teacher cpu` (slower but works on any host). | | Student forward+backward + KD has unmeasured CUDA bugs that only surface at scale | Phase 3 smoke catches anything that surfaces in the first 500 steps. Phase 4 has per-epoch sanity (val_loss monotone + checkpoint preserved). | | HumanEval pass@1 falls below 15% at end of Phase 4 → ship is blocked | Two-path fallback: (a) widen corpus per the P2-C lineage (4× corpus), retrain Phase 4 from scratch; (b) drop KD temperature from 4.0 → 2.0 to sharpen teacher distribution. Both add ~1 week each. | | KD loss is mathematically wrong or numerically unstable (e.g., logsumexp drift, FP16 overflow on KL) | Phase 2 unit test compares against a Python reference (PyTorch nn.KLDivLoss) within 1e-4 absolute error. | @@ -173,4 +190,5 @@ This matches PMAT-683's "16-40h. Δship +10. P=25%" original estimate when scale ## Changelog +- **1.1.0 (2026-05-18)** — Priority promoted to HIGH (PMAT-683 + PMAT-684 + new PMAT-691 elevated via `pmat work edit`). Phase 1 design revised from on-disk top-K cache to online teacher logits provider after the storage-math sanity check showed 1.24B tokens × 64 entries × 6 bytes ≈ 476 GB (exceeds available disk). The cache approach moves to Phase 1.5 as an optional ring-buffer optimization. Online-teacher decision matches DistilBERT/Distil-Qwen actual practice. Effort + compute totals updated accordingly. - **1.0.0 (2026-05-18)** — Initial publish. Scopes the 6-phase plan opening the distillation track that picks up MODEL-2 v2 from where the §88 stack-existence-proof ship left off.