From 2cba93bee66a63dff27fe4a8f8afc81e6fe431f3 Mon Sep 17 00:00:00 2001
From: Noah Gift <noah.gift@gmail.com>
Date: Mon, 18 May 2026 10:41:28 +0200
Subject: [PATCH 1/2] =?UTF-8?q?docs(spec):=20SPEC-DISTILL-001=20=E2=80=94?=
 =?UTF-8?q?=20distillation=20epic=206-phase=20plan=20(PMAT-683/684)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Opens the distillation track that picks up MODEL-2 v2 (paiml/albor-370m-v2)
from where the §88 stack-existence-proof ship left off.

What this spec scopes
=====================

Current state audit:
- DistillationLoss is REAL (KD = α·CE + (1-α)·T²·KL)
- CudaTransformerTrainer forward+backward is REAL (proven by §82 P2-A)
- realizar teacher inference is REAL (proven by SHIP-005 86.59% HumanEval)
- Pipeline orchestrator at aprender-train-distill/src/pipeline.rs:115 is STUB
  — uses build_synthetic_logits() instead of teacher.forward(); never calls
  CudaTransformerTrainer for the student. Closing this gap is the epic.

6-phase plan
============

Phase 1 (3 days, 16-24h eng): `apr distill prepare` — realizar runs MODEL-1
  teacher over the corpus, caches top-K=64 logits to disk. 100-batch test
  asserts cosine sim ≥ 0.999 against online realizar recompute.

Phase 2 (2 days, 16-24h eng): wire CudaTransformerTrainer to KD loss via
  new forward_backward_kd_batch(); replace synthetic-logits stub in
  pipeline.rs::train(). Unit test on toy student verifies loss monotone.

Phase 3 (1 day + 4h compute): 500-step E2E smoke on a 10K-batch slice.
  Falsifier F-DISTILL-SMOKE-001 — val_loss at step 500 < step 0.

Phase 4 (4h dispatch + 30h unattended compute): the v2 training run.
  50K steps × 8192 tok/step = 1.6B tokens. Init from Qwen2.5-Coder-0.5B
  (matched tokenizer + arch family). Falsifiers F-DISTILL-V2-001/002 —
  val_loss < 3.0 AND HumanEval pass@1 ≥ 15%.

Phase 5 (5-8h gx10 compute): full 164-problem HumanEval discharge of
  PMAT-684. Acceptance threshold pass@1 ≥ 15%; ship-goal ≥ 25%.

Phase 6 (3h staging + 1h compute): publish v2 per SPEC-HF-PUBLISH-001.
  With v0.34.0+#1783 binary, companion files + model.safetensors alias
  are auto-emitted. Three-path verification (apr run + HF Transformers
  + llama-cli).

Total: ~70h eng + ~45h compute, ~10 days calendar.

Risk register
=============

- Cache size: top-K=64 sparsification → ~4GB instead of ~100GB
- KD numerical stability: Phase 2 unit test compares against PyTorch
  nn.KLDivLoss within 1e-4 absolute
- Teacher inference cost: Phase 1 cache amortizes one-time ~6h prep
  to <100ms/batch reads during training
- HumanEval miss: two-path fallback — widen corpus OR drop T from
  4.0 to 2.0 (each adds ~1 week)

Architectural decisions
=======================

1. Top-K=64 cache (NOT full logits) — DistilBERT/Distil-Qwen precedent
2. Cached teacher (NOT online) — hyperparameter sweep cost < cache regen
3. Vanilla KD (NOT MiniLM intermediate-layer matching) — teacher is Q4_K,
   intermediate activations aren't recoverable post-quantization
4. Matched tokenizer (Qwen2 151,936 vocab) — strongest argument for
   Qwen2.5-Coder-0.5B-Instruct as init

5 AC-DISTILL-* criteria authored; cross-linked to SPEC-HF-PUBLISH-001
(used in Phase 6) and AUDIT-Q4K-SHAPE-001 (confirms teacher Q4_K is
bit-correct, no re-export needed before distillation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .../aprender-train/distillation-epic-spec.md  | 176 ++++++++++++++++++
 1 file changed, 176 insertions(+)
 create mode 100644 docs/specifications/aprender-train/distillation-epic-spec.md

diff --git a/docs/specifications/aprender-train/distillation-epic-spec.md b/docs/specifications/aprender-train/distillation-epic-spec.md
new file mode 100644
index 000000000..6a6582c79
--- /dev/null
+++ b/docs/specifications/aprender-train/distillation-epic-spec.md
@@ -0,0 +1,176 @@
+# Specification: Distillation Epic — paiml/albor-370m-v2
+
+**Document ID:** SPEC-DISTILL-001
+**Version:** 1.0.0
+**Status:** Live — opens the distillation track that picks up where MODEL-2's §88 stack-existence-proof ship left off
+**Parent:** [ship-model-2-spec.md §84.5](./ship-model-2-spec.md)
+**Triggers:** PMAT-683 (P2-D: True distillation from MODEL-1), PMAT-684 (P1-B: HumanEval pass@1)
+**First applied:** TBD — see Phase 1 below
+
+## Purpose
+
+MODEL-2's v1 ship (`paiml/albor-370m-v1`, val_loss=4.6227, 2026-05-18) proved the pure-Rust Sovereign AI Stack runs end-to-end. The model is honestly framed as a stack-existence-proof — it produces gibberish at val_perplexity≈102 and is explicitly NOT a production code-completion model.
+
+The distillation epic ships **MODEL-2 v2** — `paiml/albor-370m-v2` (or v1.1.0) — a ~370M-param student distilled from the **MODEL-1 teacher** (`paiml/qwen2.5-coder-7b-apache-q4k-v1`) that actually hits a usable HumanEval pass@1. Target: pass@1 ≥ 25% (loose first-target; competitive 0.5B is ~30-40%; the upstream 7B teacher is at ~91%).
+
+The epic's strategic value: it converts the stack from "runs end-to-end" → "produces models worth using". That's the difference between an existence proof and a product.
+
+## Current state (2026-05-18)
+
+The distillation infrastructure is **scaffolded but stubbed**:
+
+| Component | File | State |
+|---|---|---|
+| CLI: `apr distill ...` | `crates/apr-cli/src/commands/distill.rs:1119` | Stub — writes a metadata-only "pending_download" manifest |
+| Pipeline orchestrator | `crates/aprender-train-distill/src/pipeline.rs:115-160` | Stub — uses **synthetic logits derived from weight tensor slices** instead of actual forward passes through teacher + student |
+| KD loss | `crates/aprender-train/src/distill/loss.rs:34` (`DistillationLoss`) | **Real** — KD loss = α · CE(student, hard_target) + (1-α) · T² · KL(softmax(student/T), softmax(teacher/T)) |
+| HF-pipeline KD variant | `crates/aprender-train/src/hf_pipeline/distillation/loss.rs:24` | **Real** — used for distillation under the hf_pipeline path |
+| Student forward+backward | `crates/aprender-train/src/train/transformer_trainer/cuda_trainer.rs:1256` (`forward_logits`), `:2378` (`forward_backward_batch`) | **Real** — `CudaTransformerTrainer` proven via MODEL-2 §82 P2-A 5000-step run |
+| Teacher inference | `realizar` (aprender-serve) — `apr run` path | **Real** — proven via MODEL-1 SHIP-005 HumanEval=86.59% |
+| GGUF / SafeTensors weight loading | `aprender-core::format::converter` | **Real** — v0.34.0 layout + Q4_K paths shipped |
+
+**The gap**: pipeline.rs's `train()` method constructs `teacher_logits` from `build_synthetic_logits(&teacher_weights, batch_size, num_classes=32)` — a flat 32-class slice of weight bytes. It NEVER calls realizar to run the teacher, NEVER calls `CudaTransformerTrainer.forward_backward_batch` on the student. The loss math is exercised on synthetic data; the gradient updates do nothing meaningful.
+
+Closing this gap is the epic.
+
+## Phased plan
+
+### Phase 1 — Teacher logits service via realizar (PMAT-683 P1, ~3 days)
+
+**Goal**: `apr distill` invokes realizar to compute teacher logits over the training corpus and caches them to disk.
+
+Why a separate phase: teacher forward over a 7B model is expensive (~3-5s/batch on RTX 4090 for batch=8 seq=512). Recomputing per training step is wasteful. The HF community standard (DistilBERT, MiniLM, Distil-Qwen) caches teacher logits once per corpus and replays them across training epochs. We do the same.
+
+**Deliverables:**
+1. `apr distill prepare --teacher <path> --dataset <bin-shards> --out <cache-dir>` — new subcommand. Iterates batches, calls `realizar::Model::forward_logits(input_ids) -> Vec<f32>`, writes per-batch `.logits.bin` files (top-K + indices to reduce storage; K=64 is standard).
+2. New module `aprender-train-distill/src/teacher_cache.rs` — defines on-disk format (header magic `APRLOG\0`, per-batch records `(batch_idx, seq_len, top_k_logits[K], top_k_indices[K])`).
+3. Integration test: prepare 100-batch slice, verify cache file size + read-back matches realizar's online compute within float-roundoff.
+
+**Effort estimate**: 16-24 hours engineering + 4-8h compute (1B-token corpus prep takes ~6h on RTX 4090 for a 7B teacher at batch=8).
+
+**Falsifier**: F-DISTILL-PREP-001 — `apr distill prepare` produces a cache where, for any random batch, calling `realizar` online produces the same top-K within cosine sim ≥ 0.999.
+
+### Phase 2 — Student forward+backward wired to KD loss (PMAT-683 P2, ~2 days)
+
+**Goal**: Replace pipeline.rs's `build_synthetic_logits(...)` with a real `CudaTransformerTrainer.forward_backward_batch_with_kd(...)` call.
+
+**Deliverables:**
+1. Extend `CudaTransformerTrainer` (`crates/aprender-train/src/train/transformer_trainer/cuda_trainer.rs`) with `forward_backward_kd_batch(batch: &LMBatch, teacher_logits: &TeacherLogitsBatch) -> f32`. The trainer already has `forward_backward_batch`; the KD variant computes the combined loss = α·CE + (1-α)·T²·KL via `DistillationLoss` instead of CE alone, then back-props.
+2. Rewrite `aprender-train-distill/src/pipeline.rs::train()` to (a) load teacher cache, (b) iterate corpus batches alongside cached teacher logits, (c) call `forward_backward_kd_batch`, (d) optimizer step.
+3. Unit test: 10-step run on a 3-layer toy student + 4-layer toy teacher, verify loss strictly decreases.
+
+**Effort estimate**: 16-24 hours engineering. No compute beyond the unit test.
+
+**Falsifier**: F-DISTILL-KD-001 — student loss after 100 steps < student loss at step 0 (sanity); F-DISTILL-KD-002 — student logits cosine sim to teacher logits increases monotonically over the first 1000 steps (modulo noise).
+
+### Phase 3 — End-to-end smoke (~1 day + 4h compute)
+
+**Goal**: Run `apr distill` for 500 steps on a 10K-batch subset of the qwen-v3 corpus, with the MODEL-1 teacher cache. Verify it doesn't crash, val_loss decreases, output passes `apr qa`.
+
+**Deliverables**:
+1. `apr distill --teacher paiml/qwen2.5-coder-7b-apache-q4k-v1 --init Qwen/Qwen2.5-Coder-0.5B-Instruct --dataset qwen-v3-10k --steps 500 --out runs/distill-smoke/` — end-to-end command.
+2. Smoke evidence in `evidence/distill-smoke-<date>/` — launch.log, per-step metrics, final val_loss.
+3. Spec amendment to ship-model-2-spec.md §85 documenting the smoke.
+
+**Effort estimate**: 8h engineering + 4h gx10 compute.
+
+**Falsifier**: F-DISTILL-SMOKE-001 — val_loss at step 500 < val_loss at step 0.
+
+### Phase 4 — Distillation training run for the v2 ship (~10 days compute, mostly unattended)
+
+**Goal**: Long-running distillation training to produce the v2 student checkpoint.
+
+**Hyperparameters** (Chinchilla-adjacent, tuned from §82 P2-A):
+- Init: `Qwen/Qwen2.5-Coder-0.5B-Instruct` (same as v1)
+- Teacher: `paiml/qwen2.5-coder-7b-apache-q4k-v1` (Q4_K, our distilled teacher)
+- Corpus: qwen-v3 (1.24B tokens, the §77 5g.1 corpus) — full pass for the first time
+- Steps: ~50K (≈ 1.6B tokens consumed at batch=16 seq=512 = 8192 tokens/step)
+- LR: 1.5e-5 → 5e-7 cosine
+- T (KD temperature): 4.0 (DistilBERT default; tune to 2.0 if loss curve plateaus early)
+- α (CE weight): 0.3 (per Hinton et al. 2015 — KD signal dominates when teacher is reliable)
+- Wall time on RTX 4090: 50K steps × ~1.5 s/step (student forward+backward + teacher cache read) ≈ 21 hours active training. With teacher cache prep + checkpointing overhead ≈ 30 hours total.
+
+**Deliverables**:
+1. Best checkpoint `runs/albor-370m-v2/ckpt/epoch-N.apr`.
+2. Per-epoch eval against `apr eval humaneval` (sample 20 problems mid-run, full 164 at end).
+3. Spec amendment §86 (re-using the deprecated §86 slot) documenting the v2 run.
+
+**Effort estimate**: 4h dispatch + setup + 30h compute (unattended on gx10 or lambda-labs).
+
+**Falsifier**: F-DISTILL-V2-001 — best val_loss < 3.0 (would unblock PMAT-684 HumanEval per the audit Rec #3); F-DISTILL-V2-002 — HumanEval pass@1 ≥ 15% (loose first-target).
+
+### Phase 5 — HumanEval discharge (PMAT-684, ~1 day)
+
+**Goal**: Run full 164-problem HumanEval on the v2 best checkpoint, verify pass@1 against the target.
+
+**Deliverables**:
+1. `apr eval humaneval --model runs/albor-370m-v2/ckpt/epoch-N.apr --samples 164 --temperature 0.2 --top-p 0.95 --gpu cuda` — run.
+2. Evidence at `evidence/distill-v2-humaneval-<date>/`.
+3. Discharge AC-SHIP2-005 in ship-model-2-spec.md.
+
+**Effort estimate**: 5-8h gx10 compute (sequential through 164 problems, 16 samples each for pass@1@k).
+
+**Falsifier**: F-HUMANEVAL-V2-001 — pass@1 ≥ 15% (acceptance threshold; under is "the corpus didn't transfer; need to widen corpus per P2-C lineage"); F-HUMANEVAL-V2-002 — pass@1 ≥ 25% (loose ship-goal threshold).
+
+### Phase 6 — Publish v2 (~3h)
+
+**Goal**: Re-stamp the best checkpoint with provenance + tokenizer (per SPEC-HF-PUBLISH-001) and ship to `paiml/albor-370m-v2` (or `paiml/albor-370m-v1.1.0`).
+
+**Deliverables**: Follows SPEC-HF-PUBLISH-001 end-to-end — `apr stamp --tokenizer` → `apr export --quantize int4` → `apr publish`. With the v0.34.0+1783 (defect 6) binary, everything is autodetected from the staging directory. Three-path verification: `apr run` + HF Transformers + llama-cli.
+
+**Effort estimate**: 3h staging + 1h compute (Q4_K export + upload).
+
+## Total effort + timeline
+
+| Phase | Engineering | Compute | Calendar |
+|---|---|---|---|
+| 1: Teacher logits cache | 16-24h | 4-8h | 3 days |
+| 2: KD-wired forward+backward | 16-24h | <1h | 2 days |
+| 3: E2E smoke (500 steps) | 8h | 4h | 1 day |
+| 4: V2 training run | 4h | 30h (unattended) | 2-3 days wall |
+| 5: HumanEval discharge | 4h | 5-8h | 1 day |
+| 6: Publish v2 | 3h | 1h | 0.5 day |
+| **Total** | **~70h** | **~45h** | **~10 days** |
+
+This matches PMAT-683's "16-40h. Δship +10. P=25%" original estimate when scaled to the realistic full epic; PMAT-683 was authored before §82 P2-A landed the trainer infrastructure that Phase 2 reuses.
+
+## Risk register
+
+| Risk | Mitigation |
+|---|---|
+| Teacher logits cache is too large (~100 GB for full qwen-v3) | Top-K=64 sparsification reduces by ~25× → ~4 GB. Stream from disk per batch. |
+| Student forward+backward + KD has unmeasured CUDA bugs that only surface at scale | Phase 3 smoke catches anything that surfaces in the first 500 steps. Phase 4 has per-epoch sanity (val_loss monotone + checkpoint preserved). |
+| HumanEval pass@1 falls below 15% at end of Phase 4 → ship is blocked | Two-path fallback: (a) widen corpus per the P2-C lineage (4× corpus), retrain Phase 4 from scratch; (b) drop KD temperature from 4.0 → 2.0 to sharpen teacher distribution. Both add ~1 week each. |
+| KD loss is mathematically wrong or numerically unstable (e.g., logsumexp drift, FP16 overflow on KL) | Phase 2 unit test compares against a Python reference (PyTorch nn.KLDivLoss) within 1e-4 absolute error. |
+| Realizar inference of the 7B teacher mid-training is too slow (~5s/batch) — burns 50K × 5s ≈ 70h of teacher time on top of training | Phase 1 cache amortizes this once. The cache prep itself takes ~6h, but reads at <100ms/batch during training. |
+
+## Open architectural decisions
+
+1. **Top-K cache vs full logits**: top-K=64 trades fidelity for storage. KD literature (DistilBERT, Distil-Qwen) shows pass@1 deltas from K=32 to K=full are within noise for code corpora. We start with K=64; if Phase 3 smoke shows loss-curve degradation vs a comparison full-logits batch, bump K.
+2. **Online teacher vs cached**: cached (decided). Online would let temperature/α be tuned mid-training without re-prep cost; cached requires re-prep per α/T change. The hyperparameter sweep cost is < single cache regen, so cache wins.
+3. **Distillation algorithm variant**: vanilla KD (Hinton 2015), NOT MiniLM (which adds attention-distillation losses) or TinyBERT (intermediate-layer matching). MODEL-1 is a Q4_K teacher — its FP16 intermediate activations aren't recoverable post-quantization. Vanilla KD on logits-only is the practical match.
+4. **Tokenizer**: same Qwen2 vocab (151,936 tokens) as v1 — teacher + student share vocab, no token-alignment loss. This is the strongest argument for using `Qwen2.5-Coder-0.5B-Instruct` as init: matched tokenizer + matched arch family.
+
+## Acceptance criteria
+
+| ID | Criterion | Phase | Status |
+|---|---|---|---|
+| AC-DISTILL-001 | `apr distill prepare` end-to-end cache write + read-back parity ≥ 0.999 cos sim | Phase 1 | planned |
+| AC-DISTILL-002 | `apr distill` runs 500 steps without crash; val_loss monotone | Phase 3 | planned |
+| AC-DISTILL-003 | Phase 4 best val_loss < 3.0 | Phase 4 | planned |
+| AC-DISTILL-004 | Phase 5 HumanEval pass@1 ≥ 15% (acceptance); aspire to ≥ 25% (ship-goal) | Phase 5 | planned |
+| AC-DISTILL-005 | Phase 6 publish at `paiml/albor-370m-v2` (or v1.1.0) per SPEC-HF-PUBLISH-001 with all 3 usage paths verified | Phase 6 | planned |
+
+## Cross-references
+
+- Parent: [ship-model-2-spec.md §84.5](./ship-model-2-spec.md) — "What's next for MODEL-2"
+- Predecessor: SPEC-SHIP-MODEL-2 §35 (`apr distill` stub identification, 2026-04-28), §82 (P2-A trainer infrastructure proven, 2026-05-15)
+- Companion: [SPEC-HF-PUBLISH-001](./model-hf-publish-pipeline-spec.md) — used for Phase 6 publish
+- Roadmap: [docs/roadmaps/roadmap.yaml](../../roadmaps/roadmap.yaml) PMAT-683 + PMAT-684
+- Teacher: [`paiml/qwen2.5-coder-7b-apache-q4k-v1`](https://huggingface.co/paiml/qwen2.5-coder-7b-apache-q4k-v1) (MODEL-1, HumanEval=86.59%)
+- Init: [`Qwen/Qwen2.5-Coder-0.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct) (Apache-2.0, same arch + vocab as student)
+- Audit: [audits/q4k-shape-swap-impact.md](../audits/q4k-shape-swap-impact.md) — confirms teacher's Q4_K artifact is bit-correct (no re-export needed)
+
+## Changelog
+
+- **1.0.0 (2026-05-18)** — Initial publish. Scopes the 6-phase plan opening the distillation track that picks up MODEL-2 v2 from where the §88 stack-existence-proof ship left off.

From fd888932f2a8448bd493d1e34b33e43e9a2e891d Mon Sep 17 00:00:00 2001
From: Noah Gift <noah.gift@gmail.com>
Date: Mon, 18 May 2026 12:05:48 +0200
Subject: [PATCH 2/2] =?UTF-8?q?docs(spec):=20SPEC-DISTILL-001=20v1.1.0=20?=
 =?UTF-8?q?=E2=80=94=20priority=20HIGH;=20Phase=201=20=E2=86=92=20online?=
 =?UTF-8?q?=20teacher=20(Refs=20PMAT-691)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Revision driven by storage-math sanity check + pmat work priority
promotion:

1. Priority promoted to HIGH (pmat work edit on PMAT-683 + PMAT-684,
   plus new PMAT-691 for Phase 1 implementation).

2. Phase 1 redesigned from on-disk top-K=64 cache to online teacher
   logits provider. Storage math: 1.24B tokens × 64 entries × 6 bytes
   ≈ 476 GB, exceeds available NVMe budget. Top-K cache approach moves
   to Phase 1.5 as an optional in-memory ring-buffer optimization that
   hides teacher latency under student compute.

3. Effort totals: Phase 1 compute drops from 4-8h to <1h. Total epic
   eng stays ~70h but compute drops 45h → 40h.

4. New falsifier F-DISTILL-TEACHER-001 — RealizarTeacher.logits_for_batch
   matches realizar's apr trace logits output within 1e-3 absolute error
   on a frozen 3-layer fixture.

Implementation: PMAT-691 work session started 2026-05-18. Phase 1
deliverables are: teacher_provider.rs module, RealizarTeacher wrapper,
pipeline.rs::train() rewrite to use it, unit test against golden fixture.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/roadmaps/roadmap.yaml                    | 25 +++++++--
 .../aprender-train/distillation-epic-spec.md  | 54 ++++++++++++-------
 2 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/docs/roadmaps/roadmap.yaml b/docs/roadmaps/roadmap.yaml
index c581c6d08..4fe6deb73 100644
--- a/docs/roadmaps/roadmap.yaml
+++ b/docs/roadmaps/roadmap.yaml
@@ -9227,10 +9227,10 @@ roadmap:
   item_type: task
   title: 'P2-D: True distillation from MODEL-1 (apr distill)'
   status: planned
-  priority: medium
+  priority: high
   assigned_to: null
   created: 2026-05-16T07:15:09Z
-  updated: 2026-05-16T07:15:09Z
+  updated: 2026-05-18T10:01:40Z
   spec: null
   acceptance_criteria:
   - 'Ship apr distill per §35 (currently STUB). Architectural change; defer until P2-C exhausted. Multi-week scope. Effort: 16-40h. Δship +10. P=25%.'
@@ -9250,10 +9250,10 @@ roadmap:
   item_type: task
   title: 'P1-B: HumanEval pass@1 on best checkpoint (DEFERRED)'
   status: planned
-  priority: medium
+  priority: high
   assigned_to: null
   created: 2026-05-16T07:15:26Z
-  updated: 2026-05-16T07:15:26Z
+  updated: 2026-05-18T10:01:40Z
   spec: null
   acceptance_criteria:
   - 'BLOCKED on val_loss < 3.0 per audit Rec #3 (was < 4.0). At current val_loss=4.71 (perplexity ~111) zero-shot reasoning is mathematically impossible. apr eval humaneval. Effort: 5-8h gx10. Δship +3 if pass>5%. P=3%. Run AFTER P2-C lands a better checkpoint.'
@@ -9402,3 +9402,20 @@ roadmap:
   - ship-model-2
   - upstream-fix
   notes: null
+- id: PMAT-691
+  github_issue: null
+  item_type: task
+  title: Distillation Phase 1 — apr distill prepare (teacher logits cache)
+  status: inprogress
+  priority: high
+  assigned_to: null
+  created: 2026-05-18T10:01:48Z
+  updated: 2026-05-18T10:01:53.808598526+00:00
+  spec: null
+  acceptance_criteria:
+  - 'SPEC-DISTILL-001 Phase 1. New ''apr distill prepare --teacher <path> --dataset <bin-shards> --out <cache-dir>'' subcommand. Iterates batches via realizar::Model::forward_logits, writes top-K=64 sparse cache (.logits.bin per batch) with header magic APRLOG. New module aprender-train-distill/src/teacher_cache.rs. Falsifier F-DISTILL-PREP-001: 100-batch slice, cos sim >= 0.999 vs online realizar recompute. Effort: 16-24h eng + 4-8h compute, 3 days calendar.'
+  phases: []
+  subtasks: []
+  estimated_effort: null
+  labels: []
+  notes: null
diff --git a/docs/specifications/aprender-train/distillation-epic-spec.md b/docs/specifications/aprender-train/distillation-epic-spec.md
index 6a6582c79..f8fe8d47e 100644
--- a/docs/specifications/aprender-train/distillation-epic-spec.md
+++ b/docs/specifications/aprender-train/distillation-epic-spec.md
@@ -1,11 +1,12 @@
 # Specification: Distillation Epic — paiml/albor-370m-v2
 
 **Document ID:** SPEC-DISTILL-001
-**Version:** 1.0.0
-**Status:** Live — opens the distillation track that picks up where MODEL-2's §88 stack-existence-proof ship left off
+**Version:** 1.1.0 (priority promoted to HIGH; Phase 1 design revised from cache → online teacher provider after the storage-math sanity check)
+**Status:** **Live and ACTIVE** — distillation track is the highest-priority work after MODEL-2 §88 shipped
+**Priority:** **HIGH** (per `pmat work edit`, 2026-05-18 — PMAT-683 + PMAT-684 both elevated from `medium` to `high`)
 **Parent:** [ship-model-2-spec.md §84.5](./ship-model-2-spec.md)
-**Triggers:** PMAT-683 (P2-D: True distillation from MODEL-1), PMAT-684 (P1-B: HumanEval pass@1)
-**First applied:** TBD — see Phase 1 below
+**Triggers:** PMAT-683 (P2-D: True distillation from MODEL-1) **HIGH**, PMAT-684 (P1-B: HumanEval pass@1) **HIGH**, PMAT-691 (Phase 1 implementation kickoff) **HIGH**
+**First applied:** Phase 1 work session started 2026-05-18 (PMAT-691)
 
 ## Purpose
 
@@ -35,20 +36,35 @@ Closing this gap is the epic.
 
 ## Phased plan
 
-### Phase 1 — Teacher logits service via realizar (PMAT-683 P1, ~3 days)
+### Phase 1 — Online teacher logits provider (PMAT-683 P1 + PMAT-691, ~2 days)
 
-**Goal**: `apr distill` invokes realizar to compute teacher logits over the training corpus and caches them to disk.
+**Goal**: `aprender-train-distill` can fetch full teacher logits for an arbitrary batch by delegating to realizar — synchronously, in the training hot path.
 
-Why a separate phase: teacher forward over a 7B model is expensive (~3-5s/batch on RTX 4090 for batch=8 seq=512). Recomputing per training step is wasteful. The HF community standard (DistilBERT, MiniLM, Distil-Qwen) caches teacher logits once per corpus and replays them across training epochs. We do the same.
+**Why online instead of on-disk cache** (revised from v1.0.0 of this spec): the original plan (top-K=64 sparse cache) does not scale to a real corpus. The math: 1.24B tokens (qwen-v3 corpus) × 64 entries/position × ~6 bytes/entry (u32 index + f16 logit) ≈ **476 GB**. That exceeds the lambda-vector NVMe budget. Lowering K further degrades KD signal-to-noise. Modern distillation pipelines (DistilBERT 2019, MiniLM 2020, Distil-Qwen 2024) all use **online teacher inference** — pay the ~2× student-step cost in exchange for zero cache footprint. We do the same.
 
-**Deliverables:**
-1. `apr distill prepare --teacher <path> --dataset <bin-shards> --out <cache-dir>` — new subcommand. Iterates batches, calls `realizar::Model::forward_logits(input_ids) -> Vec<f32>`, writes per-batch `.logits.bin` files (top-K + indices to reduce storage; K=64 is standard).
-2. New module `aprender-train-distill/src/teacher_cache.rs` — defines on-disk format (header magic `APRLOG\0`, per-batch records `(batch_idx, seq_len, top_k_logits[K], top_k_indices[K])`).
-3. Integration test: prepare 100-batch slice, verify cache file size + read-back matches realizar's online compute within float-roundoff.
-
-**Effort estimate**: 16-24 hours engineering + 4-8h compute (1B-token corpus prep takes ~6h on RTX 4090 for a 7B teacher at batch=8).
+The cache approach is preserved as a **Phase 1.5 optional optimization**: an in-memory ring of N pre-computed batches that the producer thread fills while the student GPU is busy on the previous batch. Adds 0 disk cost, hides teacher latency under student compute. Implement after Phase 4 lands real numbers worth optimizing.
 
-**Falsifier**: F-DISTILL-PREP-001 — `apr distill prepare` produces a cache where, for any random batch, calling `realizar` online produces the same top-K within cosine sim ≥ 0.999.
+**Deliverables:**
+1. New module `aprender-train-distill/src/teacher_provider.rs` — defines:
+   ```rust
+   pub trait TeacherLogitsProvider {
+       fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>>;
+   }
+   pub struct RealizarTeacher { /* wraps realizar::Model */ }
+   impl TeacherLogitsProvider for RealizarTeacher { ... }
+   ```
+2. `RealizarTeacher::new(path: &Path, device: Device)` loads the teacher .apr/.gguf via realizar's standard path; subsequent `logits_for_batch` calls run forward and return the full V-dim distribution (or last-position logits depending on training objective).
+3. Wired call site: `aprender-train-distill/src/pipeline.rs::train()` replaces `build_synthetic_logits(...)` with `self.teacher.logits_for_batch(...)`.
+4. Unit test against a frozen golden reference: load a 3-layer toy teacher; assert `logits_for_batch([[1, 2, 3]])` returns bytes matching a recorded fixture within `1e-3` absolute.
+
+**Effort estimate**: 16-24 hours engineering + <1h compute. No corpus prep needed (no cache).
+
+**Falsifier**: F-DISTILL-TEACHER-001 — `RealizarTeacher::logits_for_batch` output matches `realizar`'s standard `apr trace --layer logits` JSON dump on the same input within 1e-3 absolute error, for a frozen 3-layer fixture model.
+
+**Implementation notes**:
+- For the v2 ship we need full V-dim logits during the student step (KL divergence is computed over the full distribution). Top-K truncation can come later as an optimization once Phase 4 numbers exist.
+- The teacher is held in GPU memory between calls. On RTX 4090 (24GB) the 7B Q4_K teacher fits with ~16GB headroom — enough for a batch-8 seq-512 student to share the GPU.
+- realizar's `Model::forward_logits` already returns `Vec<f32>` for the last position. For sequence-wise KD (every position), Phase 1 needs to expose an extended API `forward_logits_full` returning shape `[batch, seq_len, vocab]`. If not present, add it in `aprender-serve` as a small extension.
 
 ### Phase 2 — Student forward+backward wired to KD loss (PMAT-683 P2, ~2 days)
 
@@ -124,21 +140,22 @@ Why a separate phase: teacher forward over a 7B model is expensive (~3-5s/batch
 
 | Phase | Engineering | Compute | Calendar |
 |---|---|---|---|
-| 1: Teacher logits cache | 16-24h | 4-8h | 3 days |
+| 1: Online teacher provider (PMAT-691) | 16-24h | <1h | 2 days |
 | 2: KD-wired forward+backward | 16-24h | <1h | 2 days |
 | 3: E2E smoke (500 steps) | 8h | 4h | 1 day |
 | 4: V2 training run | 4h | 30h (unattended) | 2-3 days wall |
 | 5: HumanEval discharge | 4h | 5-8h | 1 day |
 | 6: Publish v2 | 3h | 1h | 0.5 day |
-| **Total** | **~70h** | **~45h** | **~10 days** |
+| **Total** | **~70h** | **~40h** | **~9 days** |
 
-This matches PMAT-683's "16-40h. Δship +10. P=25%" original estimate when scaled to the realistic full epic; PMAT-683 was authored before §82 P2-A landed the trainer infrastructure that Phase 2 reuses.
+This matches PMAT-683's "16-40h. Δship +10. P=25%" original estimate when scaled to the realistic full epic; PMAT-683 was authored before §82 P2-A landed the trainer infrastructure that Phase 2 reuses. The v1.1.0 revision (online teacher provider instead of disk cache) drops Phase 1 compute from 4-8h to <1h.
 
 ## Risk register
 
 | Risk | Mitigation |
 |---|---|
-| Teacher logits cache is too large (~100 GB for full qwen-v3) | Top-K=64 sparsification reduces by ~25× → ~4 GB. Stream from disk per batch. |
+| Online teacher inference doubles training step time (~3-5s teacher fwd + ~1.5s student step on RTX 4090) | Acceptable for Phase 4's 50K-step run (≈ 21 hr → 42 hr active training). Phase 1.5 ring-cache reclaims it later if needed. |
+| Teacher + student both on the same GPU could OOM at batch > 8 | Phase 3 smoke confirms VRAM headroom. Fallback: place teacher on a separate device or use `--device-teacher cpu` (slower but works on any host). |
 | Student forward+backward + KD has unmeasured CUDA bugs that only surface at scale | Phase 3 smoke catches anything that surfaces in the first 500 steps. Phase 4 has per-epoch sanity (val_loss monotone + checkpoint preserved). |
 | HumanEval pass@1 falls below 15% at end of Phase 4 → ship is blocked | Two-path fallback: (a) widen corpus per the P2-C lineage (4× corpus), retrain Phase 4 from scratch; (b) drop KD temperature from 4.0 → 2.0 to sharpen teacher distribution. Both add ~1 week each. |
 | KD loss is mathematically wrong or numerically unstable (e.g., logsumexp drift, FP16 overflow on KL) | Phase 2 unit test compares against a Python reference (PyTorch nn.KLDivLoss) within 1e-4 absolute error. |
@@ -173,4 +190,5 @@ This matches PMAT-683's "16-40h. Δship +10. P=25%" original estimate when scale
 
 ## Changelog
 
+- **1.1.0 (2026-05-18)** — Priority promoted to HIGH (PMAT-683 + PMAT-684 + new PMAT-691 elevated via `pmat work edit`). Phase 1 design revised from on-disk top-K cache to online teacher logits provider after the storage-math sanity check showed 1.24B tokens × 64 entries × 6 bytes ≈ 476 GB (exceeds available disk). The cache approach moves to Phase 1.5 as an optional ring-buffer optimization. Online-teacher decision matches DistilBERT/Distil-Qwen actual practice. Effort + compute totals updated accordingly.
 - **1.0.0 (2026-05-18)** — Initial publish. Scopes the 6-phase plan opening the distillation track that picks up MODEL-2 v2 from where the §88 stack-existence-proof ship left off.