Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 21 additions & 4 deletions docs/roadmaps/roadmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9227,10 +9227,10 @@ roadmap:
item_type: task
title: 'P2-D: True distillation from MODEL-1 (apr distill)'
status: planned
priority: medium
priority: high
assigned_to: null
created: 2026-05-16T07:15:09Z
updated: 2026-05-16T07:15:09Z
updated: 2026-05-18T10:01:40Z
spec: null
acceptance_criteria:
- 'Ship apr distill per §35 (currently STUB). Architectural change; defer until P2-C exhausted. Multi-week scope. Effort: 16-40h. Δship +10. P=25%.'
Expand All @@ -9250,10 +9250,10 @@ roadmap:
item_type: task
title: 'P1-B: HumanEval pass@1 on best checkpoint (DEFERRED)'
status: planned
priority: medium
priority: high
assigned_to: null
created: 2026-05-16T07:15:26Z
updated: 2026-05-16T07:15:26Z
updated: 2026-05-18T10:01:40Z
spec: null
acceptance_criteria:
- 'BLOCKED on val_loss < 3.0 per audit Rec #3 (was < 4.0). At current val_loss=4.71 (perplexity ~111) zero-shot reasoning is mathematically impossible. apr eval humaneval. Effort: 5-8h gx10. Δship +3 if pass>5%. P=3%. Run AFTER P2-C lands a better checkpoint.'
Expand Down Expand Up @@ -9402,3 +9402,20 @@ roadmap:
- ship-model-2
- upstream-fix
notes: null
- id: PMAT-691
github_issue: null
item_type: task
title: Distillation Phase 1 — apr distill prepare (teacher logits cache)
status: inprogress
priority: high
assigned_to: null
created: 2026-05-18T10:01:48Z
updated: 2026-05-18T10:01:53.808598526+00:00
spec: null
acceptance_criteria:
- 'SPEC-DISTILL-001 Phase 1. New ''apr distill prepare --teacher <path> --dataset <bin-shards> --out <cache-dir>'' subcommand. Iterates batches via realizar::Model::forward_logits, writes top-K=64 sparse cache (.logits.bin per batch) with header magic APRLOG. New module aprender-train-distill/src/teacher_cache.rs. Falsifier F-DISTILL-PREP-001: 100-batch slice, cos sim >= 0.999 vs online realizar recompute. Effort: 16-24h eng + 4-8h compute, 3 days calendar.'
phases: []
subtasks: []
estimated_effort: null
labels: []
notes: null
194 changes: 194 additions & 0 deletions docs/specifications/aprender-train/distillation-epic-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Specification: Distillation Epic — paiml/albor-370m-v2

**Document ID:** SPEC-DISTILL-001
**Version:** 1.1.0 (priority promoted to HIGH; Phase 1 design revised from cache → online teacher provider after the storage-math sanity check)
**Status:** **Live and ACTIVE** — distillation track is the highest-priority work after MODEL-2 §88 shipped
**Priority:** **HIGH** (per `pmat work edit`, 2026-05-18 — PMAT-683 + PMAT-684 both elevated from `medium` to `high`)
**Parent:** [ship-model-2-spec.md §84.5](./ship-model-2-spec.md)
**Triggers:** PMAT-683 (P2-D: True distillation from MODEL-1) **HIGH**, PMAT-684 (P1-B: HumanEval pass@1) **HIGH**, PMAT-691 (Phase 1 implementation kickoff) **HIGH**
**First applied:** Phase 1 work session started 2026-05-18 (PMAT-691)

## Purpose

MODEL-2's v1 ship (`paiml/albor-370m-v1`, val_loss=4.6227, 2026-05-18) proved the pure-Rust Sovereign AI Stack runs end-to-end. The model is honestly framed as a stack-existence-proof — it produces gibberish at val_perplexity≈102 and is explicitly NOT a production code-completion model.

The distillation epic ships **MODEL-2 v2** — `paiml/albor-370m-v2` (or v1.1.0) — a ~370M-param student distilled from the **MODEL-1 teacher** (`paiml/qwen2.5-coder-7b-apache-q4k-v1`) that actually hits a usable HumanEval pass@1. Target: pass@1 ≥ 25% (loose first-target; competitive 0.5B is ~30-40%; the upstream 7B teacher is at ~91%).

The epic's strategic value: it converts the stack from "runs end-to-end" → "produces models worth using". That's the difference between an existence proof and a product.

## Current state (2026-05-18)

The distillation infrastructure is **scaffolded but stubbed**:

| Component | File | State |
|---|---|---|
| CLI: `apr distill ...` | `crates/apr-cli/src/commands/distill.rs:1119` | Stub — writes a metadata-only "pending_download" manifest |
| Pipeline orchestrator | `crates/aprender-train-distill/src/pipeline.rs:115-160` | Stub — uses **synthetic logits derived from weight tensor slices** instead of actual forward passes through teacher + student |
| KD loss | `crates/aprender-train/src/distill/loss.rs:34` (`DistillationLoss`) | **Real** — KD loss = α · CE(student, hard_target) + (1-α) · T² · KL(softmax(student/T), softmax(teacher/T)) |
| HF-pipeline KD variant | `crates/aprender-train/src/hf_pipeline/distillation/loss.rs:24` | **Real** — used for distillation under the hf_pipeline path |
| Student forward+backward | `crates/aprender-train/src/train/transformer_trainer/cuda_trainer.rs:1256` (`forward_logits`), `:2378` (`forward_backward_batch`) | **Real** — `CudaTransformerTrainer` proven via MODEL-2 §82 P2-A 5000-step run |
| Teacher inference | `realizar` (aprender-serve) — `apr run` path | **Real** — proven via MODEL-1 SHIP-005 HumanEval=86.59% |
| GGUF / SafeTensors weight loading | `aprender-core::format::converter` | **Real** — v0.34.0 layout + Q4_K paths shipped |

**The gap**: pipeline.rs's `train()` method constructs `teacher_logits` from `build_synthetic_logits(&teacher_weights, batch_size, num_classes=32)` — a flat 32-class slice of weight bytes. It NEVER calls realizar to run the teacher, NEVER calls `CudaTransformerTrainer.forward_backward_batch` on the student. The loss math is exercised on synthetic data; the gradient updates do nothing meaningful.

Closing this gap is the epic.

## Phased plan

### Phase 1 — Online teacher logits provider (PMAT-683 P1 + PMAT-691, ~2 days)

**Goal**: `aprender-train-distill` can fetch full teacher logits for an arbitrary batch by delegating to realizar — synchronously, in the training hot path.

**Why online instead of on-disk cache** (revised from v1.0.0 of this spec): the original plan (top-K=64 sparse cache) does not scale to a real corpus. The math: 1.24B tokens (qwen-v3 corpus) × 64 entries/position × ~6 bytes/entry (u32 index + f16 logit) ≈ **476 GB**. That exceeds the lambda-vector NVMe budget. Lowering K further degrades KD signal-to-noise. Modern distillation pipelines (DistilBERT 2019, MiniLM 2020, Distil-Qwen 2024) all use **online teacher inference** — pay the ~2× student-step cost in exchange for zero cache footprint. We do the same.

The cache approach is preserved as a **Phase 1.5 optional optimization**: an in-memory ring of N pre-computed batches that the producer thread fills while the student GPU is busy on the previous batch. Adds 0 disk cost, hides teacher latency under student compute. Implement after Phase 4 lands real numbers worth optimizing.

**Deliverables:**
1. New module `aprender-train-distill/src/teacher_provider.rs` — defines:
```rust
pub trait TeacherLogitsProvider {
fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>>;
}
pub struct RealizarTeacher { /* wraps realizar::Model */ }
impl TeacherLogitsProvider for RealizarTeacher { ... }
```
2. `RealizarTeacher::new(path: &Path, device: Device)` loads the teacher .apr/.gguf via realizar's standard path; subsequent `logits_for_batch` calls run forward and return the full V-dim distribution (or last-position logits depending on training objective).
3. Wired call site: `aprender-train-distill/src/pipeline.rs::train()` replaces `build_synthetic_logits(...)` with `self.teacher.logits_for_batch(...)`.
4. Unit test against a frozen golden reference: load a 3-layer toy teacher; assert `logits_for_batch([[1, 2, 3]])` returns bytes matching a recorded fixture within `1e-3` absolute.

**Effort estimate**: 16-24 hours engineering + <1h compute. No corpus prep needed (no cache).

**Falsifier**: F-DISTILL-TEACHER-001 — `RealizarTeacher::logits_for_batch` output matches `realizar`'s standard `apr trace --layer logits` JSON dump on the same input within 1e-3 absolute error, for a frozen 3-layer fixture model.

**Implementation notes**:
- For the v2 ship we need full V-dim logits during the student step (KL divergence is computed over the full distribution). Top-K truncation can come later as an optimization once Phase 4 numbers exist.
- The teacher is held in GPU memory between calls. On RTX 4090 (24GB) the 7B Q4_K teacher fits with ~16GB headroom — enough for a batch-8 seq-512 student to share the GPU.
- realizar's `Model::forward_logits` already returns `Vec<f32>` for the last position. For sequence-wise KD (every position), Phase 1 needs to expose an extended API `forward_logits_full` returning shape `[batch, seq_len, vocab]`. If not present, add it in `aprender-serve` as a small extension.

### Phase 2 — Student forward+backward wired to KD loss (PMAT-683 P2, ~2 days)

**Goal**: Replace pipeline.rs's `build_synthetic_logits(...)` with a real `CudaTransformerTrainer.forward_backward_batch_with_kd(...)` call.

**Deliverables:**
1. Extend `CudaTransformerTrainer` (`crates/aprender-train/src/train/transformer_trainer/cuda_trainer.rs`) with `forward_backward_kd_batch(batch: &LMBatch, teacher_logits: &TeacherLogitsBatch) -> f32`. The trainer already has `forward_backward_batch`; the KD variant computes the combined loss = α·CE + (1-α)·T²·KL via `DistillationLoss` instead of CE alone, then back-props.
2. Rewrite `aprender-train-distill/src/pipeline.rs::train()` to (a) load teacher cache, (b) iterate corpus batches alongside cached teacher logits, (c) call `forward_backward_kd_batch`, (d) optimizer step.
3. Unit test: 10-step run on a 3-layer toy student + 4-layer toy teacher, verify loss strictly decreases.

**Effort estimate**: 16-24 hours engineering. No compute beyond the unit test.

**Falsifier**: F-DISTILL-KD-001 — student loss after 100 steps < student loss at step 0 (sanity); F-DISTILL-KD-002 — student logits cosine sim to teacher logits increases monotonically over the first 1000 steps (modulo noise).

### Phase 3 — End-to-end smoke (~1 day + 4h compute)

**Goal**: Run `apr distill` for 500 steps on a 10K-batch subset of the qwen-v3 corpus, with the MODEL-1 teacher cache. Verify it doesn't crash, val_loss decreases, output passes `apr qa`.

**Deliverables**:
1. `apr distill --teacher paiml/qwen2.5-coder-7b-apache-q4k-v1 --init Qwen/Qwen2.5-Coder-0.5B-Instruct --dataset qwen-v3-10k --steps 500 --out runs/distill-smoke/` — end-to-end command.
2. Smoke evidence in `evidence/distill-smoke-<date>/` — launch.log, per-step metrics, final val_loss.
3. Spec amendment to ship-model-2-spec.md §85 documenting the smoke.

**Effort estimate**: 8h engineering + 4h gx10 compute.

**Falsifier**: F-DISTILL-SMOKE-001 — val_loss at step 500 < val_loss at step 0.

### Phase 4 — Distillation training run for the v2 ship (~10 days compute, mostly unattended)

**Goal**: Long-running distillation training to produce the v2 student checkpoint.

**Hyperparameters** (Chinchilla-adjacent, tuned from §82 P2-A):
- Init: `Qwen/Qwen2.5-Coder-0.5B-Instruct` (same as v1)
- Teacher: `paiml/qwen2.5-coder-7b-apache-q4k-v1` (Q4_K, our distilled teacher)
- Corpus: qwen-v3 (1.24B tokens, the §77 5g.1 corpus) — full pass for the first time
- Steps: ~50K (≈ 1.6B tokens consumed at batch=16 seq=512 = 8192 tokens/step)
- LR: 1.5e-5 → 5e-7 cosine
- T (KD temperature): 4.0 (DistilBERT default; tune to 2.0 if loss curve plateaus early)
- α (CE weight): 0.3 (per Hinton et al. 2015 — KD signal dominates when teacher is reliable)
- Wall time on RTX 4090: 50K steps × ~1.5 s/step (student forward+backward + teacher cache read) ≈ 21 hours active training. With teacher cache prep + checkpointing overhead ≈ 30 hours total.

**Deliverables**:
1. Best checkpoint `runs/albor-370m-v2/ckpt/epoch-N.apr`.
2. Per-epoch eval against `apr eval humaneval` (sample 20 problems mid-run, full 164 at end).
3. Spec amendment §86 (re-using the deprecated §86 slot) documenting the v2 run.

**Effort estimate**: 4h dispatch + setup + 30h compute (unattended on gx10 or lambda-labs).

**Falsifier**: F-DISTILL-V2-001 — best val_loss < 3.0 (would unblock PMAT-684 HumanEval per the audit Rec #3); F-DISTILL-V2-002 — HumanEval pass@1 ≥ 15% (loose first-target).

### Phase 5 — HumanEval discharge (PMAT-684, ~1 day)

**Goal**: Run full 164-problem HumanEval on the v2 best checkpoint, verify pass@1 against the target.

**Deliverables**:
1. `apr eval humaneval --model runs/albor-370m-v2/ckpt/epoch-N.apr --samples 164 --temperature 0.2 --top-p 0.95 --gpu cuda` — run.
2. Evidence at `evidence/distill-v2-humaneval-<date>/`.
3. Discharge AC-SHIP2-005 in ship-model-2-spec.md.

**Effort estimate**: 5-8h gx10 compute (sequential through 164 problems, 16 samples each for pass@1@k).

**Falsifier**: F-HUMANEVAL-V2-001 — pass@1 ≥ 15% (acceptance threshold; under is "the corpus didn't transfer; need to widen corpus per P2-C lineage"); F-HUMANEVAL-V2-002 — pass@1 ≥ 25% (loose ship-goal threshold).

### Phase 6 — Publish v2 (~3h)

**Goal**: Re-stamp the best checkpoint with provenance + tokenizer (per SPEC-HF-PUBLISH-001) and ship to `paiml/albor-370m-v2` (or `paiml/albor-370m-v1.1.0`).

**Deliverables**: Follows SPEC-HF-PUBLISH-001 end-to-end — `apr stamp --tokenizer` → `apr export --quantize int4` → `apr publish`. With the v0.34.0+1783 (defect 6) binary, everything is autodetected from the staging directory. Three-path verification: `apr run` + HF Transformers + llama-cli.

**Effort estimate**: 3h staging + 1h compute (Q4_K export + upload).

## Total effort + timeline

| Phase | Engineering | Compute | Calendar |
|---|---|---|---|
| 1: Online teacher provider (PMAT-691) | 16-24h | <1h | 2 days |
| 2: KD-wired forward+backward | 16-24h | <1h | 2 days |
| 3: E2E smoke (500 steps) | 8h | 4h | 1 day |
| 4: V2 training run | 4h | 30h (unattended) | 2-3 days wall |
| 5: HumanEval discharge | 4h | 5-8h | 1 day |
| 6: Publish v2 | 3h | 1h | 0.5 day |
| **Total** | **~70h** | **~40h** | **~9 days** |

This matches PMAT-683's "16-40h. Δship +10. P=25%" original estimate when scaled to the realistic full epic; PMAT-683 was authored before §82 P2-A landed the trainer infrastructure that Phase 2 reuses. The v1.1.0 revision (online teacher provider instead of disk cache) drops Phase 1 compute from 4-8h to <1h.

## Risk register

| Risk | Mitigation |
|---|---|
| Online teacher inference doubles training step time (~3-5s teacher fwd + ~1.5s student step on RTX 4090) | Acceptable for Phase 4's 50K-step run (≈ 21 hr → 42 hr active training). Phase 1.5 ring-cache reclaims it later if needed. |
| Teacher + student both on the same GPU could OOM at batch > 8 | Phase 3 smoke confirms VRAM headroom. Fallback: place teacher on a separate device or use `--device-teacher cpu` (slower but works on any host). |
| Student forward+backward + KD has unmeasured CUDA bugs that only surface at scale | Phase 3 smoke catches anything that surfaces in the first 500 steps. Phase 4 has per-epoch sanity (val_loss monotone + checkpoint preserved). |
| HumanEval pass@1 falls below 15% at end of Phase 4 → ship is blocked | Two-path fallback: (a) widen corpus per the P2-C lineage (4× corpus), retrain Phase 4 from scratch; (b) drop KD temperature from 4.0 → 2.0 to sharpen teacher distribution. Both add ~1 week each. |
| KD loss is mathematically wrong or numerically unstable (e.g., logsumexp drift, FP16 overflow on KL) | Phase 2 unit test compares against a Python reference (PyTorch nn.KLDivLoss) within 1e-4 absolute error. |
| Realizar inference of the 7B teacher mid-training is too slow (~5s/batch) — burns 50K × 5s ≈ 70h of teacher time on top of training | Phase 1 cache amortizes this once. The cache prep itself takes ~6h, but reads at <100ms/batch during training. |

## Open architectural decisions

1. **Top-K cache vs full logits**: top-K=64 trades fidelity for storage. KD literature (DistilBERT, Distil-Qwen) shows pass@1 deltas from K=32 to K=full are within noise for code corpora. We start with K=64; if Phase 3 smoke shows loss-curve degradation vs a comparison full-logits batch, bump K.
2. **Online teacher vs cached**: cached (decided). Online would let temperature/α be tuned mid-training without re-prep cost; cached requires re-prep per α/T change. The hyperparameter sweep cost is < single cache regen, so cache wins.
3. **Distillation algorithm variant**: vanilla KD (Hinton 2015), NOT MiniLM (which adds attention-distillation losses) or TinyBERT (intermediate-layer matching). MODEL-1 is a Q4_K teacher — its FP16 intermediate activations aren't recoverable post-quantization. Vanilla KD on logits-only is the practical match.
4. **Tokenizer**: same Qwen2 vocab (151,936 tokens) as v1 — teacher + student share vocab, no token-alignment loss. This is the strongest argument for using `Qwen2.5-Coder-0.5B-Instruct` as init: matched tokenizer + matched arch family.

## Acceptance criteria

| ID | Criterion | Phase | Status |
|---|---|---|---|
| AC-DISTILL-001 | `apr distill prepare` end-to-end cache write + read-back parity ≥ 0.999 cos sim | Phase 1 | planned |
| AC-DISTILL-002 | `apr distill` runs 500 steps without crash; val_loss monotone | Phase 3 | planned |
| AC-DISTILL-003 | Phase 4 best val_loss < 3.0 | Phase 4 | planned |
| AC-DISTILL-004 | Phase 5 HumanEval pass@1 ≥ 15% (acceptance); aspire to ≥ 25% (ship-goal) | Phase 5 | planned |
| AC-DISTILL-005 | Phase 6 publish at `paiml/albor-370m-v2` (or v1.1.0) per SPEC-HF-PUBLISH-001 with all 3 usage paths verified | Phase 6 | planned |

## Cross-references

- Parent: [ship-model-2-spec.md §84.5](./ship-model-2-spec.md) — "What's next for MODEL-2"
- Predecessor: SPEC-SHIP-MODEL-2 §35 (`apr distill` stub identification, 2026-04-28), §82 (P2-A trainer infrastructure proven, 2026-05-15)
- Companion: [SPEC-HF-PUBLISH-001](./model-hf-publish-pipeline-spec.md) — used for Phase 6 publish
- Roadmap: [docs/roadmaps/roadmap.yaml](../../roadmaps/roadmap.yaml) PMAT-683 + PMAT-684
- Teacher: [`paiml/qwen2.5-coder-7b-apache-q4k-v1`](https://huggingface.co/paiml/qwen2.5-coder-7b-apache-q4k-v1) (MODEL-1, HumanEval=86.59%)
- Init: [`Qwen/Qwen2.5-Coder-0.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct) (Apache-2.0, same arch + vocab as student)
- Audit: [audits/q4k-shape-swap-impact.md](../audits/q4k-shape-swap-impact.md) — confirms teacher's Q4_K artifact is bit-correct (no re-export needed)

## Changelog

- **1.1.0 (2026-05-18)** — Priority promoted to HIGH (PMAT-683 + PMAT-684 + new PMAT-691 elevated via `pmat work edit`). Phase 1 design revised from on-disk top-K cache to online teacher logits provider after the storage-math sanity check showed 1.24B tokens × 64 entries × 6 bytes ≈ 476 GB (exceeds available disk). The cache approach moves to Phase 1.5 as an optional ring-buffer optimization. Online-teacher decision matches DistilBERT/Distil-Qwen actual practice. Effort + compute totals updated accordingly.
- **1.0.0 (2026-05-18)** — Initial publish. Scopes the 6-phase plan opening the distillation track that picks up MODEL-2 v2 from where the §88 stack-existence-proof ship left off.
Loading