docs(spec): §79 — external audit + Five-Whys retrospective on MODEL-2 convergence by noahgift · Pull Request #1695 · paiml/aprender

noahgift · 2026-05-15T12:27:22Z

Summary

Synthesizes the external audit (`docs/specifications/two-model-spec-audit.md`) into the SPEC-SHIP-TWO-001 timeline as §79. Uses Five-Whys methodology to explain three case-class failures that kept MODEL-2 stuck at val_loss=9.75 across months of training campaigns.

What §79 captures

Audit verdict — three root causes

#	Root cause	Mechanism
1	Data starvation	370M model on 18.1M tokens = 0.24% of Chinchilla optimum
2	False plateau hypothesis	LR budget 20k→80k → val_loss 9.7513→9.7507 (no improvement)
3	Infrastructure masking	Silent CPU fallback, exhaustion placeholder, premature early-stop

Five-Whys

Case	What	Root cause
A	Silent corpus exhaustion	`ShardBatchIter` lacked wrap-around logic
B	Premature early-stop	`HELD_OUT_BATCHES=2` inherited from smoke-test
C	val_loss=9.75 plateau	Chinchilla under-provisioning + missing data-sufficiency gate

Reconciliation with §78

Audit recommendation	§78 resolution
Cease tuning; ingest to 2B+ tokens	SUPERSEDED — pretrained-init pivot achieved 44.9% loss reduction in 8 min
Isolate SHIP-007 `ffn_swigl`	RESOLVED via different path (F32 GEMV layout fix, not FFN)
Auto wrap-around safety check	OPEN follow-up

Methodology lesson #26 (NEW)

Three-class root-cause categorization for ML convergence failures:

Data starvation (Chinchilla-class)
Optimization defects (LR/warmup/early-stop)
Infrastructure masking (silent fallbacks, placeholder losses)

Diagnose which class is binding BEFORE tuning anything.

Literature citations added

Chinchilla scaling laws — arXiv:2203.15556
Deduplication / memorization — arXiv:2107.06499

Open follow-ups identified

`min_corpus_tokens` gate per Chinchilla (~30 LOC)
`--warn-on-wrap-around` flag (~50 LOC)
Cite arXiv refs in `contracts/training-loop-pretrain-v1.yaml`

Ship-% movement

This is a retrospective amendment — no ship-% change. MODEL-1 stays 100%, MODEL-2 stays 75%.

Spec v3.24.0 → v3.25.0 (depends on §78's v3.24.0 from #1689 landing first).

🤖 Generated with Claude Code

Synthesize docs/specifications/two-model-spec-audit.md into the ship-two-models-spec.md timeline. The audit (authored externally before §78 landed) diagnosed three compounding root causes for MODEL-2's months-long failure to converge: 1. Data starvation: 370M-param model on 18.1M tokens = 0.24% of Chinchilla-optimal. Math doomed to overfit. 2. False plateau hypothesis: scaling LR budget 20k→80k steps on the same 4× corpus stayed at val_loss=9.7507. Diversity is the binding constraint, not steps. 3. Infrastructure masking: silent CPU fallback, corpus-exhaustion (1.0, 1.0) placeholder loss, premature early-stop with undersized validation set. §79 runs Five-Whys on: - Case A: silent corpus exhaustion (root cause: ShardBatchIter lacked wrap-around — PR #1073 fix) - Case B: premature early-stop (root cause: HELD_OUT_BATCHES=2 inherited from smoke-test config — PR #1073 fix) - Case C: val_loss=9.75 plateau (root cause: Chinchilla under- provisioning + missing data-sufficiency gate — §49 pivot fix) Reconciliation with audit's three engineering recommendations: Rec 1 (Cease tuning; ingest to 2B+ tokens) → SUPERSEDED by §78's §49-pivot path: Qwen-0.5B init + 1.24B tokens + 8 min GPU produced val_loss=5.36 (44.9% reduction). Pretrained-init fine-tune dominates "more data from-scratch." Rec 2 (Isolate SHIP-007 ffn_swigl) → RESOLVED via different bisection: PR-B (#1649) localized to F32 GEMV PTX, not FFN. Fixed in PR-E (#1651) — single-file transposed-matmul fix. Rec 3 (Auto wrap-around safety check) → OPEN follow-up. Methodology lesson #26 NEW: three-class root-cause categorization for ML convergence failures (data starvation / optimization defects / infrastructure masking). Diagnose which class is binding before tuning anything. Open follow-ups from audit: 1. Add min_corpus_tokens gate per Chinchilla (~30 LOC) 2. Add --warn-on-wrap-around flag (~50 LOC) 3. Cite arXiv:2203.15556 + arXiv:2107.06499 in contracts/training-loop-pretrain-v1.yaml Spec v3.24.0 → v3.25.0. (Depends on §78's v3.24.0 landing first via PR #1689; otherwise this lands at v3.24.0 and §78 bumps at merge.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-15T13:31:37Z

Consolidated into #1702 (or successor) — original DIRTY against main from overlapping header edits with §77/§78. Content preserved verbatim in the consolidated PR.

…0 packaging gaps (#1702) Triple-amendment to SPEC-SHIP-TWO-001 capturing the §78 → §80 dispatch arc that revealed a Class 3 packaging-defect wave in apr pretrain output. Consolidates the content of PRs #1695 (§79), #1697 (§80), #1698 (§81) which were all DIRTY against main due to overlapping spec-header edits. §79 — External audit + Five-Whys retrospective on MODEL-2 convergence Synthesizes docs/specifications/two-model-spec-audit.md. Identifies three compounding root causes for the val_loss=9.75 plateau: 1. Data starvation (0.24% of Chinchilla-optimal token count) 2. False plateau hypothesis (LR-budget falsification) 3. Infrastructure masking bugs (silent CPU fallback, exhaustion placeholder, premature early-stop) Five-Whys for Case A (silent corpus exhaustion), Case B (early stop), Case C (val_loss=9.75 plateau). Reconciles audit Recommendations 1-3 vs §78's §49-pivot path. §80 — Prioritized open-follow-up backlog Ranks all open SHIP-TWO-001 work by ship-% delta ÷ effort. P0 trio (apr qa / bench / export against epoch-004.apr) + P1 Chinchilla gate + P1 python validity + P1 HumanEval + P2-A long train = MODEL-2 theoretical ceiling 92% at ~6-10h RTX 4090 compute. §81 — P0 dispatch surfaced 3 systemic packaging-defect gaps Dispatching §80's P0 trio against §78's epoch-004.apr revealed: - P0-A apr qa → "APR missing embedded tokenizer" - P0-B apr bench → "C-03: APR model missing 'hidden_size' metadata" - P0-C apr export → PASSED, but llama-cli refused with "unknown model architecture: 'LlamaForCausalLM'" (GGUF expects lowercase "llama") Companion code PRs: - #1699 P0-F → HF→GGUF arch case mapping in apr export - #1701 P0-D + P0-E → embed tokenizer + write arch metadata in apr pretrain output AC-SHIP2-010 → DISCHARGED (315.5 tok/s on Qwen-0.5B fine-tune; 3.15× over the 100 tok/s floor). Methodology lessons added: #26 NEW: Three-class root-cause taxonomy for ML convergence failures (data starvation / optimization defects / infrastructure masking). Diagnose which class is binding before tuning. #27 NEW: Prioritize by ship-% delta ÷ effort, not alphabetical AC order. P0 dispatches are 0.1% the compute cost of P2-A. #28 NEW: Class 3 defects come in waves. Training works ≠ checkpoint is usable. Each lifecycle stage needs its own surfacing dispatch. Ship-% movement: MODEL-1: 100% (unchanged) MODEL-2: 75% (unchanged in this PR; +2pp expected on #1701 merge) Spec v3.24.0 → v3.27.0. Replaces PRs #1695, #1697, #1698 (all DIRTY against main). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 15, 2026 12:27

noahgift mentioned this pull request May 15, 2026

docs(spec): §79 + §80 + §81 consolidated — audit retrospective + priority queue + P0 packaging gaps #1702

Merged

noahgift closed this May 15, 2026

auto-merge was automatically disabled May 15, 2026 13:31
Pull request was closed

noahgift deleted the spec/79-audit-five-whys branch May 15, 2026 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): §79 — external audit + Five-Whys retrospective on MODEL-2 convergence#1695

docs(spec): §79 — external audit + Five-Whys retrospective on MODEL-2 convergence#1695
noahgift wants to merge 1 commit into
mainfrom
spec/79-audit-five-whys

noahgift commented May 15, 2026

Uh oh!

noahgift commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 15, 2026

Summary

What §79 captures

Audit verdict — three root causes

Five-Whys

Reconciliation with §78

Methodology lesson #26 (NEW)

Literature citations added

Open follow-ups identified

Ship-% movement

Uh oh!

noahgift commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant