Skip to content

docs(spec): §79 — external audit + Five-Whys retrospective on MODEL-2 convergence#1695

Closed
noahgift wants to merge 1 commit into
mainfrom
spec/79-audit-five-whys
Closed

docs(spec): §79 — external audit + Five-Whys retrospective on MODEL-2 convergence#1695
noahgift wants to merge 1 commit into
mainfrom
spec/79-audit-five-whys

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Synthesizes the external audit (`docs/specifications/two-model-spec-audit.md`) into the SPEC-SHIP-TWO-001 timeline as §79. Uses Five-Whys methodology to explain three case-class failures that kept MODEL-2 stuck at val_loss=9.75 across months of training campaigns.

What §79 captures

Audit verdict — three root causes

# Root cause Mechanism
1 Data starvation 370M model on 18.1M tokens = 0.24% of Chinchilla optimum
2 False plateau hypothesis LR budget 20k→80k → val_loss 9.7513→9.7507 (no improvement)
3 Infrastructure masking Silent CPU fallback, exhaustion placeholder, premature early-stop

Five-Whys

Case What Root cause
A Silent corpus exhaustion `ShardBatchIter` lacked wrap-around logic
B Premature early-stop `HELD_OUT_BATCHES=2` inherited from smoke-test
C val_loss=9.75 plateau Chinchilla under-provisioning + missing data-sufficiency gate

Reconciliation with §78

Audit recommendation §78 resolution
Cease tuning; ingest to 2B+ tokens SUPERSEDED — pretrained-init pivot achieved 44.9% loss reduction in 8 min
Isolate SHIP-007 `ffn_swigl` RESOLVED via different path (F32 GEMV layout fix, not FFN)
Auto wrap-around safety check OPEN follow-up

Methodology lesson #26 (NEW)

Three-class root-cause categorization for ML convergence failures:

  1. Data starvation (Chinchilla-class)
  2. Optimization defects (LR/warmup/early-stop)
  3. Infrastructure masking (silent fallbacks, placeholder losses)

Diagnose which class is binding BEFORE tuning anything.

Literature citations added

Open follow-ups identified

  1. `min_corpus_tokens` gate per Chinchilla (~30 LOC)
  2. `--warn-on-wrap-around` flag (~50 LOC)
  3. Cite arXiv refs in `contracts/training-loop-pretrain-v1.yaml`

Ship-% movement

This is a retrospective amendment — no ship-% change. MODEL-1 stays 100%, MODEL-2 stays 75%.

Spec v3.24.0 → v3.25.0 (depends on §78's v3.24.0 from #1689 landing first).

🤖 Generated with Claude Code

Synthesize docs/specifications/two-model-spec-audit.md into the
ship-two-models-spec.md timeline. The audit (authored externally before
§78 landed) diagnosed three compounding root causes for MODEL-2's
months-long failure to converge:

  1. Data starvation: 370M-param model on 18.1M tokens = 0.24% of
     Chinchilla-optimal. Math doomed to overfit.
  2. False plateau hypothesis: scaling LR budget 20k→80k steps on the
     same 4× corpus stayed at val_loss=9.7507. Diversity is the
     binding constraint, not steps.
  3. Infrastructure masking: silent CPU fallback, corpus-exhaustion
     (1.0, 1.0) placeholder loss, premature early-stop with
     undersized validation set.

§79 runs Five-Whys on:
  - Case A: silent corpus exhaustion (root cause: ShardBatchIter
    lacked wrap-around — PR #1073 fix)
  - Case B: premature early-stop (root cause: HELD_OUT_BATCHES=2
    inherited from smoke-test config — PR #1073 fix)
  - Case C: val_loss=9.75 plateau (root cause: Chinchilla under-
    provisioning + missing data-sufficiency gate — §49 pivot fix)

Reconciliation with audit's three engineering recommendations:
  Rec 1 (Cease tuning; ingest to 2B+ tokens) → SUPERSEDED by §78's
        §49-pivot path: Qwen-0.5B init + 1.24B tokens + 8 min GPU
        produced val_loss=5.36 (44.9% reduction). Pretrained-init
        fine-tune dominates "more data from-scratch."
  Rec 2 (Isolate SHIP-007 ffn_swigl) → RESOLVED via different
        bisection: PR-B (#1649) localized to F32 GEMV PTX, not FFN.
        Fixed in PR-E (#1651) — single-file transposed-matmul fix.
  Rec 3 (Auto wrap-around safety check) → OPEN follow-up.

Methodology lesson #26 NEW: three-class root-cause categorization for
ML convergence failures (data starvation / optimization defects /
infrastructure masking). Diagnose which class is binding before tuning
anything.

Open follow-ups from audit:
  1. Add min_corpus_tokens gate per Chinchilla (~30 LOC)
  2. Add --warn-on-wrap-around flag (~50 LOC)
  3. Cite arXiv:2203.15556 + arXiv:2107.06499 in
     contracts/training-loop-pretrain-v1.yaml

Spec v3.24.0 → v3.25.0. (Depends on §78's v3.24.0 landing first via
PR #1689; otherwise this lands at v3.24.0 and §78 bumps at merge.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift
Copy link
Copy Markdown
Contributor Author

Consolidated into #1702 (or successor) — original DIRTY against main from overlapping header edits with §77/§78. Content preserved verbatim in the consolidated PR.

@noahgift noahgift closed this May 15, 2026
auto-merge was automatically disabled May 15, 2026 13:31

Pull request was closed

@noahgift noahgift deleted the spec/79-audit-five-whys branch May 15, 2026 13:31
noahgift added a commit that referenced this pull request May 15, 2026
…0 packaging gaps (#1702)

Triple-amendment to SPEC-SHIP-TWO-001 capturing the §78 → §80 dispatch
arc that revealed a Class 3 packaging-defect wave in apr pretrain output.

Consolidates the content of PRs #1695 (§79), #1697 (§80), #1698 (§81)
which were all DIRTY against main due to overlapping spec-header edits.

§79 — External audit + Five-Whys retrospective on MODEL-2 convergence
  Synthesizes docs/specifications/two-model-spec-audit.md. Identifies
  three compounding root causes for the val_loss=9.75 plateau:
    1. Data starvation (0.24% of Chinchilla-optimal token count)
    2. False plateau hypothesis (LR-budget falsification)
    3. Infrastructure masking bugs (silent CPU fallback, exhaustion
       placeholder, premature early-stop)
  Five-Whys for Case A (silent corpus exhaustion), Case B (early stop),
  Case C (val_loss=9.75 plateau). Reconciles audit Recommendations 1-3
  vs §78's §49-pivot path.

§80 — Prioritized open-follow-up backlog
  Ranks all open SHIP-TWO-001 work by ship-% delta ÷ effort. P0 trio
  (apr qa / bench / export against epoch-004.apr) + P1 Chinchilla gate
  + P1 python validity + P1 HumanEval + P2-A long train = MODEL-2
  theoretical ceiling 92% at ~6-10h RTX 4090 compute.

§81 — P0 dispatch surfaced 3 systemic packaging-defect gaps
  Dispatching §80's P0 trio against §78's epoch-004.apr revealed:
    - P0-A apr qa     → "APR missing embedded tokenizer"
    - P0-B apr bench  → "C-03: APR model missing 'hidden_size' metadata"
    - P0-C apr export → PASSED, but llama-cli refused with
                        "unknown model architecture: 'LlamaForCausalLM'"
                        (GGUF expects lowercase "llama")
  Companion code PRs:
    - #1699 P0-F      → HF→GGUF arch case mapping in apr export
    - #1701 P0-D + P0-E → embed tokenizer + write arch metadata in
                          apr pretrain output
  AC-SHIP2-010 → DISCHARGED (315.5 tok/s on Qwen-0.5B fine-tune;
  3.15× over the 100 tok/s floor).

Methodology lessons added:
  #26 NEW: Three-class root-cause taxonomy for ML convergence failures
          (data starvation / optimization defects / infrastructure
          masking). Diagnose which class is binding before tuning.
  #27 NEW: Prioritize by ship-% delta ÷ effort, not alphabetical AC
          order. P0 dispatches are 0.1% the compute cost of P2-A.
  #28 NEW: Class 3 defects come in waves. Training works ≠ checkpoint
          is usable. Each lifecycle stage needs its own surfacing
          dispatch.

Ship-% movement:
  MODEL-1: 100% (unchanged)
  MODEL-2: 75% (unchanged in this PR; +2pp expected on #1701 merge)

Spec v3.24.0 → v3.27.0.

Replaces PRs #1695, #1697, #1698 (all DIRTY against main).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant