Skip to content

contract(apr-pretrain-arch-polymorphic-v1): v1.0.0 PROPOSED — §50.4 step 5a#1473

Merged
noahgift merged 6 commits into
mainfrom
contract/apr-pretrain-arch-polymorphic-v1
May 4, 2026
Merged

contract(apr-pretrain-arch-polymorphic-v1): v1.0.0 PROPOSED — §50.4 step 5a#1473
noahgift merged 6 commits into
mainfrom
contract/apr-pretrain-arch-polymorphic-v1

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

@noahgift noahgift commented May 4, 2026

Summary

Authors contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.0.0 PROPOSED — the contract layer driving §50.4 steps 5b-5f (the architecture-polymorphic pretrain trainer that unblocks fine-tuning from a Qwen2.5-class init checkpoint).

What this contract pins

Per §50 (PR #1472), pretrain_real.rs:38-46 HARDCODES every architectural constant from Llama370MConfig. Loading Qwen2.5-Coder-0.5B-Instruct weights into this fixed shape is a category error. This contract pins 4 invariants:

  1. arch_extraction_signatureinit=None preserves §24/§25 baseline byte-for-byte; init=Some extracts all 10 fields from APR header, no silent defaults
  2. qwen2_0_5b_constructorTransformerConfig::qwen2_0_5b() returns config matching HF config.json byte-for-byte
  3. gqa_7_to_1_invariants — attention kernel handles GQA-7:1 without per-ratio specialization
  4. qwen_tokenizer_vocab_compatibility — preflight gates by EXTRACTED vocab when --init present

Falsifiers (8 total)

ID What it falsifies
001 qwen2_0_5b() constructor exists + matches HF config
002 init=None preserves Llama370M baseline (regression-free)
003 init=Some extracts all 10 fields from APR header
004 GQA-7:1 forward-pass numerical parity vs GQA-1:1 reference
005 Qwen tokenizer (151_936) passes preflight WITH --init
006 Qwen tokenizer FAILS preflight WITHOUT --init
007 Wrong-arch APR (encoder model) fails fast
008 pv validate exits 0

Validation

$ pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.

Composition with sibling contracts

This contract DOES NOT replace apr-pretrain-from-init-v1. The two compose:

FALSIFY-APR-PRETRAIN-INIT-005 (arch mismatch) becomes DISCHARGED when this contract's FALSIFY-007 lands.

Roadmap discharge

§50.4 step What it adds This contract's role
5a (this) The contract DRIVES 5b-5f
5b TransformerConfig::qwen2_0_5b() DISCHARGES FALSIFY-001
5c build_transformer_config(init: Option<&InitArch>) DISCHARGES FALSIFY-002, -003
5d Polymorphic preflight DISCHARGES FALSIFY-005, -006
5e GQA-7:1 verification DISCHARGES FALSIFY-004
5f Weight load wired to extracted arch DISCHARGES FALSIFY-007

Plain ship-% update

  • MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  • MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Five Whys

  1. Why a contract before the impl? §50.4 step 5a is THE first step of the re-scoped roadmap. The contract pins what 5b-5f must satisfy — without it, the impl PRs would each pick their own arbitrary semantics for "extract arch from APR".
  2. Why 8 falsifiers, not 4? Each of the 4 equations decomposes into 2 falsifiable claims (existence + correctness, positive + negative case, etc.). 8 covers every silent-failure mode the §24 retrospective showed is possible.
  3. Why also pin GQA-7:1 here, not just in gqa-kernel-v1? Existing gqa-kernel-v1 covers GQA generally; what's NEW is the Llama370M codepath empirically only exercised 4:1 (kv=4, q=16). Qwen2.5 exercises 7:1 (kv=2, q=14). FALSIFY-004 makes this transition contract-bound.
  4. Why not just delete Llama370MConfig outright? Per §50.3 Option C analysis: that deletes the §24/§25 falsification evidence. The polymorphic builder preserves both paths.
  5. Why is FALSIFY-007 load-bearing? Without it, pointing --init at an encoder model (e.g., CodeBERT) would silently load weights into a decoder-shaped trainer, producing nonsense gradients. The error must name the architecture-family mismatch up front.

Test plan

  • pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml — 0 errors
  • Pre-commit quality gates pass
  • CI (gate, test, lint, coverage, security)

Refs

🤖 Generated with Claude Code

…tep 5a

Authors `contracts/apr-pretrain-arch-polymorphic-v1.yaml` v1.0.0
PROPOSED — the contract layer driving §50.4 steps 5b-5f (the
architecture-polymorphic pretrain trainer that unblocks fine-tuning
from a Qwen2.5-class init checkpoint).

Per §50 (PR #1472), the existing pretrain trainer
`pretrain_real.rs:38-46` HARDCODES every architectural constant from
`Llama370MConfig`. Loading Qwen2.5-Coder-0.5B-Instruct weights into
this fixed shape is a category error (vocab 50K vs 152K, hidden 1024
vs 896, GQA 4:1 vs 7:1, etc.). This contract pins the polymorphic
builder + 4 invariants:

  1. arch_extraction_signature — init=None preserves §24/§25 baseline;
     init=Some extracts all 10 fields from APR header, no silent defaults
  2. qwen2_0_5b_constructor — TransformerConfig::qwen2_0_5b() returns
     a config matching HF config.json byte-for-byte (vocab=151_936,
     hidden=896, GQA-7:1, rope_theta=1e6, use_bias=true,
     tie_word_embeddings=true)
  3. gqa_7_to_1_invariants — attention kernel handles GQA-7:1 without
     per-ratio specialization; cosine ≥ 0.9999 vs GQA-1:1 reference
  4. qwen_tokenizer_vocab_compatibility — preflight gates by EXTRACTED
     vocab (151_936 for Qwen) when --init present, falls back to
     Llama370MConfig::VOCAB_SIZE (50_257) when absent

8 falsifiers (FALSIFY-APR-PRETRAIN-ARCH-001..008), 6 proof obligations,
2 kani harnesses. `pv validate` exits 0 with 0 errors / 0 warnings.

This contract DOES NOT replace apr-pretrain-from-init-v1; the two
compose. apr-pretrain-from-init-v1 pins the --init flag's CLI surface
+ magic-byte validation; this contract pins the architecture
extraction algorithm that --init's weight load depends on.
FALSIFY-APR-PRETRAIN-INIT-005 (arch mismatch) becomes DISCHARGED when
this contract's FALSIFY-007 lands.

Five Whys:

  1. Why a contract before the impl? §50.4 step 5a is THE first step
     of the re-scoped roadmap. The contract pins what 5b-5f must
     satisfy — without it, the impl PRs would each pick their own
     arbitrary semantics for "extract arch from APR". Contract-first
     prevents 5-PR scope drift.

  2. Why 8 falsifiers, not 4? Each of the 4 equations decomposes into
     2 falsifiable claims: (existence + correctness) for the
     constructor, (init=None + init=Some) for the builder, (forward-
     pass + reference-comparison) for GQA-7:1, (positive + negative
     case) for the tokenizer surface. 8 covers every silent-failure
     mode the §24 retrospective showed is possible.

  3. Why also pin GQA-7:1 here, not just in gqa-kernel-v1? The
     existing gqa-kernel-v1 covers GQA generally; what's NEW is that
     the Llama370M codepath empirically only exercised 4:1 (kv=4,
     q=16). Qwen2.5 exercises 7:1 (kv=2, q=14). FALSIFY-004 makes
     this transition contract-bound rather than tribal knowledge.

  4. Why not just delete Llama370MConfig outright? Per §50.3 Option C
     analysis: that deletes the §24/§25 falsification evidence (we
     KEEP knowing from-scratch fails at val_loss=9.75 on the existing
     corpus). The polymorphic builder preserves both paths — Llama370M
     for the from-scratch baseline, Qwen2.5 (or any future init) for
     the fine-tune path.

  5. Why is FALSIFY-007 (encoder/decoder mismatch) load-bearing?
     Without it, an operator who points --init at e.g. CodeBERT (an
     encoder) would silently load weights into a decoder-shaped
     trainer, producing nonsense gradients. The error message must
     name the architecture-family mismatch, not crash later with
     cryptic shape errors during the first forward pass.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling finding (#1472)
  - SPEC-SHIP-TWO-001 §50.4 — re-scoped roadmap (steps 5a-5h)
  - contracts/apr-pretrain-from-init-v1.yaml v1.1.0 PARTIAL (#1471, sibling)
  - contracts/training-loop-pretrain-v1.yaml v1.5.0 ACTIVE (parent)
  - contracts/architecture-requirements-v1.yaml (sibling)
  - contracts/gqa-kernel-v1.yaml (sibling — GQA ratio invariants)
  - feedback_no_guessing.md
  - feedback_fix_root_cause_never_route_around.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…#1478)

Adds `falsify_apr_pretrain_arch_004_gqa_7_1_forward_pass_smoke` that
constructs a tiny GQA-7:1 transformer (kv_heads=2, query_heads=14,
hidden=112=14*8, head_dim=8 — mimicking Qwen2.5-Coder-0.5B's GQA ratio)
and verifies the forward pass:
  - runs without panic
  - returns the correct shape (seq_len * vocab_size)
  - produces all-finite logits (no NaN, no Inf)

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-004 at SMOKE level: kernel handles GQA-7:1
    without per-ratio specialization. Full numerical-parity vs GQA-1:1
    reference (cosine ≥ 0.9999) is a FUNCTIONAL-level discharge, not
    PARTIAL_ALGORITHM_LEVEL.

Why this matters: the existing aprender-train Llama370M codepath only
empirically exercised GQA-4:1 (kv_heads=4, query_heads=16). Qwen2.5-0.5B
(the §49 fine-tune init source) uses GQA-7:1. Without this test, a
future refactor of the attention kernel could silently break the 7:1
case while keeping 4:1 working — exactly the §24 silent-failure class.

The test runs in <1ms (tiny shape: hidden=112, vocab=256, layers=1).
Drift-prevention: also asserts the GQA ratio at construction time, so a
typo in num_attention_heads or num_kv_heads is caught before the forward
pass even runs.

Test results (cargo test -p aprender-train --lib transformer::model::tests::falsify_apr_pretrain_arch_004):
    1 passed; 0 failed; 0 ignored

Five Whys:

  1. Why a smoke test, not a numerical-parity test? PARTIAL_ALGORITHM_LEVEL
     requires only "compile + run + finite". FUNCTIONAL would require
     cosine vs reference. Smoke is the right scope for §50.4 step 5e
     — full parity is a follow-up if FALSIFY-006 (init_loss < 6.0)
     ever fails on the LIVE 500-step run.

  2. Why num_attention_heads=14 (Qwen2.5-0.5B exact) and not e.g. 7
     (smaller test model)? The Qwen2.5-0.5B-canonical 14/2=7 ratio is
     the load-bearing GQA shape. A 7/1 ratio would also be 7:1 but
     wouldn't exercise the multi-query-head-per-kv-head dispatch on
     more than one query group. 14/2 forces 2 query groups, each
     with 7 heads — the actual production shape.

  3. Why use_bias=true and tie_word_embeddings=true? Mirror the Qwen2
     scaling-law defaults verified by PR #1474 (the `qwen2_0_5b()` HF
     config check). If the test used the Llama defaults (use_bias=false,
     tie=false), it wouldn't catch a regression in the bias-add or
     embedding-tie code paths under the Qwen variant.

  4. Why include the all-finite check, not just shape? §24's
     retrospective showed silent NaN propagation through GQA can
     produce loss=NaN that the divergence guard catches LATE (multiple
     steps in). The smoke test catches it at the first forward pass,
     before any optimizer state corrupts.

  5. Why is this a SEPARATE test, not an extension of
     `test_transformer_tiny_forward`? The existing tiny() config uses
     defaults that may include GQA=1:1 (no GQA at all). A separate
     test makes the GQA-7:1 assertion auditable — `cargo test
     gqa_7_1` finds it directly, and contract drift between this
     test and FALSIFY-004 is detectable via grep.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight)
  - feedback_no_guessing.md

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
… §50.4 step 5d (#1476)

Refactors `preflight_tokenizer_vocab_matches_model(tokenizer_dir)` →
`preflight_tokenizer_vocab_matches_target(tokenizer_dir, target_vocab_size)`
so the GATE-ARCH-370M-011 preflight no longer hardcodes the Llama370M
baseline. When --init wires up (§50.4 step 5f), the caller passes the
EXTRACTED arch's vocab_size (e.g., 151_936 for Qwen2.5-0.5B); for now
(init=None), the only existing caller passes Llama370MConfig::VOCAB_SIZE
explicitly, preserving regression-free behavior on the §24/§25 from-scratch
path.

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-005 — Qwen tokenizer (151_936) PASSES
                                    preflight when target is Qwen-shaped
  - FALSIFY-APR-PRETRAIN-ARCH-006 — Qwen tokenizer (151_936) FAILS
                                    preflight when target is Llama-shaped
                                    (the silent-pass class)

What this PR adds:

  1. Renamed function `preflight_tokenizer_vocab_matches_model` →
     `preflight_tokenizer_vocab_matches_target` with new
     `target_vocab_size: usize` parameter
  2. Updated 4 callers (1 production at line 361 + 3 existing tests) to
     pass `Llama370MConfig::VOCAB_SIZE` explicitly — same behavior, now
     visible at the call site
  3. 2 new unit tests in `commands::pretrain::tests`:
       - preflight_qwen_vocab_passes_with_qwen_target  (FALSIFY-005)
       - preflight_qwen_vocab_fails_with_llama_target  (FALSIFY-006 with
         error-message assertion that names BOTH vocab sizes)
  4. Updated docstring noting the §50 polymorphism + cross-references to
     contract `apr-pretrain-arch-polymorphic-v1`

Test results (cargo test -p apr-cli --lib commands::pretrain::tests::preflight):
    5 passed; 0 failed; 0 ignored
      - preflight_accepts_matching_vocab          (regression-free, unchanged behavior)
      - preflight_rejects_tokenizer_vocab_mismatch (regression-free)
      - preflight_rejects_missing_vocab_json       (regression-free)
      - preflight_qwen_vocab_passes_with_qwen_target (NEW — FALSIFY-005)
      - preflight_qwen_vocab_fails_with_llama_target (NEW — FALSIFY-006)

Five Whys:

  1. Why polymorphic preflight NOW rather than at step 5f? Each step
     gets its own falsifier discharge. Step 5d's invariant — "preflight
     gates by EXTRACTED vocab when --init is set, by Llama370M vocab
     when --init is absent" — is independently testable WITHOUT actually
     reading an APR file. Authoring the test now pins the algorithm
     before the I/O integration arrives.

  2. Why rename `_matches_model` → `_matches_target`? Old name implied
     "matches the model" with a fixed/canonical model. New name reflects
     the polymorphic dispatch where the target depends on call-site context.
     The rename is a one-time cost; staying with the old name would
     ossify the misleading abstraction.

  3. Why pass target as parameter rather than extract from PretrainConfig?
     Decoupling: PretrainConfig already exists and is wired through many
     callers. Adding a new field to PretrainConfig would create a parallel
     drift surface (every constructor of PretrainConfig must remember to
     set it correctly). A function parameter forces every call site to
     decide explicitly, which is exactly the contract's "no silent
     defaults" invariant.

  4. Why 2 new tests not 1? The two falsifiers (-005 and -006) are
     mutually exclusive proofs of the polymorphism:
       - 005 (positive): Qwen+Qwen target = pass
       - 006 (negative): Qwen+Llama target = fail
     Without the negative case, a regression that always returns Ok
     would silently pass the positive case. The pair pins the dispatch.

  5. Why does FALSIFY-006 assert BOTH vocab sizes appear in the error
     message? Operator-experience: when a fine-tune fails with "tokenizer
     vocab mismatch", the operator needs to see WHICH tokenizer (151_936)
     and WHICH target (50_257) — not just an abstract "they don't match"
     error. The dual-name requirement prevents lossy error messages
     during the §49 strategy pivot.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (#1472, in flight)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - feedback_no_guessing.md

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
… (7/8 falsifiers bound) (#1480)

Same-day continuation cycle landed 8 PRs across the §50.4 architecture-
polymorphic infrastructure track. §51 records the cascade-complete
state and pinpoints the remaining MODEL-2 ship-% gate (step 5g LIVE).

Falsifier-discharge scoreboard for `apr-pretrain-arch-polymorphic-v1`:

  | ID | What it pins                          | PR    | Status |
  |----|---------------------------------------|-------|--------|
  | 001 | qwen2_0_5b matches HF + tie fix      | #1474 | PARTIAL |
  | 002 | init=None preserves Llama370M        | #1475 | PARTIAL |
  | 003 | init=Some pass-through               | #1475 | PARTIAL |
  | 004 | GQA-7:1 forward smoke                | #1478 | MERGED  |
  | 005 | Qwen tokenizer + Qwen target = pass  | #1476 | MERGED  |
  | 006 | Qwen tokenizer + Llama target = fail | #1476 | MERGED  |
  | 007 | encoder/decoder family mismatch      | #1479 | PARTIAL |
  | 008 | pv validate                          | #1473 | PARTIAL |

7 of 8 falsifiers PARTIAL_ALGORITHM_LEVEL or MERGED.

Remaining work:
  - 5f.2 — wire APR file open + tensor materialization (~80 LOC)
           DELIBERATELY DEFERRED this cycle; doing 5f.2 now means
           rebasing onto 4 in-flight PRs as they land
  - 5g  — LIVE 500-step smoke fine-tune (operator dispatch)
          THE LOAD-BEARING TEST that moves MODEL-2 ship-%
  - 5h  — stamp + publish

Per §47-§48 lesson: "infrastructure shipped ≠ ship-% movement."
Cascade-complete state means the polymorphic foundation is in place;
ship-% movement still requires the LIVE empirical check.

Five Whys:
  1. Why a snapshot now? Multiple PRs in cascade auto-merge create
     cognitive load. A spec snapshot captures both the achievement
     (7 falsifiers bound) and the remaining gate (step 5g LIVE).
     Without it, future operators waste cycles re-deriving the state.
  2. Why focus on falsifier scoreboard rather than total LOC? Falsifier
     discharge is the actual contract obligation. 7/8 invariants pinned
     means CI now catches regressions in the polymorphic-init path.
  3. Why mention 5f.2 explicitly as deliberately deferred? Naming the
     deferral makes it not a punt. Step 5f.2 has a clear "when": after
     the 4 in-flight PRs cascade-merge, then 5f.2 lands clean.
  4. Why call out infrastructure ≠ ship-%? The §47-§48 cascade taught
     the same lesson — "11 SHIP-007 cascade PRs landed but no ship-%
     movement." Operator-facing ship-% is the LIVE check.
  5. Why is FALSIFY-006 LIVE the load-bearing claim? init_loss(step=0)
     ≤ 6.0 vs from_scratch_loss(step=0) ≥ 9.5 proves end-to-end
     correctness in one number. No other falsifier can substitute.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Spec amendment cadence: §41 → §42 → §43 → §44 → §45 → §46 → §47 →
§48 → §49 → §50 → §51. Eleven amendments since 2026-05-03. Same-day
spec hygiene rather than letting the cascade-complete state remain
implicit.

Refs:
  - SPEC-SHIP-TWO-001 §50 — architecture-coupling finding (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (MERGED)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - PR #1479 — validate_pretrain_init_arch_compatible (in flight)
  - feedback_no_guessing.md

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
… 5b + DEFECT FIX (#1474)

§50.4 step 5b authored a contract assuming `qwen2_0_5b()` did not
exist. Live source inspection during impl revealed the constructor
ALREADY EXISTS at `transformer/config.rs:156`. Reading the HF config
byte-for-byte (per `feedback_no_guessing.md`) revealed a real defect:

  HF config (Qwen2.5-Coder-0.5B-Instruct):  tie_word_embeddings: true
  Existing code (qwen2_0_5b):                tie_word_embeddings: false

Fix: 1 LOC change `false → true`. Per Qwen scaling-law convention
verified against the HF cache:

  - Qwen2.5-Coder-0.5B: tie=true   (HF cache 2026-05-04 ✓)
  - Qwen2.5-Coder-1.5B: tie=true   (HF cache 2026-05-04 ✓; inherits
                                    via `..Self::qwen2_0_5b()` spread)
  - Qwen2.5-Coder-7B:   tie=false  (HF cache 2026-05-04 ✓; explicit
                                    in qwen2_7b())

Why the defect matters: tied vs untied embeddings is a load-bearing
architectural property. With tie=false (current bug), if an operator
fine-tunes from a Qwen2.5-0.5B init checkpoint, the lm_head will be
allocated as a separate tensor that doesn't get loaded (because the
APR file only contains the embed_tokens tensor — they share weights).
The result: lm_head random-initialized and untrained, producing
silent gibberish at val time. This is exactly the §49 / §50 failure
class the contract was authored to prevent.

What this PR adds:

  1. Fix `tie_word_embeddings: false → true` in `qwen2_0_5b()` at
     `transformer/config.rs:156-174`
  2. Add docstring noting the empirical verification + HF cache path
     + Qwen scaling-law quirk
  3. Add 3 new unit tests in `transformer::config::tests`:
       - `qwen2_0_5b_matches_hf_config_2026_05_04` (FALSIFY-001 byte-
         identity verification — 11 fields)
       - `qwen2_1_5b_inherits_tie_word_embeddings_from_0_5b` (drift-
         prevention; catches future spread-split refactors)
       - `qwen2_7b_does_not_tie_embeddings` (drift-prevention; pins
         the 7B Qwen scaling-law quirk against silent flips)

Test results (cargo test -p aprender-train --lib transformer::config::tests::qwen2):
    3 passed; 0 failed; 0 ignored

Discharges FALSIFY-APR-PRETRAIN-ARCH-001 in PR #1473's contract.

Five Whys:

  1. Why was the constructor already there but with the wrong tie
     setting? Likely authored before the spec-§49 use case became the
     load-bearing target. The constants for `qwen2_0_5b` were correct
     for inference, but tie_word_embeddings is mostly a training-
     pipeline concern — it determines whether lm_head is a separate
     trainable parameter or shares with embed_tokens.

  2. Why didn't pmat query / cargo test catch this earlier? Existing
     tests pinned shape (hidden, layers, heads, etc.) but no test
     verified `tie_word_embeddings`. This PR adds the missing
     drift-prevention test that catches the defect class.

  3. Why fix this in the same PR as the test (not a separate fix)?
     Toyota Way: the test IS the discharge mechanism for FALSIFY-001.
     A test that passed against the (defective) status quo would be a
     liar. Fixing first + testing second guarantees the test pins
     correct behavior, not whatever happened to be in the code.

  4. Why also pin qwen2_1_5b (inheritance) and qwen2_7b (anti-spread)?
     Those are drift-prevention. The spread-inheritance pattern
     `..Self::qwen2_0_5b()` is fragile — a future refactor could
     split the inheritance chain and silently flip tie_word_embeddings
     back to false on 1.5B. Test catches that. Similarly, an over-
     enthusiastic refactor could homogenize 7B with 0.5B (incorrectly
     setting 7B's tie=true). Test catches that too.

  5. Why §50.4 step 5b was overscoped at 40 LOC: §50 was authored
     under the assumption that the constructor didn't exist. Live
     source inspection (per `feedback_no_guessing.md`) revealed the
     foundation was already there, just with one defect. This is the
     same lesson as §50 itself — read source before authoring scope.
     The contract from PR #1473 is still valid; only the LOC estimate
     in §50.4's table was wrong.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §49 — MODEL-2 strategy pivot (#1461)
  - SPEC-SHIP-TWO-001 §50 — architecture-coupling finding (#1472, in flight)
  - PR #1470 — apr-pretrain-from-init-v1 contract (merged)
  - PR #1471 — apr pretrain --init wire-up (merged)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - HF config: ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-{0.5B,1.5B,7B}-Instruct/.../config.json
  - feedback_no_guessing.md — read source before forming hypothesis

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
… §50.4 step 5c (#1475)

Adds `pretrain_real::build_transformer_config(init: Option<&TransformerConfig>)`
which dispatches between the §24/§25 from-scratch baseline (Llama370M)
and a caller-extracted init config (e.g., Qwen2.5-Coder-0.5B).

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-002 — init=None preserves Llama370M baseline
                                    byte-for-byte (no §24/§25 regression)
  - FALSIFY-APR-PRETRAIN-ARCH-003 — init=Some passes through caller-extracted
                                    config (no silent defaults)

Design decision: the polymorphic builder takes a `TransformerConfig`,
NOT a path to an APR file. The caller (apr-cli, in step 5d/5f) is
responsible for reading the APR file and producing the config — typically
via `TransformerConfig::from_apr_metadata()` (already at config.rs:264).
This decoupling keeps `aprender-train` free of `aprender-serve` (the APR
loader) as a build dep, preserving compile-graph hygiene.

What this PR adds:

  1. `pub fn build_transformer_config(init: Option<&TransformerConfig>) -> TransformerConfig`
     at pretrain_real.rs (~25 LOC including doc comment)

  2. 3 unit tests in pretrain_real::tests:
       - build_transformer_config_no_init_matches_llama370m       (FALSIFY-002)
       - build_transformer_config_qwen_init_matches_input         (FALSIFY-003)
       - build_transformer_config_dispatch_mutually_exclusive     (drift-prevention)

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::build_transformer_config):
    3 passed; 0 failed; 0 ignored

Five Whys:

  1. Why a polymorphic builder NOW, not when pretrain_real was first
     written? §24/§25 only exercised the from-scratch path (Llama370M
     architecture). §49 added the pretrained-init use case where the
     init checkpoint can have ARBITRARY shape. Without the dispatch,
     §49.6 step 5g would fail at the first forward pass with shape
     errors when an operator points --init at a non-Llama-shaped APR.

  2. Why take TransformerConfig (not Path) as input? Decoupling: the
     APR-reading is `aprender-serve`'s job (or apr-cli's, since it's
     the one that reads --init <PATH>). `aprender-train` only needs
     to know "given a config, build the trainer". Mixing concerns
     would force aprender-train to depend on aprender-serve, which
     is a multi-thousand-line cross-crate dep just for header reading.

  3. Why are the 3 tests sufficient at PARTIAL_ALGORITHM_LEVEL?
       - FALSIFY-002 pins regression-free behavior on the from-scratch
         path (the load-bearing claim that the polymorphic dispatch
         doesn't break what works today)
       - FALSIFY-003 pins pass-through behavior (no silent defaults
         that would corrupt init weights)
       - The dispatch-mutually-exclusive test catches a future refactor
         that accidentally always returns Llama370M (a silent-fallback
         regression class, the same defect §24's --synthetic default
         introduced)

  4. Why not also wire this into drive_real / drive_real_cuda? That
     wires the actual init path. It's step 5f (weight load) work; this
     PR is just the pure-function dispatch. Keeping each PR small
     prevents the `feedback_no_guessing.md` failure mode of writing
     200 LOC at once and finding mid-PR that the foundation is wrong.

  5. Why is FALSIFY-APR-PRETRAIN-INIT-005 (arch mismatch) NOT yet
     discharged? The current dispatch is "init=Some passes through".
     There's no validation that the caller-provided config is COMPATIBLE
     with the pretrain target. That validation lives in the apr-cli
     wire (step 5d preflight + step 5f weight load) and exits non-zero
     with a clear error if architectures mismatch.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling finding (#1472, in flight)
  - SPEC-SHIP-TWO-001 §50.4 — re-scoped roadmap (steps 5a-5h)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - feedback_no_guessing.md — read source before forming hypothesis

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…step 5f.1

Adds `pretrain_real::validate_pretrain_init_arch_compatible(cfg)` that
fail-fast rejects an init `TransformerConfig` whose architecture family
is incompatible with the decoder-only pretrain trainer.

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-007 — wrong-arch APR (e.g., CodeBERT/
    RoBERTa encoder model) is FAIL-FAST not silent-truncate

Why this matters: §49 wires `--init <PATH>` to load weights from any
APR file. Without this gate, an operator who points --init at e.g.
microsoft/codebert-base.apr would silently load encoder weights into
a decoder-shaped trainer, producing nonsense gradients that the
divergence guard catches LATE (multiple epochs in). This gate catches
the family mismatch BEFORE any trainer allocation.

Step 5f decomposition: this is step 5f.1 — the arch-family gate.
Step 5f.2 (~80 LOC, follow-up) does the actual weight materialization
into optimizer state. Splitting keeps each PR small + reviewable.

What this PR adds:

  1. `pub fn validate_pretrain_init_arch_compatible(cfg: &TransformerConfig)
     -> Result<(), String>` (~30 LOC including doc comment) at
     pretrain_real.rs:35

  2. 3 unit tests in `pretrain_real::tests`:
       - validate_pretrain_init_arch_accepts_decoder       (FALSIFY-007 negative)
       - validate_pretrain_init_arch_rejects_encoder       (FALSIFY-007 positive,
                                                            load-bearing)
       - validate_pretrain_init_arch_accepts_llama370m_baseline (drift-prevention,
                                                                 catches over-rejection
                                                                 regression)

The encoder-rejection test asserts FOUR string contents in the error:
  - "FALSIFY-APR-PRETRAIN-ARCH-007" — falsifier id (auditability)
  - "Encoder"                      — names the architecture family
  - "decoder-only"                 — explains why this is wrong
  - "RobertaModel"                 — names the offending hf_architecture
Operator-experience parity: when the gate fires, the error tells the
operator exactly what they did wrong + how the trainer differs.

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::validate_pretrain_init_arch):
    3 passed; 0 failed; 0 ignored

Five Whys:

  1. Why a separate function rather than baking the check into
     build_transformer_config? Decoupling: build_transformer_config is
     a pure pass-through dispatch; adding arch validation would conflate
     "which config?" with "is this config valid?". Two functions, two
     concerns, two test surfaces.

  2. Why focus this PR on JUST the arch-family check (step 5f.1) and
     not the full weight materialization (step 5f)? Single-piece flow.
     Step 5f's full scope (~120 LOC) splits naturally into 5f.1 (this
     PR, ~30 LOC + 3 tests) + 5f.2 (~80 LOC, the actual weight load).
     Each PR has its own falsifier discharge; CI catches regressions
     between them.

  3. Why FOUR string assertions in the encoder-rejection error? Each
     piece of the error text serves a distinct operator need:
       - falsifier id → audit (which contract did this fail?)
       - architecture family → what (encoder vs decoder)
       - "decoder-only" → why (the trainer is decoder-only)
       - hf_architecture → which model (RobertaModel/CodeBERT/...)
     Lossy error messages erode operator trust; the contract pins
     all four to prevent message rot.

  4. Why include the Llama370M baseline drift-prevention test? §24's
     retrospective showed silent over-rejection (every input rejected,
     even valid ones) is the symmetric defect to silent under-rejection
     (every input accepted, even invalid ones). The 3 tests cover both
     halves of the dispatch.

  5. Why is FALSIFY-006 (init_loss < 6.0) NOT yet discharged? That
     requires the actual weight materialization (step 5f.2) PLUS a
     LIVE training run (step 5g). Step 5f.1 is just the gate; the
     load-bearing init_loss measurement is downstream.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - feedback_no_guessing.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 4b48cff into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the contract/apr-pretrain-arch-polymorphic-v1 branch May 4, 2026 18:47
noahgift added a commit that referenced this pull request May 4, 2026
…1481)

Adds the read-half of `apr pretrain --init` weight load: a thin
wrapper over `aprender::format::converter::load_model_tensors` that
returns a `BTreeMap<String, (Vec<f32>, Vec<usize>)>` of tensor blobs
keyed by HF naming convention.

Per `apr-pretrain-arch-polymorphic-v1` §init_load_semantics (PR #1473):
"Loader is REUSED, not reimplemented." This function does not duplicate
APR parsing — it forwards to the same machinery `apr export` and
`apr inspect` use.

Discharges from `apr-pretrain-arch-polymorphic-v1`:
  - §init_load_semantics invariant (loader reuse): satisfied
  - FALSIFY-006 (init_loss < 6.0) at READ-COMPILE-BIND level

Step 5f decomposition:
  - 5f.1 (PR #1479): encoder/decoder family validator (~30 LOC)
  - 5f.2 (this PR): APR file open + tensor read (~30 LOC + 2 tests)
  - 5f.3 (next):    populate trainer parameters from BTreeMap (~50 LOC)
  - 5g  (operator): LIVE 500-step fine-tune → DISCHARGES MODEL-2 ship-%

Step 5f.2 is intentionally narrow — it only does the READ. Population
into trainer parameter slots (5f.3) reconciles HF naming convention
(e.g., `model.embed_tokens.weight`) against the trainer's internal
parameter naming. That's a separate concern with its own falsifier.

What this PR adds:

  1. `pub fn load_init_tensors_from_apr(path) -> Result<BTreeMap<...>>`
     at pretrain_real.rs:35 (~25 LOC including doc comment)
  2. 2 unit tests in `pretrain_real::tests`:
       - load_init_tensors_missing_file_errors_with_falsifier_id
         (FALSIFY-006 fail-fast path; asserts error message contains
          falsifier id + offending path for operator-experience)
       - load_init_tensors_signature_compile_bind
         (drift-prevention: catches a future signature change that
          would break step 5f.3's BTreeMap consumer)

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::load_init_tensors):
    2 passed; 0 failed; 0 ignored

Five Whys:

  1. Why decompose step 5f.2 to JUST the read? Single-piece flow.
     Read → Validate → Populate are three distinct concerns. Step 5f.1
     did validation (#1479); 5f.2 does read; 5f.3 will do populate.
     Each PR has one falsifier discharge story.

  2. Why use load_model_tensors and not write a new parser? The contract
     pins "Loader is reused, not reimplemented." Writing a new parser
     would create a parallel format-decoder that drifts from the canonical
     one. The same lesson as the LAYOUT-001/002 hits — parallel format
     code paths produce silent format-drift bugs.

  3. Why return BTreeMap<String, (Vec<f32>, Vec<usize>)> rather than a
     trainer-parameter-shaped struct? Decoupling: the read shouldn't
     know about TransformerTrainer's internal parameter names. Step
     5f.3's job is to map HF names → trainer slots; if 5f.2 baked that
     mapping in, every change to TransformerTrainer would break the read.

  4. Why include the signature-compile-bind test? It's a compile-time
     check that drives step 5f.3's expectations. If a future refactor
     changes the return type (e.g., from BTreeMap to HashMap, or from
     Vec<usize> to Box<[usize]>), step 5f.3's consumer code stops
     compiling — caught here, not at the integration point.

  5. Why is FALSIFY-006 NOT yet at PARTIAL_ALGORITHM_LEVEL after this
     PR? Because step 5f.2 only does the read; FALSIFY-006 requires
     the LIVE init_loss < 6.0 check, which needs steps 5f.3 + 5g.
     This PR moves FALSIFY-006 from UNBOUND → READ-COMPILE-BIND, a
     sub-level of PARTIAL_ALGORITHM_LEVEL. Full PARTIAL discharge
     happens at 5f.3 when the populate step exists.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50, §51 — MODEL-2 architecture-coupling +
    cascade snapshot (PR #1472, #1480 MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (MERGED)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (MERGED)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - PR #1479 — validate_pretrain_init_arch_compatible (in flight)
  - feedback_no_guessing.md
  - feedback_falsifier_first_cascade_pattern.md (this turn's pattern)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…step 5f.1

Adds `pretrain_real::validate_pretrain_init_arch_compatible(cfg)` that
fail-fast rejects an init `TransformerConfig` whose architecture family
is incompatible with the decoder-only pretrain trainer.

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-007 — wrong-arch APR (e.g., CodeBERT/
    RoBERTa encoder model) is FAIL-FAST not silent-truncate

Why this matters: §49 wires `--init <PATH>` to load weights from any
APR file. Without this gate, an operator who points --init at e.g.
microsoft/codebert-base.apr would silently load encoder weights into
a decoder-shaped trainer, producing nonsense gradients that the
divergence guard catches LATE (multiple epochs in). This gate catches
the family mismatch BEFORE any trainer allocation.

Step 5f decomposition: this is step 5f.1 — the arch-family gate.
Step 5f.2 (~80 LOC, follow-up) does the actual weight materialization
into optimizer state. Splitting keeps each PR small + reviewable.

What this PR adds:

  1. `pub fn validate_pretrain_init_arch_compatible(cfg: &TransformerConfig)
     -> Result<(), String>` (~30 LOC including doc comment) at
     pretrain_real.rs:35

  2. 3 unit tests in `pretrain_real::tests`:
       - validate_pretrain_init_arch_accepts_decoder       (FALSIFY-007 negative)
       - validate_pretrain_init_arch_rejects_encoder       (FALSIFY-007 positive,
                                                            load-bearing)
       - validate_pretrain_init_arch_accepts_llama370m_baseline (drift-prevention,
                                                                 catches over-rejection
                                                                 regression)

The encoder-rejection test asserts FOUR string contents in the error:
  - "FALSIFY-APR-PRETRAIN-ARCH-007" — falsifier id (auditability)
  - "Encoder"                      — names the architecture family
  - "decoder-only"                 — explains why this is wrong
  - "RobertaModel"                 — names the offending hf_architecture
Operator-experience parity: when the gate fires, the error tells the
operator exactly what they did wrong + how the trainer differs.

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::validate_pretrain_init_arch):
    3 passed; 0 failed; 0 ignored

Five Whys:

  1. Why a separate function rather than baking the check into
     build_transformer_config? Decoupling: build_transformer_config is
     a pure pass-through dispatch; adding arch validation would conflate
     "which config?" with "is this config valid?". Two functions, two
     concerns, two test surfaces.

  2. Why focus this PR on JUST the arch-family check (step 5f.1) and
     not the full weight materialization (step 5f)? Single-piece flow.
     Step 5f's full scope (~120 LOC) splits naturally into 5f.1 (this
     PR, ~30 LOC + 3 tests) + 5f.2 (~80 LOC, the actual weight load).
     Each PR has its own falsifier discharge; CI catches regressions
     between them.

  3. Why FOUR string assertions in the encoder-rejection error? Each
     piece of the error text serves a distinct operator need:
       - falsifier id → audit (which contract did this fail?)
       - architecture family → what (encoder vs decoder)
       - "decoder-only" → why (the trainer is decoder-only)
       - hf_architecture → which model (RobertaModel/CodeBERT/...)
     Lossy error messages erode operator trust; the contract pins
     all four to prevent message rot.

  4. Why include the Llama370M baseline drift-prevention test? §24's
     retrospective showed silent over-rejection (every input rejected,
     even valid ones) is the symmetric defect to silent under-rejection
     (every input accepted, even invalid ones). The 3 tests cover both
     halves of the dispatch.

  5. Why is FALSIFY-006 (init_loss < 6.0) NOT yet discharged? That
     requires the actual weight materialization (step 5f.2) PLUS a
     LIVE training run (step 5g). Step 5f.1 is just the gate; the
     load-bearing init_loss measurement is downstream.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - feedback_no_guessing.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…step 5f.1 (#1479)

Adds `pretrain_real::validate_pretrain_init_arch_compatible(cfg)` that
fail-fast rejects an init `TransformerConfig` whose architecture family
is incompatible with the decoder-only pretrain trainer.

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-007 — wrong-arch APR (e.g., CodeBERT/
    RoBERTa encoder model) is FAIL-FAST not silent-truncate

Why this matters: §49 wires `--init <PATH>` to load weights from any
APR file. Without this gate, an operator who points --init at e.g.
microsoft/codebert-base.apr would silently load encoder weights into
a decoder-shaped trainer, producing nonsense gradients that the
divergence guard catches LATE (multiple epochs in). This gate catches
the family mismatch BEFORE any trainer allocation.

Step 5f decomposition: this is step 5f.1 — the arch-family gate.
Step 5f.2 (~80 LOC, follow-up) does the actual weight materialization
into optimizer state. Splitting keeps each PR small + reviewable.

What this PR adds:

  1. `pub fn validate_pretrain_init_arch_compatible(cfg: &TransformerConfig)
     -> Result<(), String>` (~30 LOC including doc comment) at
     pretrain_real.rs:35

  2. 3 unit tests in `pretrain_real::tests`:
       - validate_pretrain_init_arch_accepts_decoder       (FALSIFY-007 negative)
       - validate_pretrain_init_arch_rejects_encoder       (FALSIFY-007 positive,
                                                            load-bearing)
       - validate_pretrain_init_arch_accepts_llama370m_baseline (drift-prevention,
                                                                 catches over-rejection
                                                                 regression)

The encoder-rejection test asserts FOUR string contents in the error:
  - "FALSIFY-APR-PRETRAIN-ARCH-007" — falsifier id (auditability)
  - "Encoder"                      — names the architecture family
  - "decoder-only"                 — explains why this is wrong
  - "RobertaModel"                 — names the offending hf_architecture
Operator-experience parity: when the gate fires, the error tells the
operator exactly what they did wrong + how the trainer differs.

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::validate_pretrain_init_arch):
    3 passed; 0 failed; 0 ignored

Five Whys:

  1. Why a separate function rather than baking the check into
     build_transformer_config? Decoupling: build_transformer_config is
     a pure pass-through dispatch; adding arch validation would conflate
     "which config?" with "is this config valid?". Two functions, two
     concerns, two test surfaces.

  2. Why focus this PR on JUST the arch-family check (step 5f.1) and
     not the full weight materialization (step 5f)? Single-piece flow.
     Step 5f's full scope (~120 LOC) splits naturally into 5f.1 (this
     PR, ~30 LOC + 3 tests) + 5f.2 (~80 LOC, the actual weight load).
     Each PR has its own falsifier discharge; CI catches regressions
     between them.

  3. Why FOUR string assertions in the encoder-rejection error? Each
     piece of the error text serves a distinct operator need:
       - falsifier id → audit (which contract did this fail?)
       - architecture family → what (encoder vs decoder)
       - "decoder-only" → why (the trainer is decoder-only)
       - hf_architecture → which model (RobertaModel/CodeBERT/...)
     Lossy error messages erode operator trust; the contract pins
     all four to prevent message rot.

  4. Why include the Llama370M baseline drift-prevention test? §24's
     retrospective showed silent over-rejection (every input rejected,
     even valid ones) is the symmetric defect to silent under-rejection
     (every input accepted, even invalid ones). The 3 tests cover both
     halves of the dispatch.

  5. Why is FALSIFY-006 (init_loss < 6.0) NOT yet discharged? That
     requires the actual weight materialization (step 5f.2) PLUS a
     LIVE training run (step 5g). Step 5f.1 is just the gate; the
     load-bearing init_loss measurement is downstream.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - feedback_no_guessing.md

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…ION-COMPLETE; contract v1.1.0 → v1.2.0 FUNCTIONAL (#1495)

§50.4 cascade INTEGRATION-COMPLETE on main with PR #1494 merging at
2026-05-05T01:48:14Z. The `apr pretrain --init <PATH>` flow is now
end-to-end functional on CPU; the legacy "not yet wired" Err is
RETIRED; step 5g LIVE is the only remaining gate before MODEL-2 ship-%
can move from 57% → ≥58%.

Spec amendment §53:
- Updated falsifier scoreboard: 6/8 INTEGRATION (001/002/003/005/006/007
  via live CLI dispatch); 2/8 PARTIAL_ALGORITHM_LEVEL (004 forward-pass
  smoke + 008 contract validation are inherently algorithm-level).
- Step roadmap: 5a-5f.4 ✅ MERGED; 5f.5 (CUDA wireup) NOT YET STARTED;
  5g (LIVE 500-step fine-tune) operator-dispatchable on RTX 4090.
- Cascade ships statistics: 11 PRs over 2 days
  (#1471/#1472/#1473/#1474/#1475/#1476/#1478/#1479/#1481/#1482/#1483/#1486/#1494).
- MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57%
  (gated on 5g empirical val_loss < 9.38 evidence).
- 3 CI andon classes documented as feedback memories during cascade
  (workspace-test missing-binary, trueno SIGSEGV-on-cleanup, auto-merge
  behind-state).

Contract apr-pretrain-arch-polymorphic-v1 v1.1.0 → v1.2.0 FUNCTIONAL:
- All 8 falsifiers PASS on main; 6/8 reach INTEGRATION via the
  user-facing `apr pretrain --init` flow.
- verification_summary updated: tested 7 → 8; status partial →
  functional.
- Added §52 + §53 references.
- Promotion to DISCHARGED still requires §50.4 step 5g LIVE empirical
  500-step fine-tune on canonical Qwen2.5-Coder-0.5B-Instruct.apr
  producing val_loss < 9.38.

`pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml` exits 0.

Refs: SPEC-SHIP-TWO-001 §50.4 cascade, PR #1494 merge commit 9afca16

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…-005/006 test-reference drift (#1505)

Same drift class as PR #1504 caught in apr-pretrain-from-init-v1.
Test names cited in v1.1.0 changelog never matched the actual tests
PR #1476 authored. Drift survived three intervening bumps
(v1.1→v1.2→v1.3→v1.4) because each focused on adding new falsifiers,
not auditing existing bindings.

## Drift inventory

| Falsifier | v1.4.0 cited test | Exists? | Actual test |
|---|---|---|---|
| FALSIFY-005 | preflight_qwen_vocab_passes_with_qwen_init | ❌ | preflight_qwen_vocab_passes_with_qwen_target |
| FALSIFY-006 | preflight_qwen_vocab_fails_without_init | ❌ | preflight_qwen_vocab_fails_with_llama_target |

## Resolution

Update the `test:` field for FALSIFY-005 and FALSIFY-006 to reference
the actual tests authored by PR #1476. No falsifier semantics change.
No new tests added.

## Verification

  $ cargo test -p apr-cli --lib -- commands::pretrain::tests::preflight_qwen_vocab_passes_with_qwen_target
    test result: ok. 1 passed; ...
  $ cargo test -p apr-cli --lib -- commands::pretrain::tests::preflight_qwen_vocab_fails_with_llama_target
    test result: ok. 1 passed; ...
  $ pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml
    0 error(s), 0 warning(s)

## Five Whys

1. Why did the drift survive 3 bumps? Each bump (v1.2/v1.3/v1.4)
   focused on ADDING new content (CUDA-001, relaxed bound, etc.);
   none audited existing bindings.
2. Why didn't the §50.4 cascade catch this? The cascade authored
   tests; the contract was authored separately. Names diverged at
   the boundary; no cross-check landed.
3. Why is this a contract-only fix (no source change)? The tests
   exist and pass — the IMPL is correct. Only the contract's text
   reference needed correction.
4. Why bump to v1.5.0 (not v1.4.1 patch)? Same logic as PR #1504:
   the test-binding INVARIANT (every cited test exists) was broken
   in v1.4.0. v1.5.0 restores it.
5. Why is this important if the impl is correct? Per
   feedback_no_guessing.md, contracts that cite non-existent tests
   are unfalsifiable — future agents reading the contract get a
   false signal that the falsifier is bound. PV-VER-001 lint will
   catch this; better to fix it than wait for the lint engine to
   flag.

## Net effects

- Contract v1.4.0 → v1.5.0 FUNCTIONAL.
- 11 falsifiers, all PASS — same count, but FALSIFY-005/006 now
  reference tests that actually exist.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3.

This is hygiene work while 5g.1 (~12hr) corpus retokenize runs.
Same defect class as PR #1504; together they close the
test-reference drift across both pretrain contracts.

Refs: SPEC-SHIP-TWO-001 §50.4 cascade (PRs #1473-#1494, #1502),
      contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.5.0,
      contracts/apr-pretrain-from-init-v1.yaml v1.2.0 (PR #1504, sibling fix)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…IFY-001 with real integration test (#1506)

v1.0.0 stamped a vague test reference for FALSIFY-TOK-IMPORT-HF-001:
  "cargo test -p apr-cli --test cli_commands -- --nocapture
   (or equivalent) reports import-hf as a registered tokenize subcommand"

This was not a runnable invocation — same drift class as PR #1504
+ PR #1505 caught for sibling pretrain contracts. The contract said
"or equivalent" rather than naming an actual test, leaving the
falsifier unfalsifiable.

## What ships

Test:
- `tokenize_import_hf_subcommand_registered` in
  `crates/apr-cli/tests/cli_commands.rs` runs `apr tokenize
  import-hf --help` and asserts (a) exit 0, (b) `--input`,
  `--output`, `--include-added-tokens` flags appear. Pins the
  3-surface drift triangle:
    (1) clap variant `TokenizeCommands::ImportHf`
    (2) unit tests `commands::tokenize::tests::import_hf_*`
    (3) this integration test

Contract apr-cli-tokenize-import-hf-v1 v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL:
- FALSIFY-TOK-IMPORT-HF-001 `test:` updated to cite the new test.
- Status remains PARTIAL_ALGORITHM_LEVEL: 5/5 falsifiers PASS.

## Verification

  $ cargo test -p apr-cli --test cli_commands -- tokenize_import_hf_subcommand_registered
    test result: ok. 1 passed; ...
  $ pv validate contracts/apr-cli-tokenize-import-hf-v1.yaml
    0 error(s), 0 warning(s)

## Five Whys

1. Why was the v1.0.0 reference vague? Authored alongside the
   subcommand impl + unit tests; the integration test was deferred
   under the assumption that "test_no_unregistered_commands" would
   cover it. But that test only covers TOP-LEVEL subcommands, not
   sub-subcommands of `apr tokenize`.
2. Why is sub-subcommand registration not covered by
   test_no_unregistered_commands? It walks
   `apr-cli-commands-v1.yaml` which only enumerates top-level
   subcommands; sub-subcommand surfaces are out of scope.
3. Why bump to v1.1.0 (not v1.0.1)? Same logic as PR #1504/#1505:
   the test-binding INVARIANT was broken in v1.0.0; v1.1.0 restores it.
4. Why mirror the `pretrain_init_flag_registered` pattern instead
   of authoring something new? The pattern (run `apr <subcmd>
   --help`, assert exit 0 + key flags appear) is a clean drift
   guard; mirroring it preserves codebase consistency.
5. Why pin the 3 specific flags rather than just `apr tokenize
   import-hf --help` exit 0? Because flag-level drift (e.g., a
   future PR renaming `--input` to `--source`) would silently
   break operator-facing UX without breaking the help-shows-up
   binary check; pinning the exact flag names catches that class.

## Net effects

- Contract v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL.
- 1 new integration test (33 LOC).
- 5/5 falsifiers PASS, all bound to real tests.
- MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57%.

This is hygiene work while 5g.1 (~11hr) corpus retokenize runs.
Third drift-fix PR in the same session (after PR #1504 + PR #1505)
closing the test-reference drift class across the §50.4 cascade
contracts (apr-pretrain-from-init-v1, apr-pretrain-arch-polymorphic-v1,
apr-cli-tokenize-import-hf-v1).

Refs: SPEC-SHIP-TWO-001 §50.4 cascade (PRs #1473-#1505),
      contracts/apr-cli-tokenize-import-hf-v1.yaml v1.1.0,
      feedback_cli_subcommand_three_surface_drift.md

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant