Skip to content

feat(pretrain): SPEC §81 P0-D + P0-E — embed tokenizer + write arch metadata in apr pretrain output#1701

Merged
noahgift merged 7 commits into
mainfrom
feat/p0-e-arch-metadata-in-pretrain
May 15, 2026
Merged

feat(pretrain): SPEC §81 P0-D + P0-E — embed tokenizer + write arch metadata in apr pretrain output#1701
noahgift merged 7 commits into
mainfrom
feat/p0-e-arch-metadata-in-pretrain

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Adds hidden_size + num_layers + num_heads + num_kv_heads + intermediate_size + vocab_size + max_position_embeddings + rope_theta + rms_norm_eps to CudaTransformerTrainer's APR checkpoint output. Closes the C-03 gate that blocked apr bench. End-to-end verify on §78 checkpoint: apr bench → 315.5 tok/s (3.15× over AC-SHIP2-010 100 tok/s floor). AC-SHIP2-010 / FALSIFY-SHIP-020 → DISCHARGED. MODEL-2 ship %: 75% → 77%.

…kpoints

CudaTransformerTrainer::save_apr now writes individual transformer
config keys (hidden_size, num_hidden_layers, num_attention_heads,
num_kv_heads, intermediate_size, vocab_size, max_position_embeddings,
rope_theta, rms_norm_eps) as AprWriter well-known metadata fields.

These map to AprV2Metadata typed fields via build_v2_metadata, which
realizar's gguf::config::from_apr requires (C-03 gate). Before this
fix, MODEL-2 checkpoints from `apr pretrain` failed apr bench with
"C-03: APR model missing 'hidden_size' metadata".

End-to-end verify on §78 fine-tune from Qwen-0.5B init:
  apr inspect --json | jq .metadata →
    hidden_size:    896  ✓
    num_layers:     24   ✓
    num_heads:      14   ✓
    num_kv_heads:   2    ✓
    intermediate_size: 4864 ✓
    vocab_size:     151936 ✓

  apr bench --iterations 5 --max-tokens 128 →
    tokens_per_second: 315.5   ✓ (3.15× over AC-SHIP2-010 100 tok/s floor)
    passed: true               ✓

AC-SHIP2-010 (FALSIFY-SHIP-020) PARTIAL_ALGORITHM_LEVEL → DISCHARGED
on real fine-tuned MODEL-2 checkpoint.

Implementation: replace the legacy save_model() path (which only wrote
model_name + architecture + format + version) with a direct AprWriter
call that adds the arch dim keys. Reuses io::save::infer_all_tensor_shapes
for 2D weight shape handling (now pub(crate)).

MODEL-2 ship %: 75% → 77%.

Closes §81 P0-E (one of three apr pretrain output metadata gaps;
P0-D embed-tokenizer and P0-F arch-case-mapping are separate PRs).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 15, 2026 13:08
…ints

Adds CudaAprCheckpointFn::with_tokenizer_dir() builder + new
CudaTransformerTrainer::save_apr_with_tokenizer() method that reads
tokenizer.json from --tokenizer dir and embeds:

  - tokenizer.vocabulary (151643 entries on §78 Qwen-0.5B fixture)
  - tokenizer.merges     (151387 entries)
  - tokenizer.bos_token_id (parsed from added_tokens for known special
    strings: <s>, <|im_start|>, <|begin_of_text|>)
  - tokenizer.eos_token_id (</s>, <|im_end|>, <|end_of_text|>,
    <|endoftext|>)

apr-cli pretrain.rs now passes --tokenizer through via
.with_tokenizer_dir(&config.tokenizer_dir).

Empirical verify on §78 fine-tune from Qwen-0.5B init (100 steps):
  apr run <checkpoint>.apr "def fib(n):" →
    [PMAT-171] Loaded embedded BPE tokenizer: 151643 vocab, 151387
               merges, 3 special tokens

  apr qa <checkpoint>.apr --skip-throughput ... →
    Previously: "Validation failed: APR missing embedded tokenizer"
    Now:        Gates execute; only golden_output fails (separate issue
                — model output quality at val_loss=6.56 with 100 steps).

Closes §81 P0-D (one of three apr pretrain output metadata gaps;
P0-E arch metadata is in PR #1701, P0-F arch case mapping in PR #1699).

After this PR, apr qa runs against MODEL-2 checkpoints WITHOUT the
external --tokenizer requirement — checkpoints are now self-contained
per the AC-SHIP2-005 / .apr format spec.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift changed the title feat(pretrain): SPEC §81 P0-E — write arch metadata keys to .apr checkpoints feat(pretrain): SPEC §81 P0-D + P0-E — embed tokenizer + write arch metadata in apr pretrain output May 15, 2026
noahgift added a commit that referenced this pull request May 15, 2026
…0 packaging gaps (#1702)

Triple-amendment to SPEC-SHIP-TWO-001 capturing the §78 → §80 dispatch
arc that revealed a Class 3 packaging-defect wave in apr pretrain output.

Consolidates the content of PRs #1695 (§79), #1697 (§80), #1698 (§81)
which were all DIRTY against main due to overlapping spec-header edits.

§79 — External audit + Five-Whys retrospective on MODEL-2 convergence
  Synthesizes docs/specifications/two-model-spec-audit.md. Identifies
  three compounding root causes for the val_loss=9.75 plateau:
    1. Data starvation (0.24% of Chinchilla-optimal token count)
    2. False plateau hypothesis (LR-budget falsification)
    3. Infrastructure masking bugs (silent CPU fallback, exhaustion
       placeholder, premature early-stop)
  Five-Whys for Case A (silent corpus exhaustion), Case B (early stop),
  Case C (val_loss=9.75 plateau). Reconciles audit Recommendations 1-3
  vs §78's §49-pivot path.

§80 — Prioritized open-follow-up backlog
  Ranks all open SHIP-TWO-001 work by ship-% delta ÷ effort. P0 trio
  (apr qa / bench / export against epoch-004.apr) + P1 Chinchilla gate
  + P1 python validity + P1 HumanEval + P2-A long train = MODEL-2
  theoretical ceiling 92% at ~6-10h RTX 4090 compute.

§81 — P0 dispatch surfaced 3 systemic packaging-defect gaps
  Dispatching §80's P0 trio against §78's epoch-004.apr revealed:
    - P0-A apr qa     → "APR missing embedded tokenizer"
    - P0-B apr bench  → "C-03: APR model missing 'hidden_size' metadata"
    - P0-C apr export → PASSED, but llama-cli refused with
                        "unknown model architecture: 'LlamaForCausalLM'"
                        (GGUF expects lowercase "llama")
  Companion code PRs:
    - #1699 P0-F      → HF→GGUF arch case mapping in apr export
    - #1701 P0-D + P0-E → embed tokenizer + write arch metadata in
                          apr pretrain output
  AC-SHIP2-010 → DISCHARGED (315.5 tok/s on Qwen-0.5B fine-tune;
  3.15× over the 100 tok/s floor).

Methodology lessons added:
  #26 NEW: Three-class root-cause taxonomy for ML convergence failures
          (data starvation / optimization defects / infrastructure
          masking). Diagnose which class is binding before tuning.
  #27 NEW: Prioritize by ship-% delta ÷ effort, not alphabetical AC
          order. P0 dispatches are 0.1% the compute cost of P2-A.
  #28 NEW: Class 3 defects come in waves. Training works ≠ checkpoint
          is usable. Each lifecycle stage needs its own surfacing
          dispatch.

Ship-% movement:
  MODEL-1: 100% (unchanged)
  MODEL-2: 75% (unchanged in this PR; +2pp expected on #1701 merge)

Spec v3.24.0 → v3.27.0.

Replaces PRs #1695, #1697, #1698 (all DIRTY against main).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 185eefc into main May 15, 2026
10 checks passed
@noahgift noahgift deleted the feat/p0-e-arch-metadata-in-pretrain branch May 15, 2026 19:02
noahgift added a commit that referenced this pull request May 16, 2026
…for llama.cpp interop (#1706)

* fix(export): SPEC §82 P0-G — pad tokenizer.ggml.tokens to vocab_size

llama.cpp's check_tensor_dims uses len(tokenizer.ggml.tokens) as the
expected first dim of token_embd.weight. Qwen2.5 models pad embed_tokens
to 151936 for TP-alignment but real tokenizer vocab is 151643 — the 293
delta causes llama-cli to refuse loading APR-exported GGUFs.

Fix: thread `<arch>.vocab_size` into both tokenizer-emission paths
(GgufTokenizer + APR-fallback) and pad with `<|pad_N|>` placeholders
from `len(tokens)` to `vocab_size`. Pass 0 to disable (back-compat for
tests that don't care about model dims).

Empirically verified end-to-end on SPEC §82's P2-A epoch-020 checkpoint:

  [P0-G] Padding APR-fallback tokenizer.ggml.tokens:
    151643 + 293 placeholders = 151936

Unit tests (6 new):
- test_p0g_pad_tokens_to_vocab_size (GgufTokenizer path)
- test_p0g_no_pad_when_vocab_size_zero (back-compat)
- test_p0g_no_pad_when_vocab_size_equals_tokens
- test_p0g_no_pad_when_vocab_size_smaller (no truncation)
- test_p0g_apr_fallback_pad_tokens_to_vocab_size (APR path)
- test_p0g_apr_fallback_no_pad_when_vocab_size_zero

Discharges AC-SHIP2-010 vocab-size component (next blocker is
P0-H tensor-count mismatch — separate PR).

Methodology lesson #29: Class 3 packaging defects surface in waves.
P0-G is the 4th in 24h (D embed tokenizer, E arch dims, F arch case, G vocab pad).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): SPEC §82 — P2-A 5000-step EARLY_STOP val_loss=4.7111; P0-G LIVE-discharged

§82 records the first long-training MODEL-2 dispatch since §34 (27 days,
60 amendments ago):

- §34 ceiling broken further: 9.38 → 5.36 (§78) → 4.71 (§82)
- P2-A on lambda-vector RTX 4090: 27 epochs / 2700 steps / ~40 min wall
- Best val_loss = 4.7110777 at epoch 20

P0 trio dispatched against epoch-020.apr:
- P0-A apr qa: infra PASS (only golden_output fails — expected for pretrain)
- P0-B apr bench: PASS at 325.1 tok/s with embedded BPE tokenizer + C-03
  metadata satisfied — confirms #1701 P0-D/E fixes live in production
- P0-C step 1 apr export: PASS — confirms #1699 P0-F arch case mapping live
- P0-C step 2 llama-cli: BLOCKED by NEW Class 3 defect P0-G
  (fixed in companion code commit on this branch)
- P0-G fix DISCHARGED end-to-end; surfaces P0-H tensor-count mismatch
  (out of scope for this PR)

AC-SHIP2-* movement:
- AC-SHIP2-009 → DISCHARGED (apr bench works on pretrain ckpt)
- AC-SHIP2-006 → FUNCTIONAL (apr qa infra runs end-to-end)
- AC-SHIP2-010 → vocab-component DISCHARGED via P0-G; blocked on P0-H

MODEL-1 ship %: 100% (unchanged).
MODEL-2 ship %: 77% → 79% (+1 for AC-SHIP2-009; +1 for ceiling break to 4.71).

Methodology lesson #29 NEW: Class 3 packaging defects surface in waves of 4
(not 2). Every downstream tool falsifies its own invariant in the
checkpoint-emission contract.

Evidence: evidence/section-82-p2a-results-2026-05-15/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): SPEC §82 P0-G/P0-H — add llama-cli failure logs (force-add over .gitignore)

The two llamacli error logs document the pre-fix (P0-G vocab mismatch)
and post-P0G-fix (P0-H tensor count mismatch) states for §82 evidence.
.gitignore excludes *.log so force-add is required.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(lint): collapsible_match — flatten nested if-let into find_map pattern

CI failed clippy::collapsible_match on the nested if-let chain in
P0-G's APR-fallback padding path. Rust 2021 edition can't use let
chains, so the cleanest fix is to use find_map with a pattern guard
that returns the inner Vec<String> directly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant