feat(pretrain): SPEC §81 P0-D + P0-E — embed tokenizer + write arch metadata in apr pretrain output by noahgift · Pull Request #1701 · paiml/aprender

noahgift · 2026-05-15T13:08:19Z

Adds hidden_size + num_layers + num_heads + num_kv_heads + intermediate_size + vocab_size + max_position_embeddings + rope_theta + rms_norm_eps to CudaTransformerTrainer's APR checkpoint output. Closes the C-03 gate that blocked apr bench. End-to-end verify on §78 checkpoint: apr bench → 315.5 tok/s (3.15× over AC-SHIP2-010 100 tok/s floor). AC-SHIP2-010 / FALSIFY-SHIP-020 → DISCHARGED. MODEL-2 ship %: 75% → 77%.

…kpoints CudaTransformerTrainer::save_apr now writes individual transformer config keys (hidden_size, num_hidden_layers, num_attention_heads, num_kv_heads, intermediate_size, vocab_size, max_position_embeddings, rope_theta, rms_norm_eps) as AprWriter well-known metadata fields. These map to AprV2Metadata typed fields via build_v2_metadata, which realizar's gguf::config::from_apr requires (C-03 gate). Before this fix, MODEL-2 checkpoints from `apr pretrain` failed apr bench with "C-03: APR model missing 'hidden_size' metadata". End-to-end verify on §78 fine-tune from Qwen-0.5B init: apr inspect --json | jq .metadata → hidden_size: 896 ✓ num_layers: 24 ✓ num_heads: 14 ✓ num_kv_heads: 2 ✓ intermediate_size: 4864 ✓ vocab_size: 151936 ✓ apr bench --iterations 5 --max-tokens 128 → tokens_per_second: 315.5 ✓ (3.15× over AC-SHIP2-010 100 tok/s floor) passed: true ✓ AC-SHIP2-010 (FALSIFY-SHIP-020) PARTIAL_ALGORITHM_LEVEL → DISCHARGED on real fine-tuned MODEL-2 checkpoint. Implementation: replace the legacy save_model() path (which only wrote model_name + architecture + format + version) with a direct AprWriter call that adds the arch dim keys. Reuses io::save::infer_all_tensor_shapes for 2D weight shape handling (now pub(crate)). MODEL-2 ship %: 75% → 77%. Closes §81 P0-E (one of three apr pretrain output metadata gaps; P0-D embed-tokenizer and P0-F arch-case-mapping are separate PRs). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ints Adds CudaAprCheckpointFn::with_tokenizer_dir() builder + new CudaTransformerTrainer::save_apr_with_tokenizer() method that reads tokenizer.json from --tokenizer dir and embeds: - tokenizer.vocabulary (151643 entries on §78 Qwen-0.5B fixture) - tokenizer.merges (151387 entries) - tokenizer.bos_token_id (parsed from added_tokens for known special strings: <s>, <|im_start|>, <|begin_of_text|>) - tokenizer.eos_token_id (</s>, <|im_end|>, <|end_of_text|>, <|endoftext|>) apr-cli pretrain.rs now passes --tokenizer through via .with_tokenizer_dir(&config.tokenizer_dir). Empirical verify on §78 fine-tune from Qwen-0.5B init (100 steps): apr run <checkpoint>.apr "def fib(n):" → [PMAT-171] Loaded embedded BPE tokenizer: 151643 vocab, 151387 merges, 3 special tokens apr qa <checkpoint>.apr --skip-throughput ... → Previously: "Validation failed: APR missing embedded tokenizer" Now: Gates execute; only golden_output fails (separate issue — model output quality at val_loss=6.56 with 100 steps). Closes §81 P0-D (one of three apr pretrain output metadata gaps; P0-E arch metadata is in PR #1701, P0-F arch case mapping in PR #1699). After this PR, apr qa runs against MODEL-2 checkpoints WITHOUT the external --tokenizer requirement — checkpoints are now self-contained per the AC-SHIP2-005 / .apr format spec. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…0 packaging gaps (#1702) Triple-amendment to SPEC-SHIP-TWO-001 capturing the §78 → §80 dispatch arc that revealed a Class 3 packaging-defect wave in apr pretrain output. Consolidates the content of PRs #1695 (§79), #1697 (§80), #1698 (§81) which were all DIRTY against main due to overlapping spec-header edits. §79 — External audit + Five-Whys retrospective on MODEL-2 convergence Synthesizes docs/specifications/two-model-spec-audit.md. Identifies three compounding root causes for the val_loss=9.75 plateau: 1. Data starvation (0.24% of Chinchilla-optimal token count) 2. False plateau hypothesis (LR-budget falsification) 3. Infrastructure masking bugs (silent CPU fallback, exhaustion placeholder, premature early-stop) Five-Whys for Case A (silent corpus exhaustion), Case B (early stop), Case C (val_loss=9.75 plateau). Reconciles audit Recommendations 1-3 vs §78's §49-pivot path. §80 — Prioritized open-follow-up backlog Ranks all open SHIP-TWO-001 work by ship-% delta ÷ effort. P0 trio (apr qa / bench / export against epoch-004.apr) + P1 Chinchilla gate + P1 python validity + P1 HumanEval + P2-A long train = MODEL-2 theoretical ceiling 92% at ~6-10h RTX 4090 compute. §81 — P0 dispatch surfaced 3 systemic packaging-defect gaps Dispatching §80's P0 trio against §78's epoch-004.apr revealed: - P0-A apr qa → "APR missing embedded tokenizer" - P0-B apr bench → "C-03: APR model missing 'hidden_size' metadata" - P0-C apr export → PASSED, but llama-cli refused with "unknown model architecture: 'LlamaForCausalLM'" (GGUF expects lowercase "llama") Companion code PRs: - #1699 P0-F → HF→GGUF arch case mapping in apr export - #1701 P0-D + P0-E → embed tokenizer + write arch metadata in apr pretrain output AC-SHIP2-010 → DISCHARGED (315.5 tok/s on Qwen-0.5B fine-tune; 3.15× over the 100 tok/s floor). Methodology lessons added: #26 NEW: Three-class root-cause taxonomy for ML convergence failures (data starvation / optimization defects / infrastructure masking). Diagnose which class is binding before tuning. #27 NEW: Prioritize by ship-% delta ÷ effort, not alphabetical AC order. P0 dispatches are 0.1% the compute cost of P2-A. #28 NEW: Class 3 defects come in waves. Training works ≠ checkpoint is usable. Each lifecycle stage needs its own surfacing dispatch. Ship-% movement: MODEL-1: 100% (unchanged) MODEL-2: 75% (unchanged in this PR; +2pp expected on #1701 merge) Spec v3.24.0 → v3.27.0. Replaces PRs #1695, #1697, #1698 (all DIRTY against main). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…for llama.cpp interop (#1706) * fix(export): SPEC §82 P0-G — pad tokenizer.ggml.tokens to vocab_size llama.cpp's check_tensor_dims uses len(tokenizer.ggml.tokens) as the expected first dim of token_embd.weight. Qwen2.5 models pad embed_tokens to 151936 for TP-alignment but real tokenizer vocab is 151643 — the 293 delta causes llama-cli to refuse loading APR-exported GGUFs. Fix: thread `<arch>.vocab_size` into both tokenizer-emission paths (GgufTokenizer + APR-fallback) and pad with `<|pad_N|>` placeholders from `len(tokens)` to `vocab_size`. Pass 0 to disable (back-compat for tests that don't care about model dims). Empirically verified end-to-end on SPEC §82's P2-A epoch-020 checkpoint: [P0-G] Padding APR-fallback tokenizer.ggml.tokens: 151643 + 293 placeholders = 151936 Unit tests (6 new): - test_p0g_pad_tokens_to_vocab_size (GgufTokenizer path) - test_p0g_no_pad_when_vocab_size_zero (back-compat) - test_p0g_no_pad_when_vocab_size_equals_tokens - test_p0g_no_pad_when_vocab_size_smaller (no truncation) - test_p0g_apr_fallback_pad_tokens_to_vocab_size (APR path) - test_p0g_apr_fallback_no_pad_when_vocab_size_zero Discharges AC-SHIP2-010 vocab-size component (next blocker is P0-H tensor-count mismatch — separate PR). Methodology lesson #29: Class 3 packaging defects surface in waves. P0-G is the 4th in 24h (D embed tokenizer, E arch dims, F arch case, G vocab pad). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): SPEC §82 — P2-A 5000-step EARLY_STOP val_loss=4.7111; P0-G LIVE-discharged §82 records the first long-training MODEL-2 dispatch since §34 (27 days, 60 amendments ago): - §34 ceiling broken further: 9.38 → 5.36 (§78) → 4.71 (§82) - P2-A on lambda-vector RTX 4090: 27 epochs / 2700 steps / ~40 min wall - Best val_loss = 4.7110777 at epoch 20 P0 trio dispatched against epoch-020.apr: - P0-A apr qa: infra PASS (only golden_output fails — expected for pretrain) - P0-B apr bench: PASS at 325.1 tok/s with embedded BPE tokenizer + C-03 metadata satisfied — confirms #1701 P0-D/E fixes live in production - P0-C step 1 apr export: PASS — confirms #1699 P0-F arch case mapping live - P0-C step 2 llama-cli: BLOCKED by NEW Class 3 defect P0-G (fixed in companion code commit on this branch) - P0-G fix DISCHARGED end-to-end; surfaces P0-H tensor-count mismatch (out of scope for this PR) AC-SHIP2-* movement: - AC-SHIP2-009 → DISCHARGED (apr bench works on pretrain ckpt) - AC-SHIP2-006 → FUNCTIONAL (apr qa infra runs end-to-end) - AC-SHIP2-010 → vocab-component DISCHARGED via P0-G; blocked on P0-H MODEL-1 ship %: 100% (unchanged). MODEL-2 ship %: 77% → 79% (+1 for AC-SHIP2-009; +1 for ceiling break to 4.71). Methodology lesson #29 NEW: Class 3 packaging defects surface in waves of 4 (not 2). Every downstream tool falsifies its own invariant in the checkpoint-emission contract. Evidence: evidence/section-82-p2a-results-2026-05-15/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): SPEC §82 P0-G/P0-H — add llama-cli failure logs (force-add over .gitignore) The two llamacli error logs document the pre-fix (P0-G vocab mismatch) and post-P0G-fix (P0-H tensor count mismatch) states for §82 evidence. .gitignore excludes *.log so force-add is required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(lint): collapsible_match — flatten nested if-let into find_map pattern CI failed clippy::collapsible_match on the nested if-let chain in P0-G's APR-fallback padding path. Rust 2021 edition can't use let chains, so the cleanest fix is to use find_map with a pattern guard that returns the inner Vec<String> directly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 15, 2026 13:08

noahgift changed the title ~~feat(pretrain): SPEC §81 P0-E — write arch metadata keys to .apr checkpoints~~ feat(pretrain): SPEC §81 P0-D + P0-E — embed tokenizer + write arch metadata in apr pretrain output May 15, 2026

noahgift mentioned this pull request May 15, 2026

docs(spec): §79 + §80 + §81 consolidated — audit retrospective + priority queue + P0 packaging gaps #1702

Merged

Merge branch 'main' into feat/p0-e-arch-metadata-in-pretrain

35e45a8

noahgift added 4 commits May 15, 2026 17:28

Merge branch 'main' into feat/p0-e-arch-metadata-in-pretrain

635f78d

Merge branch 'main' into feat/p0-e-arch-metadata-in-pretrain

c12dac8

Merge branch 'main' into feat/p0-e-arch-metadata-in-pretrain

8d3f54a

Merge branch 'main' into feat/p0-e-arch-metadata-in-pretrain

d31cee4

noahgift merged commit 185eefc into main May 15, 2026
10 checks passed

noahgift deleted the feat/p0-e-arch-metadata-in-pretrain branch May 15, 2026 19:02

This was referenced May 15, 2026

fix(export): SPEC §82 P0-G — pad tokenizer.ggml.tokens to vocab_size for llama.cpp interop #1706

Merged

fix(pretrain): SPEC §82 P0-H — stamp APR checkpoint architecture from --init model #1709

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pretrain): SPEC §81 P0-D + P0-E — embed tokenizer + write arch metadata in apr pretrain output#1701

feat(pretrain): SPEC §81 P0-D + P0-E — embed tokenizer + write arch metadata in apr pretrain output#1701
noahgift merged 7 commits into
mainfrom
feat/p0-e-arch-metadata-in-pretrain

noahgift commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant