feat(pretrain): SPEC §81 P0-D + P0-E — embed tokenizer + write arch metadata in apr pretrain output#1701
Merged
Merged
Conversation
…kpoints
CudaTransformerTrainer::save_apr now writes individual transformer
config keys (hidden_size, num_hidden_layers, num_attention_heads,
num_kv_heads, intermediate_size, vocab_size, max_position_embeddings,
rope_theta, rms_norm_eps) as AprWriter well-known metadata fields.
These map to AprV2Metadata typed fields via build_v2_metadata, which
realizar's gguf::config::from_apr requires (C-03 gate). Before this
fix, MODEL-2 checkpoints from `apr pretrain` failed apr bench with
"C-03: APR model missing 'hidden_size' metadata".
End-to-end verify on §78 fine-tune from Qwen-0.5B init:
apr inspect --json | jq .metadata →
hidden_size: 896 ✓
num_layers: 24 ✓
num_heads: 14 ✓
num_kv_heads: 2 ✓
intermediate_size: 4864 ✓
vocab_size: 151936 ✓
apr bench --iterations 5 --max-tokens 128 →
tokens_per_second: 315.5 ✓ (3.15× over AC-SHIP2-010 100 tok/s floor)
passed: true ✓
AC-SHIP2-010 (FALSIFY-SHIP-020) PARTIAL_ALGORITHM_LEVEL → DISCHARGED
on real fine-tuned MODEL-2 checkpoint.
Implementation: replace the legacy save_model() path (which only wrote
model_name + architecture + format + version) with a direct AprWriter
call that adds the arch dim keys. Reuses io::save::infer_all_tensor_shapes
for 2D weight shape handling (now pub(crate)).
MODEL-2 ship %: 75% → 77%.
Closes §81 P0-E (one of three apr pretrain output metadata gaps;
P0-D embed-tokenizer and P0-F arch-case-mapping are separate PRs).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ints
Adds CudaAprCheckpointFn::with_tokenizer_dir() builder + new
CudaTransformerTrainer::save_apr_with_tokenizer() method that reads
tokenizer.json from --tokenizer dir and embeds:
- tokenizer.vocabulary (151643 entries on §78 Qwen-0.5B fixture)
- tokenizer.merges (151387 entries)
- tokenizer.bos_token_id (parsed from added_tokens for known special
strings: <s>, <|im_start|>, <|begin_of_text|>)
- tokenizer.eos_token_id (</s>, <|im_end|>, <|end_of_text|>,
<|endoftext|>)
apr-cli pretrain.rs now passes --tokenizer through via
.with_tokenizer_dir(&config.tokenizer_dir).
Empirical verify on §78 fine-tune from Qwen-0.5B init (100 steps):
apr run <checkpoint>.apr "def fib(n):" →
[PMAT-171] Loaded embedded BPE tokenizer: 151643 vocab, 151387
merges, 3 special tokens
apr qa <checkpoint>.apr --skip-throughput ... →
Previously: "Validation failed: APR missing embedded tokenizer"
Now: Gates execute; only golden_output fails (separate issue
— model output quality at val_loss=6.56 with 100 steps).
Closes §81 P0-D (one of three apr pretrain output metadata gaps;
P0-E arch metadata is in PR #1701, P0-F arch case mapping in PR #1699).
After this PR, apr qa runs against MODEL-2 checkpoints WITHOUT the
external --tokenizer requirement — checkpoints are now self-contained
per the AC-SHIP2-005 / .apr format spec.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 15, 2026
…0 packaging gaps (#1702) Triple-amendment to SPEC-SHIP-TWO-001 capturing the §78 → §80 dispatch arc that revealed a Class 3 packaging-defect wave in apr pretrain output. Consolidates the content of PRs #1695 (§79), #1697 (§80), #1698 (§81) which were all DIRTY against main due to overlapping spec-header edits. §79 — External audit + Five-Whys retrospective on MODEL-2 convergence Synthesizes docs/specifications/two-model-spec-audit.md. Identifies three compounding root causes for the val_loss=9.75 plateau: 1. Data starvation (0.24% of Chinchilla-optimal token count) 2. False plateau hypothesis (LR-budget falsification) 3. Infrastructure masking bugs (silent CPU fallback, exhaustion placeholder, premature early-stop) Five-Whys for Case A (silent corpus exhaustion), Case B (early stop), Case C (val_loss=9.75 plateau). Reconciles audit Recommendations 1-3 vs §78's §49-pivot path. §80 — Prioritized open-follow-up backlog Ranks all open SHIP-TWO-001 work by ship-% delta ÷ effort. P0 trio (apr qa / bench / export against epoch-004.apr) + P1 Chinchilla gate + P1 python validity + P1 HumanEval + P2-A long train = MODEL-2 theoretical ceiling 92% at ~6-10h RTX 4090 compute. §81 — P0 dispatch surfaced 3 systemic packaging-defect gaps Dispatching §80's P0 trio against §78's epoch-004.apr revealed: - P0-A apr qa → "APR missing embedded tokenizer" - P0-B apr bench → "C-03: APR model missing 'hidden_size' metadata" - P0-C apr export → PASSED, but llama-cli refused with "unknown model architecture: 'LlamaForCausalLM'" (GGUF expects lowercase "llama") Companion code PRs: - #1699 P0-F → HF→GGUF arch case mapping in apr export - #1701 P0-D + P0-E → embed tokenizer + write arch metadata in apr pretrain output AC-SHIP2-010 → DISCHARGED (315.5 tok/s on Qwen-0.5B fine-tune; 3.15× over the 100 tok/s floor). Methodology lessons added: #26 NEW: Three-class root-cause taxonomy for ML convergence failures (data starvation / optimization defects / infrastructure masking). Diagnose which class is binding before tuning. #27 NEW: Prioritize by ship-% delta ÷ effort, not alphabetical AC order. P0 dispatches are 0.1% the compute cost of P2-A. #28 NEW: Class 3 defects come in waves. Training works ≠ checkpoint is usable. Each lifecycle stage needs its own surfacing dispatch. Ship-% movement: MODEL-1: 100% (unchanged) MODEL-2: 75% (unchanged in this PR; +2pp expected on #1701 merge) Spec v3.24.0 → v3.27.0. Replaces PRs #1695, #1697, #1698 (all DIRTY against main). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 15, 2026
Merged
noahgift
added a commit
that referenced
this pull request
May 16, 2026
…for llama.cpp interop (#1706) * fix(export): SPEC §82 P0-G — pad tokenizer.ggml.tokens to vocab_size llama.cpp's check_tensor_dims uses len(tokenizer.ggml.tokens) as the expected first dim of token_embd.weight. Qwen2.5 models pad embed_tokens to 151936 for TP-alignment but real tokenizer vocab is 151643 — the 293 delta causes llama-cli to refuse loading APR-exported GGUFs. Fix: thread `<arch>.vocab_size` into both tokenizer-emission paths (GgufTokenizer + APR-fallback) and pad with `<|pad_N|>` placeholders from `len(tokens)` to `vocab_size`. Pass 0 to disable (back-compat for tests that don't care about model dims). Empirically verified end-to-end on SPEC §82's P2-A epoch-020 checkpoint: [P0-G] Padding APR-fallback tokenizer.ggml.tokens: 151643 + 293 placeholders = 151936 Unit tests (6 new): - test_p0g_pad_tokens_to_vocab_size (GgufTokenizer path) - test_p0g_no_pad_when_vocab_size_zero (back-compat) - test_p0g_no_pad_when_vocab_size_equals_tokens - test_p0g_no_pad_when_vocab_size_smaller (no truncation) - test_p0g_apr_fallback_pad_tokens_to_vocab_size (APR path) - test_p0g_apr_fallback_no_pad_when_vocab_size_zero Discharges AC-SHIP2-010 vocab-size component (next blocker is P0-H tensor-count mismatch — separate PR). Methodology lesson #29: Class 3 packaging defects surface in waves. P0-G is the 4th in 24h (D embed tokenizer, E arch dims, F arch case, G vocab pad). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): SPEC §82 — P2-A 5000-step EARLY_STOP val_loss=4.7111; P0-G LIVE-discharged §82 records the first long-training MODEL-2 dispatch since §34 (27 days, 60 amendments ago): - §34 ceiling broken further: 9.38 → 5.36 (§78) → 4.71 (§82) - P2-A on lambda-vector RTX 4090: 27 epochs / 2700 steps / ~40 min wall - Best val_loss = 4.7110777 at epoch 20 P0 trio dispatched against epoch-020.apr: - P0-A apr qa: infra PASS (only golden_output fails — expected for pretrain) - P0-B apr bench: PASS at 325.1 tok/s with embedded BPE tokenizer + C-03 metadata satisfied — confirms #1701 P0-D/E fixes live in production - P0-C step 1 apr export: PASS — confirms #1699 P0-F arch case mapping live - P0-C step 2 llama-cli: BLOCKED by NEW Class 3 defect P0-G (fixed in companion code commit on this branch) - P0-G fix DISCHARGED end-to-end; surfaces P0-H tensor-count mismatch (out of scope for this PR) AC-SHIP2-* movement: - AC-SHIP2-009 → DISCHARGED (apr bench works on pretrain ckpt) - AC-SHIP2-006 → FUNCTIONAL (apr qa infra runs end-to-end) - AC-SHIP2-010 → vocab-component DISCHARGED via P0-G; blocked on P0-H MODEL-1 ship %: 100% (unchanged). MODEL-2 ship %: 77% → 79% (+1 for AC-SHIP2-009; +1 for ceiling break to 4.71). Methodology lesson #29 NEW: Class 3 packaging defects surface in waves of 4 (not 2). Every downstream tool falsifies its own invariant in the checkpoint-emission contract. Evidence: evidence/section-82-p2a-results-2026-05-15/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): SPEC §82 P0-G/P0-H — add llama-cli failure logs (force-add over .gitignore) The two llamacli error logs document the pre-fix (P0-G vocab mismatch) and post-P0G-fix (P0-H tensor count mismatch) states for §82 evidence. .gitignore excludes *.log so force-add is required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(lint): collapsible_match — flatten nested if-let into find_map pattern CI failed clippy::collapsible_match on the nested if-let chain in P0-G's APR-fallback padding path. Rust 2021 edition can't use let chains, so the cleanest fix is to use find_map with a pattern guard that returns the inner Vec<String> directly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds hidden_size + num_layers + num_heads + num_kv_heads + intermediate_size + vocab_size + max_position_embeddings + rope_theta + rms_norm_eps to CudaTransformerTrainer's APR checkpoint output. Closes the C-03 gate that blocked apr bench. End-to-end verify on §78 checkpoint: apr bench → 315.5 tok/s (3.15× over AC-SHIP2-010 100 tok/s floor). AC-SHIP2-010 / FALSIFY-SHIP-020 → DISCHARGED. MODEL-2 ship %: 75% → 77%.