fix(export): SPEC §82 P0-G — pad tokenizer.ggml.tokens to vocab_size for llama.cpp interop by noahgift · Pull Request #1706 · paiml/aprender

noahgift · 2026-05-15T22:03:37Z

Summary

P0-G fix: pad GGUF tokenizer.ggml.tokens to match <arch>.vocab_size so llama.cpp's check_tensor_dims accepts the corresponding token_embd.weight first dim. Threads vocab_size through both build_tokenizer_gguf_metadata (GgufTokenizer path) and extract_apr_tokenizer_for_gguf (APR-fallback path), padding with <|pad_N|> placeholders.
SPEC §82 amendment: records the first long-training MODEL-2 dispatch since §34 — P2-A 5000-step EARLY_STOP val_loss=4.7111 at epoch 20 on lambda-vector RTX 4090. §34 ceiling broken further: 9.38 → 5.36 (§78) → 4.71 (§82). MODEL-2 ship % 77% → 79%.
AC-SHIP2-009 LIVE-DISCHARGED: apr bench PASSED at 325.1 tok/s with embedded BPE tokenizer + C-03 metadata on pretrain checkpoint.

Empirical verification

Before fix:

llama_model_load: error loading model: check_tensor_dims:
  tensor 'token_embd.weight' has wrong shape;
  expected   896, 151643, got   896, 151936

After fix (with new apr binary, re-exporting same checkpoint):

[P0-G] Padding APR-fallback tokenizer.ggml.tokens:
  151643 + 293 placeholders = 151936

Llama.cpp accepts the vocab metadata; next blocker (P0-H tensor count mismatch) is out of scope for this PR.

Test plan

6 new unit tests cover both code paths and back-compat (vocab_size=0)
Existing 170 export tests stay green
End-to-end re-export of P2-A epoch-020.apr → llama-cli now passes vocab check
cargo test -p aprender-core --lib p0g_ → 6/6 PASS

Methodology

Lesson #29 NEW: Class 3 packaging defects surface in waves of 4 (not 2):

P0-D missing embedded BPE tokenizer (PR feat(pretrain): SPEC §81 P0-D + P0-E — embed tokenizer + write arch metadata in apr pretrain output #1701)
P0-E missing arch metadata keys (PR feat(pretrain): SPEC §81 P0-D + P0-E — embed tokenizer + write arch metadata in apr pretrain output #1701)
P0-F HF→GGUF arch case mismatch (PR feat(export): SPEC §81 P0-F — HF arch → GGUF lowercase case mapping #1699)
P0-G GGUF tokens not padded to vocab_size (this PR)

Each downstream tool falsifies its own invariant in the checkpoint-emission contract.

🤖 Generated with Claude Code

llama.cpp's check_tensor_dims uses len(tokenizer.ggml.tokens) as the expected first dim of token_embd.weight. Qwen2.5 models pad embed_tokens to 151936 for TP-alignment but real tokenizer vocab is 151643 — the 293 delta causes llama-cli to refuse loading APR-exported GGUFs. Fix: thread `<arch>.vocab_size` into both tokenizer-emission paths (GgufTokenizer + APR-fallback) and pad with `<|pad_N|>` placeholders from `len(tokens)` to `vocab_size`. Pass 0 to disable (back-compat for tests that don't care about model dims). Empirically verified end-to-end on SPEC §82's P2-A epoch-020 checkpoint: [P0-G] Padding APR-fallback tokenizer.ggml.tokens: 151643 + 293 placeholders = 151936 Unit tests (6 new): - test_p0g_pad_tokens_to_vocab_size (GgufTokenizer path) - test_p0g_no_pad_when_vocab_size_zero (back-compat) - test_p0g_no_pad_when_vocab_size_equals_tokens - test_p0g_no_pad_when_vocab_size_smaller (no truncation) - test_p0g_apr_fallback_pad_tokens_to_vocab_size (APR path) - test_p0g_apr_fallback_no_pad_when_vocab_size_zero Discharges AC-SHIP2-010 vocab-size component (next blocker is P0-H tensor-count mismatch — separate PR). Methodology lesson #29: Class 3 packaging defects surface in waves. P0-G is the 4th in 24h (D embed tokenizer, E arch dims, F arch case, G vocab pad). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…G LIVE-discharged §82 records the first long-training MODEL-2 dispatch since §34 (27 days, 60 amendments ago): - §34 ceiling broken further: 9.38 → 5.36 (§78) → 4.71 (§82) - P2-A on lambda-vector RTX 4090: 27 epochs / 2700 steps / ~40 min wall - Best val_loss = 4.7110777 at epoch 20 P0 trio dispatched against epoch-020.apr: - P0-A apr qa: infra PASS (only golden_output fails — expected for pretrain) - P0-B apr bench: PASS at 325.1 tok/s with embedded BPE tokenizer + C-03 metadata satisfied — confirms #1701 P0-D/E fixes live in production - P0-C step 1 apr export: PASS — confirms #1699 P0-F arch case mapping live - P0-C step 2 llama-cli: BLOCKED by NEW Class 3 defect P0-G (fixed in companion code commit on this branch) - P0-G fix DISCHARGED end-to-end; surfaces P0-H tensor-count mismatch (out of scope for this PR) AC-SHIP2-* movement: - AC-SHIP2-009 → DISCHARGED (apr bench works on pretrain ckpt) - AC-SHIP2-006 → FUNCTIONAL (apr qa infra runs end-to-end) - AC-SHIP2-010 → vocab-component DISCHARGED via P0-G; blocked on P0-H MODEL-1 ship %: 100% (unchanged). MODEL-2 ship %: 77% → 79% (+1 for AC-SHIP2-009; +1 for ceiling break to 4.71). Methodology lesson #29 NEW: Class 3 packaging defects surface in waves of 4 (not 2). Every downstream tool falsifies its own invariant in the checkpoint-emission contract. Evidence: evidence/section-82-p2a-results-2026-05-15/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…d over .gitignore) The two llamacli error logs document the pre-fix (P0-G vocab mismatch) and post-P0G-fix (P0-H tensor count mismatch) states for §82 evidence. .gitignore excludes *.log so force-add is required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ttern CI failed clippy::collapsible_match on the nested if-let chain in P0-G's APR-fallback padding path. Rust 2021 edition can't use let chains, so the cleanest fix is to use find_map with a pattern guard that returns the inner Vec<String> directly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… --init model (#1709) When `apr pretrain --init <qwen2.apr>` fine-tunes a Qwen2 model, the trainer was hardcoded to stamp `("llama-370m-pretrain", "LlamaForCausalLM")` regardless of what the init model actually was. Downstream `apr export --format gguf` then routed through the llama-family GGUF mapper, which has no mapping for Qwen2's per-layer biases (q_proj_bias, k_proj_bias, v_proj_bias × 24 layers = 72 tensors). Those biases fell through to passthrough names like `model.layers.0.self_attn.q_proj.bias`, got counted in the GGUF header (291 total), but llama.cpp's llama-arch loader silently skipped them → `done_getting_tensors: wrong number of tensors; expected 291, got 219`. The fix derives `name` and `architecture` from `init_arch`: - Qwen2 init → ("qwen2-pretrain", "Qwen2ForCausalLM") - Other init → ("<hf_model_type>-pretrain", "<hf_architecture>") - No init → ("llama-370m-pretrain", "LlamaForCausalLM") [back-compat] Once stamped correctly, the qwen2 GGUF family mapper handles biases via its `q_proj_bias: "attn_q.bias"` rules and the tensor count matches. Discharges §82's P0-H item and unblocks AC-SHIP2-010 (llama-cli interop) in combination with the P0-G vocab pad fix (PR #1706). Test plan: - 3 new unit tests in pretrain::tests: - checkpoint_name_and_arch_default_when_no_init (back-compat) - checkpoint_name_and_arch_qwen2_init (Qwen2 stamping) - checkpoint_name_and_arch_init_without_hf_fields (graceful fallback) - All 3 PASS Methodology lesson #29 evidence: P0-G surfaced P0-H within minutes; 4 Class 3 defects (P0-D, P0-E, P0-F, P0-G, P0-H) in 24h confirms the "waves of 4" pattern. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 3 commits May 16, 2026 00:01

noahgift enabled auto-merge (squash) May 15, 2026 22:03

noahgift and others added 2 commits May 16, 2026 00:03

Merge branch 'main' into fix/p0g-gguf-vocab-pad

074e825

noahgift mentioned this pull request May 15, 2026

fix(pretrain): SPEC §82 P0-H — stamp APR checkpoint architecture from --init model #1709

Merged

4 tasks

noahgift added 5 commits May 16, 2026 00:44

Merge branch 'main' into fix/p0g-gguf-vocab-pad

a50ecfb

Merge branch 'main' into fix/p0g-gguf-vocab-pad

d9ee634

Merge branch 'main' into fix/p0g-gguf-vocab-pad

25dd558

Merge branch 'main' into fix/p0g-gguf-vocab-pad

984abe8

Merge branch 'main' into fix/p0g-gguf-vocab-pad

4cfe578

noahgift added 2 commits May 16, 2026 04:04

Merge branch 'main' into fix/p0g-gguf-vocab-pad

077293a

Merge branch 'main' into fix/p0g-gguf-vocab-pad

974ce3a

noahgift merged commit b3ab72f into main May 16, 2026
18 of 20 checks passed

noahgift deleted the fix/p0g-gguf-vocab-pad branch May 16, 2026 03:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(export): SPEC §82 P0-G — pad tokenizer.ggml.tokens to vocab_size for llama.cpp interop#1706

fix(export): SPEC §82 P0-G — pad tokenizer.ggml.tokens to vocab_size for llama.cpp interop#1706
noahgift merged 12 commits into
mainfrom
fix/p0g-gguf-vocab-pad

noahgift commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 15, 2026

Summary

Empirical verification

Test plan

Methodology

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant