Skip to content

fix(export): SPEC §82 P0-G — pad tokenizer.ggml.tokens to vocab_size for llama.cpp interop#1706

Merged
noahgift merged 12 commits into
mainfrom
fix/p0g-gguf-vocab-pad
May 16, 2026
Merged

fix(export): SPEC §82 P0-G — pad tokenizer.ggml.tokens to vocab_size for llama.cpp interop#1706
noahgift merged 12 commits into
mainfrom
fix/p0g-gguf-vocab-pad

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

  • P0-G fix: pad GGUF tokenizer.ggml.tokens to match <arch>.vocab_size so llama.cpp's check_tensor_dims accepts the corresponding token_embd.weight first dim. Threads vocab_size through both build_tokenizer_gguf_metadata (GgufTokenizer path) and extract_apr_tokenizer_for_gguf (APR-fallback path), padding with <|pad_N|> placeholders.
  • SPEC §82 amendment: records the first long-training MODEL-2 dispatch since §34 — P2-A 5000-step EARLY_STOP val_loss=4.7111 at epoch 20 on lambda-vector RTX 4090. §34 ceiling broken further: 9.38 → 5.36 (§78) → 4.71 (§82). MODEL-2 ship % 77% → 79%.
  • AC-SHIP2-009 LIVE-DISCHARGED: apr bench PASSED at 325.1 tok/s with embedded BPE tokenizer + C-03 metadata on pretrain checkpoint.

Empirical verification

Before fix:

llama_model_load: error loading model: check_tensor_dims:
  tensor 'token_embd.weight' has wrong shape;
  expected   896, 151643, got   896, 151936

After fix (with new apr binary, re-exporting same checkpoint):

[P0-G] Padding APR-fallback tokenizer.ggml.tokens:
  151643 + 293 placeholders = 151936

Llama.cpp accepts the vocab metadata; next blocker (P0-H tensor count mismatch) is out of scope for this PR.

Test plan

  • 6 new unit tests cover both code paths and back-compat (vocab_size=0)
  • Existing 170 export tests stay green
  • End-to-end re-export of P2-A epoch-020.apr → llama-cli now passes vocab check
  • cargo test -p aprender-core --lib p0g_ → 6/6 PASS

Methodology

Lesson #29 NEW: Class 3 packaging defects surface in waves of 4 (not 2):

Each downstream tool falsifies its own invariant in the checkpoint-emission contract.

🤖 Generated with Claude Code

noahgift and others added 3 commits May 16, 2026 00:01
llama.cpp's check_tensor_dims uses len(tokenizer.ggml.tokens) as the
expected first dim of token_embd.weight. Qwen2.5 models pad embed_tokens
to 151936 for TP-alignment but real tokenizer vocab is 151643 — the 293
delta causes llama-cli to refuse loading APR-exported GGUFs.

Fix: thread `<arch>.vocab_size` into both tokenizer-emission paths
(GgufTokenizer + APR-fallback) and pad with `<|pad_N|>` placeholders
from `len(tokens)` to `vocab_size`. Pass 0 to disable (back-compat for
tests that don't care about model dims).

Empirically verified end-to-end on SPEC §82's P2-A epoch-020 checkpoint:

  [P0-G] Padding APR-fallback tokenizer.ggml.tokens:
    151643 + 293 placeholders = 151936

Unit tests (6 new):
- test_p0g_pad_tokens_to_vocab_size (GgufTokenizer path)
- test_p0g_no_pad_when_vocab_size_zero (back-compat)
- test_p0g_no_pad_when_vocab_size_equals_tokens
- test_p0g_no_pad_when_vocab_size_smaller (no truncation)
- test_p0g_apr_fallback_pad_tokens_to_vocab_size (APR path)
- test_p0g_apr_fallback_no_pad_when_vocab_size_zero

Discharges AC-SHIP2-010 vocab-size component (next blocker is
P0-H tensor-count mismatch — separate PR).

Methodology lesson #29: Class 3 packaging defects surface in waves.
P0-G is the 4th in 24h (D embed tokenizer, E arch dims, F arch case, G vocab pad).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…G LIVE-discharged

§82 records the first long-training MODEL-2 dispatch since §34 (27 days,
60 amendments ago):

- §34 ceiling broken further: 9.38 → 5.36 (§78) → 4.71 (§82)
- P2-A on lambda-vector RTX 4090: 27 epochs / 2700 steps / ~40 min wall
- Best val_loss = 4.7110777 at epoch 20

P0 trio dispatched against epoch-020.apr:
- P0-A apr qa: infra PASS (only golden_output fails — expected for pretrain)
- P0-B apr bench: PASS at 325.1 tok/s with embedded BPE tokenizer + C-03
  metadata satisfied — confirms #1701 P0-D/E fixes live in production
- P0-C step 1 apr export: PASS — confirms #1699 P0-F arch case mapping live
- P0-C step 2 llama-cli: BLOCKED by NEW Class 3 defect P0-G
  (fixed in companion code commit on this branch)
- P0-G fix DISCHARGED end-to-end; surfaces P0-H tensor-count mismatch
  (out of scope for this PR)

AC-SHIP2-* movement:
- AC-SHIP2-009 → DISCHARGED (apr bench works on pretrain ckpt)
- AC-SHIP2-006 → FUNCTIONAL (apr qa infra runs end-to-end)
- AC-SHIP2-010 → vocab-component DISCHARGED via P0-G; blocked on P0-H

MODEL-1 ship %: 100% (unchanged).
MODEL-2 ship %: 77% → 79% (+1 for AC-SHIP2-009; +1 for ceiling break to 4.71).

Methodology lesson #29 NEW: Class 3 packaging defects surface in waves of 4
(not 2). Every downstream tool falsifies its own invariant in the
checkpoint-emission contract.

Evidence: evidence/section-82-p2a-results-2026-05-15/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d over .gitignore)

The two llamacli error logs document the pre-fix (P0-G vocab mismatch)
and post-P0G-fix (P0-H tensor count mismatch) states for §82 evidence.
.gitignore excludes *.log so force-add is required.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 15, 2026 22:03
noahgift and others added 2 commits May 16, 2026 00:03
…ttern

CI failed clippy::collapsible_match on the nested if-let chain in
P0-G's APR-fallback padding path. Rust 2021 edition can't use let
chains, so the cleanest fix is to use find_map with a pattern guard
that returns the inner Vec<String> directly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 16, 2026
… --init model (#1709)

When `apr pretrain --init <qwen2.apr>` fine-tunes a Qwen2 model, the
trainer was hardcoded to stamp `("llama-370m-pretrain", "LlamaForCausalLM")`
regardless of what the init model actually was. Downstream `apr export
--format gguf` then routed through the llama-family GGUF mapper, which
has no mapping for Qwen2's per-layer biases (q_proj_bias, k_proj_bias,
v_proj_bias × 24 layers = 72 tensors). Those biases fell through to
passthrough names like `model.layers.0.self_attn.q_proj.bias`, got
counted in the GGUF header (291 total), but llama.cpp's llama-arch
loader silently skipped them → `done_getting_tensors: wrong number
of tensors; expected 291, got 219`.

The fix derives `name` and `architecture` from `init_arch`:
- Qwen2 init → ("qwen2-pretrain", "Qwen2ForCausalLM")
- Other init → ("<hf_model_type>-pretrain", "<hf_architecture>")
- No init → ("llama-370m-pretrain", "LlamaForCausalLM") [back-compat]

Once stamped correctly, the qwen2 GGUF family mapper handles biases via
its `q_proj_bias: "attn_q.bias"` rules and the tensor count matches.

Discharges §82's P0-H item and unblocks AC-SHIP2-010 (llama-cli interop)
in combination with the P0-G vocab pad fix (PR #1706).

Test plan:
- 3 new unit tests in pretrain::tests:
  - checkpoint_name_and_arch_default_when_no_init (back-compat)
  - checkpoint_name_and_arch_qwen2_init (Qwen2 stamping)
  - checkpoint_name_and_arch_init_without_hf_fields (graceful fallback)
- All 3 PASS

Methodology lesson #29 evidence: P0-G surfaced P0-H within minutes;
4 Class 3 defects (P0-D, P0-E, P0-F, P0-G, P0-H) in 24h confirms the
"waves of 4" pattern.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit b3ab72f into main May 16, 2026
18 of 20 checks passed
@noahgift noahgift deleted the fix/p0g-gguf-vocab-pad branch May 16, 2026 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant