Skip to content

fix(pretrain): SPEC §82 P0-H — stamp APR checkpoint architecture from --init model#1709

Merged
noahgift merged 6 commits into
mainfrom
fix/p0h-arch-from-init
May 16, 2026
Merged

fix(pretrain): SPEC §82 P0-H — stamp APR checkpoint architecture from --init model#1709
noahgift merged 6 commits into
mainfrom
fix/p0h-arch-from-init

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

When apr pretrain --init <qwen2.apr> fine-tunes a Qwen2 model, the trainer hardcoded ("llama-370m-pretrain", "LlamaForCausalLM") regardless of the init model. Downstream apr export --format gguf routed through the llama-family mapper, which has no mapping for Qwen2's 72 per-layer biases — they fell through as passthrough names, getting counted in the GGUF header (291 total) but rejected by llama.cpp's llama-arch loader → expected 291, got 219.

The fix derives name and architecture from init_arch:

  • Qwen2 init("qwen2-pretrain", "Qwen2ForCausalLM") — qwen2 family mapper handles biases via q_proj_bias: "attn_q.bias" rules
  • Other init("<hf_model_type>-pretrain", "<hf_architecture>")
  • No init (from-scratch) → ("llama-370m-pretrain", "LlamaForCausalLM") (back-compat)

Discharges §82's P0-H item. Combined with PR #1706 (P0-G vocab pad) and #1701 (P0-D/E embed tokenizer + arch metadata), this unblocks AC-SHIP2-010 (llama-cli interop) end-to-end for apr pretrain outputs from Qwen2 init.

Test plan

  • 3 new unit tests in pretrain::tests
  • cargo test -p apr-cli --lib checkpoint_name_and_arch → 3/3 PASS
  • cargo clippy -p apr-cli --lib -- -D warnings clean
  • cargo build -p apr-cli --bin apr succeeds

Methodology

Class 3 packaging defect cascade #29 confirmation:

# Defect PR
1 P0-D missing embedded tokenizer #1701
2 P0-E missing arch metadata #1701
3 P0-F HF→GGUF arch case #1699
4 P0-G GGUF vocab pad #1706
5 P0-H arch from init THIS PR

5 Class 3 defects in 24h. The "waves of 4, not 2" lesson is empirically holding (perhaps "waves of 5+").

🤖 Generated with Claude Code

… --init model

When `apr pretrain --init <qwen2.apr>` fine-tunes a Qwen2 model, the
trainer was hardcoded to stamp `("llama-370m-pretrain", "LlamaForCausalLM")`
regardless of what the init model actually was. Downstream `apr export
--format gguf` then routed through the llama-family GGUF mapper, which
has no mapping for Qwen2's per-layer biases (q_proj_bias, k_proj_bias,
v_proj_bias × 24 layers = 72 tensors). Those biases fell through to
passthrough names like `model.layers.0.self_attn.q_proj.bias`, got
counted in the GGUF header (291 total), but llama.cpp's llama-arch
loader silently skipped them → `done_getting_tensors: wrong number
of tensors; expected 291, got 219`.

The fix derives `name` and `architecture` from `init_arch`:
- Qwen2 init → ("qwen2-pretrain", "Qwen2ForCausalLM")
- Other init → ("<hf_model_type>-pretrain", "<hf_architecture>")
- No init → ("llama-370m-pretrain", "LlamaForCausalLM") [back-compat]

Once stamped correctly, the qwen2 GGUF family mapper handles biases via
its `q_proj_bias: "attn_q.bias"` rules and the tensor count matches.

Discharges §82's P0-H item and unblocks AC-SHIP2-010 (llama-cli interop)
in combination with the P0-G vocab pad fix (PR #1706).

Test plan:
- 3 new unit tests in pretrain::tests:
  - checkpoint_name_and_arch_default_when_no_init (back-compat)
  - checkpoint_name_and_arch_qwen2_init (Qwen2 stamping)
  - checkpoint_name_and_arch_init_without_hf_fields (graceful fallback)
- All 3 PASS

Methodology lesson #29 evidence: P0-G surfaced P0-H within minutes;
4 Class 3 defects (P0-D, P0-E, P0-F, P0-G, P0-H) in 24h confirms the
"waves of 4" pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 15, 2026 22:42
@noahgift noahgift merged commit 2b26e69 into main May 16, 2026
10 checks passed
@noahgift noahgift deleted the fix/p0h-arch-from-init branch May 16, 2026 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant