-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Bug
apr finetune --task instruct --model-size 8B silently constructs a 2-layer toy model (64h x 2L, vocab 1000) instead of reading the actual Qwen3-8B architecture (32 heads, 36 layers, vocab 151936) from the .apr checkpoint.
Reproduction
apr finetune checkpoints/qwen_qwen3-8b.apr \
--method qlora --quantize-nf4 --rank 16 \
--data data/instruct-corpus.jsonl \
--task instruct --model-size 8B \
--output checkpoints/qwen_qwen3-8b-finetuned.apr --verboseOutput:
Model: 64h x 2L (vocab 1000) # Should be 32h x 36L (vocab 151936)
Root Cause (Five Whys)
- Why is the model 2 layers? Because
TransformerConfig::tiny()is used. - Why is tiny() used? Because
"8B"doesn't match any arm in the hardcodedmatchstatement infinetune.rs:868. - Why is there a hardcoded match? Because
run_instruct()uses string pattern matching on--model-sizeinstead of reading the.aprfile's architecture metadata. - Why doesn't it read from the .apr file? Because
run_instruct()was written independently of the regular LoRA path, which already hasestimate_params_from_file()/resolve_model_params(). - Why wasn't this caught? No contract enforces that the constructed model matches the checkpoint's declared architecture. A
tensor-layout-v1contract validates tensors at import time but is never checked at finetune time.
Design Violation
The .apr file is the single source of truth for model architecture — it contains validated tensor shapes, vocab size, layer count, and head configuration (proven at import time via tensor-layout-v1 contract). The finetune path ignores this contract and substitutes a fragile string-matching lookup table that silently degrades to a toy model on any unrecognized size string.
Required Fix
run_instruct()must extractTransformerConfigfrom the.aprfile's metadata — same as the LoRA path does viaestimate_params_from_file()- Remove the hardcoded
model_sizematch table (or demote it to fallback-only when no .apr file is provided) - Add a shape contract assertion: constructed model dimensions must match checkpoint tensor shapes before training begins
Severity
Critical — silently trains a garbage model while consuming full GPU resources. No error, no warning. User discovers the bug only by inspecting output logs.
Refs
- PMAT-017 (apr-leaderboard dogfooding)
- apr finetune: Add --task instruct for generative instruction-following LoRA training #371 (instruct pipeline)
tensor-layout-v1contract