Skip to content

finetune --task instruct ignores .apr architecture metadata, falls back to hardcoded tiny model #376

@noahgift

Description

@noahgift

Bug

apr finetune --task instruct --model-size 8B silently constructs a 2-layer toy model (64h x 2L, vocab 1000) instead of reading the actual Qwen3-8B architecture (32 heads, 36 layers, vocab 151936) from the .apr checkpoint.

Reproduction

apr finetune checkpoints/qwen_qwen3-8b.apr \
    --method qlora --quantize-nf4 --rank 16 \
    --data data/instruct-corpus.jsonl \
    --task instruct --model-size 8B \
    --output checkpoints/qwen_qwen3-8b-finetuned.apr --verbose

Output:

Model: 64h x 2L (vocab 1000)   # Should be 32h x 36L (vocab 151936)

Root Cause (Five Whys)

  1. Why is the model 2 layers? Because TransformerConfig::tiny() is used.
  2. Why is tiny() used? Because "8B" doesn't match any arm in the hardcoded match statement in finetune.rs:868.
  3. Why is there a hardcoded match? Because run_instruct() uses string pattern matching on --model-size instead of reading the .apr file's architecture metadata.
  4. Why doesn't it read from the .apr file? Because run_instruct() was written independently of the regular LoRA path, which already has estimate_params_from_file() / resolve_model_params().
  5. Why wasn't this caught? No contract enforces that the constructed model matches the checkpoint's declared architecture. A tensor-layout-v1 contract validates tensors at import time but is never checked at finetune time.

Design Violation

The .apr file is the single source of truth for model architecture — it contains validated tensor shapes, vocab size, layer count, and head configuration (proven at import time via tensor-layout-v1 contract). The finetune path ignores this contract and substitutes a fragile string-matching lookup table that silently degrades to a toy model on any unrecognized size string.

Required Fix

  1. run_instruct() must extract TransformerConfig from the .apr file's metadata — same as the LoRA path does via estimate_params_from_file()
  2. Remove the hardcoded model_size match table (or demote it to fallback-only when no .apr file is provided)
  3. Add a shape contract assertion: constructed model dimensions must match checkpoint tensor shapes before training begins

Severity

Critical — silently trains a garbage model while consuming full GPU resources. No error, no warning. User discovers the bug only by inspecting output logs.

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions