Summary
MVP qualification tests show that format conversions produce significantly different outputs. The diff values are ~0.5-0.8 while the acceptable epsilon is 1e-6.
Evidence (Source of Truth)
From fresh runs on 2026-02-06 with apr-cli 0.2.12:
0.5B Model (output/mvp-0.5b/evidence.json)
F-CONV-S-G: Conversion SafeTensors → Gguf produced different output (diff: 8.00e-1, ε: 1.00e-6)
F-CONV-S-A: Conversion SafeTensors → Apr produced different output (diff: 7.67e-1, ε: 1.00e-6)
F-CONV-A-S: Conversion Apr → SafeTensors produced different output (diff: 8.12e-1, ε: 1.00e-6)
F-CONV-A-G: Conversion Apr → Gguf produced different output (diff: 8.00e-1, ε: 1.00e-6)
F-CONV-RT-002: Round-trip conversion produced different output
F-CONV-RT-003: Round-trip conversion produced different output
1.5B Model (output/mvp-1.5b/evidence.json)
F-CONV-S-G: Conversion SafeTensors → Gguf produced different output (diff: 7.89e-1, ε: 1.00e-6)
F-CONV-S-A: Conversion SafeTensors → Apr produced different output (diff: 5.44e-1, ε: 1.00e-6)
F-CONV-A-S: Conversion Apr → SafeTensors produced different output (diff: 8.08e-1, ε: 1.00e-6)
F-CONV-A-G: Conversion Apr → Gguf produced different output (diff: 8.07e-1, ε: 1.00e-6)
Expected Behavior
Converted models should produce outputs that differ by less than epsilon (1e-6) from the original.
Actual Behavior
Outputs differ by 0.5-0.8 (50-80% different), which is catastrophically bad for model quality.
Impact
- P0 CRITICAL: Converted models may produce garbage or wrong outputs
- Golden Rule Test violation: "Converted models MUST produce the same output as the original"
- Blocks certification for all formats except ground truth SafeTensors
Test Command
cargo run --release --bin apr-qa -- run playbooks/models/qwen2.5-coder-0.5b-mvp.playbook.yaml --output evidence.json
Root Cause Investigation Needed
- Are weights being transposed incorrectly during conversion?
- Is quantization causing significant precision loss?
- Are tokenizer differences causing output divergence?
Links
- Evidence:
apr-model-qa-playbook/output/mvp-0.5b/evidence.json
- Evidence:
apr-model-qa-playbook/output/mvp-1.5b/evidence.json
- Playbooks:
apr-model-qa-playbook/playbooks/models/qwen2.5-coder-*.playbook.yaml
Summary
MVP qualification tests show that format conversions produce significantly different outputs. The diff values are ~0.5-0.8 while the acceptable epsilon is 1e-6.
Evidence (Source of Truth)
From fresh runs on 2026-02-06 with apr-cli 0.2.12:
0.5B Model (
output/mvp-0.5b/evidence.json)1.5B Model (
output/mvp-1.5b/evidence.json)Expected Behavior
Converted models should produce outputs that differ by less than epsilon (1e-6) from the original.
Actual Behavior
Outputs differ by 0.5-0.8 (50-80% different), which is catastrophically bad for model quality.
Impact
Test Command
Root Cause Investigation Needed
Links
apr-model-qa-playbook/output/mvp-0.5b/evidence.jsonapr-model-qa-playbook/output/mvp-1.5b/evidence.jsonapr-model-qa-playbook/playbooks/models/qwen2.5-coder-*.playbook.yaml