Conversion output differences exceed acceptable tolerance

## Summary

MVP qualification tests show that format conversions produce significantly different outputs. The diff values are ~0.5-0.8 while the acceptable epsilon is 1e-6.

## Evidence (Source of Truth)

From fresh runs on 2026-02-06 with apr-cli 0.2.12:

### 0.5B Model (`output/mvp-0.5b/evidence.json`)
```
F-CONV-S-G: Conversion SafeTensors → Gguf produced different output (diff: 8.00e-1, ε: 1.00e-6)
F-CONV-S-A: Conversion SafeTensors → Apr produced different output (diff: 7.67e-1, ε: 1.00e-6)
F-CONV-A-S: Conversion Apr → SafeTensors produced different output (diff: 8.12e-1, ε: 1.00e-6)
F-CONV-A-G: Conversion Apr → Gguf produced different output (diff: 8.00e-1, ε: 1.00e-6)
F-CONV-RT-002: Round-trip conversion produced different output
F-CONV-RT-003: Round-trip conversion produced different output
```

### 1.5B Model (`output/mvp-1.5b/evidence.json`)
```
F-CONV-S-G: Conversion SafeTensors → Gguf produced different output (diff: 7.89e-1, ε: 1.00e-6)
F-CONV-S-A: Conversion SafeTensors → Apr produced different output (diff: 5.44e-1, ε: 1.00e-6)
F-CONV-A-S: Conversion Apr → SafeTensors produced different output (diff: 8.08e-1, ε: 1.00e-6)
F-CONV-A-G: Conversion Apr → Gguf produced different output (diff: 8.07e-1, ε: 1.00e-6)
```

## Expected Behavior

Converted models should produce outputs that differ by less than epsilon (1e-6) from the original.

## Actual Behavior

Outputs differ by 0.5-0.8 (50-80% different), which is catastrophically bad for model quality.

## Impact

- **P0 CRITICAL**: Converted models may produce garbage or wrong outputs
- Golden Rule Test violation: "Converted models MUST produce the same output as the original"
- Blocks certification for all formats except ground truth SafeTensors

## Test Command

```bash
cargo run --release --bin apr-qa -- run playbooks/models/qwen2.5-coder-0.5b-mvp.playbook.yaml --output evidence.json
```

## Root Cause Investigation Needed

1. Are weights being transposed incorrectly during conversion?
2. Is quantization causing significant precision loss?
3. Are tokenizer differences causing output divergence?

## Links

- Evidence: `apr-model-qa-playbook/output/mvp-0.5b/evidence.json`
- Evidence: `apr-model-qa-playbook/output/mvp-1.5b/evidence.json`
- Playbooks: `apr-model-qa-playbook/playbooks/models/qwen2.5-coder-*.playbook.yaml`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion output differences exceed acceptable tolerance #215

Summary

Evidence (Source of Truth)

0.5B Model (`output/mvp-0.5b/evidence.json`)

1.5B Model (`output/mvp-1.5b/evidence.json`)

Expected Behavior

Actual Behavior

Impact

Test Command

Root Cause Investigation Needed

Links

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Conversion output differences exceed acceptable tolerance #215

Description

Summary

Evidence (Source of Truth)

0.5B Model (output/mvp-0.5b/evidence.json)

1.5B Model (output/mvp-1.5b/evidence.json)

Expected Behavior

Actual Behavior

Impact

Test Command

Root Cause Investigation Needed

Links

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

0.5B Model (`output/mvp-0.5b/evidence.json`)

1.5B Model (`output/mvp-1.5b/evidence.json`)