Skip to content

Conversion output differences exceed acceptable tolerance #215

@noahgift

Description

@noahgift

Summary

MVP qualification tests show that format conversions produce significantly different outputs. The diff values are ~0.5-0.8 while the acceptable epsilon is 1e-6.

Evidence (Source of Truth)

From fresh runs on 2026-02-06 with apr-cli 0.2.12:

0.5B Model (output/mvp-0.5b/evidence.json)

F-CONV-S-G: Conversion SafeTensors → Gguf produced different output (diff: 8.00e-1, ε: 1.00e-6)
F-CONV-S-A: Conversion SafeTensors → Apr produced different output (diff: 7.67e-1, ε: 1.00e-6)
F-CONV-A-S: Conversion Apr → SafeTensors produced different output (diff: 8.12e-1, ε: 1.00e-6)
F-CONV-A-G: Conversion Apr → Gguf produced different output (diff: 8.00e-1, ε: 1.00e-6)
F-CONV-RT-002: Round-trip conversion produced different output
F-CONV-RT-003: Round-trip conversion produced different output

1.5B Model (output/mvp-1.5b/evidence.json)

F-CONV-S-G: Conversion SafeTensors → Gguf produced different output (diff: 7.89e-1, ε: 1.00e-6)
F-CONV-S-A: Conversion SafeTensors → Apr produced different output (diff: 5.44e-1, ε: 1.00e-6)
F-CONV-A-S: Conversion Apr → SafeTensors produced different output (diff: 8.08e-1, ε: 1.00e-6)
F-CONV-A-G: Conversion Apr → Gguf produced different output (diff: 8.07e-1, ε: 1.00e-6)

Expected Behavior

Converted models should produce outputs that differ by less than epsilon (1e-6) from the original.

Actual Behavior

Outputs differ by 0.5-0.8 (50-80% different), which is catastrophically bad for model quality.

Impact

  • P0 CRITICAL: Converted models may produce garbage or wrong outputs
  • Golden Rule Test violation: "Converted models MUST produce the same output as the original"
  • Blocks certification for all formats except ground truth SafeTensors

Test Command

cargo run --release --bin apr-qa -- run playbooks/models/qwen2.5-coder-0.5b-mvp.playbook.yaml --output evidence.json

Root Cause Investigation Needed

  1. Are weights being transposed incorrectly during conversion?
  2. Is quantization causing significant precision loss?
  3. Are tokenizer differences causing output divergence?

Links

  • Evidence: apr-model-qa-playbook/output/mvp-0.5b/evidence.json
  • Evidence: apr-model-qa-playbook/output/mvp-1.5b/evidence.json
  • Playbooks: apr-model-qa-playbook/playbooks/models/qwen2.5-coder-*.playbook.yaml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions