Skip to content

feat: support tensor level comparison for model evaluation #804

@zhenchaoni

Description

@zhenchaoni

feat(eval): add --mode compare for ONNX vs HF tensor-similarity evaluation

Problem

winml eval today reports task-level metrics (top-1, mIoU, BLEU, ...) on a labeled dataset. That tells us whether an optimized model still works, but not how much the optimize / quantize / compile passes perturbed the raw output tensors relative to the source HF PyTorch model.

When investigating a build-pipeline regression today we have to:

  • Stand up the HF reference by hand.
  • Feed both models the same inputs by hand.
  • Compute SQNR / PSNR / cosine / MSE / max-abs-diff on the outputs by hand.
  • Re-do all of the above per output tensor, per sample, per config we want to sweep.

There is no first-class, label-free, dataset-free signal in winml eval for "did this quantization config drift the tensors?" — which is the question we ask most often when picking quantization configs or chasing accuracy regressions.

Proposal

Add a new evaluation mode to winml eval:

winml eval --mode compare -m <onnx_or_hf_id> --task <task> --samples 100

--mode compare should:

  1. Load the ONNX candidate (existing WinMLPreTrainedModel path).
  2. Load the matching HF PyTorch reference on CPU/fp32, resolved generically from the task (no per-task mapping).
  3. Draw inputs from RandomDataset over the candidate's ONNX I/O spec — no labeled dataset, no HF pipeline, no preprocessor in the loop, so any divergence reflects the build pipeline only.
  4. Run both backends on identical inputs per sample.
  5. Report per-output-tensor parity metrics: SQNR (dB), PSNR (dB), cosine similarity, MSE, max absolute diff, aggregated as mean / std / min / max across samples.

Default mode stays onnx (today's behavior); compare is fully opt-in.

Acceptance criteria

  • winml eval --mode compare -m microsoft/resnet-50 --task image-classification --precision fp16 --samples 100 runs end-to-end and writes a metrics block with the 5 metrics × 4 stats (= 20 keys) per output tensor.
  • Per-sample math matches the team-wide eval_tensors reference library bit-for-bit on the same .npy pair.
  • Architecture-agnostic: no model-specific names, layer patterns, or hardcoded outputs in metric or evaluator code.
  • Composite models (e.g. BLIP) fail fast with a clear message instructing the user to run compare per sub-component.
  • ONNX/HF output-name divergence is handled gracefully (compute on the intersection, warn on drift) rather than failing.
  • Output integrates with the existing eval report renderer — no custom renderer required for this issue.
  • Unit tests cover metric numerics + evaluator dispatch + composite guard; one e2e test exercises the full CLI on a small real model.

Out of scope (separate issues)

  • --mode hf (run the HF pipeline on a labeled dataset as the reference for task-level metric parity).
  • A dedicated compare-mode renderer / richer per-output presentation.
  • Multi-component (composite) compare flow.

Notes

Per-sample metric math is fixed and well-known; the design space is mostly around the evaluator wiring (model loading, input generation, output pairing) and the output shape that the eval renderer consumes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions