feat(eval): add --mode compare for ONNX vs HF tensor-similarity evaluation
Problem
winml eval today reports task-level metrics (top-1, mIoU, BLEU, ...) on a labeled dataset. That tells us whether an optimized model still works, but not how much the optimize / quantize / compile passes perturbed the raw output tensors relative to the source HF PyTorch model.
When investigating a build-pipeline regression today we have to:
- Stand up the HF reference by hand.
- Feed both models the same inputs by hand.
- Compute SQNR / PSNR / cosine / MSE / max-abs-diff on the outputs by hand.
- Re-do all of the above per output tensor, per sample, per config we want to sweep.
There is no first-class, label-free, dataset-free signal in winml eval for "did this quantization config drift the tensors?" — which is the question we ask most often when picking quantization configs or chasing accuracy regressions.
Proposal
Add a new evaluation mode to winml eval:
winml eval --mode compare -m <onnx_or_hf_id> --task <task> --samples 100
--mode compare should:
- Load the ONNX candidate (existing
WinMLPreTrainedModel path).
- Load the matching HF PyTorch reference on CPU/fp32, resolved generically from the task (no per-task mapping).
- Draw inputs from
RandomDataset over the candidate's ONNX I/O spec — no labeled dataset, no HF pipeline, no preprocessor in the loop, so any divergence reflects the build pipeline only.
- Run both backends on identical inputs per sample.
- Report per-output-tensor parity metrics: SQNR (dB), PSNR (dB), cosine similarity, MSE, max absolute diff, aggregated as mean / std / min / max across samples.
Default mode stays onnx (today's behavior); compare is fully opt-in.
Acceptance criteria
winml eval --mode compare -m microsoft/resnet-50 --task image-classification --precision fp16 --samples 100 runs end-to-end and writes a metrics block with the 5 metrics × 4 stats (= 20 keys) per output tensor.
- Per-sample math matches the team-wide
eval_tensors reference library bit-for-bit on the same .npy pair.
- Architecture-agnostic: no model-specific names, layer patterns, or hardcoded outputs in metric or evaluator code.
- Composite models (e.g. BLIP) fail fast with a clear message instructing the user to run
compare per sub-component.
- ONNX/HF output-name divergence is handled gracefully (compute on the intersection, warn on drift) rather than failing.
- Output integrates with the existing eval report renderer — no custom renderer required for this issue.
- Unit tests cover metric numerics + evaluator dispatch + composite guard; one e2e test exercises the full CLI on a small real model.
Out of scope (separate issues)
--mode hf (run the HF pipeline on a labeled dataset as the reference for task-level metric parity).
- A dedicated compare-mode renderer / richer per-output presentation.
- Multi-component (composite) compare flow.
Notes
Per-sample metric math is fixed and well-known; the design space is mostly around the evaluator wiring (model loading, input generation, output pairing) and the output shape that the eval renderer consumes.
feat(eval): add
--mode comparefor ONNX vs HF tensor-similarity evaluationProblem
winml evaltoday reports task-level metrics (top-1, mIoU, BLEU, ...) on a labeled dataset. That tells us whether an optimized model still works, but not how much the optimize / quantize / compile passes perturbed the raw output tensors relative to the source HF PyTorch model.When investigating a build-pipeline regression today we have to:
There is no first-class, label-free, dataset-free signal in
winml evalfor "did this quantization config drift the tensors?" — which is the question we ask most often when picking quantization configs or chasing accuracy regressions.Proposal
Add a new evaluation mode to
winml eval:--mode compareshould:WinMLPreTrainedModelpath).RandomDatasetover the candidate's ONNX I/O spec — no labeled dataset, no HFpipeline, no preprocessor in the loop, so any divergence reflects the build pipeline only.Default mode stays
onnx(today's behavior); compare is fully opt-in.Acceptance criteria
winml eval --mode compare -m microsoft/resnet-50 --task image-classification --precision fp16 --samples 100runs end-to-end and writes ametricsblock with the 5 metrics × 4 stats (= 20 keys) per output tensor.eval_tensorsreference library bit-for-bit on the same.npypair.compareper sub-component.Out of scope (separate issues)
--mode hf(run the HFpipelineon a labeled dataset as the reference for task-level metric parity).Notes
Per-sample metric math is fixed and well-known; the design space is mostly around the evaluator wiring (model loading, input generation, output pairing) and the output shape that the eval renderer consumes.