feat: support tensor level comparison for model evaluation

---

## feat(eval): add `--mode compare` for ONNX vs HF tensor-similarity evaluation

### Problem

`winml eval` today reports task-level metrics (top-1, mIoU, BLEU, ...) on a labeled dataset. That tells us *whether* an optimized model still works, but not *how much* the optimize / quantize / compile passes perturbed the raw output tensors relative to the source HF PyTorch model.

When investigating a build-pipeline regression today we have to:

- Stand up the HF reference by hand.
- Feed both models the same inputs by hand.
- Compute SQNR / PSNR / cosine / MSE / max-abs-diff on the outputs by hand.
- Re-do all of the above per output tensor, per sample, per config we want to sweep.

There is no first-class, label-free, dataset-free signal in `winml eval` for "did this quantization config drift the tensors?" — which is the question we ask most often when picking quantization configs or chasing accuracy regressions.

### Proposal

Add a new evaluation mode to `winml eval`:

```bash
winml eval --mode compare -m <onnx_or_hf_id> --task <task> --samples 100
```

`--mode compare` should:

1. Load the ONNX candidate (existing `WinMLPreTrainedModel` path).
2. Load the matching HF PyTorch reference on CPU/fp32, resolved generically from the task (no per-task mapping).
3. Draw inputs from `RandomDataset` over the candidate's ONNX I/O spec — no labeled dataset, no HF `pipeline`, no preprocessor in the loop, so any divergence reflects the build pipeline only.
4. Run both backends on identical inputs per sample.
5. Report per-output-tensor parity metrics: **SQNR (dB), PSNR (dB), cosine similarity, MSE, max absolute diff**, aggregated as **mean / std / min / max** across samples.

Default mode stays `onnx` (today's behavior); compare is fully opt-in.

### Acceptance criteria

- `winml eval --mode compare -m microsoft/resnet-50 --task image-classification --precision fp16 --samples 100` runs end-to-end and writes a `metrics` block with the 5 metrics × 4 stats (= 20 keys) per output tensor.
- Per-sample math matches the team-wide `eval_tensors` reference library bit-for-bit on the same `.npy` pair.
- Architecture-agnostic: no model-specific names, layer patterns, or hardcoded outputs in metric or evaluator code.
- Composite models (e.g. BLIP) fail fast with a clear message instructing the user to run `compare` per sub-component.
- ONNX/HF output-name divergence is handled gracefully (compute on the intersection, warn on drift) rather than failing.
- Output integrates with the existing eval report renderer — no custom renderer required for this issue.
- Unit tests cover metric numerics + evaluator dispatch + composite guard; one e2e test exercises the full CLI on a small real model.

### Out of scope (separate issues)

- `--mode hf` (run the HF `pipeline` on a labeled dataset as the reference for task-level metric parity).
- A dedicated compare-mode renderer / richer per-output presentation.
- Multi-component (composite) compare flow.

### Notes

Per-sample metric math is fixed and well-known; the design space is mostly around the evaluator wiring (model loading, input generation, output pairing) and the output shape that the eval renderer consumes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support tensor level comparison for model evaluation #804

feat(eval): add `--mode compare` for ONNX vs HF tensor-similarity evaluation

Problem

Proposal

Acceptance criteria

Out of scope (separate issues)

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: support tensor level comparison for model evaluation #804

Description

feat(eval): add --mode compare for ONNX vs HF tensor-similarity evaluation

Problem

Proposal

Acceptance criteria

Out of scope (separate issues)

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

feat(eval): add `--mode compare` for ONNX vs HF tensor-similarity evaluation