Implement tensor similarity evaluator by zhenchaoni · Pull Request #805 · microsoft/winml-cli

zhenchaoni · 2026-06-03T05:30:46Z

Fixes #804

Implement tensor similarity evaluator

Summary

Adds a new compare mode to winml eval that compares an ONNX candidate against its HF PyTorch reference on identical random inputs and reports per-output tensor-parity metrics (SQNR, PSNR, cosine similarity, MSE, max absolute diff). This isolates divergence introduced by the build pipeline (optimize / quantize / compile) from data- or pipeline-related differences — there is no labeled dataset, no HF pipeline, and no preprocessor in the loop.

Motivation

Task-level metrics (top-1, mIoU, BLEU, ...) tell us whether an optimized model still works, but not how much the optimize/quantize/compile passes perturbed the raw tensors. Tensor-similarity gives a fast, label-free, dataset-free signal for build-pipeline regressions and for picking quantization configs.

Usage

winml eval --mode compare -m microsoft/resnet-50 --task image-classification --precision fp16 --samples 100

What's new

winml eval --mode {onnx,compare} — new Click option on winml eval. onnx (default) is the existing dataset-driven flow; compare activates the new evaluator.
TensorSimilarityEvaluator (tensor_similarity_evaluator.py) — loads the HF reference on CPU/fp32 via resolve_task_and_model_class, draws inputs from RandomDataset over the candidate's ONNX I/O spec, runs both backends per sample, and aggregates per-output metrics.
TensorSimilarityMetric (tensor_similarity.py) — stateful update / compute / reset metric mirroring MeanIoUMetric. Per-sample math is bit-equivalent to the team-wide eval_tensors reference library on the same .npy pair.
Dispatch — evaluate.py registers "compare-tensor" and get_evaluator_class routes to it when config.mode == "compare"; compare mode bypasses default-dataset resolution and the dataset section of print_config.
Config — WinMLEvaluationConfig.mode: str = "onnx"; to_dict only emits mode when non-default.

Output shape

compute() returns display-ready flat dict so the existing generic eval report renders without a custom renderer:

{
    f"{metric}_{stat}": {output_name: float},  # 5 metrics × 4 stats = 20 keys
    ...
}

Stats are mean / std / min / max. The renderer prints one row per {metric}_{stat} with output_name=value cells joined across outputs.

Notable design choices

Output-name overlap, not strict equality. ONNX and HF output sets can differ (HF often exposes auxiliary tensors). We compute on the intersection and warn on divergence rather than failing.
Composite-model guard. Multi-component models (e.g. BLIP) raise a TypeError with guidance to run compare per sub-component — there is no canonical "one HF reference" for the composite.
int dtype normalization. Narrow int tensors are upcast to int64 before inference so HF embeddings accept them; WinMLSession down-casts to the ORT graph's declared dtype on its side. The same input dict feeds both backends.
Architecture-agnostic. No model-specific names, layer patterns, or hardcoded outputs anywhere in metric or evaluator code.

Tests

test_tensor_similarity_metric.py — 10 unit tests for the metric (numerics, identity, stat shape, reset, empty-state error).
test_tensor_similarity_evaluator.py — 4 unit tests (composite-model guard, output-name overlap, dispatch).
test_eval.py — get_evaluator_class updated to take WinMLEvaluationConfig; "compare-tensor" exempted from the task-schema set check.
tests/e2e/test_eval_e2e.py::test_compare_mode_image_classification — full CLI run on microsoft/resnet-50 fp16: asserts the 20-key flat shape, per-output cosine bounds [-1, 1], and (QNN host only) cosine_similarity_mean >= 0.95.

53 unit tests pass; ruff clean.

Out of scope (follow-ups)

--mode hf (run the HF pipeline on a labeled dataset as the reference for task-level metrics) — placeholder removed from this PR; will land as a separate change.
A custom renderer for compare mode (currently uses the generic table).

xieofxie · 2026-06-03T06:06:46Z

    dataset_script: str | None,
    trust_remote_code: bool,
    show_schema: bool,
+    mode: str,


use a type instead of str

vortex-captain · 2026-06-03T06:21:21Z

+        return 1.0
+    if norm_ref == 0.0 or norm_test == 0.0:
+        return 0.0
+    return float(np.dot(ref, test) / (norm_ref * norm_test))


Just being curious: cosine similarity seems to be a robust measure when mean/max of abs diff can be quite large due to numerical errors of both models being compares; on the other hand, the norms of 2 vectors are lost in similarity computation, would it be possible to compare norms jointly with cosine similarity?

Implement tensor similarity evaluator

1edda7a

zhenchaoni requested a review from a team as a code owner June 3, 2026 05:30

xieofxie reviewed Jun 3, 2026

View reviewed changes

vortex-captain reviewed Jun 3, 2026

View reviewed changes

vortex-captain approved these changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement tensor similarity evaluator#805

Implement tensor similarity evaluator#805
zhenchaoni wants to merge 1 commit into
mainfrom
private/zhenni/tensor_eval

zhenchaoni commented Jun 3, 2026

Uh oh!

xieofxie Jun 3, 2026

Uh oh!

vortex-captain Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhenchaoni commented Jun 3, 2026