Implement tensor similarity evaluator#805
Open
zhenchaoni wants to merge 1 commit into
Open
Conversation
xieofxie
reviewed
Jun 3, 2026
| dataset_script: str | None, | ||
| trust_remote_code: bool, | ||
| show_schema: bool, | ||
| mode: str, |
| return 1.0 | ||
| if norm_ref == 0.0 or norm_test == 0.0: | ||
| return 0.0 | ||
| return float(np.dot(ref, test) / (norm_ref * norm_test)) |
Contributor
There was a problem hiding this comment.
Just being curious: cosine similarity seems to be a robust measure when mean/max of abs diff can be quite large due to numerical errors of both models being compares; on the other hand, the norms of 2 vectors are lost in similarity computation, would it be possible to compare norms jointly with cosine similarity?
vortex-captain
approved these changes
Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #804
Implement tensor similarity evaluator
Summary
Adds a new
comparemode towinml evalthat compares an ONNX candidate against its HF PyTorch reference on identical random inputs and reports per-output tensor-parity metrics (SQNR, PSNR, cosine similarity, MSE, max absolute diff). This isolates divergence introduced by the build pipeline (optimize / quantize / compile) from data- or pipeline-related differences — there is no labeled dataset, no HFpipeline, and no preprocessor in the loop.Motivation
Task-level metrics (top-1, mIoU, BLEU, ...) tell us whether an optimized model still works, but not how much the optimize/quantize/compile passes perturbed the raw tensors. Tensor-similarity gives a fast, label-free, dataset-free signal for build-pipeline regressions and for picking quantization configs.
Usage
winml eval --mode compare -m microsoft/resnet-50 --task image-classification --precision fp16 --samples 100What's new
winml eval --mode {onnx,compare}— new Click option onwinml eval.onnx(default) is the existing dataset-driven flow;compareactivates the new evaluator.TensorSimilarityEvaluator(tensor_similarity_evaluator.py) — loads the HF reference on CPU/fp32 viaresolve_task_and_model_class, draws inputs fromRandomDatasetover the candidate's ONNX I/O spec, runs both backends per sample, and aggregates per-output metrics.TensorSimilarityMetric(tensor_similarity.py) — statefulupdate/compute/resetmetric mirroringMeanIoUMetric. Per-sample math is bit-equivalent to the team-wideeval_tensorsreference library on the same.npypair.evaluate.pyregisters"compare-tensor"andget_evaluator_classroutes to it whenconfig.mode == "compare"; compare mode bypasses default-dataset resolution and the dataset section ofprint_config.WinMLEvaluationConfig.mode: str = "onnx";to_dictonly emitsmodewhen non-default.Output shape
compute()returns display-ready flat dict so the existing generic eval report renders without a custom renderer:{ f"{metric}_{stat}": {output_name: float}, # 5 metrics × 4 stats = 20 keys ... }Stats are
mean / std / min / max. The renderer prints one row per{metric}_{stat}withoutput_name=valuecells joined across outputs.Notable design choices
TypeErrorwith guidance to runcompareper sub-component — there is no canonical "one HF reference" for the composite.WinMLSessiondown-casts to the ORT graph's declared dtype on its side. The same input dict feeds both backends.Tests
get_evaluator_classupdated to takeWinMLEvaluationConfig;"compare-tensor"exempted from the task-schema set check.tests/e2e/test_eval_e2e.py::test_compare_mode_image_classification— full CLI run onmicrosoft/resnet-50fp16: asserts the 20-key flat shape, per-output cosine bounds[-1, 1], and (QNN host only)cosine_similarity_mean >= 0.95.53 unit tests pass; ruff clean.
Out of scope (follow-ups)
--mode hf(run the HF pipeline on a labeled dataset as the reference for task-level metrics) — placeholder removed from this PR; will land as a separate change.