Skip to content

Implement tensor similarity evaluator#805

Open
zhenchaoni wants to merge 1 commit into
mainfrom
private/zhenni/tensor_eval
Open

Implement tensor similarity evaluator#805
zhenchaoni wants to merge 1 commit into
mainfrom
private/zhenni/tensor_eval

Conversation

@zhenchaoni
Copy link
Copy Markdown
Member

Fixes #804

Implement tensor similarity evaluator

Summary

Adds a new compare mode to winml eval that compares an ONNX candidate against its HF PyTorch reference on identical random inputs and reports per-output tensor-parity metrics (SQNR, PSNR, cosine similarity, MSE, max absolute diff). This isolates divergence introduced by the build pipeline (optimize / quantize / compile) from data- or pipeline-related differences — there is no labeled dataset, no HF pipeline, and no preprocessor in the loop.

Motivation

Task-level metrics (top-1, mIoU, BLEU, ...) tell us whether an optimized model still works, but not how much the optimize/quantize/compile passes perturbed the raw tensors. Tensor-similarity gives a fast, label-free, dataset-free signal for build-pipeline regressions and for picking quantization configs.

Usage

winml eval --mode compare -m microsoft/resnet-50 --task image-classification --precision fp16 --samples 100

What's new

  • winml eval --mode {onnx,compare} — new Click option on winml eval. onnx (default) is the existing dataset-driven flow; compare activates the new evaluator.
  • TensorSimilarityEvaluator (tensor_similarity_evaluator.py) — loads the HF reference on CPU/fp32 via resolve_task_and_model_class, draws inputs from RandomDataset over the candidate's ONNX I/O spec, runs both backends per sample, and aggregates per-output metrics.
  • TensorSimilarityMetric (tensor_similarity.py) — stateful update / compute / reset metric mirroring MeanIoUMetric. Per-sample math is bit-equivalent to the team-wide eval_tensors reference library on the same .npy pair.
  • Dispatchevaluate.py registers "compare-tensor" and get_evaluator_class routes to it when config.mode == "compare"; compare mode bypasses default-dataset resolution and the dataset section of print_config.
  • ConfigWinMLEvaluationConfig.mode: str = "onnx"; to_dict only emits mode when non-default.

Output shape

compute() returns display-ready flat dict so the existing generic eval report renders without a custom renderer:

{
    f"{metric}_{stat}": {output_name: float},  # 5 metrics × 4 stats = 20 keys
    ...
}

Stats are mean / std / min / max. The renderer prints one row per {metric}_{stat} with output_name=value cells joined across outputs.

Notable design choices

  • Output-name overlap, not strict equality. ONNX and HF output sets can differ (HF often exposes auxiliary tensors). We compute on the intersection and warn on divergence rather than failing.
  • Composite-model guard. Multi-component models (e.g. BLIP) raise a TypeError with guidance to run compare per sub-component — there is no canonical "one HF reference" for the composite.
  • int dtype normalization. Narrow int tensors are upcast to int64 before inference so HF embeddings accept them; WinMLSession down-casts to the ORT graph's declared dtype on its side. The same input dict feeds both backends.
  • Architecture-agnostic. No model-specific names, layer patterns, or hardcoded outputs anywhere in metric or evaluator code.

Tests

  • test_tensor_similarity_metric.py — 10 unit tests for the metric (numerics, identity, stat shape, reset, empty-state error).
  • test_tensor_similarity_evaluator.py — 4 unit tests (composite-model guard, output-name overlap, dispatch).
  • test_eval.py — get_evaluator_class updated to take WinMLEvaluationConfig; "compare-tensor" exempted from the task-schema set check.
  • tests/e2e/test_eval_e2e.py::test_compare_mode_image_classification — full CLI run on microsoft/resnet-50 fp16: asserts the 20-key flat shape, per-output cosine bounds [-1, 1], and (QNN host only) cosine_similarity_mean >= 0.95.

53 unit tests pass; ruff clean.

Out of scope (follow-ups)

  • --mode hf (run the HF pipeline on a labeled dataset as the reference for task-level metrics) — placeholder removed from this PR; will land as a separate change.
  • A custom renderer for compare mode (currently uses the generic table).

@zhenchaoni zhenchaoni requested a review from a team as a code owner June 3, 2026 05:30
dataset_script: str | None,
trust_remote_code: bool,
show_schema: bool,
mode: str,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a type instead of str

return 1.0
if norm_ref == 0.0 or norm_test == 0.0:
return 0.0
return float(np.dot(ref, test) / (norm_ref * norm_test))
Copy link
Copy Markdown
Contributor

@vortex-captain vortex-captain Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just being curious: cosine similarity seems to be a robust measure when mean/max of abs diff can be quite large due to numerical errors of both models being compares; on the other hand, the norms of 2 vectors are lost in similarity computation, would it be possible to compare norms jointly with cosine similarity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: support tensor level comparison for model evaluation

3 participants