feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report by DingmaomaoBJTU · Pull Request #599 · microsoft/winml-cli

DingmaomaoBJTU · 2026-05-12T08:16:37Z

Summary

3-stage perf benchmarking: wmk perf runs after each of export, graph-optimize (normalize), SA-optimize, and QDQ-quantize so every stage's latency is captured and compared
Stage 6 — QDQ quantize: adds wmk quantize as a new eval stage after SA-optimize; output is quantized.onnx + quantized_perf.json
Workflow-first HTML report: table columns reorganized to match pipeline order — Export → Normalize → Pre SA → Flags → Optimized → Post SA → Quantize → Delta; each perf column shows gain% vs the previous stage (chain normalization)
--report-only flag: regenerates the HTML from existing per-model sa_eval_result.json files without re-running any eval stages
wmk quantize --model-name: new CLI option so task-aware calibration can load the correct HuggingFace tokenizer/processor
SA without rule data: downgraded from fatal error to warning when no EP results are found (missing parquet files)

New CLI flags

# Full eval with quantize
uv run python scripts/e2e_eval/run_sa_eval.py --output-dir sa_eval_results/2026-05-12 --use-cache --perf-iterations 10

# Skip quantize
uv run python scripts/e2e_eval/run_sa_eval.py ... --no-quantize

# Regenerate report only (no eval re-run)
uv run python scripts/e2e_eval/run_sa_eval.py --output-dir sa_eval_results/2026-05-12 --report-only
uv run python scripts/e2e_eval/sa_report.py sa_eval_results/2026-05-12

…rst HTML report - run_sa_eval.py: run wmk perf after export, graph-optimize, SA-optimize, and QDQ-quantize (stage 6); add --no-perf, --perf-iterations, --perf-warmup, --no-quantize, --quantize-precision, --quantize-samples, --report-only flags; fix output_dir to resolve to absolute path; downgrade empty SA classification from fatal error to warning - sa_comparison.py: warn instead of silently returning empty results when SA produces no EP results (missing parquet rule data) - sa_report.py: reorganize table columns in workflow order (Export → Normalize → Pre SA → Flags → Optimized → Post SA → Quantize → Delta); chain-normalize perf gain% against previous stage; add __main__ CLI entrypoint for report-only refresh - quantize.py: add --model-name CLI option so task-aware calibration can load the correct HuggingFace tokenizer/processor

…r each model

xieofxie · 2026-05-12T08:23:55Z

    help="Task for calibration dataset selection (e.g., 'image-classification').",
 )
+@click.option(
+    "--model-name",


I suggest use model-id? although conflicts with the parameter inside function, it have benefits

align with help

eval.py already uses it

Maybe we should update function to align cases like this should use model-id

xieofxie · 2026-05-12T08:24:18Z



 @click.command()
 @click.option(


could use cli_utils

+
+if __name__ == "__main__":
+    import argparse
+    import json


timenick · 2026-05-13T06:42:26Z

    help="Task for calibration dataset selection (e.g., 'image-classification').",
 )
+@click.option(
+    "--model-name",


The new flag on winml quantize is named --model-name, but the rest of the CLI surface uses --model-id for the HuggingFace model identifier (e.g. winml eval --model-id google/vit-base-patch16-224). The semantic role is identical — load tokenizer/processor for an HF model id — so the names should match. CLI flags are public API; this is hard to change after release.

Suggest renaming to --model-id for consistency with winml eval.

🤖 Generated with Claude Code

timenick · 2026-05-13T06:42:26Z

+        "--output",
+        str(output_json),
+    ]
+    ep_arg = _EP_TO_PERF_ARG.get(ep, ep.lower() if ep else None) if ep else None


ep_arg = _EP_TO_PERF_ARG.get(ep, ep.lower() if ep else None) if ep else None

Two issues:

The outer if ep else None makes the inner if ep else None dead code — it can't be reached when ep is falsy.

The ep.lower() fallback is wrong: if ep is an unmapped EP name (typo, or a new EP added to ORT but not _EP_TO_PERF_ARG), it passes "someexecutionprovider" to winml perf --ep, which will error out with a confusing message. Cleaner to fail-fast on unknown EPs:

if ep is None: ep_arg = None elif ep in _EP_TO_PERF_ARG: ep_arg = _EP_TO_PERF_ARG[ep] else: raise ValueError(f"Unknown EP {ep!r}; add to _EP_TO_PERF_ARG.")

Loud failure beats silently shipping a garbage --ep value to the perf subprocess.

🤖 Generated with Claude Code

timenick · 2026-05-13T06:42:26Z



+# Map full EP names to the short form accepted by `wmk perf --ep`
+_EP_TO_PERF_ARG: dict[str, str] = {


The mapping hardcodes ORT-symbol → short-form pairs. The winml perf --ep command has its own list of accepted short-form values. If anyone adds an EP (or renames a short form) in one place, the other side silently breaks. Possible mitigations:

Import the canonical mapping from the source where it's defined (if there is one), or

Add a smoke test in tests/cli/ that verifies every value in _EP_TO_PERF_ARG.values() is a valid winml perf --ep choice.

Per the project's EP canonical names: ORT symbol names look correct (QNNExecutionProvider, DmlExecutionProvider, NvTensorRTRTXExecutionProvider). No naming-convention violations.

🤖 Generated with Claude Code

timenick · 2026-05-13T06:42:26Z

    return result.returncode, (result.stderr or "").strip()[-500:]


+# Map full EP names to the short form accepted by `wmk perf --ep`


Comments and docstrings throughout this file refer to a wmk CLI that doesn't exist in this project — the actual subprocess invocations all use python -m winml.modelkit.cli, the entry point in pyproject.toml is winml, and there are zero other wmk references in the repo.

Affected locations in this PR:

L13 module docstring: Stage 1: wmk export + Python optimize_onnx (default)

L85-86 function name + docstring: def run_wmk_export(...) / """Run wmk export via subprocess..."""

L111 comment: Map full EP names to the short form accepted by wmk perf --ep

L133 docstring: Run wmk perf on onnx_path

L235 caller of run_wmk_export

L388 docstring: Run wmk compile ...

L474 docstring: Runs wmk quantize...

Suggest a search-and-replace wmk → winml for comments and docstrings, and renaming the helper run_wmk_export → run_winml_export (the new run_winml_perf already follows the right convention). Anyone copying a command out of a docstring should be able to actually run it.

🤖 Generated with Claude Code

- Default sort by perf gain descending (Unlocked models float to top) - Add perf gain summary cards: Avg Perf Gain, Faster Models, Unlocked count - Reorder summary cards to show perf gain metrics first - Unlocked badge: compact purple pill style '⚡ Unlocked · Xms' - Hide models without quantize perf from main table - Add footer showing quantized vs total complete model counts - Rename report title to 'WinML CLI Component Analysis Report' - Remove Regressed summary card

Replace 6 manual wmk stages with winml config + winml build: - Stage 1: winml config → build_config.json (export/quant/compile settings) - Stage 2: winml build → export.onnx, optimized.onnx, quantized.onnx, compiled.onnx, winml_build_config.json - Stage 3: SA pre-check on export.onnx (via ONNXStaticAnalyzer Python API) - Stage 4: SA post-check on optimized.onnx - Stage 5: EPContext diff on compiled.onnx (produced by build) Read SA optimization flags from winml_build_config.json['optim'] instead of computing them via SA API. Result schema is backward-compatible with sa_report.py (perf.graph_optimized=None, perf.sa_optimized=optimized.onnx). Add --no-compile flag; remove unused run_wmk_export helper.

zhenchaoni · 2026-05-13T08:30:33Z

    help="Task for calibration dataset selection (e.g., 'image-classification').",
 )
+@click.option(
+    "--model-name",


In the quantize command e2e test, I also need to add this option to perform e2e testing. Are you OK to undo this part in your PR and my PR will cover this.

#608

DingmaomaoBJTU requested a review from a team as a code owner May 12, 2026 08:16

feat(eval): add --cleanup flag to delete intermediate ONNX files afte…

0b899f3

…r each model

xieofxie reviewed May 12, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/quantize.py

@click.command()

@click.option(

Copy link
Copy Markdown

Contributor

xieofxie May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use cli_utils

github-advanced-security AI found potential problems May 12, 2026

View reviewed changes

Comment thread scripts/e2e_eval/sa_report.py

if __name__ == "__main__":

import argparse

import json

timenick reviewed May 13, 2026

View reviewed changes

DingmaomaoBJTU added 2 commits May 13, 2026 16:15

zhenchaoni reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report#599

feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report#599
DingmaomaoBJTU wants to merge 4 commits into
mainfrom
qiowu/add_sa_eval

DingmaomaoBJTU commented May 12, 2026

Uh oh!

xieofxie May 12, 2026

Uh oh!

xieofxie May 12, 2026

Uh oh!

timenick May 13, 2026

Uh oh!

timenick May 13, 2026

Uh oh!

timenick May 13, 2026

Uh oh!

timenick May 13, 2026

Uh oh!

zhenchaoni May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants



		# Map full EP names to the short form accepted by `wmk perf --ep`
		_EP_TO_PERF_ARG: dict[str, str] = {

		return result.returncode, (result.stderr or "").strip()[-500:]


		# Map full EP names to the short form accepted by `wmk perf --ep`

Conversation

DingmaomaoBJTU commented May 12, 2026

Summary

New CLI flags

Uh oh!

xieofxie May 12, 2026

Choose a reason for hiding this comment

Uh oh!

xieofxie May 12, 2026

Choose a reason for hiding this comment

Uh oh!

timenick May 13, 2026

Choose a reason for hiding this comment

Uh oh!

timenick May 13, 2026

Choose a reason for hiding this comment

Uh oh!

timenick May 13, 2026

Choose a reason for hiding this comment

Uh oh!

timenick May 13, 2026

Choose a reason for hiding this comment

Uh oh!

zhenchaoni May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhenchaoni May 13, 2026 •

edited

Loading