feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report#599
feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report#599DingmaomaoBJTU wants to merge 4 commits into
Conversation
…rst HTML report - run_sa_eval.py: run wmk perf after export, graph-optimize, SA-optimize, and QDQ-quantize (stage 6); add --no-perf, --perf-iterations, --perf-warmup, --no-quantize, --quantize-precision, --quantize-samples, --report-only flags; fix output_dir to resolve to absolute path; downgrade empty SA classification from fatal error to warning - sa_comparison.py: warn instead of silently returning empty results when SA produces no EP results (missing parquet rule data) - sa_report.py: reorganize table columns in workflow order (Export → Normalize → Pre SA → Flags → Optimized → Post SA → Quantize → Delta); chain-normalize perf gain% against previous stage; add __main__ CLI entrypoint for report-only refresh - quantize.py: add --model-name CLI option so task-aware calibration can load the correct HuggingFace tokenizer/processor
| help="Task for calibration dataset selection (e.g., 'image-classification').", | ||
| ) | ||
| @click.option( | ||
| "--model-name", |
There was a problem hiding this comment.
I suggest use model-id? although conflicts with the parameter inside function, it have benefits
- align with help
- eval.py already uses it
Maybe we should update function to align cases like this should use model-id
|
|
||
|
|
||
| @click.command() | ||
| @click.option( |
|
|
||
| if __name__ == "__main__": | ||
| import argparse | ||
| import json |
| help="Task for calibration dataset selection (e.g., 'image-classification').", | ||
| ) | ||
| @click.option( | ||
| "--model-name", |
There was a problem hiding this comment.
The new flag on winml quantize is named --model-name, but the rest of the CLI surface uses --model-id for the HuggingFace model identifier (e.g. winml eval --model-id google/vit-base-patch16-224). The semantic role is identical — load tokenizer/processor for an HF model id — so the names should match. CLI flags are public API; this is hard to change after release.
Suggest renaming to --model-id for consistency with winml eval.
🤖 Generated with Claude Code
| "--output", | ||
| str(output_json), | ||
| ] | ||
| ep_arg = _EP_TO_PERF_ARG.get(ep, ep.lower() if ep else None) if ep else None |
There was a problem hiding this comment.
ep_arg = _EP_TO_PERF_ARG.get(ep, ep.lower() if ep else None) if ep else NoneTwo issues:
- The outer
if ep else Nonemakes the innerif ep else Nonedead code — it can't be reached whenepis falsy. - The
ep.lower()fallback is wrong: ifepis an unmapped EP name (typo, or a new EP added to ORT but not_EP_TO_PERF_ARG), it passes"someexecutionprovider"towinml perf --ep, which will error out with a confusing message. Cleaner to fail-fast on unknown EPs:
if ep is None:
ep_arg = None
elif ep in _EP_TO_PERF_ARG:
ep_arg = _EP_TO_PERF_ARG[ep]
else:
raise ValueError(f"Unknown EP {ep!r}; add to _EP_TO_PERF_ARG.")Loud failure beats silently shipping a garbage --ep value to the perf subprocess.
🤖 Generated with Claude Code
|
|
||
|
|
||
| # Map full EP names to the short form accepted by `wmk perf --ep` | ||
| _EP_TO_PERF_ARG: dict[str, str] = { |
There was a problem hiding this comment.
The mapping hardcodes ORT-symbol → short-form pairs. The winml perf --ep command has its own list of accepted short-form values. If anyone adds an EP (or renames a short form) in one place, the other side silently breaks. Possible mitigations:
- Import the canonical mapping from the source where it's defined (if there is one), or
- Add a smoke test in
tests/cli/that verifies every value in_EP_TO_PERF_ARG.values()is a validwinml perf --epchoice.
Per the project's EP canonical names: ORT symbol names look correct (QNNExecutionProvider, DmlExecutionProvider, NvTensorRTRTXExecutionProvider). No naming-convention violations.
🤖 Generated with Claude Code
| return result.returncode, (result.stderr or "").strip()[-500:] | ||
|
|
||
|
|
||
| # Map full EP names to the short form accepted by `wmk perf --ep` |
There was a problem hiding this comment.
Comments and docstrings throughout this file refer to a wmk CLI that doesn't exist in this project — the actual subprocess invocations all use python -m winml.modelkit.cli, the entry point in pyproject.toml is winml, and there are zero other wmk references in the repo.
Affected locations in this PR:
- L13 module docstring:
Stage 1: wmk export + Python optimize_onnx (default) - L85-86 function name + docstring:
def run_wmk_export(...)/"""Run wmk export via subprocess...""" - L111 comment:
Map full EP names to the short form accepted bywmk perf --ep - L133 docstring:
Run wmk perf on onnx_path - L235 caller of
run_wmk_export - L388 docstring:
Run wmk compile ... - L474 docstring:
Runswmk quantize...
Suggest a search-and-replace wmk → winml for comments and docstrings, and renaming the helper run_wmk_export → run_winml_export (the new run_winml_perf already follows the right convention). Anyone copying a command out of a docstring should be able to actually run it.
🤖 Generated with Claude Code
- Default sort by perf gain descending (Unlocked models float to top) - Add perf gain summary cards: Avg Perf Gain, Faster Models, Unlocked count - Reorder summary cards to show perf gain metrics first - Unlocked badge: compact purple pill style '⚡ Unlocked · Xms' - Hide models without quantize perf from main table - Add footer showing quantized vs total complete model counts - Rename report title to 'WinML CLI Component Analysis Report' - Remove Regressed summary card
Replace 6 manual wmk stages with winml config + winml build:
- Stage 1: winml config → build_config.json (export/quant/compile settings)
- Stage 2: winml build → export.onnx, optimized.onnx, quantized.onnx,
compiled.onnx, winml_build_config.json
- Stage 3: SA pre-check on export.onnx (via ONNXStaticAnalyzer Python API)
- Stage 4: SA post-check on optimized.onnx
- Stage 5: EPContext diff on compiled.onnx (produced by build)
Read SA optimization flags from winml_build_config.json['optim'] instead
of computing them via SA API. Result schema is backward-compatible with
sa_report.py (perf.graph_optimized=None, perf.sa_optimized=optimized.onnx).
Add --no-compile flag; remove unused run_wmk_export helper.
| help="Task for calibration dataset selection (e.g., 'image-classification').", | ||
| ) | ||
| @click.option( | ||
| "--model-name", |
There was a problem hiding this comment.
In the quantize command e2e test, I also need to add this option to perform e2e testing. Are you OK to undo this part in your PR and my PR will cover this.
Summary
wmk perfruns after each of export, graph-optimize (normalize), SA-optimize, and QDQ-quantize so every stage's latency is captured and comparedwmk quantizeas a new eval stage after SA-optimize; output isquantized.onnx+quantized_perf.json--report-onlyflag: regenerates the HTML from existing per-modelsa_eval_result.jsonfiles without re-running any eval stageswmk quantize --model-name: new CLI option so task-aware calibration can load the correct HuggingFace tokenizer/processorNew CLI flags