Skip to content

feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report#599

Open
DingmaomaoBJTU wants to merge 4 commits into
mainfrom
qiowu/add_sa_eval
Open

feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report#599
DingmaomaoBJTU wants to merge 4 commits into
mainfrom
qiowu/add_sa_eval

Conversation

@DingmaomaoBJTU
Copy link
Copy Markdown
Collaborator

Summary

  • 3-stage perf benchmarking: wmk perf runs after each of export, graph-optimize (normalize), SA-optimize, and QDQ-quantize so every stage's latency is captured and compared
  • Stage 6 — QDQ quantize: adds wmk quantize as a new eval stage after SA-optimize; output is quantized.onnx + quantized_perf.json
  • Workflow-first HTML report: table columns reorganized to match pipeline order — Export → Normalize → Pre SA → Flags → Optimized → Post SA → Quantize → Delta; each perf column shows gain% vs the previous stage (chain normalization)
  • --report-only flag: regenerates the HTML from existing per-model sa_eval_result.json files without re-running any eval stages
  • wmk quantize --model-name: new CLI option so task-aware calibration can load the correct HuggingFace tokenizer/processor
  • SA without rule data: downgraded from fatal error to warning when no EP results are found (missing parquet files)

New CLI flags

# Full eval with quantize
uv run python scripts/e2e_eval/run_sa_eval.py --output-dir sa_eval_results/2026-05-12 --use-cache --perf-iterations 10

# Skip quantize
uv run python scripts/e2e_eval/run_sa_eval.py ... --no-quantize

# Regenerate report only (no eval re-run)
uv run python scripts/e2e_eval/run_sa_eval.py --output-dir sa_eval_results/2026-05-12 --report-only
uv run python scripts/e2e_eval/sa_report.py sa_eval_results/2026-05-12

…rst HTML report

- run_sa_eval.py: run wmk perf after export, graph-optimize, SA-optimize,
  and QDQ-quantize (stage 6); add --no-perf, --perf-iterations, --perf-warmup,
  --no-quantize, --quantize-precision, --quantize-samples, --report-only flags;
  fix output_dir to resolve to absolute path; downgrade empty SA classification
  from fatal error to warning
- sa_comparison.py: warn instead of silently returning empty results when SA
  produces no EP results (missing parquet rule data)
- sa_report.py: reorganize table columns in workflow order
  (Export → Normalize → Pre SA → Flags → Optimized → Post SA → Quantize → Delta);
  chain-normalize perf gain% against previous stage; add __main__ CLI entrypoint
  for report-only refresh
- quantize.py: add --model-name CLI option so task-aware calibration can load
  the correct HuggingFace tokenizer/processor
@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner May 12, 2026 08:16
help="Task for calibration dataset selection (e.g., 'image-classification').",
)
@click.option(
"--model-name",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest use model-id? although conflicts with the parameter inside function, it have benefits

  • align with help
  • eval.py already uses it

Maybe we should update function to align cases like this should use model-id



@click.command()
@click.option(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use cli_utils


if __name__ == "__main__":
import argparse
import json
help="Task for calibration dataset selection (e.g., 'image-classification').",
)
@click.option(
"--model-name",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new flag on winml quantize is named --model-name, but the rest of the CLI surface uses --model-id for the HuggingFace model identifier (e.g. winml eval --model-id google/vit-base-patch16-224). The semantic role is identical — load tokenizer/processor for an HF model id — so the names should match. CLI flags are public API; this is hard to change after release.

Suggest renaming to --model-id for consistency with winml eval.

🤖 Generated with Claude Code

"--output",
str(output_json),
]
ep_arg = _EP_TO_PERF_ARG.get(ep, ep.lower() if ep else None) if ep else None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ep_arg = _EP_TO_PERF_ARG.get(ep, ep.lower() if ep else None) if ep else None

Two issues:

  • The outer if ep else None makes the inner if ep else None dead code — it can't be reached when ep is falsy.
  • The ep.lower() fallback is wrong: if ep is an unmapped EP name (typo, or a new EP added to ORT but not _EP_TO_PERF_ARG), it passes "someexecutionprovider" to winml perf --ep, which will error out with a confusing message. Cleaner to fail-fast on unknown EPs:
if ep is None:
    ep_arg = None
elif ep in _EP_TO_PERF_ARG:
    ep_arg = _EP_TO_PERF_ARG[ep]
else:
    raise ValueError(f"Unknown EP {ep!r}; add to _EP_TO_PERF_ARG.")

Loud failure beats silently shipping a garbage --ep value to the perf subprocess.

🤖 Generated with Claude Code



# Map full EP names to the short form accepted by `wmk perf --ep`
_EP_TO_PERF_ARG: dict[str, str] = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping hardcodes ORT-symbol → short-form pairs. The winml perf --ep command has its own list of accepted short-form values. If anyone adds an EP (or renames a short form) in one place, the other side silently breaks. Possible mitigations:

  • Import the canonical mapping from the source where it's defined (if there is one), or
  • Add a smoke test in tests/cli/ that verifies every value in _EP_TO_PERF_ARG.values() is a valid winml perf --ep choice.

Per the project's EP canonical names: ORT symbol names look correct (QNNExecutionProvider, DmlExecutionProvider, NvTensorRTRTXExecutionProvider). No naming-convention violations.

🤖 Generated with Claude Code

return result.returncode, (result.stderr or "").strip()[-500:]


# Map full EP names to the short form accepted by `wmk perf --ep`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments and docstrings throughout this file refer to a wmk CLI that doesn't exist in this project — the actual subprocess invocations all use python -m winml.modelkit.cli, the entry point in pyproject.toml is winml, and there are zero other wmk references in the repo.

Affected locations in this PR:

  • L13 module docstring: Stage 1: wmk export + Python optimize_onnx (default)
  • L85-86 function name + docstring: def run_wmk_export(...) / """Run wmk export via subprocess..."""
  • L111 comment: Map full EP names to the short form accepted by wmk perf --ep
  • L133 docstring: Run wmk perf on onnx_path
  • L235 caller of run_wmk_export
  • L388 docstring: Run wmk compile ...
  • L474 docstring: Runs wmk quantize...

Suggest a search-and-replace wmkwinml for comments and docstrings, and renaming the helper run_wmk_exportrun_winml_export (the new run_winml_perf already follows the right convention). Anyone copying a command out of a docstring should be able to actually run it.

🤖 Generated with Claude Code

- Default sort by perf gain descending (Unlocked models float to top)
- Add perf gain summary cards: Avg Perf Gain, Faster Models, Unlocked count
- Reorder summary cards to show perf gain metrics first
- Unlocked badge: compact purple pill style '⚡ Unlocked · Xms'
- Hide models without quantize perf from main table
- Add footer showing quantized vs total complete model counts
- Rename report title to 'WinML CLI Component Analysis Report'
- Remove Regressed summary card
Replace 6 manual wmk stages with winml config + winml build:
- Stage 1: winml config → build_config.json (export/quant/compile settings)
- Stage 2: winml build → export.onnx, optimized.onnx, quantized.onnx,
                         compiled.onnx, winml_build_config.json
- Stage 3: SA pre-check on export.onnx (via ONNXStaticAnalyzer Python API)
- Stage 4: SA post-check on optimized.onnx
- Stage 5: EPContext diff on compiled.onnx (produced by build)

Read SA optimization flags from winml_build_config.json['optim'] instead
of computing them via SA API. Result schema is backward-compatible with
sa_report.py (perf.graph_optimized=None, perf.sa_optimized=optimized.onnx).

Add --no-compile flag; remove unused run_wmk_export helper.
help="Task for calibration dataset selection (e.g., 'image-classification').",
)
@click.option(
"--model-name",
Copy link
Copy Markdown
Member

@zhenchaoni zhenchaoni May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the quantize command e2e test, I also need to add this option to perform e2e testing. Are you OK to undo this part in your PR and my PR will cover this.

#608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants