Summary
winml perf currently dispatches by argument shape and runs two completely different benchmark pipelines for what users perceive as the same operation:
winml perf -m hf/model → PerfBenchmark.run() (commands/perf.py:280), which calls WinMLAutoModel.from_pretrained(...). This runs the full AOT pipeline: export → optimize → quantize → compile, then benchmarks the compiled artifact.
winml perf -m path/to/model.onnx → _run_onnx_benchmark(...) (commands/perf.py:952), which calls WinMLSession(onnx_path=...) directly. This bypasses the build pipeline entirely — it's effectively a raw ORT JIT load with whatever EP the device resolves to.
The branch is at commands/perf.py:1297 (is_onnx = model_path.suffix.lower() == \".onnx\").
Why this matters
-
Apples-to-oranges numbers. If a user runs winml perf -m microsoft/resnet-50 and then winml perf -m <the_output.onnx>, the two latency numbers don't agree, even though the file was just produced by the first command. The first goes through quantize+compile (so the benchmarked artifact may not even be <the_output.onnx> — it's the compiled context model). The second loads the raw file as-is.
-
Flag semantics silently change. Several CLI flags are HF-path only:
--no-quantize, --rebuild, --ignore-cache, --precision: silently no-op on the ONNX path.
--shape-config: ONNX path prints a warning and discards it (commands/perf.py:1301).
The user has no way to ask "benchmark this ONNX after quantize+compile".
-
Duplicated benchmark logic. The perf loop, hardware-monitor setup, model-info printing, and stats collection exist in both PerfBenchmark.run() and _run_onnx_benchmark() (with helpers _run_simple_loop / _run_monitored_loop). Any future change to the timing/stats code has to be made in two places.
-
Dead code. PerfBenchmark._load_model at commands/perf.py:322-353 has a branch for is_onnx and model_path.exists() calling WinMLAutoModel.from_onnx(...). This branch is unreachable from the CLI because the outer dispatcher at commands/perf.py:1297 already routes .onnx files away.
Suggested direction
Unify the two paths around WinMLAutoModel, which already exposes both from_pretrained (HF) and from_onnx (ONNX) constructors. The CLI dispatcher decides which constructor to call; everything downstream (benchmark loop, monitor, stats, report) shares one implementation. Then:
winml perf -m model.onnx (raw, no transforms) → today's default for .onnx paths, achieved via WinMLBuildConfig(export=None, optim=None, quant=None, compile=...) or equivalent.
winml perf -m model.onnx --quantize ... --compile ... → opt into transforms on a pre-exported ONNX, producing meaningful numbers comparable to the HF flow.
--shape-config / --precision / --no-quantize etc. mean the same thing on both paths (or error clearly when not applicable).
This also lets us delete _run_onnx_benchmark and the dead from_onnx branch in _load_model.
Summary
winml perfcurrently dispatches by argument shape and runs two completely different benchmark pipelines for what users perceive as the same operation:winml perf -m hf/model→PerfBenchmark.run()(commands/perf.py:280), which callsWinMLAutoModel.from_pretrained(...). This runs the full AOT pipeline: export → optimize → quantize → compile, then benchmarks the compiled artifact.winml perf -m path/to/model.onnx→_run_onnx_benchmark(...)(commands/perf.py:952), which callsWinMLSession(onnx_path=...)directly. This bypasses the build pipeline entirely — it's effectively a raw ORT JIT load with whatever EP the device resolves to.The branch is at
commands/perf.py:1297(is_onnx = model_path.suffix.lower() == \".onnx\").Why this matters
Apples-to-oranges numbers. If a user runs
winml perf -m microsoft/resnet-50and thenwinml perf -m <the_output.onnx>, the two latency numbers don't agree, even though the file was just produced by the first command. The first goes through quantize+compile (so the benchmarked artifact may not even be<the_output.onnx>— it's the compiled context model). The second loads the raw file as-is.Flag semantics silently change. Several CLI flags are HF-path only:
--no-quantize,--rebuild,--ignore-cache,--precision: silently no-op on the ONNX path.--shape-config: ONNX path prints a warning and discards it (commands/perf.py:1301).The user has no way to ask "benchmark this ONNX after quantize+compile".
Duplicated benchmark logic. The perf loop, hardware-monitor setup, model-info printing, and stats collection exist in both
PerfBenchmark.run()and_run_onnx_benchmark()(with helpers_run_simple_loop/_run_monitored_loop). Any future change to the timing/stats code has to be made in two places.Dead code.
PerfBenchmark._load_modelatcommands/perf.py:322-353has a branch foris_onnx and model_path.exists()callingWinMLAutoModel.from_onnx(...). This branch is unreachable from the CLI because the outer dispatcher atcommands/perf.py:1297already routes.onnxfiles away.Suggested direction
Unify the two paths around
WinMLAutoModel, which already exposes bothfrom_pretrained(HF) andfrom_onnx(ONNX) constructors. The CLI dispatcher decides which constructor to call; everything downstream (benchmark loop, monitor, stats, report) shares one implementation. Then:winml perf -m model.onnx(raw, no transforms) → today's default for.onnxpaths, achieved viaWinMLBuildConfig(export=None, optim=None, quant=None, compile=...)or equivalent.winml perf -m model.onnx --quantize ... --compile ...→ opt into transforms on a pre-exported ONNX, producing meaningful numbers comparable to the HF flow.--shape-config/--precision/--no-quantizeetc. mean the same thing on both paths (or error clearly when not applicable).This also lets us delete
_run_onnx_benchmarkand the deadfrom_onnxbranch in_load_model.