update eval scripts: add ONNX size tracking and output sanitization#755
update eval scripts: add ONNX size tracking and output sanitization#755DingmaomaoBJTU wants to merge 8 commits into
Conversation
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Nice additions - ONNX size tracking and output sanitization are useful for keeping eval_result.json lean. A few observations below.
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Useful additions - ONNX size tracking and output sanitization will make eval results much cleaner. A few suggestions below.
- Add _compute_onnx_size() to measure combined ONNX + .data file sizes - Add _sanitize_output() to strip CLI chrome (Rich tables, banners) from eval_result.json, keeping only error-relevant content - Pass onnx_size_bytes and sanitize_fn through to build_eval_result() - Minor formatting fixes in reporter.py
Capture hardware details (devices, EPs, backends) by running `winml sys --format json` and embedding the result under the `winml_sys` key in environment.json.
The sanitize_fn strips perf metric lines (latency, throughput, etc.) from stdout/stderr. Store the original output in raw_stdout/raw_stderr fields so downstream consumers can still access the full perf data.
52fa381 to
a7b8a0c
Compare
- Fix displaced docstring in generate_html_report (was after import) - Anchor _sanitize_output patterns to line start to avoid stripping legitimate error messages containing pattern substrings - Use idiomatic pathlib for .data companion file construction
a7b8a0c to
b3e0cf9
Compare
| # Lines that carry no diagnostic value in eval_result.json. | ||
| # Matching is case-insensitive, applied per-line. | ||
| _NOISE_PATTERNS = ( | ||
| "benchmarking onnx", |
There was a problem hiding this comment.
a little strange.. any way we add a parameter in eval command to just drop them?
| ) | ||
|
|
||
| # Box-drawing characters used by Rich tables. | ||
| _BOX_CHARS = frozenset("─│┌┐└┘├┤┬┴┼") |
There was a problem hiding this comment.
_BOX_CHARS only covers Unicode's LIGHT box-drawing style (─│┌┐└┘├┤┬┴┼), but winml perf uses Rich's default Table(), which renders with box.HEAVY_HEAD. I rendered a default Rich table locally and 3 of the 5 lines start with chars not in this set:
| Row | First char | In _BOX_CHARS? |
|---|---|---|
top border ┏━━━━━┳━━━━━┓ |
┏ (U+250F) |
❌ |
header row ┃ Avg ┃ P90 ┃ |
┃ (U+2503) |
❌ |
head separator ┡━━━━━╇━━━━━┩ |
┡ (U+2521) |
❌ |
data row │ 12.5 │ 15.2 │ |
│ |
✅ |
bottom border └─────┴─────┘ |
└ |
✅ |
Net result: eval_result.json keeps the top border + header text + head separator while stripping data rows and the bottom border — arguably uglier than no sanitization at all (orphaned half-table chrome).
Cheap fix — use a Unicode block range check instead of a hand-curated set:
if 0x2500 <= ord(stripped[0]) <= 0x257F: # Unicode "Box Drawing" block
continueThat covers all four Rich styles (light, heavy, double, rounded) in one rule and won''t silently drift the next time someone changes the table style.
🤖 Generated with GitHub Copilot CLI
| low = stripped.lower() | ||
| if any(low.startswith(pat) for pat in _NOISE_PATTERNS): | ||
| continue | ||
| kept.append(stripped) |
There was a problem hiding this comment.
Appending stripped (post-.strip()) discards leading indentation, which destroys the structure of multi-line errors — the very content the docstring promises to preserve ("All classifier patterns are error-related and survive this filter"). For example, a traceback:
File "foo.py", line 5, in bar
raise RuntimeError("x")
becomes a visually-flat:
File "foo.py", line 5, in bar
raise RuntimeError("x")
which is noticeably harder to read.
Suggested fix — keep stripped only for the box/noise classifier checks, then append the original line (lightly trimmed):
if not stripped:
continue
if 0x2500 <= ord(stripped[0]) <= 0x257F:
continue
low = stripped.lower()
if any(low.startswith(pat) for pat in _NOISE_PATTERNS):
continue
kept.append(line.rstrip())🤖 Generated with GitHub Copilot CLI
| "latency (ms)", | ||
| "throughput:", | ||
| "results saved to", | ||
| "inputs:", |
There was a problem hiding this comment.
A couple of patterns in _NOISE_PATTERNS can swallow legitimate diagnostic content with the current low.startswith(pat) matching:
"inputs:"/"outputs:"— these silently strip lines likeInputs: expected (1, 224, 224, 3), got (1, 3, 224, 224)(a real shape-mismatch error), which is exactly the kind of "error-relevant" content a sanitizer is supposed to keep. Same concern for"device:"if a downstream error ever reads something likeDevice: <name> is not available, falling back to CPU."samples/sec"— dead pattern understartswithsemantics.Throughput: 80 samples/secis already covered by"throughput:"above; nowinml perfline literally starts withsamples/sec.
Tightening options (cheapest first):
- Just drop
inputs:/outputs:/samples/sec. The remaining patterns are unambiguous CLI chrome. - Switch to exact-prefix-with-boundary: only strip when the line is
patfollowed by space or end, e.g.low == pat or low.startswith(pat + " "), so error lines that happen to start withInputs:but carry non-trivial content survive.
🤖 Generated with GitHub Copilot CLI
- Improve _compute_onnx_size to parse ONNX proto for all external data files instead of only checking the conventional .data suffix - Add debug logging when winml sys times out or fails (replaces bare pass) - Add --raw-output flag to skip output sanitization in eval_result.json
Replaces per-line linear scan with a pre-compiled regex for better performance on large outputs.
Summary
_compute_onnx_size()to measure combined ONNX +.datafile sizes and includeonnx_size_bytesin eval results_sanitize_output()to strip CLI chrome (Rich tables, device/IO banners) fromeval_result.json, keeping only error-relevant contentreporter.py