feat: Add E2E eval pipeline for QNN NPU models by KayMKM · Pull Request #242 · microsoft/winml-cli

KayMKM · 2026-04-03T04:04:30Z

Add E2E eval pipeline for QNN NPU models

This pipeline automates end-to-end model evaluation on the self-hosted NPU agent, from model discovery through report generation and artifact publishing. https://dev.azure.com/microsoft/windows.ai.toolkit/_build?definitionId=190174&_a=summary

Pipeline overview

The pipeline is manually triggered (trigger: none) with two parameters:

evalDate — target date for the eval run (defaults to today)
continueRun — skip models that already have results, enabling incremental/resumable runs

Jobs

1. Prepare — Sets up the Python environment (uv + Python 3.10), installs dependencies from Azure Artifacts, computes the eval output directory (c:/eval_results/{date}), generates the model list, and builds an ADO matrix for parallel-safe sequential execution.

2. EvalModel — Runs each model through run_eval.py one at a time (maxParallel: 1) using the matrix from Prepare. Individual model failures are logged as warnings but do not fail the pipeline, so remaining models continue to be evaluated.

3. Report — Runs unconditionally (condition: always()) after eval completes. Generates the evaluation report (JSON, text, markdown, HTML) via generate_report.py, then publishes the entire results directory as a downloadable pipeline artifact (EvalReport).

Key design decisions

Self-hosted agent (NPU-QNN) — required for NPU device access
No checkout in EvalModel/Report — reuses the venv and source from Prepare since all jobs run on the same agent
Incremental runs — --continue flag skips already-evaluated models, allowing the pipeline to be re-triggered to pick up where it left off
Non-blocking model failures — a single model crash doesn't block the rest of the eval
Artifact publishing — eval results are published via PublishPipelineArtifact@1 so anyone who triggered the pipeline can download them from the run summary

…ilure

DingmaomaoBJTU · 2026-04-03T05:40:49Z

Code review

Found 4 issues:

_clear_disk_caches() is called per-model in the evaluation loop, nuking the entire HuggingFace and WML cache after every model. The replaced _clean_model_hf_cache(entry.hf_id) was intentionally scoped to the current model's cache only. With --clean-cache, every subsequent model must re-download from scratch, dramatically increasing network usage and runtime.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/scripts/e2e_eval/run_eval.py#L1269-L1273

Model evaluation failures are now non-blocking (exit 0), and the report generation step does not check pass rates or exit non-zero if models fail. A regression where a previously-passing model now fails will be silently swallowed — the pipeline reports success regardless.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/.pipelines/Modelkit%20E2E%20Test.yml#L153-L159

Duplicate copyright header — the file now has two identical license blocks (lines 1–4 and lines 6–9). Already flagged in review.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/scripts/e2e_eval/run_eval.py#L1-L10

Matrix key generation uses (hf_id + '_' + task) -replace '[^A-Za-z0-9]', '_' with no collision detection. Two models whose IDs and tasks differ only in special characters (e.g. foo/bar-baz vs foo/bar.baz with the same task) produce the same slug, and the second entry silently overwrites the first in the hashtable — dropping that model from evaluation with no warning.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/.pipelines/Modelkit%20E2E%20Test.yml#L88-L94

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

KayMKM · 2026-04-03T08:24:21Z

Code review

Found 4 issues:

_clear_disk_caches() is called per-model in the evaluation loop, nuking the entire HuggingFace and WML cache after every model. The replaced _clean_model_hf_cache(entry.hf_id) was intentionally scoped to the current model's cache only. With --clean-cache, every subsequent model must re-download from scratch, dramatically increasing network usage and runtime.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/scripts/e2e_eval/run_eval.py#L1269-L1273

Model evaluation failures are now non-blocking (exit 0), and the report generation step does not check pass rates or exit non-zero if models fail. A regression where a previously-passing model now fails will be silently swallowed — the pipeline reports success regardless.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/.pipelines/Modelkit%20E2E%20Test.yml#L153-L159

Duplicate copyright header — the file now has two identical license blocks (lines 1–4 and lines 6–9). Already flagged in review.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/scripts/e2e_eval/run_eval.py#L1-L10

Matrix key generation uses (hf_id + '_' + task) -replace '[^A-Za-z0-9]', '_' with no collision detection. Two models whose IDs and tasks differ only in special characters (e.g. foo/bar-baz vs foo/bar.baz with the same task) produce the same slug, and the second entry silently overwrites the first in the hashtable — dropping that model from evaluation with no warning.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/.pipelines/Modelkit%20E2E%20Test.yml#L88-L94

🤖 Generated with Claude Code

If this code review was useful, please react with 👍. Otherwise, react with 👎.

issue 1 & 2 are by designed, 3 & 4 are fixed

KayMKM added 15 commits March 31, 2026 15:57

change error to warning

b6f8a94

refine

82cf2ea

Merge remote-tracking branch 'origin/main' into yuesu/fix_e2e_test_fa…

4b6351d

…ilure

add date variable

6f1963b

add --continue for list model

2f32a7d

update clean

8967738

update

088d82f

remove retry

c578b22

fix

1a11e28

clean .onnx.data file in temp

a85954f

Merge branch 'yuesu/fix_temp_clean' into yuesu/fix_e2e_test_failure

b73b584

refine job name

74656dd

change number to task

73baa30

add publish artifact

6d9d6c9

Merge remote-tracking branch 'origin/main' into yuesu/fix_e2e_test_fa…

a246163

…ilure

KayMKM requested a review from a team as a code owner April 3, 2026 04:04

DingmaomaoBJTU reviewed Apr 3, 2026

View reviewed changes

Comment thread scripts/e2e_eval/run_eval.py Outdated

github-advanced-security AI found potential problems Apr 3, 2026

View reviewed changes

Comment thread scripts/e2e_eval/run_eval.py Fixed

remove

5f7d875

update

973caf7

DingmaomaoBJTU approved these changes Apr 3, 2026

View reviewed changes

KayMKM added 2 commits April 3, 2026 16:30

Merge branch 'main' into yuesu/fix_e2e_test_failure

4a7c3d2

Merge branch 'main' into yuesu/fix_e2e_test_failure

145b236

KayMKM enabled auto-merge (squash) April 8, 2026 03:49

KayMKM merged commit 6116e2c into main Apr 8, 2026
8 checks passed

KayMKM deleted the yuesu/fix_e2e_test_failure branch April 8, 2026 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add E2E eval pipeline for QNN NPU models#242

feat: Add E2E eval pipeline for QNN NPU models#242
KayMKM merged 19 commits into
mainfrom
yuesu/fix_e2e_test_failure

KayMKM commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU commented Apr 3, 2026

Uh oh!

KayMKM commented Apr 3, 2026

Code review

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KayMKM commented Apr 3, 2026

Add E2E eval pipeline for QNN NPU models

Pipeline overview

Jobs

Key design decisions

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU commented Apr 3, 2026

Code review

Uh oh!

KayMKM commented Apr 3, 2026

Code review

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants