Skip to content

feat: Add E2E eval pipeline for QNN NPU models#242

Merged
KayMKM merged 19 commits into
mainfrom
yuesu/fix_e2e_test_failure
Apr 8, 2026
Merged

feat: Add E2E eval pipeline for QNN NPU models#242
KayMKM merged 19 commits into
mainfrom
yuesu/fix_e2e_test_failure

Conversation

@KayMKM
Copy link
Copy Markdown
Contributor

@KayMKM KayMKM commented Apr 3, 2026

Add E2E eval pipeline for QNN NPU models

This pipeline automates end-to-end model evaluation on the self-hosted NPU agent, from model discovery through report generation and artifact publishing. https://dev.azure.com/microsoft/windows.ai.toolkit/_build?definitionId=190174&_a=summary

Pipeline overview

The pipeline is manually triggered (trigger: none) with two parameters:

  • evalDate — target date for the eval run (defaults to today)
  • continueRun — skip models that already have results, enabling incremental/resumable runs

Jobs

1. Prepare — Sets up the Python environment (uv + Python 3.10), installs dependencies from Azure Artifacts, computes the eval output directory (c:/eval_results/{date}), generates the model list, and builds an ADO matrix for parallel-safe sequential execution.

2. EvalModel — Runs each model through run_eval.py one at a time (maxParallel: 1) using the matrix from Prepare. Individual model failures are logged as warnings but do not fail the pipeline, so remaining models continue to be evaluated.

3. Report — Runs unconditionally (condition: always()) after eval completes. Generates the evaluation report (JSON, text, markdown, HTML) via generate_report.py, then publishes the entire results directory as a downloadable pipeline artifact (EvalReport).

Key design decisions

  • Self-hosted agent (NPU-QNN) — required for NPU device access
  • No checkout in EvalModel/Report — reuses the venv and source from Prepare since all jobs run on the same agent
  • Incremental runs--continue flag skips already-evaluated models, allowing the pipeline to be re-triggered to pick up where it left off
  • Non-blocking model failures — a single model crash doesn't block the rest of the eval
  • Artifact publishing — eval results are published via PublishPipelineArtifact@1 so anyone who triggered the pipeline can download them from the run summary

@KayMKM KayMKM requested a review from a team as a code owner April 3, 2026 04:04
Comment thread scripts/e2e_eval/run_eval.py Outdated
Comment thread scripts/e2e_eval/run_eval.py Fixed
@DingmaomaoBJTU
Copy link
Copy Markdown
Collaborator

Code review

Found 4 issues:

  1. _clear_disk_caches() is called per-model in the evaluation loop, nuking the entire HuggingFace and WML cache after every model. The replaced _clean_model_hf_cache(entry.hf_id) was intentionally scoped to the current model's cache only. With --clean-cache, every subsequent model must re-download from scratch, dramatically increasing network usage and runtime.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/scripts/e2e_eval/run_eval.py#L1269-L1273

  1. Model evaluation failures are now non-blocking (exit 0), and the report generation step does not check pass rates or exit non-zero if models fail. A regression where a previously-passing model now fails will be silently swallowed — the pipeline reports success regardless.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/.pipelines/Modelkit%20E2E%20Test.yml#L153-L159

  1. Duplicate copyright header — the file now has two identical license blocks (lines 1–4 and lines 6–9). Already flagged in review.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/scripts/e2e_eval/run_eval.py#L1-L10

  1. Matrix key generation uses (hf_id + '_' + task) -replace '[^A-Za-z0-9]', '_' with no collision detection. Two models whose IDs and tasks differ only in special characters (e.g. foo/bar-baz vs foo/bar.baz with the same task) produce the same slug, and the second entry silently overwrites the first in the hashtable — dropping that model from evaluation with no warning.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/.pipelines/Modelkit%20E2E%20Test.yml#L88-L94

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@KayMKM
Copy link
Copy Markdown
Contributor Author

KayMKM commented Apr 3, 2026

Code review

Found 4 issues:

  1. _clear_disk_caches() is called per-model in the evaluation loop, nuking the entire HuggingFace and WML cache after every model. The replaced _clean_model_hf_cache(entry.hf_id) was intentionally scoped to the current model's cache only. With --clean-cache, every subsequent model must re-download from scratch, dramatically increasing network usage and runtime.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/scripts/e2e_eval/run_eval.py#L1269-L1273

  1. Model evaluation failures are now non-blocking (exit 0), and the report generation step does not check pass rates or exit non-zero if models fail. A regression where a previously-passing model now fails will be silently swallowed — the pipeline reports success regardless.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/.pipelines/Modelkit%20E2E%20Test.yml#L153-L159

  1. Duplicate copyright header — the file now has two identical license blocks (lines 1–4 and lines 6–9). Already flagged in review.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/scripts/e2e_eval/run_eval.py#L1-L10

  1. Matrix key generation uses (hf_id + '_' + task) -replace '[^A-Za-z0-9]', '_' with no collision detection. Two models whose IDs and tasks differ only in special characters (e.g. foo/bar-baz vs foo/bar.baz with the same task) produce the same slug, and the second entry silently overwrites the first in the hashtable — dropping that model from evaluation with no warning.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/.pipelines/Modelkit%20E2E%20Test.yml#L88-L94

🤖 Generated with Claude Code

  • If this code review was useful, please react with 👍. Otherwise, react with 👎.

Code review

Found 4 issues:

  1. _clear_disk_caches() is called per-model in the evaluation loop, nuking the entire HuggingFace and WML cache after every model. The replaced _clean_model_hf_cache(entry.hf_id) was intentionally scoped to the current model's cache only. With --clean-cache, every subsequent model must re-download from scratch, dramatically increasing network usage and runtime.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/scripts/e2e_eval/run_eval.py#L1269-L1273

  1. Model evaluation failures are now non-blocking (exit 0), and the report generation step does not check pass rates or exit non-zero if models fail. A regression where a previously-passing model now fails will be silently swallowed — the pipeline reports success regardless.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/.pipelines/Modelkit%20E2E%20Test.yml#L153-L159

  1. Duplicate copyright header — the file now has two identical license blocks (lines 1–4 and lines 6–9). Already flagged in review.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/scripts/e2e_eval/run_eval.py#L1-L10

  1. Matrix key generation uses (hf_id + '_' + task) -replace '[^A-Za-z0-9]', '_' with no collision detection. Two models whose IDs and tasks differ only in special characters (e.g. foo/bar-baz vs foo/bar.baz with the same task) produce the same slug, and the second entry silently overwrites the first in the hashtable — dropping that model from evaluation with no warning.

https://github.com/microsoft/ModelKit/blob/5f7d875dbe77fc7da8f9c07e9feecc6f2eafabc1/.pipelines/Modelkit%20E2E%20Test.yml#L88-L94

🤖 Generated with Claude Code

  • If this code review was useful, please react with 👍. Otherwise, react with 👎.

issue 1 & 2 are by designed, 3 & 4 are fixed

@KayMKM KayMKM enabled auto-merge (squash) April 8, 2026 03:49
@KayMKM KayMKM merged commit 6116e2c into main Apr 8, 2026
8 checks passed
@KayMKM KayMKM deleted the yuesu/fix_e2e_test_failure branch April 8, 2026 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants