Skip to content

ci: split scheduled pipelines into weekly Eval Report and daily E2E Test#756

Open
KayMKM wants to merge 21 commits into
mainfrom
yuesu/refactor_e2e_pipeline
Open

ci: split scheduled pipelines into weekly Eval Report and daily E2E Test#756
KayMKM wants to merge 21 commits into
mainfrom
yuesu/refactor_e2e_pipeline

Conversation

@KayMKM
Copy link
Copy Markdown
Contributor

@KayMKM KayMKM commented May 26, 2026

Summary

Refactor the previous Modelkit E2E Test pipeline (which actually runs the full model registry and produces a markdown report) into two distinct pipelines with different cadences and scopes.

Renamed (no behaviour change)

  • Modelkit E2E Test.ymlModelkit Eval Report.yml
  • templates/e2e-eval-jobs.ymltemplates/eval-report-jobs.yml
  • Stage displayNames: E2E Eval — {QNN, OV, AMD}Eval Report — …
  • Continues to run weekly on Friday 08:00 (UTC+8) against the full model registry with sharding, --list-json, --continue, --retry-failed, and report generation.

New: Modelkit E2E Test.yml (daily scheduled)

  • Schedule: 0 16 * * * UTC = 00:00 UTC+8 every day, staggered 8 h away from the weekly Eval Report cron.

  • Three parallel stages (QNN / OV / AMD), each running on its dedicated self-hosted agent.

  • Two phases per stage, both gated by queue-time parameters so a one-off run can be trimmed easily:

    1. winml perf phase — runs winml perf once per (model × EP/device pair) against an inline models parameter. Default list covers one small representative model per supported task (P0 first, P1/P2 filling the remainder).
    2. pytest e2e phase — runs a configurable list of tests/e2e/test_<name>_e2e.py suites (default: all 11). Tests use require_ep() to self-skip when the target EP is absent, so the same list is safe to run on all three agents.
  • Each winml perf step uses condition: always() so every combination runs and the stage fails on any non-zero exit. No matrix sharding, no report generation.

  • Reuses the eval-report setup helpers (parquet copy, uv venv, PipAuthenticate, pip install -e .[dev]).

Why not PR-gating?

E2E runs on self-hosted hardware are too long and too flaky (driver / firmware variance) to gate every PR. The daily cadence keeps regressions surfaced within ~24 h without blocking developer throughput. Per-PR validation continues to rely on the existing unit / integration suites.

Portal actions (not YAML-controllable)

  • Repoint the existing pipeline definition to Modelkit Eval Report.yml.
  • Create a new pipeline definition for Modelkit E2E Test.yml.
  • Do not add the new pipeline as a required branch-policy check on main — it is informational only.

Files

  • .pipelines/Modelkit Eval Report.yml (renamed)
  • .pipelines/templates/eval-report-jobs.yml (renamed)
  • .pipelines/Modelkit E2E Test.yml (new, daily)
  • .pipelines/templates/e2e-test-jobs.yml (new)

Refactor the previous 'Modelkit E2E Test' pipeline (which actually runs the full model registry and produces reports) into two pipelines with distinct purposes:

Renamed (no behavior change):
- 'Modelkit E2E Test.yml' -> 'Modelkit Eval Report.yml'
- 'templates/e2e-eval-jobs.yml' -> 'templates/eval-report-jobs.yml'
- Stage displayNames: 'E2E Eval -- {QNN,OV,AMD}' -> 'Eval Report -- ...'

New (PR-gating e2e test):
- 'Modelkit E2E Test.yml': pr trigger on main with drafts:false; three parallel stages (QNN/OV/AMD) running an inline 'models' parameter (prototype: facebook/convnext-tiny-224) across every EP/device pair on each agent.
- 'templates/e2e-test-jobs.yml': single job per agent; reuses the eval-report env setup (parquet copy, uv venv, PipAuthenticate, install -e .[dev]); one 'winml perf' step per (model x pair) with condition: always() so all combinations run and the job fails on any non-zero exit. No matrix sharding, --list-json, --continue, --retry-failed, or report generation.

Portal actions still required (not YAML-controllable):
- Repoint existing pipeline definition to 'Modelkit Eval Report.yml'.
- Create new pipeline definition for the new 'Modelkit E2E Test.yml'.
- Enable 'Automatically cancel existing validation builds for previous iterations of a pull request' on the new pipeline.
- During the prototype phase, do NOT add the new pipeline as a required branch-policy check on main -- failures show red on PR but do not block merge.
@KayMKM KayMKM marked this pull request as ready for review May 26, 2026 09:39
@KayMKM KayMKM requested a review from a team as a code owner May 26, 2026 09:39
@KayMKM KayMKM marked this pull request as draft May 26, 2026 09:40
KayMKM added 4 commits May 27, 2026 14:48
Convert the new "Modelkit E2E Test" pipeline from a PR gate to a daily
scheduled run, and broaden its scope from winml perf only to winml perf
plus a configurable list of pytest e2e suites.

Pipeline (.pipelines/Modelkit E2E Test.yml):
- Drop the `pr:` trigger.
- Add daily schedule cron '0 16 * * *' (00:00 Beijing time, branch
  main, always: true), staggered 8h from the weekly Eval Report cron.
- Add `runEval` (boolean, default true) so the winml perf phase can be
  toggled off from the queue UI.
- Add `pytestTargets` (object, default = all 11 e2e files: analyze,
  inspect, build, compile, config, export, optimize, quantize, sys,
  perf, eval). Edit at queue time to do a minimal run; empty list
  skips the pytest phase.
- Add `pytestTimeout` (number, default 1000) forwarded to pytest
  --timeout.
- All 3 stages (QNN/OV/AMD) forward the new params into the template.

Template (.pipelines/templates/e2e-test-jobs.yml):
- Bump `timeoutInMinutes` 60 -> 360 to accommodate both phases.
- Wrap the existing per-(model x pair) winml perf loop in
  `${{ if eq(parameters.runEval, true) }}`.
- Replace per-model failure log prefix "E2E test" with "Eval" to
  disambiguate from pytest e2e steps.
- Add a `${{ each target in parameters.pytestTargets }}` loop that
  runs `uv run --no-sync python -m pytest tests/e2e/test_<name>_e2e.py
  -m e2e --timeout=<pytestTimeout> --junitxml=...` with
  `condition: always()`. Tests use `require_ep()` to self-skip on
  irrelevant EPs, so it is safe to run all of them on every agent.
- Append a `PublishTestResults@2` task (`condition: always()`, JUnit
  format, `mergeTestResults: true`, `failTaskOnFailedTests: false`)
  so junit XMLs surface in the ADO Tests tab without becoming a
  second source of failure on top of the pytest step itself.
Replace the single facebook/convnext-tiny-224 seed with 13 curated
(model, task) entries covering the 13 tasks in hub_models.json.

Selection rules:
- optimum_supported == true (run_eval needs ORT export)
- P0 priority preferred; P1/P2 used to fill tasks where no P0 exists
- Within a task, prefer smaller / canonical / well-downloaded models
- Avoid niche or personal fine-tunes

Final 13 rows (image-classification keeps the original convnext-tiny seed):
  image-classification           facebook/convnext-tiny-224
  feature-extraction             openai/clip-vit-base-patch32
  zero-shot-classification       openai/clip-vit-base-patch32
  zero-shot-image-classification openai/clip-vit-base-patch32
  object-detection               facebook/detr-resnet-50
  fill-mask                      google-bert/bert-base-multilingual-cased
  masked-lm                      google-bert/bert-base-multilingual-cased
  depth-estimation               Intel/dpt-hybrid-midas
  image-feature-extraction       facebook/dinov2-small
  question-answering             deepset/roberta-base-squad2
  sentence-similarity            sentence-transformers/all-MiniLM-L6-v2
  text-classification            cross-encoder/ms-marco-MiniLM-L4-v2
  token-classification           dslim/bert-base-NER
Copy link
Copy Markdown
Collaborator

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring — splitting scheduled eval-report from a fast PR-gating e2e test is a clear win. Well-documented YAML. A few suggestions below.

Comment thread .pipelines/templates/e2e-test-jobs.yml
Comment thread .pipelines/Modelkit E2E Test.yml
Comment thread .pipelines/templates/e2e-test-jobs.yml
@KayMKM KayMKM changed the title ci: split eval-report pipeline from PR-gating e2e test ci: split scheduled pipelines into weekly Eval Report and daily E2E Test Jun 1, 2026
KayMKM added 7 commits June 1, 2026 14:54
The autouse fixture in test_config_e2e.py patched winml.modelkit.sysinfo.resolve_device (re-export), but resolve_check_device_ep in sysinfo/device.py calls its module-local resolve_device, so the mock was a no-op on that path. EPs not installed on the host (qnn, vitisai, migraphx, nv_tensorrt_rtx) hit the real availability check and failed.

Patch winml.modelkit.sysinfo.device.resolve_device as well, and make the mock EP-aware by returning the EP's supported devices from EP_SUPPORTED_DEVICES.
- name: models
displayName: 'Models to test (inline list, used by runEval)'
type: object
default:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DingmaomaoBJTU could you help to review the model lists? I pick one model for each task and try to choose P0 models

@KayMKM KayMKM marked this pull request as ready for review June 2, 2026 07:51
Comment thread .pipelines/templates/e2e-test-jobs.yml
Comment thread .pipelines/templates/e2e-test-jobs.yml
Comment thread .pipelines/Modelkit E2E Test.yml
Comment thread .pipelines/templates/e2e-test-jobs.yml
Comment thread tests/e2e/test_config_e2e.py Outdated
KayMKM added 2 commits June 2, 2026 17:07
The autouse fixture in test_config_e2e.py patched winml.modelkit.sysinfo.resolve_device (re-export), but resolve_check_device_ep in sysinfo/device.py calls its module-local resolve_device, so the mock was a no-op on that path. EPs not installed on the host (qnn, vitisai, migraphx, nv_tensorrt_rtx) hit the real availability check and failed.

Patch winml.modelkit.sysinfo.device.resolve_device as well, and make the mock EP-aware by returning the EP's supported devices from EP_SUPPORTED_DEVICES.
KayMKM added 4 commits June 3, 2026 10:32
…tion

TestPerfHuggingFace.test_benchmark_ep_gpu and test_benchmark_ep_npu now pass require_utilization=False, matching the existing exemption used by test_benchmark_gpu_monitor, test_benchmark_npu_monitor, and test_benchmark_ep_device_{gpu,npu}.

PDH GPU/NPU engine counters are not bumped reliably by every EP for short runs (e.g. OpenVINO on Intel iGPU routes compute via its own path, bypassing DXGI command queues that PDH samples). The structural checks (section present, device_kind, adapter_luid) still run; only the strict mean_pct > 0 check is dropped.
Comment out cross-encoder/nli-deberta-v3-small (zero-shot-classification) and facebook/detr-resnet-50 (object-detection): neither passes on every supported EP yet, so they are out of scope for this round of the Modelkit E2E Test pipeline.

Also comment out the entire P1/P2 block (depth-estimation, image-feature-extraction, question-answering, sentence-similarity, text-classification, token-classification) — not in scope for this round; re-enable as coverage expands.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants