ci: shard PR tests + move binary build off critical path#1437
Conversation
Cuts PR-time CI wall-clock from ~134s (Build & Test) + up to 30s (merge-gate poll slack) = ~165s perceived down to a measured-locally ~65-75s + ~5s gate slack. Changes ------- 1. Two-way sharded Build & Test (ci.yml). Matrix of two runners using pytest-split --splits 2 --group N, mirroring the proven pattern in ci-integration.yml. Each shard runs xdist -n 2 --dist worksteal inside. Fan-in job preserves the 'Build & Test (Linux)' check-run name that merge-gate.yml requires in EXPECTED_CHECKS, so no gate edits are needed. First runs use pytest-split's naive file-based split; a .test_durations file can be committed in a follow-up PR for better balance. Coverage is collected per-shard with --cov-fail-under=0 (each shard only exercises ~half the code paths) and the 80% gate moves to the fan-in job after coverage combine -- same approach as ci-integration.yml. Local timing: full suite was ~84s; each shard now runs in ~60-67s (7400 tests passing per shard). 4-vCPU runner means xdist -n 2 leaves headroom for the two split-loaders. 2. Binary build moved off the PR critical path (ci.yml). The 'Install UPX', 'Build binary', and 'Upload binary as workflow artifact' steps are deleted from Build & Test. uv sync drops '--extra build' (PyInstaller + ~150MB of wheels), which also shortens cold-cache install by 15-25s. Verified zero PR-time consumers: the apm-linux-x86_64 artifact uploaded here was never referenced by any PR-required workflow. ci-integration.yml (merge_group) and ci-runtime.yml both rebuild the binary inline; build-release.yml owns the canonical platform matrix on release. A new non-required 'PR Binary Smoke (Linux)' job runs in parallel to preserve the packaging-regression signal at PR time without gating the merge. 3. merge-gate.yml POLL_SEC 30 -> 5. With ~3 expected checks per poll = ~36 gh API calls/min, well under the 5000/hr quota. cancel-in-progress concurrency on the gate already bounds parallel polling per PR. Cuts perceived gate latency by avg ~15s, worst ~25s. What was considered and dropped ------------------------------- - coverage --core=sysmon: verified locally that sysmon does not support branch coverage on Python 3.12 (CoverageWarning: 'sys.monitoring can\'t measure branches in this version, using default core'). The change would have been a silent no-op while branch=true. Revisit when CPython 3.14 lands (PEP 626 enhancements). - Explicit enable-cache: true on Lint job: verified against astral-sh/setup-uv README that enable-cache defaults to 'auto', which is true on GitHub-hosted runners. The bare form on the Lint job is already caching. - 4-way shard / 8-core runner / move coverage off PR / replace pylint R0801 with jscpd: out of scope for this PR per a panel review that traded each for either added complexity or weakened signal. See WIP/ for follow-up notes. Honest commitment ----------------- Targets ~75s p50, ~90s p99 on warm cache for the PR-time critical path (was ~165s). Below 60s would require coverage off the PR critical path or self-hosted runners, both deferred. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three tests passed in CI but failed on dev machines where the runtime binaries (claude, codex) happen to be installed: - test_all_excluded_warns_and_returns_zero (hermetic + phase3w4): find_runtime_binary is module-level imported into mcp_integrator_install at import time, so patching apm_cli.runtime.utils.find_runtime_binary did not rebind the symbol the function actually calls. With `claude` on PATH it leaked into installed_runtimes and bypassed the exclude path, suppressing the expected warning. Also added the missing MCPServerOperations mock that the sibling test in the same class already had. - test_execute_runtime_command_uses_shlex_on_unix: the two Windows counterparts in the same class already patched apm_cli.core.script_runner.find_runtime_binary; the Unix variant was missing it and asserted bare "codex" in args, which failed when codex resolved to /opt/homebrew/bin/codex. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors PR-time CI to reduce critical-path wall clock by splitting the unit test suite into two shards, fanning results back into a single required check that enforces the unit coverage floor, and moving the PyInstaller binary build to a parallel non-required job. It also reduces merge-gate polling latency and tightens a few unit tests to avoid environment-dependent runtime discovery.
Changes:
- Shard PR-time unit tests into two
pytest-splitgroups and add a fan-inBuild & Test (Linux)job that combines coverage and enforces the 80% floor. - Move the PyInstaller binary build into a parallel, non-required
PR Binary Smoke (Linux)job. - Reduce
merge-gate.ymlpolling interval and fix unit tests that were accidentally dependent on local PATH/runtime presence.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/ci.yml |
Splits unit tests into 2 shard jobs, adds a fan-in required coverage gate, and moves binary packaging to a parallel non-required job. |
.github/workflows/merge-gate.yml |
Reduces required-check polling interval to lower merge-gate latency. |
tests/unit/test_runtime_windows.py |
Patches runtime discovery at the usage site to avoid PATH-dependent behavior in parsing tests. |
tests/unit/test_mcp_integrator_install_phase3w4.py |
Adds missing mocks/patches to make runtime detection hermetic and consistent across developer machines. |
tests/unit/test_mcp_integrator_install_hermetic.py |
Mirrors the phase3w4 test fixes to keep hermetic coverage consistent. |
CHANGELOG.md |
Adds an Unreleased entry describing the PR-time CI pipeline change. |
Copilot's findings
- Files reviewed: 6/6 changed files
- Comments generated: 3
|
|
||
| ### Changed | ||
|
|
||
| - PR-time CI cut from ~4 min to ~70-90 s by sharding `pytest` two ways with `pytest-split` and a fan-in coverage gate, moving the PyInstaller binary build off the required critical path into a parallel non-required `pr-binary-smoke` job, and lowering `merge-gate.yml` poll interval from 30 s to 5 s. The required check name `Build & Test (Linux)` is preserved by the fan-in job. The 80% unit coverage floor is now enforced after `coverage combine`, mirroring the pattern already used in `ci-integration.yml`. |
| pattern: unit-coverage-shard-* | ||
| path: coverage-shards/ | ||
| merge-multiple: true | ||
| continue-on-error: true |
| else | ||
| echo "No coverage shard files found; skipping unit coverage summary." | ||
| fi | ||
| continue-on-error: true |
- Remove continue-on-error from download-artifact and coverage-combine steps in the fan-in. The downstream 'Enforce unit coverage gate' step already checks for .coverage and fails with a clear message if absent, so silencing upstream failures only made diagnosis harder. - Rewrite the Unreleased changelog entry to the repo's single-line Keep-a-Changelog convention, with #1437 ref. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the review. All three comments accepted, addressed in f118a41:
Separately, while assessing the review I noticed the fan-in job started ~3 minutes after the shards finished on the first green run (shards done 19:55:55, fan-in started 19:58:50). That's outside this PR's scope but tracked in #1438 — likely |
…e_group only Two consecutive observations on this branch showed the fan-in job sitting idle for ~3 minutes after the shards completed (run 26249570926: +2m 55s; run 26249910941: +2m 57s). Step-level telemetry confirms the delay is GitHub Actions runner-allocation latency between matrix `needs:` resolution and the dependent job's runner pickup -- not artifact propagation (download-artifact completed in 1 s). The fan-in's own work was only ~17 s. Net effect of the previous shape: critical path was ~4m 35s even though the shards themselves ran in ~80 s, eating most of the sharding win. This commit: - Promotes the two shard jobs to required checks; the fan-in is no longer on the PR-time critical path. - Keeps the global 80% unit coverage gate as a non-required `Coverage Combine (Linux)` job at PR time so drift is visible in the PR UI, and escalates it to required on `merge_group` so the global floor still blocks the actual merge. - Per-shard floor set to 60% (observed: shard 1 64.82%, shard 2 62.52%) -- catches local regressions at PR time without paying the runner-allocation tax. - Updates merge-gate.yml EXPECTED_CHECKS accordingly: PR context gates on the two shard names; merge_group context additionally gates on Coverage Combine. Expected wall-clock for PR-time required checks: ~80 s (max of two shards), down from ~4m 35s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Folded the artifact-lag mitigation into this PR (commit b419231). Closed #1438. Measured wall-clock for PR-time required checks:
What changed Step-level telemetry showed
Trade-off: PR-time enforcement of the global 80% floor moves from blocking to advisory. The merge gate still blocks. The prize is dropping ~3 min off the critical path on every PR. |
TL;DR
PR-time CI was creeping toward 4 minutes as the unit suite grew past 14k tests, dominated by a single serial
Build & Test (Linux)job that also bundled a PyInstaller package step. This PR shardspytesttwo ways withpytest-split, fans the shards into a gate-named aggregator that enforces the 80% coverage floor aftercoverage combine, lifts the binary build into a parallel non-requiredpr-binary-smokejob, and lowersmerge-gate.yml's poll interval from 30 s to 5 s. Expected p50 PR wall-clock: ~70-90 s (warm cache), down from ~3.5-4 min.Note
The pre-discussion floor was ~55-65 s warm. I pushed back on the 1-minute target during design: it is reachable only with a committed
.test_durationsfile, a cold cache miss budget that GitHub's hosted cache cannot guarantee, andsysmoncoverage acceleration that silently falls back on Python 3.12 + branch coverage. The realistic, durable target this PR commits to is ≤ 90 s p99 / ~70 s p50.Problem (WHY)
build-and-testjob ran the entire ~14k-test suite serially on one runner, with--extra buildpulling PyInstaller + ~150 MB of packaging-only deps into the test path, then ran UPX +pyinstallereven though no PR-time consumer downloaded the artifact.merge-gate.ymlpolled the required checks every 30 s, adding up to 29 s of latency between "checks done" and "queue auto-merge", purely as observability lag..github/workflows/ci-integration.yml(lines 131-291), but PR-time CI had never adopted it.sysmonand explicitenable-cache: trueitems from the original 5-item plan once they proved to be no-ops or fall-back-only on this stack.Approach (WHAT)
-n auto-n 2 --dist worksteal, split bypytest-splitbuild-and-testpr-binary-smokeuv sync --extra dev --extra builduv sync --extra devfail_under = 80on partial data (would break sharding)--cov-fail-under=0per shard;coverage combine+coverage report --fail-under=80in fan-inPOLL_SEC: 30POLL_SEC: 5Implementation (HOW)
.github/workflows/ci.yml— replaced the monolithicbuild-and-testjob with two jobs:build-and-test-shard(matrixshard: [1, 2]), each invokingpytest --splits 2 --group N -n 2 --dist worksteal --cov --cov-report= --cov-fail-under=0, renaming.coverageto.coverage.unit-shard-N, and uploading it as aunit-coverage-shard-Nartifact (hidden files included so.coverage*survives).build-and-test(fan-in,needs: [build-and-test-shard],if: always()) — name preserved verbatim to satisfymerge-gate.yml'sEXPECTED_CHECKS. Downloads both shard coverage files, runscoverage combine, renders the summary viascripts/coverage-summary.py, then enforces the 80% floor withcoverage report --fail-under=80, and finally re-checksneeds.build-and-test-shard.result == 'success'so a shard failure still fails the gate.Also added
pr-binary-smoke: a parallel non-required job that reproduces the old UPX +build-binary.shpath so packaging regressions still surface in review, just off the critical path..github/workflows/merge-gate.yml—POLL_SEC: '30'→'5', with a comment showing the math (~3 checks × 12 polls/min = 36 calls/min; ceiling is 5000/hr). The poller script (merge_gate_wait.sh) was already parameterised, no script change needed.tests/unit/test_mcp_integrator_install_{hermetic,phase3w4}.pyandtests/unit/test_runtime_windows.py— fixed three tests whose CI-passing status was an environment accident: they relied onclaude/codexnot being on PATH on the runner. The bug is thatfind_runtime_binaryis module-level imported into the source-under-test, so patchingapm_cli.runtime.utils.find_runtime_binarydoes not rebind the local symbol the function actually calls. Patched at the usage site (apm_cli.integration.mcp_integrator_install.find_runtime_binary,apm_cli.core.script_runner.find_runtime_binary) and added the missingMCPServerOperationsmock that the sibling test in the same class already had. Failing the tests locally on a developer machine withcodexinstalled was the only way to surface this drift; without the local shard rehearsal the silent breakage would have shipped.CHANGELOG.md— Unreleased / Changed entry documenting the CI delta and the preserved gate semantics.Diagrams
Legend: PR-time job graph before and after. Red box = required check on critical path; green box = parallel non-required signal. Coverage gate location is the key change.
flowchart LR subgraph Before["Before: serial, ~4 min critical path"] B_PR([PR push]) --> B_Lint[lint] B_PR --> B_BT["Build & Test (Linux)<br/>tests + binary + UPX<br/>~3.5 min"] B_PR --> B_Self[apm-self-check] B_PR --> B_Notice[NOTICE Drift] B_BT -.gate.-> B_MG[merge-gate poll 30s] end subgraph After["After: sharded fan-in, ~70-90s critical path"] A_PR([PR push]) --> A_Lint[lint] A_PR --> A_S1["shard 1<br/>~44s"] A_PR --> A_S2["shard 2<br/>~67s"] A_PR --> A_Self[apm-self-check] A_PR --> A_Notice[NOTICE Drift] A_PR --> A_Bin["pr-binary-smoke<br/>(non-required, parallel)"] A_S1 --> A_FanIn["Build & Test (Linux)<br/>coverage combine + 80% gate"] A_S2 --> A_FanIn A_FanIn -.gate.-> A_MG[merge-gate poll 5s] end classDef required fill:#ffd9d9,stroke:#cc0000; classDef parallel fill:#d9f5d9,stroke:#0a7a0a; class B_BT,A_FanIn,A_S1,A_S2 required class A_Bin parallelLegend: shard coverage data flow into the gate.
.coverageper shard is uploaded as a hidden-file artifact, the fan-in downloads both,coverage combineunions them, and only then is--fail-under=80enforced — mirroringci-integration.yml:265-283.sequenceDiagram participant S1 as shard 1 runner participant S2 as shard 2 runner participant A as artifacts participant F as fan-in runner S1->>S1: pytest --splits 2 --group 1<br/>--cov-fail-under=0 S2->>S2: pytest --splits 2 --group 2<br/>--cov-fail-under=0 S1->>A: upload .coverage.unit-shard-1 S2->>A: upload .coverage.unit-shard-2 F->>A: download unit-coverage-shard-* F->>F: coverage combine F->>F: coverage report --fail-under=80 F-->>F: fail if union < 80% OR any shard failedTrade-offs
-n 2is already saturating cores inside each shard. Four shards would also quadrupleuv synccold-cache cost. Revisit only if shard 2 grows past ~90 s p99..test_durations,pytest-splitfalls back to file-count balancing — shard 2 currently runs ~23 s longer than shard 1 (67 s vs 44 s). A follow-up PR can commit a durations file harvested from the first green run; not blocking.pr-binary-smokeis non-required. A packaging regression at PR time will surface as a red check the author can see, but it does not block merge. The canonical binary build still runs inci-integration.ymlonmerge_groupandbuild-release.ymlon release, so nothing reaches users without a passing build. Acceptable risk; the alternative (keeping it on the critical path) re-adds 60-90 s for a signal that no PR-time consumer was downloading the artifact for.POLL_SEC=5increasesghAPI call rate. Math: 3 expected checks × 12 polls/min = 36 calls/min, two orders of magnitude under the 5000/hr quota.cancel-in-progressconcurrency caps parallel pollers per PR.sysmoncoverage core.coverage.py7.10.6 silently falls back to the default tracer whensys.monitoringis asked to measure branches on Python 3.12 (warning:sys.monitoring can't measure branches in this version). Re-evaluate on Python 3.14.enable-cache: trueonsetup-uv@v6.setup-uv@v6already defaults to"auto", which is true on hosted runners. Adding it was a no-op.Benefits
uv syncno longer pulls--extra build: ~10-20 s removed from every test runner, plus a smaller cache footprint.merge-gateperceived latency drops by up to 25 s per gate run; ~36ghAPI calls/min stays comfortably inside quota.claude/codexnot being on PATH) are fixed; the sharded local rehearsal was the surfacing mechanism.Validation
Run on this branch (
danielmeppiel/bookish-adventure, headc85acd1a):Shard 1 — 7402 passed in 43.92 s
Shard 2 — 7400 passed, 1 skipped in 67.47 s
Lint contract (canonical, per
.apm/instructions/linting.instructions.md)Scenario evidence
Per
.agents/skills/pr-description-skill/assets/scenario-evidence-rubric.md: this PR is a CI-pipeline change with no user-promise behavior surface (no CLI, no schema, no auth, no install path). The skip clause applies — the substantive behavior coverage is the unchanged test suite continuing to pass on the new pipeline shape, evidenced by the two shard transcripts above plus the three test bugs explicitly fixed (test_all_excluded_warns_and_returns_zerointest_mcp_integrator_install_{hermetic,phase3w4}.pyandtest_execute_runtime_command_uses_shlex_on_unixintest_runtime_windows.py).How to test
Build & Test Shard N (Linux)checks start in parallel, pluspr-binary-smoke (Linux)running alongside.Build & Test (Linux)check appears and turns green after both shards succeed.apm-self-checkandNOTICE Drift Checkstill appear as required.Build & Test (Linux)(fan-in) wall-clock should be max(shard1, shard2) + ~10 s coverage-combine overhead, i.e. ~75-90 s warm.merge_gate_wait.shlogsSleeping 5s before next pollinstead of30s.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com