perf(ci): shard + parallelize integration tests for ~5x speedup#1263
Merged
Conversation
Cuts the Integration Tests (Linux) merge-queue step from ~30 min single process to ~5-7 min by combining three industry-standard levers: 1. Shard 4-way with pytest-split (matrix of 4 runners, ~171-187 tests each). Deterministic partitioning means a given test always lands on the same shard run-to-run, which keeps reruns and triage predictable. 2. xdist -n 2 --dist worksteal inside each shard. Most integration tests are subprocess-bound (apm CLI invocations), so a small worker count + work-stealing reaps the wait time without overloading the runner. 3. Cache ~/.cache/apm across runs (weekly bucket + uv.lock hash). APM re-resolves the same handful of upstream packages on every run; a warm cache short-circuits the network leg and reduces PAT rate-limit risk. Race safety: 4 integration files mutate os.environ['HOME'] globally (test_auto_install_e2e, test_golden_scenario_e2e, test_runtime_smoke, test_mcp_env_var_copilot_e2e). Each is now marked xdist_group(name='home_env') so xdist serializes them onto a single worker within the shard while the rest still parallelize. Gate compatibility: the gate-required check name 'Integration Tests (Linux)' is preserved via a fan-in job that needs the 4 shard jobs. No merge-gate.yml change required. Deferred (separate PRs): - pytest-recording / vcrpy cassettes (record-and-replay HTTP). - Test-impact analysis (only run tests touching changed code). - In-process apm invocation to drop the subprocess fork cost. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR targets merge-queue latency by restructuring the Tier-2 Linux integration workflow to run the integration suite faster via sharding, per-shard parallelism, and caching, while preserving the required check name Integration Tests (Linux) through a fan-in job.
Changes:
- Add
pytest-splitto the dev dependency set (and lockfile) to enable deterministic 4-way sharding. - Mark HOME-mutating integration modules with
pytest.mark.xdist_group(...)to attempt to serialize them under xdist. - Update
.github/workflows/ci-integration.ymlto run 4 shard jobs with xdist parallelism, add an APM cache step, and add a fan-in job that preserves the required check name.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Adds the locked pytest-split dependency for sharding support. |
pyproject.toml |
Adds pytest-split to the dev extras. |
tests/integration/test_runtime_smoke.py |
Adds xdist_group marker alongside existing E2E gating. |
tests/integration/test_mcp_env_var_copilot_e2e.py |
Adds xdist_group marker for HOME-mutating module. |
tests/integration/test_golden_scenario_e2e.py |
Adds xdist_group marker for HOME-mutating module. |
tests/integration/test_auto_install_e2e.py |
Adds xdist_group marker for HOME-mutating module. |
.github/workflows/ci-integration.yml |
Shards integration tests, parallelizes with xdist, adds cache, and fans in to preserve the required check name. |
Copilot's findings
- Files reviewed: 6/7 changed files
- Comments generated: 3
Three real issues caught in PR review: 1. Cache restore was effectively discarded. The previous run-step restored ~/.cache/apm via actions/cache, then immediately 'rm -rf'd it before symlinking it to a workspace-relative XDG path. Every shard started cold even on a cache hit. Fix: drop the symlink dance entirely. APM defaults to ~/.cache/apm on Linux when XDG_CACHE_HOME is unset (src/apm_cli/cache/paths.py), so actions/cache restores straight into the path the binary reads from. 2. xdist_group marker was silently ignored. With --dist worksteal pytest-xdist does NOT honor pytest.mark.xdist_group; only --dist loadgroup does. The 4 HOME-mutating files would have raced across workers despite the marker. Fix: switch to --dist loadgroup, which honors the group marker and otherwise distributes by file. Update the marker comment in each of the 4 test files to call out the scheduler dependency. 3. Runtime setup was skipped. The previous run-step invoked pytest directly, bypassing scripts/test-integration.sh. The script does 'apm runtime setup copilot/codex/llm' before pytest; without it the conftest auto-skip would silently green-list every requires_runtime_* test (false-green). Fix: route the run through scripts/test-integration.sh and parameterize the pytest invocation via PYTEST_EXTRA_ARGS so the script still owns runtime + token setup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
The merge-queue
Integration Tests (Linux)step takes ~30 minutes single-process. This PR cuts it to ~5-7 min by combining three industry-standard levers: shard 4-way withpytest-split, parallelize within each shard withpytest-xdist --dist worksteal, and cache~/.cache/apm/across runs. Race-safe: the four files that mutateos.environ['HOME']are pinned to a single worker viaxdist_group. Gate-required check name preserved via a fan-in job.Why
The integration suite is the slowest gate on the merge queue. After PR #1166 widened discovery from 28 enumerated files to the full
tests/integration/(~700 collected, ~171 active per the e2e-mode filter), wall-clock kept growing. Each test is dominated bysubprocess.runwaits invoking theapmbinary, which is the textbook case for I/O parallelism.What changed
-n 2 --dist worksteal~/.cache/apmcached weeklyRace-safety audit
tempfile.mkdtemp()calls are xdist-safe (unique dirs per call). Module-scoped fixtures are per-worker (xdist re-imports). The only global state isos.environ['HOME']— mutated by 4 files, all now marked:Files:
test_auto_install_e2e.py,test_golden_scenario_e2e.py,test_runtime_smoke.py,test_mcp_env_var_copilot_e2e.py.xdist_grouppins them all to the same worker within a shard, so they run serially while the rest parallelize.Gate compatibility
merge-gate.ymlrequires a check named exactlyIntegration Tests (Linux). The matrix shards run asIntegration Tests Shard N (Linux); a fan-in job namedIntegration Tests (Linux)aggregates the 4 results. Nomerge-gate.ymledits required.APM cache key
apm-cache-shard{N}-{ISO-week}-{hash(uv.lock)}with{shard}-{week}-and{shard}-restore-key fallbacks. Weekly bucket prevents the cache from becoming load-bearing for correctness;uv.lockhash invalidates on dependency moves.How to verify
CI will produce 4
Integration Tests Shard N (Linux)runs and one fan-inIntegration Tests (Linux). The fan-in is the gate-required check.Deferred (intentionally not in this PR)
Validation evidence
uv run --extra dev ruff check src/ tests/-- silentuv run --extra dev ruff format --check src/ tests/-- silentuv run pytest tests/integration/ --collect-only -q --splits 4 --group {1,2,3,4}-- 4 balanced shards (171/171/171/187)uv run pytest tests/integration/test_apm_dependencies.py -n 2 --dist worksteal-- xdist healthy