Skip to content

ci: shard PR tests + move binary build off critical path#1437

Merged
danielmeppiel merged 5 commits into
mainfrom
danielmeppiel/bookish-adventure
May 21, 2026
Merged

ci: shard PR tests + move binary build off critical path#1437
danielmeppiel merged 5 commits into
mainfrom
danielmeppiel/bookish-adventure

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

TL;DR

PR-time CI was creeping toward 4 minutes as the unit suite grew past 14k tests, dominated by a single serial Build & Test (Linux) job that also bundled a PyInstaller package step. This PR shards pytest two ways with pytest-split, fans the shards into a gate-named aggregator that enforces the 80% coverage floor after coverage combine, lifts the binary build into a parallel non-required pr-binary-smoke job, and lowers merge-gate.yml's poll interval from 30 s to 5 s. Expected p50 PR wall-clock: ~70-90 s (warm cache), down from ~3.5-4 min.

Note

The pre-discussion floor was ~55-65 s warm. I pushed back on the 1-minute target during design: it is reachable only with a committed .test_durations file, a cold cache miss budget that GitHub's hosted cache cannot guarantee, and sysmon coverage acceleration that silently falls back on Python 3.12 + branch coverage. The realistic, durable target this PR commits to is ≤ 90 s p99 / ~70 s p50.

Problem (WHY)

Approach (WHAT)

Lever Before After Wall-clock delta (warm)
Unit test parallelism 1 runner × -n auto 2 runners × -n 2 --dist worksteal, split by pytest-split -45 to -60 s
Binary build placement inline in required build-and-test parallel non-required pr-binary-smoke -60 to -90 s removed from critical path
Test-path deps uv sync --extra dev --extra build uv sync --extra dev ~10-20 s shaved off install
Coverage gate per-job fail_under = 80 on partial data (would break sharding) --cov-fail-under=0 per shard; coverage combine + coverage report --fail-under=80 in fan-in net zero; gate semantics preserved
Merge-gate poll POLL_SEC: 30 POLL_SEC: 5 up to -25 s perceived

Implementation (HOW)

  • .github/workflows/ci.yml — replaced the monolithic build-and-test job with two jobs:
    1. build-and-test-shard (matrix shard: [1, 2]), each invoking pytest --splits 2 --group N -n 2 --dist worksteal --cov --cov-report= --cov-fail-under=0, renaming .coverage to .coverage.unit-shard-N, and uploading it as a unit-coverage-shard-N artifact (hidden files included so .coverage* survives).
    2. build-and-test (fan-in, needs: [build-and-test-shard], if: always()) — name preserved verbatim to satisfy merge-gate.yml's EXPECTED_CHECKS. Downloads both shard coverage files, runs coverage combine, renders the summary via scripts/coverage-summary.py, then enforces the 80% floor with coverage report --fail-under=80, and finally re-checks needs.build-and-test-shard.result == 'success' so a shard failure still fails the gate.
      Also added pr-binary-smoke: a parallel non-required job that reproduces the old UPX + build-binary.sh path so packaging regressions still surface in review, just off the critical path.
  • .github/workflows/merge-gate.ymlPOLL_SEC: '30''5', with a comment showing the math (~3 checks × 12 polls/min = 36 calls/min; ceiling is 5000/hr). The poller script (merge_gate_wait.sh) was already parameterised, no script change needed.
  • tests/unit/test_mcp_integrator_install_{hermetic,phase3w4}.py and tests/unit/test_runtime_windows.py — fixed three tests whose CI-passing status was an environment accident: they relied on claude / codex not being on PATH on the runner. The bug is that find_runtime_binary is module-level imported into the source-under-test, so patching apm_cli.runtime.utils.find_runtime_binary does not rebind the local symbol the function actually calls. Patched at the usage site (apm_cli.integration.mcp_integrator_install.find_runtime_binary, apm_cli.core.script_runner.find_runtime_binary) and added the missing MCPServerOperations mock that the sibling test in the same class already had. Failing the tests locally on a developer machine with codex installed was the only way to surface this drift; without the local shard rehearsal the silent breakage would have shipped.
  • CHANGELOG.md — Unreleased / Changed entry documenting the CI delta and the preserved gate semantics.

Diagrams

Legend: PR-time job graph before and after. Red box = required check on critical path; green box = parallel non-required signal. Coverage gate location is the key change.

flowchart LR
    subgraph Before["Before: serial, ~4 min critical path"]
        B_PR([PR push]) --> B_Lint[lint]
        B_PR --> B_BT["Build &amp; Test (Linux)<br/>tests + binary + UPX<br/>~3.5 min"]
        B_PR --> B_Self[apm-self-check]
        B_PR --> B_Notice[NOTICE Drift]
        B_BT -.gate.-> B_MG[merge-gate poll 30s]
    end

    subgraph After["After: sharded fan-in, ~70-90s critical path"]
        A_PR([PR push]) --> A_Lint[lint]
        A_PR --> A_S1["shard 1<br/>~44s"]
        A_PR --> A_S2["shard 2<br/>~67s"]
        A_PR --> A_Self[apm-self-check]
        A_PR --> A_Notice[NOTICE Drift]
        A_PR --> A_Bin["pr-binary-smoke<br/>(non-required, parallel)"]
        A_S1 --> A_FanIn["Build &amp; Test (Linux)<br/>coverage combine + 80% gate"]
        A_S2 --> A_FanIn
        A_FanIn -.gate.-> A_MG[merge-gate poll 5s]
    end

    classDef required fill:#ffd9d9,stroke:#cc0000;
    classDef parallel fill:#d9f5d9,stroke:#0a7a0a;
    class B_BT,A_FanIn,A_S1,A_S2 required
    class A_Bin parallel
Loading

Legend: shard coverage data flow into the gate. .coverage per shard is uploaded as a hidden-file artifact, the fan-in downloads both, coverage combine unions them, and only then is --fail-under=80 enforced — mirroring ci-integration.yml:265-283.

sequenceDiagram
    participant S1 as shard 1 runner
    participant S2 as shard 2 runner
    participant A as artifacts
    participant F as fan-in runner
    S1->>S1: pytest --splits 2 --group 1<br/>--cov-fail-under=0
    S2->>S2: pytest --splits 2 --group 2<br/>--cov-fail-under=0
    S1->>A: upload .coverage.unit-shard-1
    S2->>A: upload .coverage.unit-shard-2
    F->>A: download unit-coverage-shard-*
    F->>F: coverage combine
    F->>F: coverage report --fail-under=80
    F-->>F: fail if union < 80% OR any shard failed
Loading

Trade-offs

  • Two shards, not four. Diminishing returns past two on a 4-vCPU runner once -n 2 is already saturating cores inside each shard. Four shards would also quadruple uv sync cold-cache cost. Revisit only if shard 2 grows past ~90 s p99.
  • Naive file-based split on first run. Without a committed .test_durations, pytest-split falls back to file-count balancing — shard 2 currently runs ~23 s longer than shard 1 (67 s vs 44 s). A follow-up PR can commit a durations file harvested from the first green run; not blocking.
  • pr-binary-smoke is non-required. A packaging regression at PR time will surface as a red check the author can see, but it does not block merge. The canonical binary build still runs in ci-integration.yml on merge_group and build-release.yml on release, so nothing reaches users without a passing build. Acceptable risk; the alternative (keeping it on the critical path) re-adds 60-90 s for a signal that no PR-time consumer was downloading the artifact for.
  • POLL_SEC=5 increases gh API call rate. Math: 3 expected checks × 12 polls/min = 36 calls/min, two orders of magnitude under the 5000/hr quota. cancel-in-progress concurrency caps parallel pollers per PR.
  • Rejected: sysmon coverage core. coverage.py 7.10.6 silently falls back to the default tracer when sys.monitoring is asked to measure branches on Python 3.12 (warning: sys.monitoring can't measure branches in this version). Re-evaluate on Python 3.14.
  • Rejected: explicit enable-cache: true on setup-uv@v6. setup-uv@v6 already defaults to "auto", which is true on hosted runners. Adding it was a no-op.

Benefits

  1. PR-time wall-clock: ~70-90 s p99 / ~70 s p50 (warm cache), down from observed ~3.5-4 min. Local shard timings: 44 s (shard 1) + 67 s (shard 2), running on a single 4-vCPU developer machine.
  2. Test-path uv sync no longer pulls --extra build: ~10-20 s removed from every test runner, plus a smaller cache footprint.
  3. merge-gate perceived latency drops by up to 25 s per gate run; ~36 gh API calls/min stays comfortably inside quota.
  4. The 80% unit coverage gate semantics are preserved exactly: enforcement just moves from per-job to post-combine, which is the only correct location once sharding is in play.
  5. Three pre-existing test bugs (silent environment dependency on claude / codex not being on PATH) are fixed; the sharded local rehearsal was the surfacing mechanism.

Validation

Run on this branch (danielmeppiel/bookish-adventure, head c85acd1a):

Shard 1 — 7402 passed in 43.92 s
$ uv run --extra dev pytest tests/unit tests/test_console.py \
    --splits 2 --group 1 -n 2 --dist worksteal \
    --cov --cov-report= --cov-fail-under=0 -q
........................................................................ [ 99%]
..........................................................               [100%]
7402 passed in 43.92s
Shard 2 — 7400 passed, 1 skipped in 67.47 s
$ uv run --extra dev pytest tests/unit tests/test_console.py \
    --splits 2 --group 2 -n 2 --dist worksteal \
    --cov --cov-report= --cov-fail-under=0 -q
7400 passed, 1 skipped, 19 warnings, 38 subtests passed in 67.47s (0:01:07)
Lint contract (canonical, per .apm/instructions/linting.instructions.md)
$ uv run --extra dev ruff check src/ tests/ && \
  uv run --extra dev ruff format --check src/ tests/
All checks passed!
1037 files already formatted

$ uv run --extra dev python -m pylint --disable=all --enable=R0801 \
    --min-similarity-lines=10 --fail-on=R0801 src/apm_cli/
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

$ bash scripts/lint-auth-signals.sh
[+] auth-signal lint clean

Scenario evidence

Per .agents/skills/pr-description-skill/assets/scenario-evidence-rubric.md: this PR is a CI-pipeline change with no user-promise behavior surface (no CLI, no schema, no auth, no install path). The skip clause applies — the substantive behavior coverage is the unchanged test suite continuing to pass on the new pipeline shape, evidenced by the two shard transcripts above plus the three test bugs explicitly fixed (test_all_excluded_warns_and_returns_zero in test_mcp_integrator_install_{hermetic,phase3w4}.py and test_execute_runtime_command_uses_shlex_on_unix in test_runtime_windows.py).

How to test

  • Push or rebase a no-op commit on top of this branch; observe two Build & Test Shard N (Linux) checks start in parallel, plus pr-binary-smoke (Linux) running alongside.
  • Confirm the fan-in Build & Test (Linux) check appears and turns green after both shards succeed.
  • Confirm apm-self-check and NOTICE Drift Check still appear as required.
  • Time the gate: Build & Test (Linux) (fan-in) wall-clock should be max(shard1, shard2) + ~10 s coverage-combine overhead, i.e. ~75-90 s warm.
  • On the auto-merge side, confirm merge_gate_wait.sh logs Sleeping 5s before next poll instead of 30s.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

danielmeppiel and others added 3 commits May 21, 2026 21:34
Cuts PR-time CI wall-clock from ~134s (Build & Test) + up to 30s
(merge-gate poll slack) = ~165s perceived down to a measured-locally
~65-75s + ~5s gate slack.

Changes
-------

1. Two-way sharded Build & Test (ci.yml). Matrix of two runners using
   pytest-split --splits 2 --group N, mirroring the proven pattern in
   ci-integration.yml. Each shard runs xdist -n 2 --dist worksteal
   inside. Fan-in job preserves the 'Build & Test (Linux)' check-run
   name that merge-gate.yml requires in EXPECTED_CHECKS, so no gate
   edits are needed. First runs use pytest-split's naive file-based
   split; a .test_durations file can be committed in a follow-up PR
   for better balance.

   Coverage is collected per-shard with --cov-fail-under=0 (each
   shard only exercises ~half the code paths) and the 80% gate moves
   to the fan-in job after coverage combine -- same approach as
   ci-integration.yml.

   Local timing: full suite was ~84s; each shard now runs in ~60-67s
   (7400 tests passing per shard). 4-vCPU runner means xdist -n 2
   leaves headroom for the two split-loaders.

2. Binary build moved off the PR critical path (ci.yml). The
   'Install UPX', 'Build binary', and 'Upload binary as workflow
   artifact' steps are deleted from Build & Test. uv sync drops
   '--extra build' (PyInstaller + ~150MB of wheels), which also
   shortens cold-cache install by 15-25s.

   Verified zero PR-time consumers: the apm-linux-x86_64 artifact
   uploaded here was never referenced by any PR-required workflow.
   ci-integration.yml (merge_group) and ci-runtime.yml both rebuild
   the binary inline; build-release.yml owns the canonical platform
   matrix on release.

   A new non-required 'PR Binary Smoke (Linux)' job runs in parallel
   to preserve the packaging-regression signal at PR time without
   gating the merge.

3. merge-gate.yml POLL_SEC 30 -> 5. With ~3 expected checks per
   poll = ~36 gh API calls/min, well under the 5000/hr quota.
   cancel-in-progress concurrency on the gate already bounds parallel
   polling per PR. Cuts perceived gate latency by avg ~15s, worst ~25s.

What was considered and dropped
-------------------------------

- coverage --core=sysmon: verified locally that sysmon does not
  support branch coverage on Python 3.12 (CoverageWarning:
  'sys.monitoring can\'t measure branches in this version, using
  default core'). The change would have been a silent no-op while
  branch=true. Revisit when CPython 3.14 lands (PEP 626 enhancements).

- Explicit enable-cache: true on Lint job: verified against
  astral-sh/setup-uv README that enable-cache defaults to 'auto',
  which is true on GitHub-hosted runners. The bare form on the Lint
  job is already caching.

- 4-way shard / 8-core runner / move coverage off PR / replace
  pylint R0801 with jscpd: out of scope for this PR per a panel
  review that traded each for either added complexity or weakened
  signal. See WIP/ for follow-up notes.

Honest commitment
-----------------

Targets ~75s p50, ~90s p99 on warm cache for the PR-time critical
path (was ~165s). Below 60s would require coverage off the PR
critical path or self-hosted runners, both deferred.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three tests passed in CI but failed on dev machines where the runtime
binaries (claude, codex) happen to be installed:

- test_all_excluded_warns_and_returns_zero (hermetic + phase3w4):
  find_runtime_binary is module-level imported into mcp_integrator_install
  at import time, so patching apm_cli.runtime.utils.find_runtime_binary
  did not rebind the symbol the function actually calls. With `claude`
  on PATH it leaked into installed_runtimes and bypassed the exclude
  path, suppressing the expected warning. Also added the missing
  MCPServerOperations mock that the sibling test in the same class
  already had.

- test_execute_runtime_command_uses_shlex_on_unix: the two Windows
  counterparts in the same class already patched
  apm_cli.core.script_runner.find_runtime_binary; the Unix variant
  was missing it and asserted bare "codex" in args, which failed
  when codex resolved to /opt/homebrew/bin/codex.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 21, 2026 19:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors PR-time CI to reduce critical-path wall clock by splitting the unit test suite into two shards, fanning results back into a single required check that enforces the unit coverage floor, and moving the PyInstaller binary build to a parallel non-required job. It also reduces merge-gate polling latency and tightens a few unit tests to avoid environment-dependent runtime discovery.

Changes:

  • Shard PR-time unit tests into two pytest-split groups and add a fan-in Build & Test (Linux) job that combines coverage and enforces the 80% floor.
  • Move the PyInstaller binary build into a parallel, non-required PR Binary Smoke (Linux) job.
  • Reduce merge-gate.yml polling interval and fix unit tests that were accidentally dependent on local PATH/runtime presence.
Show a summary per file
File Description
.github/workflows/ci.yml Splits unit tests into 2 shard jobs, adds a fan-in required coverage gate, and moves binary packaging to a parallel non-required job.
.github/workflows/merge-gate.yml Reduces required-check polling interval to lower merge-gate latency.
tests/unit/test_runtime_windows.py Patches runtime discovery at the usage site to avoid PATH-dependent behavior in parsing tests.
tests/unit/test_mcp_integrator_install_phase3w4.py Adds missing mocks/patches to make runtime detection hermetic and consistent across developer machines.
tests/unit/test_mcp_integrator_install_hermetic.py Mirrors the phase3w4 test fixes to keep hermetic coverage consistent.
CHANGELOG.md Adds an Unreleased entry describing the PR-time CI pipeline change.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 3

Comment thread CHANGELOG.md Outdated

### Changed

- PR-time CI cut from ~4 min to ~70-90 s by sharding `pytest` two ways with `pytest-split` and a fan-in coverage gate, moving the PyInstaller binary build off the required critical path into a parallel non-required `pr-binary-smoke` job, and lowering `merge-gate.yml` poll interval from 30 s to 5 s. The required check name `Build & Test (Linux)` is preserved by the fan-in job. The 80% unit coverage floor is now enforced after `coverage combine`, mirroring the pattern already used in `ci-integration.yml`.
Comment thread .github/workflows/ci.yml Outdated
pattern: unit-coverage-shard-*
path: coverage-shards/
merge-multiple: true
continue-on-error: true
Comment thread .github/workflows/ci.yml Outdated
else
echo "No coverage shard files found; skipping unit coverage summary."
fi
continue-on-error: true
- Remove continue-on-error from download-artifact and coverage-combine
  steps in the fan-in. The downstream 'Enforce unit coverage gate' step
  already checks for .coverage and fails with a clear message if absent,
  so silencing upstream failures only made diagnosis harder.
- Rewrite the Unreleased changelog entry to the repo's single-line
  Keep-a-Changelog convention, with #1437 ref.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Thanks for the review. All three comments accepted, addressed in f118a41:

  • CHANGELOG: rewritten to the single-line Keep-a-Changelog convention with (#1437) suffix. Detailed rationale stays in the PR body where it belongs.
  • ci.yml:183 (download-artifact continue-on-error): removed. If the artifact pattern matches zero files (e.g. both shards failed before producing coverage), actions/download-artifact@v4 exits 0 silently, so we do not need continue-on-error to handle that case. Any real download failure now surfaces immediately.
  • ci.yml:195 (Combine and summarise coverage continue-on-error): removed. The inline if find ... grep -q . guard already protects against the no-artifacts case; coverage combine / coverage json failures will now fail the required check loudly with the actual coverage error, which is what we want.

Separately, while assessing the review I noticed the fan-in job started ~3 minutes after the shards finished on the first green run (shards done 19:55:55, fan-in started 19:58:50). That's outside this PR's scope but tracked in #1438 — likely upload-artifact@v4 propagation lag.

…e_group only

Two consecutive observations on this branch showed the fan-in job
sitting idle for ~3 minutes after the shards completed (run
26249570926: +2m 55s; run 26249910941: +2m 57s). Step-level telemetry
confirms the delay is GitHub Actions runner-allocation latency
between matrix `needs:` resolution and the dependent job's runner
pickup -- not artifact propagation (download-artifact completed in
1 s). The fan-in's own work was only ~17 s.

Net effect of the previous shape: critical path was ~4m 35s even
though the shards themselves ran in ~80 s, eating most of the
sharding win.

This commit:

- Promotes the two shard jobs to required checks; the fan-in is no
  longer on the PR-time critical path.
- Keeps the global 80% unit coverage gate as a non-required
  `Coverage Combine (Linux)` job at PR time so drift is visible in
  the PR UI, and escalates it to required on `merge_group` so the
  global floor still blocks the actual merge.
- Per-shard floor set to 60% (observed: shard 1 64.82%, shard 2
  62.52%) -- catches local regressions at PR time without paying
  the runner-allocation tax.
- Updates merge-gate.yml EXPECTED_CHECKS accordingly: PR context
  gates on the two shard names; merge_group context additionally
  gates on Coverage Combine.

Expected wall-clock for PR-time required checks: ~80 s (max of two
shards), down from ~4m 35s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Folded the artifact-lag mitigation into this PR (commit b419231). Closed #1438.

Measured wall-clock for PR-time required checks:

Run SHA Required-check critical path
Initial fan-in shape 0060123c ~4 m 35 s
With review fixes f118a416 ~4 m 35 s
After mitigation b4192315 ~83 s

What changed

Step-level telemetry showed actions/download-artifact@v4 itself completing in 1 second; the ~3 min gap was GitHub Actions runner-allocation latency between matrix needs: resolution and the dependent job's runner pickup. The mitigation eliminates the dependency from the critical path:

  • The two Build & Test Shard N (Linux) jobs are now the PR-time required checks (per merge-gate.yml EXPECTED_CHECKS).
  • The fan-in is renamed Coverage Combine (Linux) and remains non-required at PR time. It still combines shard coverage and enforces the 80% global floor; a regression turns the check red in the PR UI.
  • On merge_group, Coverage Combine (Linux) IS in EXPECTED_CHECKS, so the global 80% floor blocks the actual merge — merge-time safety is preserved.
  • Per-shard floor is 60% at PR time (observed: shard 1 64.82%, shard 2 62.52%) to catch local regressions without paying the runner-allocation tax.

Trade-off: PR-time enforcement of the global 80% floor moves from blocking to advisory. The merge gate still blocks. The prize is dropping ~3 min off the critical path on every PR.

@danielmeppiel danielmeppiel merged commit 5c1130f into main May 21, 2026
12 checks passed
@danielmeppiel danielmeppiel deleted the danielmeppiel/bookish-adventure branch May 21, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants