Skip to content

[ci] Parallelize unit-test as a per-module matrix + flaky-tests bucket#3239

Closed
snalli wants to merge 4 commits intomasterfrom
snalli/parallel-unit-test
Closed

[ci] Parallelize unit-test as a per-module matrix + flaky-tests bucket#3239
snalli wants to merge 4 commits intomasterfrom
snalli/parallel-unit-test

Conversation

@snalli
Copy link
Copy Markdown
Contributor

@snalli snalli commented May 1, 2026

Purpose

Replaces the single `unit-test` job (which runs every non-store module's `:test` task on one runner, ~22-30 min) with a 20-way matrix strategy — one runner per module, all in parallel. Wall-clock becomes max(per-module time) instead of sum.

Plus a `flaky-tests` bucket that runs known-flaky / `@Ignore`'d test classes with `continue-on-error: true` for visibility without blocking CI.

Draft — proposing for evaluation against the deflake PR (#3235); not for immediate merge.

What changed

`unit-test` job becomes:

```yaml
unit-test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
include:
- { module: ambry-account, args: ":ambry-account:test" }
- { module: ambry-clustermap, args: ":ambry-clustermap:test" }
... (18 more) ...
- { module: flaky-tests, allow-fail: true,
args: ":ambry-file-transfer:test :ambry-vcr:test --tests StoreFileCopyHandlerTest ..." }
continue-on-error: ${{ matrix.allow-fail == true }}
```

20 module legs + 1 flaky bucket = 21 runners per push.

Expected impact (assuming the deflake PR #3235 is in)

Today (single runner) Per-module matrix
Total wall-clock ~22-30 min ~22 min (bound by ambry-clustermap)
Runners used 1 21
Setup overhead 30-60s once 30-60s × 21
Cost (runner-minutes) ~30 ~150-200

The biggest single-module is still `ambry-clustermap` (~22 min). Until that's also parallelized internally (per-fork ZK-port isolation), the matrix only gets us to clustermap's runtime. Other modules finish in 2-5 min and run effectively for free alongside.

Trade-offs

  • Higher runner-minute spend. ~5-7× more runner-minutes per push. Whether that matters depends on the runner budget — LinkedIn-hosted runners may be fine; GitHub-hosted limits could bite.
  • Per-module Codecov flags so coverage from different runners doesn't clobber each other. Codecov merges by flag.
  • Per-module Gradle cache key (`jdk11-${module}`) so concurrent jobs don't fight over the same cache lock.
  • No `codeCoverageReport` aggregation in matrix legs — that task aggregates across all subprojects in one Gradle invocation, which doesn't fit the matrix shape. Codecov's flag-merging covers most use cases; a follow-up could add a separate aggregator job downstream of the matrix if a unified report is needed.

Excluded from matrix

  • `ambry-store` — keeps its existing dedicated job below.
  • `ambry-api`, `ambry-test-utils`, `ambry-all`, `ambry-benchmarks`, `log4j-test-config` — no meaningful test source sets.
  • `ambry-server` — tests are integration only (already in `int-test` job).

flaky-tests bucket

Currently lists `StoreFileCopyHandlerTest`, `StoreFileCopyHandlerIntegTest`, and `CloudBlobStoreTest` — the three classes `@Ignore`'d on the deflake PR for being staged-but-off in production. As we identify more flakes, add to the `--tests` filter on this leg. Failures here surface in PR checks as a yellow ⚠️ instead of red ❌ — visible but non-blocking.

Testing Done

  • Pending CI run on this branch. Expect: 20 module legs all green (or close), flaky-tests leg likely all-skipped (since the @ignore'd tests are still ignored).

snalli and others added 4 commits May 1, 2026 14:38
Splits the single unit-test job (which ran `build` against every
non-store module on one runner, ~22-30 min) into a 20-way matrix —
one runner per module. Wall-clock is now bound by max(per-module test
time) instead of sum, which on this codebase is dominated by
ambry-clustermap (~22 min). Other modules finish in 2-5 min and run
concurrently with clustermap, so they're effectively free.

Trade-offs:
- 20 runner-jobs per push instead of 1. Higher runner-minute spend.
- Per-module Codecov flags so coverage uploads from different runners
  don't clobber each other.
- Per-module Gradle cache key (jdk11-${module}) so concurrent jobs
  don't fight over the same cache lock.

Excluded:
- ambry-store keeps its existing dedicated job below.
- ambry-api, ambry-test-utils, ambry-all, ambry-benchmarks,
  log4j-test-config — these don't have meaningful test source sets.
- ambry-server tests are integration only (already in int-test job).

Code-coverage aggregation across matrix legs is left to Codecov's
default flag-merging — no `codeCoverageReport` task is run per leg
since that aggregates across all subprojects in one Gradle invocation.
A follow-up could run a separate aggregator job after the matrix
completes if a unified coverage view is needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "flaky-tests" matrix leg that runs known-flaky / @ignore'd test
classes (StoreFileCopyHandlerTest, StoreFileCopyHandlerIntegTest,
CloudBlobStoreTest) with continue-on-error: true. Provides visibility
without blocking CI. New flaky tests can be added to this leg's
--tests filter as they're flagged.

Switched matrix to `include:` syntax so per-leg gradle args (and the
allow-fail flag) can be customized.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverting the visibility-bucket idea. @ignore'd tests are already
skipped by JUnit; running them in a separate bucket adds noise
without any signal. Keep the matrix to the 20 productive modules
only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two main test classes in this module (StoreFileCopyHandlerTest +
IntegTest) are already @ignore'd; the remaining four (Disk-aware
thread pool, replication scheduler, FileCopyThread, FileCopyUtils)
aren't load-bearing right now. Skip the whole module to save a runner.

Re-add this leg if file-transfer becomes operational.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
snalli added a commit to snalli/ambry that referenced this pull request May 1, 2026
inconsistentReplicaCapacityTest still hits a 5-min Helix
waitForInitNotification timeout intermittently even with the
@after per-cluster-name cleanup landed earlier in this PR. Root
cause unknown; @ignore for now and chase down on a separate debug
branch with shorter waits.

Bring in the per-module unit-test matrix from the parallel-unit-test
draft (linkedin#3239) so unit-test runs as 19 parallel runners (one per
module) instead of one sequential run. Wall-clock drops from ~30 min
to ~ambry-clustermap's runtime (~22 min). Concurrency block preserved
so PR pushes still supersede in-flight runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@snalli
Copy link
Copy Markdown
Contributor Author

snalli commented May 1, 2026

Superseded — matrix workflow merged into main PR #3235 (commit a791ac5). Closing.

@snalli snalli closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant