Skip to content

CI: cap mvn job timeout at 1h and mark slow-test profiles optional#145

Open
laxman-ch wants to merge 1 commit into
branch-3.6from
vchekka/ci-relax-flaky-test-checks
Open

CI: cap mvn job timeout at 1h and mark slow-test profiles optional#145
laxman-ch wants to merge 1 commit into
branch-3.6from
vchekka/ci-relax-flaky-test-checks

Conversation

@laxman-ch
Copy link
Copy Markdown
Collaborator

What this changes

.github/workflows/ci.yaml — two related tweaks:

Change From To
timeout-minutes on the mvn job 360 (6 hours) 60 (1 hour)
continue-on-error on the mvn job (unset) ${{ matrix.profile.optional }} driven by a new per-profile optional flag

Per-profile optional flag:

Profile optional
full-build-jdk8 (apache-rat + spotbugs + checkstyle) false (still strict)
full-build-jdk11 (apache-rat + spotbugs + checkstyle) false (still strict)
full-build-java-tests (the slow surefire suite) true
full-build-cppunit-tests true

Why

The two slow test profiles routinely take far longer than expected on the self-hosted runners — recent PRs (#140, #143, #144) have seen them sit in IN_PROGRESS for 80–100+ minutes before the 360-minute timeout kicks in or a manual cancel happens. On healthy hardware the same suite runs in well under an hour: a recent local run of the same mvn -Pfull-build verify surefire suite finished in 28 minutes on Apple Silicon at forkcount=4 (3,184/3,187 tests pass), so the 6-hour ceiling is buying nothing but blocked PRs.

The matrix-level optional flag plumbs continue-on-error per-profile so the lint jobs (which DO catch real regressions cheaply) stay strict, while the historically-flaky long-running test profiles stop failing the workflow run.

Important: this only addresses half the problem

continue-on-error: true makes the overall workflow run report success even if an optional profile fails — but each profile still surfaces as its own status check in the rollup. If these four check names (or just the two *-tests ones) are listed individually in the branch-protection required-status-checks rule on branch-3.6, an actual failure on an "optional" profile will still block merge.

Action for repo admin (separate from this PR): Settings -> Branches -> branch-3.6 protection rule -> remove mvn (full-build-java-tests, ...) and mvn (full-build-cppunit-tests, ...) from the Required Status Checks list. That's the second half. After both this PR merges AND the branch-protection rule is updated, PR merges will no longer be blocked on these flaky/slow tests.

Diagnostic context

Companion diagnostic PR #144 was opened in parallel — a one-character README change targeting branch-3.6 to observe whether the multi-hour hang on these test jobs is specific to the PR #140 diff or pre-existing infrastructure flake. If the hang reproduces on #144 (zero code touch), the runner / flaky-test theory is conclusively proven.

🤖 Generated with Claude Code

Two related changes to make the CI workflow less of a bottleneck on
PR merges:

1. timeout-minutes: 360 -> 60. On healthy runners the matrix completes
   well under an hour (a recent local run of the same surefire suite
   on Apple Silicon at forkcount=4 finished in 28 minutes). 360 minutes
   lets wedged jobs camp on runner capacity for 6 hours before getting
   killed, which has been blocking every PR-#140-ish change for hours
   at a stretch.

2. continue-on-error: matrix.profile.optional. Adds a per-profile
   `optional` flag. The two test profiles (full-build-java-tests and
   full-build-cppunit-tests) are marked optional so a failure there
   does NOT fail the workflow run. The lint profiles (jdk8 / jdk11
   apache-rat / spotbugs / checkstyle) remain strict.

The continue-on-error change only affects the WORKFLOW-RUN conclusion.
If the four matrix profile names are also listed individually as
required status checks under the branch-protection rule for branch-3.6,
repo admin still needs to remove them from Settings -> Branches for
"optional" to translate into merges not being blocked. This PR addresses
the workflow side of the problem; the branch-protection side is a
separate one-time admin action.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant