Skip to content

fix(triggers): dispatch respond-to-ci on mixed-state SHA from check-suite-success#1241

Merged
zbigniewsobiecki merged 1 commit intodevfrom
fix/check-suite-mixed-state-respond-to-ci
Apr 30, 2026
Merged

fix(triggers): dispatch respond-to-ci on mixed-state SHA from check-suite-success#1241
zbigniewsobiecki merged 1 commit intodevfrom
fix/check-suite-mixed-state-respond-to-ci

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

What

When a check_suite completes with conclusion=success but another suite on the same SHA failed earlier, dispatch respond-to-ci (instead of review) so the CI failure actually gets fixed.

Why — the live incident on ucho/PR#176 (2026-04-30)

GitHub fires check_suite.completed once per workflow. On ucho PR #176 the timeline was:

  • 11:16:31ZE2B Template Rebuild workflow finishes in ~38s with conclusion=failure. The webhook hits check-suite-failure which correctly skips: Not all checks complete yet (6/7 still running).
  • 11:18:00Z — the main CI workflow finishes with conclusion=success. The webhook hits check-suite-success which dispatches review.
  • The review job is enqueued, the worker calls pollWaitForChecks, sees allPassing=false (E2B is failing), and silently skips the agent.
  • No later event with conclusion=failure ever fires — GitHub already delivered that one. So check-suite-failure never re-evaluates.

Net: respond-to-ci is permanently lost on that SHA, the user has to manually nudge or push a fix commit.

Root cause

check-suite-failure is event-keyed on conclusion=failure, but the failure event arrives before aggregate state is "all complete." It correctly defers, but the only event that would have re-fired it is gone.

The system needed any check_suite completion to be able to make the right dispatch decision based on aggregate state — not just the polarity of the event that triggered the matcher.

The fix

check-suite-success already needed to know aggregate state to decide whether to set waitForChecks: true. Pull that aggregate query forward and fork on it:

Aggregate state on the SHA Dispatch
All complete + all passing review (current behavior)
All complete + any failed respond-to-ci (new — closes the gap)
Some still in-progress review with waitForChecks: true

The fork sits after the author + base gates and before the "already-reviewed at HEAD" check — because a SHA that's both already-reviewed and CI-failing still needs the CI fixed (review approval doesn't make tests pass).

Single-sourced dispatch

Both handlers now converge on dispatchRespondToCi in the new src/triggers/github/respond-to-ci-dispatch.ts:

  • Owns fixAttempts map, MAX_ATTEMPTS = 3, attempt-limit gate, warning-comment side effect, and the dispatch result envelope.
  • CheckSuiteFailureTrigger keeps its early gateTriggerEnabled (so disabled projects don't burn GitHub API calls), then delegates the post-aggregate dispatch to the helper.
  • CheckSuiteSuccessTrigger calls the helper inside the new fork — gated by the helper's own gateTriggerEnabled so a project with respond-to-ci disabled won't dispatch.

This is the same shape the spec-017 PM-ack consolidation enforces (see CLAUDE.md "PM-ack dispatch coverage invariant"): one helper, two call sites, no parallel-path drift.

Tests

5 new TDD cases on the success handler:

  • ✅ dispatches respond-to-ci when aggregate has any failure on the SHA
  • ✅ dispatches respond-to-ci on timed_out conclusion
  • ✅ dispatches review when aggregate is all-complete and all-passing
  • ✅ dispatches review with waitForChecks when not all checks complete yet
  • ✅ dispatches respond-to-ci even when SHA was already reviewed (CI failure still needs fixing)

Adjusted 3 pre-existing assertions for the new flow:

  • Tests with early-gate skip (PR not authored by implementer persona, no personaIdentities) keep getCheckSuiteStatus.not.toHaveBeenCalled() — gates short-circuit before the aggregate query.
  • Tests with mid-flow skip (already reviewed at HEAD) drop that assertion — the aggregate query now happens before that check.
  • The happy-path test now asserts getCheckSuiteStatus IS called.

Test plan

  • npx vitest run --project unit-triggers — 968/968 pass
  • npx vitest run --project unit-api — 1526/1526 pass (router/webhook coverage)
  • npx vitest run --project unit-core — 5174/5174 pass (conformance)
  • npm run typecheck — clean
  • npm run lint — no new warnings on changed files
  • CI green on this PR

🤖 Generated with Claude Code

…uite-success

GitHub fires `check_suite.completed` once per workflow. When workflow A's
suite fails fast (e.g. the E2B template-rebuild on ucho/PR#176 — 38s) and
workflow B's suite is still running, the failure handler correctly defers
with "not all complete yet". When workflow B's suite eventually completes
with `conclusion=success`, only the success handler fires — and it
unconditionally dispatches `review`. That review is silently skipped at
worker time (`pollWaitForChecks` sees `allPassing=false`), and no later
event with `conclusion=failure` ever wakes the failure handler back up.
Net: `respond-to-ci` is permanently lost for that SHA.

Fix: in the success handler, after the author + base gates, query the
aggregate `getCheckSuiteStatus` and fork. When `allComplete && anyFailed`,
dispatch `respond-to-ci` (gated by its own trigger config + attempt limit)
instead of `review`. Otherwise, current behavior — dispatch review with
`waitForChecks: true` so the worker polls if checks are still in progress.

Single-source the dispatch envelope: extract `dispatchRespondToCi` plus
`fixAttempts` / `MAX_ATTEMPTS` / `resetFixAttempts` into a new shared
module `respond-to-ci-dispatch.ts`. Both handlers converge on it. The
failure handler keeps its early `gateTriggerEnabled` so disabled projects
don't burn GitHub API calls; the helper re-checks the same gate so the
success-handler fork is also guarded.

Tests: 5 new TDD cases on the success handler — failure conclusion,
timed_out conclusion, in-progress checks (defers to worker), all-passing
(review), and the "already-reviewed at HEAD with CI failure" case (CI
failure trumps prior approval). Adjusted the existing pre-existing
"getCheckSuiteStatus not called" assertions to match the new flow:
removed where the call now happens (post-base-gate), kept where the gate
short-circuits before it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 98.94737% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/triggers/github/respond-to-ci-dispatch.ts 98.46% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki zbigniewsobiecki merged commit 5cfbbbc into dev Apr 30, 2026
9 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/check-suite-mixed-state-respond-to-ci branch April 30, 2026 12:03
zbigniewsobiecki added a commit that referenced this pull request May 2, 2026
ucho/PR #231 had its CI fully green at the latest attempt but cascade
dispatched respond-to-ci anyway, with triggerType=check-failure. Wasteful
agent run; respond-to-ci is the agent that fixes failing CI, so running it
on a green PR confuses the loop semantics.

Root cause: each CI workflow on the PR's head_sha ran TWICE on the same
SHA (push-then-rerun, or two pushes that resolved to the same SHA). The
first attempt of `Rebuild ucho-cli template` FAILED at 19:44:40; the
second attempt SUCCEEDED at 19:45:28. GitHub's `listWorkflowRunsForRepo`
returns BOTH workflow_runs. cascade's `getCheckSuiteStatus` iterated
both and concatenated their jobs — including the stale FAILURE record.

`check-suite-success.handle` (the fork from PR #1241/#1243) computes
`anyFailed = checkRuns.some(cr => cr.conclusion === 'failure' || ...)`.
With the stale failure in the list, `anyFailed=true` even though the PR
is green at the latest attempt. The handler mistakenly forks to
`dispatchRespondToCi(...)`. The triggerType=check-failure in the run
record confirms it came through this fork.

GitHub's `listJobsForWorkflowRun` accepts `filter='latest'` (default)
which dedupes job ATTEMPTS within a single workflow_run (the
"Re-run failed jobs" case). It does NOT dedupe across multiple
workflow_runs of the same workflow on the same SHA — which is what bit us.

Fix: dedupe `workflowRuns` by `workflow_id` BEFORE fetching jobs. GitHub
returns runs sorted by `created_at` desc, so the first occurrence per
workflow_id is the latest. Three-line addition in `getCheckSuiteStatus`
at the GitHub-client layer — every caller (`check-suite-success`,
`check-suite-failure`, anywhere else that asks "current state of CI?")
benefits without changing.

Why at the client layer:
- Source of truth match: GitHub's PR UI uses the latest attempt's status;
  cascade should match that.
- Centralization: the same dedup bug would otherwise surface separately
  per caller. Closing it once at the boundary eliminates the class.

Edge cases handled:
- Same `workflow_id` across different events (push vs pull_request) — keeps
  the most recent regardless of event.
- Different `workflow_id`s (e.g. CI + CodeQL) — both kept; new test pins
  this so over-aggressive dedup can't ship.
- No workflow runs — Map yields empty list, downstream code already
  handles `checkRuns.length === 0`.

Tests:
- New `dedupes workflow runs by workflow_id, keeping only the latest re-run`
  test directly pins the PR #231 incident: 2 workflow_runs same workflow_id,
  first failed second succeeded — assert allPassing=true and only 1
  check_run returned. Also asserts `listJobsForWorkflowRun` is NOT called
  for the older run (saves API quota and proves the dedup actually skipped
  the stale jobs).
- New `keeps separate workflow_ids distinct (CI vs CodeQL on same SHA)`
  guard against over-aggressive dedup.
- Updated `mockWorkflowRuns` helper to default `workflow_id` to the run id
  so existing tests (which don't care about workflow_id) keep semantically
  matching.
- Updated two pagination tests to include `workflow_id` so the dedup
  doesn't collapse multiple-runs-with-undefined-workflow_id to one.

Out of scope (follow-up):
- Audit `getFailedWorkflowRunJobs` (used by respond-to-ci agent) — different
  semantics (it WANTS to show what failed even if subsequently succeeded).
- Eventually-consistent GitHub API — the workflow-runs-list endpoint may
  take a few seconds to reflect a new attempt. Orthogonal to dedup.

Verification: vitest 7687/7687, typecheck clean, lint clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant