fix(triggers): dispatch respond-to-ci on mixed-state SHA from check-suite-success by zbigniewsobiecki · Pull Request #1241 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-04-30T11:58:22Z

What

When a check_suite completes with conclusion=success but another suite on the same SHA failed earlier, dispatch respond-to-ci (instead of review) so the CI failure actually gets fixed.

Why — the live incident on ucho/PR#176 (2026-04-30)

GitHub fires check_suite.completed once per workflow. On ucho PR #176 the timeline was:

11:16:31Z — E2B Template Rebuild workflow finishes in ~38s with conclusion=failure. The webhook hits check-suite-failure which correctly skips: Not all checks complete yet (6/7 still running).
11:18:00Z — the main CI workflow finishes with conclusion=success. The webhook hits check-suite-success which dispatches review.
The review job is enqueued, the worker calls pollWaitForChecks, sees allPassing=false (E2B is failing), and silently skips the agent.
No later event with conclusion=failure ever fires — GitHub already delivered that one. So check-suite-failure never re-evaluates.

Net: respond-to-ci is permanently lost on that SHA, the user has to manually nudge or push a fix commit.

Root cause

check-suite-failure is event-keyed on conclusion=failure, but the failure event arrives before aggregate state is "all complete." It correctly defers, but the only event that would have re-fired it is gone.

The system needed any check_suite completion to be able to make the right dispatch decision based on aggregate state — not just the polarity of the event that triggered the matcher.

The fix

check-suite-success already needed to know aggregate state to decide whether to set waitForChecks: true. Pull that aggregate query forward and fork on it:

Aggregate state on the SHA	Dispatch
All complete + all passing	`review` (current behavior)
All complete + any failed	`respond-to-ci` (new — closes the gap)
Some still in-progress	`review` with `waitForChecks: true`

The fork sits after the author + base gates and before the "already-reviewed at HEAD" check — because a SHA that's both already-reviewed and CI-failing still needs the CI fixed (review approval doesn't make tests pass).

Single-sourced dispatch

Both handlers now converge on dispatchRespondToCi in the new src/triggers/github/respond-to-ci-dispatch.ts:

Owns fixAttempts map, MAX_ATTEMPTS = 3, attempt-limit gate, warning-comment side effect, and the dispatch result envelope.
CheckSuiteFailureTrigger keeps its early gateTriggerEnabled (so disabled projects don't burn GitHub API calls), then delegates the post-aggregate dispatch to the helper.
CheckSuiteSuccessTrigger calls the helper inside the new fork — gated by the helper's own gateTriggerEnabled so a project with respond-to-ci disabled won't dispatch.

This is the same shape the spec-017 PM-ack consolidation enforces (see CLAUDE.md "PM-ack dispatch coverage invariant"): one helper, two call sites, no parallel-path drift.

Tests

5 new TDD cases on the success handler:

✅ dispatches respond-to-ci when aggregate has any failure on the SHA
✅ dispatches respond-to-ci on timed_out conclusion
✅ dispatches review when aggregate is all-complete and all-passing
✅ dispatches review with waitForChecks when not all checks complete yet
✅ dispatches respond-to-ci even when SHA was already reviewed (CI failure still needs fixing)

Adjusted 3 pre-existing assertions for the new flow:

Tests with early-gate skip (PR not authored by implementer persona, no personaIdentities) keep getCheckSuiteStatus.not.toHaveBeenCalled() — gates short-circuit before the aggregate query.
Tests with mid-flow skip (already reviewed at HEAD) drop that assertion — the aggregate query now happens before that check.
The happy-path test now asserts getCheckSuiteStatus IS called.

Test plan

npx vitest run --project unit-triggers — 968/968 pass
npx vitest run --project unit-api — 1526/1526 pass (router/webhook coverage)
npx vitest run --project unit-core — 5174/5174 pass (conformance)
npm run typecheck — clean
npm run lint — no new warnings on changed files
CI green on this PR

🤖 Generated with Claude Code

…uite-success GitHub fires `check_suite.completed` once per workflow. When workflow A's suite fails fast (e.g. the E2B template-rebuild on ucho/PR#176 — 38s) and workflow B's suite is still running, the failure handler correctly defers with "not all complete yet". When workflow B's suite eventually completes with `conclusion=success`, only the success handler fires — and it unconditionally dispatches `review`. That review is silently skipped at worker time (`pollWaitForChecks` sees `allPassing=false`), and no later event with `conclusion=failure` ever wakes the failure handler back up. Net: `respond-to-ci` is permanently lost for that SHA. Fix: in the success handler, after the author + base gates, query the aggregate `getCheckSuiteStatus` and fork. When `allComplete && anyFailed`, dispatch `respond-to-ci` (gated by its own trigger config + attempt limit) instead of `review`. Otherwise, current behavior — dispatch review with `waitForChecks: true` so the worker polls if checks are still in progress. Single-source the dispatch envelope: extract `dispatchRespondToCi` plus `fixAttempts` / `MAX_ATTEMPTS` / `resetFixAttempts` into a new shared module `respond-to-ci-dispatch.ts`. Both handlers converge on it. The failure handler keeps its early `gateTriggerEnabled` so disabled projects don't burn GitHub API calls; the helper re-checks the same gate so the success-handler fork is also guarded. Tests: 5 new TDD cases on the success handler — failure conclusion, timed_out conclusion, in-progress checks (defers to worker), all-passing (review), and the "already-reviewed at HEAD with CI failure" case (CI failure trumps prior approval). Adjusted the existing pre-existing "getCheckSuiteStatus not called" assertions to match the new flow: removed where the call now happens (post-base-gate), kept where the gate short-circuits before it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-30T12:03:02Z

Codecov Report

❌ Patch coverage is 98.94737% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/triggers/github/respond-to-ci-dispatch.ts	98.46%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

ucho/PR #231 had its CI fully green at the latest attempt but cascade dispatched respond-to-ci anyway, with triggerType=check-failure. Wasteful agent run; respond-to-ci is the agent that fixes failing CI, so running it on a green PR confuses the loop semantics. Root cause: each CI workflow on the PR's head_sha ran TWICE on the same SHA (push-then-rerun, or two pushes that resolved to the same SHA). The first attempt of `Rebuild ucho-cli template` FAILED at 19:44:40; the second attempt SUCCEEDED at 19:45:28. GitHub's `listWorkflowRunsForRepo` returns BOTH workflow_runs. cascade's `getCheckSuiteStatus` iterated both and concatenated their jobs — including the stale FAILURE record. `check-suite-success.handle` (the fork from PR #1241/#1243) computes `anyFailed = checkRuns.some(cr => cr.conclusion === 'failure' || ...)`. With the stale failure in the list, `anyFailed=true` even though the PR is green at the latest attempt. The handler mistakenly forks to `dispatchRespondToCi(...)`. The triggerType=check-failure in the run record confirms it came through this fork. GitHub's `listJobsForWorkflowRun` accepts `filter='latest'` (default) which dedupes job ATTEMPTS within a single workflow_run (the "Re-run failed jobs" case). It does NOT dedupe across multiple workflow_runs of the same workflow on the same SHA — which is what bit us. Fix: dedupe `workflowRuns` by `workflow_id` BEFORE fetching jobs. GitHub returns runs sorted by `created_at` desc, so the first occurrence per workflow_id is the latest. Three-line addition in `getCheckSuiteStatus` at the GitHub-client layer — every caller (`check-suite-success`, `check-suite-failure`, anywhere else that asks "current state of CI?") benefits without changing. Why at the client layer: - Source of truth match: GitHub's PR UI uses the latest attempt's status; cascade should match that. - Centralization: the same dedup bug would otherwise surface separately per caller. Closing it once at the boundary eliminates the class. Edge cases handled: - Same `workflow_id` across different events (push vs pull_request) — keeps the most recent regardless of event. - Different `workflow_id`s (e.g. CI + CodeQL) — both kept; new test pins this so over-aggressive dedup can't ship. - No workflow runs — Map yields empty list, downstream code already handles `checkRuns.length === 0`. Tests: - New `dedupes workflow runs by workflow_id, keeping only the latest re-run` test directly pins the PR #231 incident: 2 workflow_runs same workflow_id, first failed second succeeded — assert allPassing=true and only 1 check_run returned. Also asserts `listJobsForWorkflowRun` is NOT called for the older run (saves API quota and proves the dedup actually skipped the stale jobs). - New `keeps separate workflow_ids distinct (CI vs CodeQL on same SHA)` guard against over-aggressive dedup. - Updated `mockWorkflowRuns` helper to default `workflow_id` to the run id so existing tests (which don't care about workflow_id) keep semantically matching. - Updated two pagination tests to include `workflow_id` so the dedup doesn't collapse multiple-runs-with-undefined-workflow_id to one. Out of scope (follow-up): - Audit `getFailedWorkflowRunJobs` (used by respond-to-ci agent) — different semantics (it WANTS to show what failed even if subsequently succeeded). - Eventually-consistent GitHub API — the workflow-runs-list endpoint may take a few seconds to reflect a new attempt. Orthogonal to dedup. Verification: vitest 7687/7687, typecheck clean, lint clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

zbigniewsobiecki merged commit 5cfbbbc into dev Apr 30, 2026
9 checks passed

zbigniewsobiecki deleted the fix/check-suite-mixed-state-respond-to-ci branch April 30, 2026 12:03

This was referenced Apr 30, 2026

Promote dev to main: fix(triggers) respond-to-ci on mixed-state SHA #1242

Merged

fix(github): dedupe re-run workflows in getCheckSuiteStatus #1254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(triggers): dispatch respond-to-ci on mixed-state SHA from check-suite-success#1241

fix(triggers): dispatch respond-to-ci on mixed-state SHA from check-suite-success#1241
zbigniewsobiecki merged 1 commit intodevfrom
fix/check-suite-mixed-state-respond-to-ci

zbigniewsobiecki commented Apr 30, 2026

Uh oh!

codecov Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbigniewsobiecki commented Apr 30, 2026

What

Why — the live incident on ucho/PR#176 (2026-04-30)

Root cause

The fix

Single-sourced dispatch

Tests

Test plan

Uh oh!

codecov Bot commented Apr 30, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant