Skip to content

fix(triggers): update initial PR comment on GitHub agent failure#194

Merged
zbigniewsobiecki merged 1 commit intodevfrom
fix/github-agent-failure-comment
Feb 13, 2026
Merged

fix(triggers): update initial PR comment on GitHub agent failure#194
zbigniewsobiecki merged 1 commit intodevfrom
fix/github-agent-failure-comment

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

  • When GitHub-bound agents (review, respond-to-ci, respond-to-review) fail, the initial PR comment ("🤖 Working on fixing CI failures...") is now updated with a clear error message instead of being left stale
  • Handles both agent failure (agentResult.success === false) and uncaught exceptions in the catch block
  • Mirrors the error reporting pattern already used for Trello-bound agents

Test plan

  • npm run lint passes
  • npm run typecheck passes
  • npm test — all 555 tests pass
  • Manual: trigger respond-to-ci on a PR with a known failure, verify initial comment gets updated

🤖 Generated with Claude Code

Previously, GitHub-bound agents (review, respond-to-ci, respond-to-review)
posted an initial "working on it" comment but never updated it on failure,
leaving a stale optimistic message with no indication of what happened.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit ae36f73 into dev Feb 13, 2026
3 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/github-agent-failure-comment branch February 13, 2026 17:56
zbigniewsobiecki added a commit that referenced this pull request May 1, 2026
…s gap (#1248)

ucho/PR #194 (2026-05-01, CI fully green, single SHA) had two review runs
dispatched 29 s apart — both `triggerType=ci-success`, same engine/model/
workItemId, both completed normally, both burned LLM tokens and posted
reviews. Confirmed verbatim in Loki: identical `reviewDispatchKey`
(`zbigniewsobiecki/ucho:194:9ed484df...`) appeared in claim logs from TWO
processes — one from `agent-execution.ts:279` post-completion-hook running
in the IMPL worker container, one from `check-suite-success.ts:240` running
in the cascade-router. Both `claimReviewDispatch(key, ...)` returned `true`
because the dedup `recentlyDispatched` Map at `review-dispatch-dedup.ts:5`
was module-scoped, in-memory, per-process state. Two processes = two
independent Maps = no dedup. Pure waste.

PR #1246 explicitly called this out as out-of-scope ("With the architectural
shift above, the in-memory dedup is sufficient. Skip."). That was wrong:
the architectural shift in #1246 closes the worker-bail-out path but does
nothing about the post-completion-hook path running in a different process.
The bug class is broader than ucho/PR #194 — anywhere two processes might
independently call `claimReviewDispatch` for the same (project, PR, SHA) is
exposed: post-completion-hook ↔ check-suite-success (live now), review-
requested ↔ post-completion-hook (waiting to bite), and any future
horizontally-scaled router replicas.

Architectural fix — move the dedup state to Redis so it's shared across all
processes. Atomic primitive: `SET key value NX EX <ttl>` returns `'OK'`
exactly once per key within the TTL, regardless of how many processes race
it. No application-level locking, no race window.

`src/triggers/github/review-dispatch-dedup.ts` — rewritten:
- Drops the in-memory `Map<string, number>` and `cleanupExpiredEntries`.
- Lazy IORedis singleton (mirrors `src/queue/cancel.ts:28-37`).
- `claimReviewDispatch` / `releaseReviewDispatch` are now async.
- Keys namespaced under `cascade:review-dedup:`.
- Fail-closed on Redis errors: `claim` returns `false` and Sentry-captures
  under tag `review_dedup_redis_down`. Better to skip a legit dispatch than
  duplicate. Mirrors spec-017 fail-closed pipeline-capacity-gate posture.
- 5-min TTL preserved from #1246. The pre-Redis "30-min TTL clears stale
  entries on router restart" CLAUDE.md note is now obsolete.
- Test-only `__resetForTests()` flushes the namespace.

Callers — minimal mechanical await additions:
- `check-suite-success.ts:240` — `await claimReviewDispatch(...)`.
- `check-suite-success.ts:274` — `onBlocked: () => { void releaseReviewDispatch(...) }`
  (callback signature is sync; fire-and-forget the release).
- `review-requested.ts:101-102` — both calls become `await`.
- `review-requested.ts:131` — same fire-and-forget shape.
- `agent-execution.ts:279` (post-completion-hook) — `await`.

Tests:
- `tests/unit/triggers/github/review-dispatch-dedup.test.ts` rewritten
  around a vi.mock('ioredis', ...) factory whose closure captures a single
  in-memory store. Every `new Redis(...)` instance shares that backend, so
  the cross-process invariant is trivially testable: instantiate two
  IORedis clients and verify the second `SET NX EX` for the same key
  returns `null`. **This is the regression pin for ucho/PR #194.**
- New "fails closed when Redis errors" test covers the Sentry-capture
  path under tag `review_dedup_redis_down`.
- `check-suite-success.test.ts` and `review-requested.test.ts`: replaced
  the per-test `recentlyDispatched.clear()` with `vi.mock` of the dedup
  module. Tests that assert the dedup-skip path now use
  `mockClaimReviewDispatch.mockResolvedValueOnce(true).mockResolvedValueOnce(false)`
  to simulate the SET NX EX race.

Verification:
- npx vitest run --project unit-triggers --project unit-api --project unit-core
  → 7678/7678 pass.
- npm run typecheck + npm run lint clean.
- Direct-instance test demonstrates two IORedis clients against the shared
  backend correctly reject the second claim — the production scenario
  pinned in code.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant