feat(groomer): serialize runs behind a DB lock#517
Conversation
/api/groomer/run had no run-level lock: candidate selection (selector.ts) and per-issue lease acquisition (run.ts) are not atomic, and the "other lease" check only looks at different agent names — so two concurrent groomer runs could select and groom the same issue (duplicate LLM call + label writes). Add acquireGroomerLock/releaseGroomerLock (src/lib/groomer/groomer-lock.ts), mirroring sync-lock.ts: a first-writer-wins single-row lock with 30-min stale reclaim, reusing the generic sync_lock table with id="groomer" and a random token (its syncRunId column is a plain nullable string, no FK) — so no migration. runHostedGroomer acquires the lock up front, bails (returns null, like an empty tick) on conflict, and releases in finally. This makes the groomer safe to run on a schedule / >1 replica (prereq for #504). Closes #503
There was a problem hiding this comment.
AI Automated Review
Full PR review.
Analysis engine: MiniMax-M2.7@https://litellm.jory.dev/v1 (anthropic) — escalated (fast_low_confidence)
Recommendation: Approve
This PR correctly implements a DB-backed lock for the hosted groomer, solving the race condition described in issue PR 503 where concurrent groomer runs could double-groom the same issue.
Change-by-Change Findings
src/lib/groomer/groomer-lock.ts (new file, 58 lines)
acquireGroomerLock(): Checks for existing lock (stale if >30 min), deletes stale lock, then does a transaction with double-check + create. Throws/rethrows on unexpected errors, returns{ locked: true, token }or{ locked: false }.releaseGroomerLock(token): ConditionaldeleteManybyidANDsyncRunIdtoken — guards against releasing another run's lock.- Lock ID: Uses
LOCK_ID = "groomer", distinct from sync's"global"— rows don't interfere. - Token: Random UUID via
crypto.randomUUID(), stored insyncRunIdcolumn (already nullable per schema).
src/lib/groomer/run.ts (modified)
- Imports
acquireGroomerLock/releaseGroomerLock. runHostedGroomernow acquires the lock upfront, bails withnullon conflict, and releases infinally.- Original logic refactored into
executeGroomerRun(called only when lock is held). GroomerDepsinterface extended with lock dependencies for testability.
src/lib/groomer/run.test.ts (modified)
- Added mocks for
acquireGroomerLock/releaseGroomerLock. - "bails without selecting when the groomer lock is held": Verifies lock conflict returns
nullwithout callingselectGroomingCandidateorreleaseGroomerLock. - "releases the groomer lock after a run completes": Verifies
acquireGroomerLockcalled once andreleaseGroomerLockcalled with correct token.
Sources
- PR body and diff confirm 189 tests passing,
tscclean,eslintclean. - CI check results: Tests ✅, Lint ✅, Build ✅, Typecheck ✅, Docker Build ✅.
- Tool harness:
src/lib/sync-lock.tsread successfully; confirmed the pattern this PR mirrors. - Repository history: confirms this commit (
8a3aed2) adds the groomer lock.
Standards Compliance
| Convention | Status | Evidence |
|---|---|---|
Error handling (error instanceof Error) |
✅ | groomer-lock.ts line ~44: if (err instanceof Error && err.message === "already_locked") |
| Prisma schema reuse (no unnecessary migrations) | ✅ | Reuses existing sync_lock table with id="groomer"; no schema change needed |
| Lock semantics match existing pattern | ✅ | First-writer-wins, 30-min stale reclaim, double-check in transaction, conditional release by token — mirrors sync-lock.ts |
| Lock is always released (no leak) | ✅ | finally block in runHostedGroomer guarantees release |
Linked Issue Fit
Issue PR 503 requested: a DB lock for groomer analogous to sync-lock.ts, preventing concurrent runs from double-grooming, protecting manual UI trigger racing scheduler, and being safe on >1 replica.
This PR satisfies all requirements:
- ✅ DB lock created (
groomer-lock.ts), same pattern assync-lock.ts - ✅ Race condition eliminated: lock acquired before candidate selection
- ✅ Manual trigger + scheduler safe from race
- ✅ Multi-replica safe (DB lock works across replicas)
- ✅ No migration needed (reuses
sync_locktable)
Issue PR 504 is referenced in the PR body as a follow-up (internalizing groomer on >1 replica), but PR 503 is the prerequisite that this PR completes.
Evidence Provider Findings
No evidence providers configured for this review.
Tool Harness Findings
The tool harness attempted to verify sync-lock.ts and the groomer directory via GitHub API — both returned 404. However, git_status_short and read_file for src/lib/sync-lock.ts succeeded. The read confirmed:
sync-lock.tsuses the samesyncLocktable- Same
MAX_AGE_MS = 30 * 60 * 1000constant - Same double-check-in-transaction pattern
- Token-based conditional release
The groomer-lock implementation correctly mirrors this established pattern.
Unknowns / Needs Verification
None. The PR is fully verifiable from the corpus:
- Lock semantics verified against
sync-lock.ts(read directly) - Tests confirm bail-on-conflict and release-on-completion paths
- CI confirms all checks pass
- No schema migration was needed (confirms
syncRunIdis already nullable)
Summary
acquireGroomerLock/releaseGroomerLockinsrc/lib/groomer/groomer-lock.ts).runHostedGroomertakes it up front, bails (returnsnull, like an empty tick) on conflict, and releases infinally.Why
/api/groomer/runhad no global lock — candidate selection (selector.ts) and per-issue lease acquisition (run.ts) aren't atomic, and the "other lease" check only compares different agent names. So two concurrent groomer runs could select + groom the same issue (duplicate LLM call + label writes). This makes the groomer safe on a schedule / >1 replica — the prerequisite for internalizing it in #504.Design
sync-lock.ts: first-writer-wins single row, 30-min stale reclaim, double-check inside a transaction, conditional release by token.sync_locktable withid="groomer"and a random UUID token (itssyncRunIdcolumn is a plain nullable string, no FK). Sync usesid="global"; the rows don't interfere.Verification
vitest run src/lib/groomer/→ 189 passing (2 new: bails-without-selecting-when-locked, releases-lock-after-run).tsc→ 0;eslint→ clean.Closes #503 · part of #498