Why
CTO question while dogfooding the brainstormer epic: "how come maintenance didn't detect the issue?"
The current maintenance.py daemon is an LLM-driven backlog GROOMER (closes dupes, retitles, adds loop:ready). It does NOT scan for stuck loop:ready issues that have repeatedly hit worker_iterations_exhausted.
Result: even after the iteration loop bailed on a broken branch, the issue stayed loop:ready (until #128 fixed the label drop). And even with that fix, transient bugs that don't trip escalate_to_human cleanly will still leave issues stuck.
What
A stuck_sweep.py health-check that runs every tick (or every N ticks) and:
- Reads the last 100 events from
loop-runner-events.jsonl
- Counts
worker_iterations_exhausted events per issue
- Any issue with ≥ 2 exhausted attempts AND still
loop:ready → demote to loop:needs-human + comment with the last failure state
- Any issue with a PR in a
pushed_no_pr / committed_not_pushed state persisting > 3 ticks → escalate similarly
- Emit a typed
stuck_sweep_demoted event
Acceptance
src/forge_loop/stuck_sweep.py (new) with sweep(events_file, gh_client) -> SweepReport
- Wired into tick loop: runs after the iteration loop on every tick, before dispatching the next batch
- Configurable via
settings.maintenance.stuck_threshold_attempts (default 2)
- Tests: fixtures of events.jsonl with mixed stuck/healthy issues; assert correct demotions
Test matrix
- Issue with 2 exhausted events → demoted
- Issue with 1 exhausted + 1 success after → NOT demoted (it recovered)
- Issue with 0 exhausted events → NOT demoted
- Demotion fires the typed event + correct labels
- GhClient failure during demotion → caught, logged, doesn't crash sweep
File pointers
src/forge_loop/stuck_sweep.py (new)
src/forge_loop/events.py — register StuckSweepDemotedEvent
src/forge_loop/settings.py — maintenance.stuck_threshold_attempts
src/forge_loop/runner/tick.py — call sweep per tick
tests/test_stuck_sweep.py (new)
Why
CTO question while dogfooding the brainstormer epic: "how come maintenance didn't detect the issue?"
The current
maintenance.pydaemon is an LLM-driven backlog GROOMER (closes dupes, retitles, adds loop:ready). It does NOT scan for stuck loop:ready issues that have repeatedly hitworker_iterations_exhausted.Result: even after the iteration loop bailed on a broken branch, the issue stayed loop:ready (until #128 fixed the label drop). And even with that fix, transient bugs that don't trip escalate_to_human cleanly will still leave issues stuck.
What
A
stuck_sweep.pyhealth-check that runs every tick (or every N ticks) and:loop-runner-events.jsonlworker_iterations_exhaustedevents per issueloop:ready→ demote toloop:needs-human+ comment with the last failure statepushed_no_pr/committed_not_pushedstate persisting > 3 ticks → escalate similarlystuck_sweep_demotedeventAcceptance
src/forge_loop/stuck_sweep.py(new) withsweep(events_file, gh_client) -> SweepReportsettings.maintenance.stuck_threshold_attempts(default 2)Test matrix
File pointers
src/forge_loop/stuck_sweep.py(new)src/forge_loop/events.py— registerStuckSweepDemotedEventsrc/forge_loop/settings.py—maintenance.stuck_threshold_attemptssrc/forge_loop/runner/tick.py— call sweep per ticktests/test_stuck_sweep.py(new)