Fix #2806: hard-fail pipeline on producer permanent death#2811
Conversation
Producer agents whose consensus-wrapper exhausts its retry budget cannot be replaced by the slice state machine. Previously the orchestrator left such pipelines in `status: running` indefinitely while surviving reviewers heartbeat-polled, masking the failure for 30+ minutes. Four changes, all in the consensus-wrapper / polling-loop seam: - Bump `MAX_CONSENSUS_RESTARTS` from 2 → 3 to give one extra recovery attempt before the orchestrator hard-fails the pipeline. - Publish a medium-priority `OVERSEER_ALERT` (anomaly `agent-restart`) from the wrapper on each restart so operators see recovery attempts in real time rather than only after exhaustion. - In `_run_concurrent_phase`, when a non-clean container exit's role is a producer in the filtered review graph, short-circuit the polling loop with a non-zero return so the caller transitions the pipeline (or slice, via `scheduler.record_failure`) to FAILED. Reviewer-only deaths still flow through the existing `handle_agent_failure` HITL path because peer-review redistribution can recover them. - New `_emit_producer_death_alert` helper publishes a high-priority `OVERSEER_ALERT` (anomaly `producer-permanent-death`) at the detection point with role, exit code, and recovery guidance. Tests: new `test_producer_death_alert.py` covers the alert helper end to end (subject/metadata/body/slice propagation/error swallowing). `test_consensus_wrapper.py` adds asserts for the bumped default and the per-restart alert call site.
This comment has been minimized.
This comment has been minimized.
The new producer-death short-circuit in _run_concurrent_phase returns early when a producer container exits non-zero, bypassing the legacy handle_agent_failure / HITL-decision / consensus-recovery paths. Tests covering those legacy paths must use a non-producer role for the failing container (reviewer_code, reviewer_code_holistic, reviewer_security, reviewer_concurrency) to keep exercising the code under test. The producer-death short-circuit itself is covered by the new test_producer_death_alert.py.
|
Fixed 10 test failures from the "Unit Tests" check. Root cause: The new producer-death short-circuit in Fix: Existing tests that exercised those legacy paths were written with
The producer-death short-circuit itself is covered by the new Pushed as — Authored by egg |
Autofix tracking{"Test/Unit Tests": 2} |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Review of #2811 — hard-fail pipeline on producer permanent death
Reviewed the full diff against origin/main (merge-base 99f8a008a8), traced the new short-circuit through _run_concurrent_phase, confirmed _emit_producer_death_alert's message-store interaction, verified the wrapper template renders, and walked each test change.
The direction is correct: a producer that exhausts its restart budget cannot be replaced by the slice state machine, so failing fast is strictly better than the prior silent-hang. The implementation is also self-consistent — is_producer() is the right discriminator and the alert + _stop_running_containers() ordering is fine.
No Blocking issues. Several Non-blocking suggestions below, ordered by importance.
Non-blocking
1. Lost test coverage in test_consensus_race_on_exit.py. Commit 72d1627 swaps CODER / TESTER (producers) for reviewer-only roles in three race-on-exit tests so they bypass the new short-circuit. That's a faithful rewrite for the new code path, but it deletes the only test coverage of the race the original PR (#1564 family) was written for: a producer container that exits non-zero while consensus was concurrently confirming. Under #2811's design that scenario will now hard-fail rather than recover via the final consensus re-check — which may be the intended outcome (Option A), but it should be tested explicitly, not silently swept under the rug.
Suggest: add a new test that exercises the producer-permanent-death path (executions include CODER, all containers exit code 1, expect (1, combined_logs) containing PRODUCER PERMANENT DEATH). Then the original "producer races with consensus" coverage isn't lost — it's just been re-shaped to assert the new contract.
2. No integration test for the new short-circuit branch in _run_concurrent_phase. test_producer_death_alert.py exercises _emit_producer_death_alert in isolation (4 tests, all hit the helper directly), but nothing exercises pipelines.py:17859-17896 — the branch that decides whether the alert fires and whether _stop_running_containers() runs before returning (1, …). The behavioural claim of the PR ("producer death → phase fails immediately") is not covered. Suggest a polling-loop test in the spirit of test_consensus_polling.py that asserts result == (1, …), combined_logs contains the death marker, and _stop_running_containers was called.
3. _emit_producer_death_alert accepts pipeline: Pipeline but never reads it (orchestrator/routes/pipelines.py:15310). The body only references pipeline_id, role, phase, slice_id, exit_code. Either drop the parameter or document why it's there for future use. The call site at pipelines.py:17873 would simplify accordingly.
4. Race window between step 1 (consensus check) and step 4 (exit detection) within the same poll iteration. The new branch returns (1, …) as soon as a producer container is observed in EXITED status with a non-clean code. But a producer can legitimately exit non-zero after completing CONFIRMED if the wrapper itself crashes on cleanup — in which case the step-1 check_consensus() of the next iteration would have returned is_complete=True. The pre-existing step-5 final recheck handled this for the all-containers-exited case; the new branch skips it.
Concretely: between step 1 (consensus = incomplete) and step 4 (producer EXITED), the producer could have written its CONFIRMED state and then died. Suggest adding a one-shot executor.check_consensus() inside the new branch, identical in spirit to the step-5 recheck:
if filtered_graph.is_producer(role_value):
try:
recheck = executor.check_consensus()
except Exception:
recheck = {"is_complete": False}
if not recheck.get("is_complete"):
_emit_producer_death_alert(...)
...
return 1, combined_logs
# else fall through; the next iteration's step 1/2 will return 0This preserves the pre-#2811 recovery semantics for the narrow race while still hard-failing the genuinely-dead-producer case.
5. Wrapper's egg-orch overseer alert call has no timeout (orchestrator/consensus_wrapper.py:500-513). If the orchestrator is unreachable, the call could hang and delay (or stall) the wrapper's restart loop. The >/dev/null 2>&1 || true swallows errors but doesn't bound wall-clock time. Suggest wrapping with timeout 5 (consistent with other gateway/orch calls in the wrapper).
6. Test duplication. test_default_max_restarts_value_is_three in test_consensus_wrapper.py mostly duplicates test_default_max_restarts — same regex against the same template. Consider folding into one parameterised test or removing the older one.
7. Logs ordering when producer dies before any other container has logged. combined_logs = "\n".join(all_logs) at line 17889 is fine when all_logs is non-empty, but the leading \n--- PRODUCER PERMANENT DEATH ---\n produces a stray blank first line when all_logs == []. Minor; consider "\n".join(all_logs + ["--- PRODUCER PERMANENT DEATH ---", "…"]).
8. Alert metadata. _emit_producer_death_alert does not include exit_code in the rendered body's first line / subject — it's in the body prose and in metadata, which is fine, but operators scanning the alert list see only producer-permanent-death: <role> [high]. Consider producer-permanent-death: {role} exit={exit_code} [high] for at-a-glance triage.
Confirmed working
MessageType.OVERSEER_ALERTis a plain string constant; themsg.message_type == "OVERSEER_ALERT"assertions intest_producer_death_alert.pyare correct.- The
{{…}}braces inconsensus_wrapper.pyare intentional escapes forstr.format()— they render as literal{…}in the emitted shell. filtered_graph.is_producer(role_value)correctly distinguishes coder/tester/documenter from reviewer roles (tester is producer-side per the implement-phase graph).- The
egg-orch overseer alertCLI accepts--role,--anomaly,--priority,--summary,--detailexactly as the wrapper invokes them. _stop_running_containers()is called beforereturn, so reviewers blocked waiting on the dead producer's proposal will get SIGTERM rather than heartbeating indefinitely.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Reviewer egg-reviewer raised 8 non-blocking suggestions. Addressed all actionable items in this PR: - Add race-window recheck inside the producer-death branch in `_run_concurrent_phase`: re-query `executor.check_consensus()` before hard-failing; if consensus is complete, log + `continue` so the next iteration's step 1/2 returns success. Preserves the pre-#2806 recovery semantics for the wrapper-cleanup-after-CONFIRMED race. - Drop the unused `pipeline` parameter from `_emit_producer_death_alert`. - Wrap the wrapper's `egg-orch overseer alert` call with `timeout 5` so a stalled orchestrator cannot delay the restart loop. - Add exit code to the alert subject (`producer-permanent-death: <role> exit=<N> [high]`) for at-a-glance operator triage. - Use list-join when building `combined_logs` to avoid a leading blank line when `all_logs` is empty. - Add two integration tests in `test_consensus_polling.py` covering the producer-death short-circuit branch in `_run_concurrent_phase` (the helper-only tests in `test_producer_death_alert.py` left that branch uncovered): the hard-fail path and the new race-window recheck path. - Fold the duplicate `test_default_max_restarts_value_is_three` into `test_default_max_restarts`; update the overseer-alert wrapper test to assert `timeout 5 egg-orch overseer alert`. - Refresh `docs/guides/concurrent-execution.md` step 6 to describe the recheck before the producer-death hard-fail. Closes review feedback on #2811.
Review response — #2811Thanks for the thorough review. Disposition per item below; all in-PR fixes landed in commit 1. Lost test coverage in 2. No integration test for the new short-circuit branch — fixed-in-PR (commit e810e6e). 3. 4. Race window between step 1 (consensus check) and step 4 (exit detection) — fixed-in-PR (commit e810e6e). 5. Wrapper's 6. Test duplication ( 7. Stray blank first line in 8. Alert metadata — include — Authored by egg |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Check Failure: Unit Tests (autofixer report)What's failing: 27 tests in Root cause: Not caused by this PR. CI tests the merge commit of this PR into This PR touches only What needs to be done:
Suggestion: If you'd rather fold the cleanup into this PR to unblock it, say the word and I'll add a commit deleting those three files. I held off because the rule is "fix only the listed issues; do not fix unrelated code" and these tests have no relation to producer-death handling. — Authored by egg |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Re-review of #2811 — round 2 (commit e810e6e)
Re-read the delta 72d1627..83f765d, traced the new race-window recheck through _run_concurrent_phase, verified the _emit_producer_death_alert signature change against its call sites and tests, and walked the two new integration tests in test_consensus_polling.py.
All 8 items from the previous review are addressed correctly:
- Lost test coverage —
test_producer_death_short_circuits_phaseintest_consensus_polling.pynow exercises the producer-death short-circuit through the polling loop end-to-end (CODER exit=1, asserts(1, logs)with the death marker, surviving reviewer stopped,handle_agent_failurenot called, alert published). - No integration test for the short-circuit branch — covered by item 1 plus the new race-window test.
- Unused
pipelineparam — removed from the helper signature, both call sites and all four helper tests updated. - Race window — the new recheck inside the producer-death branch (
pipelines.py:17879-17901) preserves the pre-#2811 wrapper-cleanup-after-CONFIRMED recovery semantics. Thetry/except → recheck = {"is_complete": False}fallback oncheck_consensusexception is the safe direction (treat unknown as not-complete and hard-fail). timeout 5on the wrapper's overseer alert — wrapped correctly;|| truecontinues to swallow non-zero exits (including 124/137 from the timeout itself or 127 iftimeoutwere missing).- Test duplication —
test_default_max_restarts_value_is_threefolded intotest_default_max_restarts. - Stray blank line — rebuilt as
"\n".join(all_logs + [...]), no leading newline regardless ofall_logslength. - Exit code in subject —
producer-permanent-death: {role} exit={exit_code} [high], asserted explicitly in both helper tests (exit=1andexit=137).
The docs change in concurrent-execution.md step 6 accurately reflects the new "re-query consensus once more" race-window guard.
No Blocking issues. A few minor Non-blocking observations:
Non-blocking
1. The recheck's is_complete check does not consider has_unresolved_nacks (pipelines.py:17889). Compare to the step-5 final recheck (pipelines.py:17974-17997) which explicitly handles the "complete with unresolved NACKs" case by creating an HITL decision and returning (1, …). The new race-window recheck just trusts is_complete=True and continues — relying on step 5's later final recheck (or step 2's next-iteration check) to handle NACKs.
Step 2 has the same lighter check, so this is consistent with existing behavior, and the #1671 "all roles in confirmed_roles" override is intentional. But note that if a future change tightens step 2's NACK handling, this new branch should be tightened in lockstep.
2. The comment block at pipelines.py:17896-17900 says "let the next iteration's step 1/2 return success." In practice, when the dying producer is the last running container (the common case once _stop_running_containers has already fired earlier in the pipeline), step 5's same-iteration if len(exited_containers) >= len(active_executions) path is what returns success — not the next iteration's step 1/2. The behavior is correct; only the comment is imprecise. A one-word fix: "let step 5 (or the next iteration's step 1/2) return success."
3. The race-window test (test_producer_death_skipped_when_consensus_completes_in_race) is brittle to the call ordering of check_consensus. It counts polls with a mutable closure (poll_count = [0]) and assumes:
- call #1 = step 1 of iter 1 (incomplete)
- call #2 = recheck inside producer-death branch (complete)
- call #3+ = step 5 final recheck (complete)
If anyone adds another check_consensus() call earlier in the loop, the test would silently start exercising a different scenario. Consider asserting mock_executor_instance.check_consensus.call_count or making the side_effect content-addressed (e.g., look at recent args) rather than position-addressed. Minor — current tests pass and the assertion exit_code == 0 is the load-bearing check.
4. _emit_producer_death_alert subject omits slice_id. Mentioned in the previous review's item 8 — fixed for exit_code but slice_id is still only in body/metadata. For slice-cascade pipelines, operators triaging the alert list would see N identical subjects (one per slice) and have to expand each to find which slice died. Consider producer-permanent-death: {role} exit={exit_code}{slice_render} [high] where slice_render = f" slice={slice_id}" when set. Not blocking — metadata is queryable.
Confirmed working
- The recheck-exception path correctly degrades to
{"is_complete": False}and hard-fails, which is the conservative direction. _record_container_exitruns BEFORE the producer-death branch, sohas_failures[0]is correctly set to True whencontinuefalls through to step 5. Step 5's NACK-aware final recheck then handles the genuine completion correctly._stop_running_containersis only called in the hard-fail path — when consensus completes in the race window, surviving reviewers are allowed to finish naturally (and will be stopped by step 2/step 5's consensus-reached path).- The CI suite (Unit Tests, Integration Tests, Lint, Docker, Security Scan) is green on
83f765d. - The
timeout 5wrapping is shell-correct:timeout 5 cmd … >/dev/null 2>&1 || truebounds wall-clock and swallows exit 124/137.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
- Subject now includes slice_id when set: `producer-permanent-death: <role> exit=<N>[ slice=<id>] [high]` so per-slice cascades are triagable at-a-glance in the alert list (item 4). - Race-window recheck comment clarified: step 5 (or the next iteration's step 1/2) returns success, not just the next iteration (item 2). - Race-window test decoupled from absolute call indices: monotonic clock has its own counter, and check_consensus side_effect uses a one-shot iterator that returns incomplete on the first call and complete on every subsequent call. Adding another check_consensus call earlier in the loop would not silently change the test scenario (item 3). Lower-bound assertion on call_count locks the invariant in.
Review response — #2811 round 2Thanks for the re-review. Round-1 items were already approved on Round 2 item 1 — Race-window recheck does not consider Round 2 item 2 — Comment imprecision ("next iteration's step 1/2" vs "step 5 or next iteration's step 1/2") — fixed-in-PR (commit 67e6bfc). Round 2 item 3 — Race-window test brittle to
Round 2 item 4 — — Authored by egg |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Re-review of #2811 — round 3 (commit 67e6bfc)
Re-read the delta 83f765d..67e6bfc, traced the three remaining non-blocking items from round 2, and walked the test refactor and subject-format change.
All three round-2 actionable items addressed correctly:
-
Item 2 — Comment imprecision at
pipelines.py:17896-17900. Updated to "Step 5 (or the next iteration's step 1/2) will return success." Verified against the loop structure:continueexits the innerfor exec_info in active_executionsloop, then control falls through to step 5 (pipelines.py:17955). When the dying producer is the last running container — common when_stop_running_containershas already fired —len(exited_containers) >= len(active_executions)is true on the same iteration, so step 5's final recheck is what returns success. When other producers/reviewers are still running, step 5 is skipped and the next iteration's step 1/2 returns success. The "or" disjunction is accurate. -
Item 3 — Race-window test brittle to call ordering. Refactored
test_producer_death_skipped_when_consensus_completes_in_racecleanly:tick[0]increments inside_monotonic(one per call), decoupled fromcheck_consensusinvocations.consensus_returns = iter([incomplete])+StopIteration → completefallback makes the side_effect content-addressed by call order, not absolute index. Adding anothercheck_consensus()earlier in the loop would just shift which call is "first incomplete" but preserve the load-bearing invariant.assert check_consensus.call_count >= 2locks the load-bearing pair (initial step-1 incomplete + race-window recheck complete) without being brittle to step 5's third recheck.
-
Item 4 —
slice_idin alert subject.subject_slice = f" slice={slice_id}" if slice_id else ""correctly produces:producer-permanent-death: coder exit=1 [high]whenslice_id is Noneproducer-permanent-death: tester exit=137 slice=slice-2 [high]whenslice_id == "slice-2"- Both call-site assertions (
test_producer_death_short_circuits_phasefor None,test_slice_id_propagates_into_metadata_and_bodyfor set) cover the format.
Item 1 (NACK handling in race-window recheck) — disagree accepted. The author's reasoning is sound: the new race-window recheck (pipelines.py:17890) mirrors step 2's existing is_complete-only check, not step 5's NACK-aware check. Keeping them in lockstep so a future tightening of step 2 brings the race-window recheck along is a defensible choice. The functional consequence is at worst one extra iteration: when the race-window recheck continues with is_complete=True and has_unresolved_nacks=True, the next iteration's step 2 (or step 5's NACK-aware path with has_failures[0]==True) catches it. No correctness regression.
Confirmed working
- CI is green (Unit Tests, Integration Tests, Lint, Docker, Security Scan, Python).
- The recheck-exception fallback (
recheck = {"is_complete": False}) correctly fails closed — if we can't determine consensus, hard-fail rather than risk leaving the pipeline running. _record_container_exitatpipelines.py:17852(before the producer-death branch) correctly setshas_failures[0] = True, so step 5's final recheck path runs whencontinuefalls through to it.- The race-window test's monotonic clock is now per-call, not per-poll, so the loop's
elapsed = time.monotonic() - start_timeadvances naturally regardless of how manycheck_consensuscalls happen per iteration. _emit_producer_death_alertsignature, body, and metadata changes are consistent: subject now includes slice (in addition to the existing body(slice X)render andmetadata["slice_id"]).
Non-blocking
1. slice_id falsiness inconsistency between subject and metadata (pipelines.py:15329 vs 15349). Subject uses truthy: f" slice={slice_id}" if slice_id else "". Metadata uses is not None: if slice_id is not None: metadata["slice_id"] = slice_id. For slice_id == "", the subject would omit but metadata would include as empty string. In practice slice_id is either None or a validated slice-<N> string (per extract_slice_id / SLICE_ID_PATTERN), so this can't fire — but the asymmetry is a minor footgun for a future caller. One-line fix: if slice_id is not None: at line 15329 too, or if slice_id: at line 15349. Non-blocking.
2. End-to-end coverage of slice_id in subject through _run_concurrent_phase. The new subject format is asserted at the helper level (test_slice_id_propagates_into_metadata_and_body) and indirectly at the integration level (test_producer_death_short_circuits_phase with slice_id=None). No integration test exercises the producer-death short-circuit with a non-None slice_id. The risk surface is small (call site at pipelines.py:17907 just forwards the loop-scope slice_id straight through), but a per-slice cascade producing the subject through the full polling-loop path is currently only smoke-tested. Non-blocking.
Verdict
Approve. All actionable round-2 items addressed; round-2 item 1 (NACK handling) disagreed with sound reasoning; CI is green; the comment correction, test refactor, and subject-format change are all correct. The two non-blocking observations above are tiny and can be folded into a follow-up if desired.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Round 3 non-blocking items: 1. slice_id falsiness inconsistency between subject and metadata (orchestrator/routes/pipelines.py:15329). Subject used a truthy check while metadata at :15349 used 'is not None' — for slice_id == "" the subject would have omitted but metadata would have included as empty string. Aligned both subject and body's slice_render on 'is not None' so the helper agrees on edge values. Can't fire today (extract_slice_id validates to slice-<N> upstream) but the asymmetry was a future footgun. 2. End-to-end coverage of slice_id in subject through _run_concurrent_phase. Added test_producer_death_short_circuits_phase_with_slice_id to test_consensus_polling.py: passes slice_id="slice-2" via the kwarg, exit_code=137 (OOM-shape), and asserts the alert subject carries 'slice=slice-2' plus the slice_id metadata. Closes the per-slice triage format gap that previously only had helper-level coverage.
Review response — #2811 round 3Thanks for the round-3 re-review. Both non-blocking items addressed in commit Round 3 item 1 — Round 3 item 2 — End-to-end coverage of — Authored by egg |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Re-review of #2811 — round 4 (commit a277055)
Re-read the delta 67e6bfc..a277055 (9-line production change + 71-line new test). Both non-blocking items from round 3 are addressed correctly.
Round-3 items addressed
-
Item 1 — slice_id falsiness inconsistency (
pipelines.py:15334,:15336). Bothsubject_sliceandslice_rendernow useif slice_id is not None, matching the metadata check at:15354. Inline comment cites the asymmetry (""would have produced omit-subject + empty-string-metadata) and notesextract_slice_idvalidates upstream so this can't fire today. The fix is the right shape — align all three checks rather than weaken metadata. -
Item 2 — End-to-end slice_id coverage.
test_producer_death_short_circuits_phase_with_slice_idintest_consensus_polling.py:739-802passesslice_id="slice-2"through_run_concurrent_phase, usesexit_code=137(OOM-shape, complements the existing test's exit=1), and asserts both the subject format ("producer-permanent-death: coder exit=137 slice=slice-2 [high]") andmetadata["slice_id"]/["exit_code"]. Closes the gap where the per-slice subject format was only smoke-tested at the helper level.
Confirmed clean
_publish_consensus_timeout_alertdoesn't renderslice_idin subject/body — only in metadata. The producer-death helper's choice to render in subject + body is a deliberate divergence (per-slice triage at-a-glance per round-2 item 4), not a consistency violation.- The repo's prevailing pattern for
slice_idisis not None(see lines 2447, 2593, 4466, 11156, 15268, 15354). The fix aligns with that convention. The remainingif slice_id:truthy checks at lines 9022/9031/9085 are pre-existing and outside the scope of this PR. - Test setup is consistent with the existing
test_producer_death_short_circuits_phase: same mocks,_CALL_ARGS,_base_mocks, lock fixture. Onlyslice_idandexit_codediffer, which is exactly the right axis to vary. - CI is green: Unit Tests, Integration Tests, Lint (Python/Shell/YAML/Docker/Actions), Security Scan, Custom Checks all SUCCESS at HEAD.
- The new test asserts
msg_store.add_message.call_count == 1— guards against accidental duplicate alerts in the short-circuit path.
Behavioral delta of the is not None change
The change widens what counts as a slice-bearing alert: slice_id == "" would now render slice= in the subject and (slice ) in the body. The producer's comment correctly notes this is unreachable through extract_slice_id (which validates against SLICE_ID_PATTERN for slice-<N> shape) and the call site at :17912 forwards the loop-scope value directly. No new code path produces an empty-string slice_id. The change is purely defensive consistency, not a behavioral expansion in practice.
Verdict
Approve. Both non-blocking round-3 items addressed; the change is minimal, surgical, and matches the codebase's is not None convention. CI green. No new issues.
— Authored by egg
|
egg review completed. View run logs 14 previous review(s) hidden. |
Update consensus wrapper description (item 6) to reflect changes from issue #2806 / PR #2811: - MAX_CONSENSUS_RESTARTS bumped from 2 to 3 - Each restart now emits a medium-priority OVERSEER_ALERT - Producer permanent death now hard-fails the pipeline (FAILED state) with a high-priority OVERSEER_ALERT rather than the generic single- agent HITL path; reviewer deaths still use the HITL path Co-authored-by: jwbron <8340608+jwbron@users.noreply.github.com>
Summary
Closes #2806. When a producer agent's consensus-wrapper exhausts its retry budget the orchestrator used to leave the pipeline in
status: runningforever — surviving reviewers heartbeat-poll on a proposal that will never come, the overseer eventually fires a softstuck-phase-transitionalert, and the operator notices ~30+ minutes later. Switching to Option A from the issue: detect the producer-death shape at the polling loop and fail the pipeline immediately, with a high-priority alert.Four changes, in the consensus-wrapper / polling-loop seam:
MAX_CONSENSUS_RESTARTS2 → 3 inorchestrator/consensus_wrapper.pyto give one extra recovery attempt before the orchestrator hard-fails the pipeline.OVERSEER_ALERT(anomalyagent-restart, prioritymedium) from the wrapper. Operators see recovery attempts in real time rather than only learning about a dead agent after exhaustion._run_concurrent_phase— when a non-clean container exit's role is a producer infiltered_graph, short-circuit the polling loop with a non-zero return so the caller transitions the pipeline (or the slice viascheduler.record_failure) to FAILED. Reviewer-only deaths still flow through the existinghandle_agent_failureHITL path because peer-review redistribution can recover them._emit_producer_death_alerthelper publishes a high-priorityOVERSEER_ALERT(anomalyproducer-permanent-death) at the detection point — surfacing role, exit code, and recovery guidance (restart_phase/cancel_task) immediately.Notes on scope and the persistence diagnostic
While tracing the issue I confirmed the user-flagged hypothesis:
ConcurrentPhaseExecutor.handle_agent_failureadds an HITL decision toself.pipeline(the executor's in-memory reference), but the polling loop's mutations always go throughpip = store.load_pipeline(...)/store.save_pipeline(pip)— a different object graph. Nothing persists the executor's mutation, which is why issue #2806 observedpending_decisions: []after coder permanent death. With this PR, producer death no longer relies on that path at all; the orchestrator just fails the pipeline. The pre-existing persistence gap on the reviewer-failure HITL is out of scope here and deliberately left for a follow-up since the user's instruction was scoped to Option A + alerts.Test plan
pytest orchestrator/tests/test_consensus_wrapper.py orchestrator/tests/test_producer_death_alert.py— new tests for the bumped default, the per-restart alert, and the producer-death alert helper (subject/metadata/body/slice propagation/error swallowing). 79 passed.pytest orchestrator/tests/test_concurrent_executor.py orchestrator/tests/test_concurrent_phases.py— nearby tests unchanged. 67 passed.ruff checkclean on changed files; pre-commit hooks pass.producer-permanent-deathalert visible in/sdlc.Linked issues
issue-2777)