Fix #2806: hard-fail pipeline on producer permanent death by jwbron · Pull Request #2811 · jwbron/egg

jwbron · 2026-05-27T17:07:57Z

Summary

Closes #2806. When a producer agent's consensus-wrapper exhausts its retry budget the orchestrator used to leave the pipeline in status: running forever — surviving reviewers heartbeat-poll on a proposal that will never come, the overseer eventually fires a soft stuck-phase-transition alert, and the operator notices ~30+ minutes later. Switching to Option A from the issue: detect the producer-death shape at the polling loop and fail the pipeline immediately, with a high-priority alert.

Four changes, in the consensus-wrapper / polling-loop seam:

Bump MAX_CONSENSUS_RESTARTS 2 → 3 in orchestrator/consensus_wrapper.py to give one extra recovery attempt before the orchestrator hard-fails the pipeline.
Per-restart OVERSEER_ALERT (anomaly agent-restart, priority medium) from the wrapper. Operators see recovery attempts in real time rather than only learning about a dead agent after exhaustion.
Hard-fail on producer death in _run_concurrent_phase — when a non-clean container exit's role is a producer in filtered_graph, short-circuit the polling loop with a non-zero return so the caller transitions the pipeline (or the slice via scheduler.record_failure) to FAILED. Reviewer-only deaths still flow through the existing handle_agent_failure HITL path because peer-review redistribution can recover them.
_emit_producer_death_alert helper publishes a high-priority OVERSEER_ALERT (anomaly producer-permanent-death) at the detection point — surfacing role, exit code, and recovery guidance (restart_phase / cancel_task) immediately.

Notes on scope and the persistence diagnostic

While tracing the issue I confirmed the user-flagged hypothesis: ConcurrentPhaseExecutor.handle_agent_failure adds an HITL decision to self.pipeline (the executor's in-memory reference), but the polling loop's mutations always go through pip = store.load_pipeline(...) / store.save_pipeline(pip) — a different object graph. Nothing persists the executor's mutation, which is why issue #2806 observed pending_decisions: [] after coder permanent death. With this PR, producer death no longer relies on that path at all; the orchestrator just fails the pipeline. The pre-existing persistence gap on the reviewer-failure HITL is out of scope here and deliberately left for a follow-up since the user's instruction was scoped to Option A + alerts.

Test plan

pytest orchestrator/tests/test_consensus_wrapper.py orchestrator/tests/test_producer_death_alert.py — new tests for the bumped default, the per-restart alert, and the producer-death alert helper (subject/metadata/body/slice propagation/error swallowing). 79 passed.
pytest orchestrator/tests/test_concurrent_executor.py orchestrator/tests/test_concurrent_phases.py — nearby tests unchanged. 67 passed.
ruff check clean on changed files; pre-commit hooks pass.
Live exercise on a real pipeline: trigger a producer crash beyond the new restart cap, confirm pipeline transitions to FAILED with the producer-permanent-death alert visible in /sdlc.

Linked issues

Closes Pipeline should auto-mark FAILED when any producer agent dies permanently (retry budget exhausted) #2806
Related: Agent SDK message reader has 1MB JSON buffer limit — large tool results crash agents with exit code 255 #2804 (SDK 1MB JSON buffer overflow — the root cause that produced this state-machine gap on issue-2777)
Related: SDLC skill: Monitor invocation drops events after 1-hour timeout for multi-hour pipelines #2801, Cleanup: holistic overseer overhaul — phase-peer lifecycle, signal coverage, file_issue path #2270, Audit: SDLC pipeline phase transitions and deadlock conditions #2237, Remaining recovery/teardown coverage gaps after #2633 #2659 (adjacent state-machine + visibility gaps)

Producer agents whose consensus-wrapper exhausts its retry budget cannot be replaced by the slice state machine. Previously the orchestrator left such pipelines in `status: running` indefinitely while surviving reviewers heartbeat-polled, masking the failure for 30+ minutes. Four changes, all in the consensus-wrapper / polling-loop seam: - Bump `MAX_CONSENSUS_RESTARTS` from 2 → 3 to give one extra recovery attempt before the orchestrator hard-fails the pipeline. - Publish a medium-priority `OVERSEER_ALERT` (anomaly `agent-restart`) from the wrapper on each restart so operators see recovery attempts in real time rather than only after exhaustion. - In `_run_concurrent_phase`, when a non-clean container exit's role is a producer in the filtered review graph, short-circuit the polling loop with a non-zero return so the caller transitions the pipeline (or slice, via `scheduler.record_failure`) to FAILED. Reviewer-only deaths still flow through the existing `handle_agent_failure` HITL path because peer-review redistribution can recover them. - New `_emit_producer_death_alert` helper publishes a high-priority `OVERSEER_ALERT` (anomaly `producer-permanent-death`) at the detection point with role, exit code, and recovery guidance. Tests: new `test_producer_death_alert.py` covers the alert helper end to end (subject/metadata/body/slice propagation/error swallowing). `test_consensus_wrapper.py` adds asserts for the bumped default and the per-restart alert call site.

The new producer-death short-circuit in _run_concurrent_phase returns early when a producer container exits non-zero, bypassing the legacy handle_agent_failure / HITL-decision / consensus-recovery paths. Tests covering those legacy paths must use a non-producer role for the failing container (reviewer_code, reviewer_code_holistic, reviewer_security, reviewer_concurrency) to keep exercising the code under test. The producer-death short-circuit itself is covered by the new test_producer_death_alert.py.

james-in-a-box · 2026-05-27T17:30:38Z

Fixed 10 test failures from the "Unit Tests" check.

Root cause: The new producer-death short-circuit in _run_concurrent_phase (routes/pipelines.py) returns early with (1, combined_logs) when a producer container exits non-zero, bypassing the legacy handle_agent_failure / HITL-decision / consensus-recovery paths.

Fix: Existing tests that exercised those legacy paths were written with CODER (a producer) as the failing container. Swapped them to non-producer roles so the legacy paths under test still run:

test_agent_exits_recorded.py::test_per_role_exit_records_in_chronological_order — coder → reviewer_code
test_brc_nack_iteration.py (2 tests) — coder → reviewer_code_holistic
test_consensus_polling.py (3 tests) — coder → reviewer_code
test_consensus_race_on_exit.py (3 tests) — coder/tester → reviewer_code/reviewer_security/reviewer_concurrency
test_consensus_timeout_recheck.py (1 test) — swapped exit codes so reviewer_code fails, coder exits clean

The producer-death short-circuit itself is covered by the new test_producer_death_alert.py introduced in this PR.

Pushed as 72d1627.

— Authored by egg

james-in-a-box · 2026-05-27T17:31:22Z

Autofix tracking

{"Test/Unit Tests": 2}

egg-reviewer

Review of #2811 — hard-fail pipeline on producer permanent death

Reviewed the full diff against origin/main (merge-base 99f8a008a8), traced the new short-circuit through _run_concurrent_phase, confirmed _emit_producer_death_alert's message-store interaction, verified the wrapper template renders, and walked each test change.

The direction is correct: a producer that exhausts its restart budget cannot be replaced by the slice state machine, so failing fast is strictly better than the prior silent-hang. The implementation is also self-consistent — is_producer() is the right discriminator and the alert + _stop_running_containers() ordering is fine.

No Blocking issues. Several Non-blocking suggestions below, ordered by importance.

Non-blocking

1. Lost test coverage in test_consensus_race_on_exit.py. Commit 72d1627 swaps CODER / TESTER (producers) for reviewer-only roles in three race-on-exit tests so they bypass the new short-circuit. That's a faithful rewrite for the new code path, but it deletes the only test coverage of the race the original PR (#1564 family) was written for: a producer container that exits non-zero while consensus was concurrently confirming. Under #2811's design that scenario will now hard-fail rather than recover via the final consensus re-check — which may be the intended outcome (Option A), but it should be tested explicitly, not silently swept under the rug.

Suggest: add a new test that exercises the producer-permanent-death path (executions include CODER, all containers exit code 1, expect (1, combined_logs) containing PRODUCER PERMANENT DEATH). Then the original "producer races with consensus" coverage isn't lost — it's just been re-shaped to assert the new contract.

2. No integration test for the new short-circuit branch in _run_concurrent_phase. test_producer_death_alert.py exercises _emit_producer_death_alert in isolation (4 tests, all hit the helper directly), but nothing exercises pipelines.py:17859-17896 — the branch that decides whether the alert fires and whether _stop_running_containers() runs before returning (1, …). The behavioural claim of the PR ("producer death → phase fails immediately") is not covered. Suggest a polling-loop test in the spirit of test_consensus_polling.py that asserts result == (1, …), combined_logs contains the death marker, and _stop_running_containers was called.

3. _emit_producer_death_alert accepts pipeline: Pipeline but never reads it (orchestrator/routes/pipelines.py:15310). The body only references pipeline_id, role, phase, slice_id, exit_code. Either drop the parameter or document why it's there for future use. The call site at pipelines.py:17873 would simplify accordingly.

4. Race window between step 1 (consensus check) and step 4 (exit detection) within the same poll iteration. The new branch returns (1, …) as soon as a producer container is observed in EXITED status with a non-clean code. But a producer can legitimately exit non-zero after completing CONFIRMED if the wrapper itself crashes on cleanup — in which case the step-1 check_consensus() of the next iteration would have returned is_complete=True. The pre-existing step-5 final recheck handled this for the all-containers-exited case; the new branch skips it.

Concretely: between step 1 (consensus = incomplete) and step 4 (producer EXITED), the producer could have written its CONFIRMED state and then died. Suggest adding a one-shot executor.check_consensus() inside the new branch, identical in spirit to the step-5 recheck:

if filtered_graph.is_producer(role_value):
    try:
        recheck = executor.check_consensus()
    except Exception:
        recheck = {"is_complete": False}
    if not recheck.get("is_complete"):
        _emit_producer_death_alert(...)
        ...
        return 1, combined_logs
    # else fall through; the next iteration's step 1/2 will return 0

This preserves the pre-#2811 recovery semantics for the narrow race while still hard-failing the genuinely-dead-producer case.

5. Wrapper's egg-orch overseer alert call has no timeout (orchestrator/consensus_wrapper.py:500-513). If the orchestrator is unreachable, the call could hang and delay (or stall) the wrapper's restart loop. The >/dev/null 2>&1 || true swallows errors but doesn't bound wall-clock time. Suggest wrapping with timeout 5 (consistent with other gateway/orch calls in the wrapper).

6. Test duplication. test_default_max_restarts_value_is_three in test_consensus_wrapper.py mostly duplicates test_default_max_restarts — same regex against the same template. Consider folding into one parameterised test or removing the older one.

7. Logs ordering when producer dies before any other container has logged. combined_logs = "\n".join(all_logs) at line 17889 is fine when all_logs is non-empty, but the leading \n--- PRODUCER PERMANENT DEATH ---\n produces a stray blank first line when all_logs == []. Minor; consider "\n".join(all_logs + ["--- PRODUCER PERMANENT DEATH ---", "…"]).

8. Alert metadata. _emit_producer_death_alert does not include exit_code in the rendered body's first line / subject — it's in the body prose and in metadata, which is fine, but operators scanning the alert list see only producer-permanent-death: <role> [high]. Consider producer-permanent-death: {role} exit={exit_code} [high] for at-a-glance triage.

Confirmed working

MessageType.OVERSEER_ALERT is a plain string constant; the msg.message_type == "OVERSEER_ALERT" assertions in test_producer_death_alert.py are correct.
The {{…}} braces in consensus_wrapper.py are intentional escapes for str.format() — they render as literal {…} in the emitted shell.
filtered_graph.is_producer(role_value) correctly distinguishes coder/tester/documenter from reviewer roles (tester is producer-side per the implement-phase graph).
The egg-orch overseer alert CLI accepts --role, --anomaly, --priority, --summary, --detail exactly as the wrapper invokes them.
_stop_running_containers() is called before return, so reviewers blocked waiting on the dead producer's proposal will get SIGTERM rather than heartbeating indefinitely.

— Authored by egg

Reviewer egg-reviewer raised 8 non-blocking suggestions. Addressed all actionable items in this PR: - Add race-window recheck inside the producer-death branch in `_run_concurrent_phase`: re-query `executor.check_consensus()` before hard-failing; if consensus is complete, log + `continue` so the next iteration's step 1/2 returns success. Preserves the pre-#2806 recovery semantics for the wrapper-cleanup-after-CONFIRMED race. - Drop the unused `pipeline` parameter from `_emit_producer_death_alert`. - Wrap the wrapper's `egg-orch overseer alert` call with `timeout 5` so a stalled orchestrator cannot delay the restart loop. - Add exit code to the alert subject (`producer-permanent-death: <role> exit=<N> [high]`) for at-a-glance operator triage. - Use list-join when building `combined_logs` to avoid a leading blank line when `all_logs` is empty. - Add two integration tests in `test_consensus_polling.py` covering the producer-death short-circuit branch in `_run_concurrent_phase` (the helper-only tests in `test_producer_death_alert.py` left that branch uncovered): the hard-fail path and the new race-window recheck path. - Fold the duplicate `test_default_max_restarts_value_is_three` into `test_default_max_restarts`; update the overseer-alert wrapper test to assert `timeout 5 egg-orch overseer alert`. - Refresh `docs/guides/concurrent-execution.md` step 6 to describe the recheck before the producer-death hard-fail. Closes review feedback on #2811.

james-in-a-box · 2026-05-27T18:13:07Z

Review response — #2811

Thanks for the thorough review. Disposition per item below; all in-PR fixes landed in commit e810e6e.

1. Lost test coverage in test_consensus_race_on_exit.py — fixed-in-PR (commit e810e6e).
Added test_producer_death_short_circuits_phase to test_consensus_polling.py. Exercises CODER (producer) exit=1 through _run_concurrent_phase, asserts (1, logs) carries the PRODUCER PERMANENT DEATH marker, surviving reviewer container is stopped via _stop_running_containers, handle_agent_failure is NOT called, and the OVERSEER_ALERT is published. The original "producer races with consensus" scenario is now expressed as the new contract.

2. No integration test for the new short-circuit branch — fixed-in-PR (commit e810e6e).
Covered by item 1 above plus test_producer_death_skipped_when_consensus_completes_in_race (which asserts the race-window recheck added per item 4).

3. _emit_producer_death_alert accepts unused pipeline param — fixed-in-PR (commit e810e6e).
Dropped pipeline from the signature; updated the single call site in _run_concurrent_phase and both the unit tests in test_producer_death_alert.py.

4. Race window between step 1 (consensus check) and step 4 (exit detection) — fixed-in-PR (commit e810e6e).
Added an executor.check_consensus() recheck inside the producer-death branch, in the spirit of the step-5 final recheck. On is_complete=True we log and continue so the next iteration's step 1/2 returns success; on incomplete (or recheck exception → treated as incomplete) we emit the alert and hard-fail. The new test_producer_death_skipped_when_consensus_completes_in_race test locks this in. Also refreshed docs/guides/concurrent-execution.md step 6 to describe the recheck.

5. Wrapper's egg-orch overseer alert call has no timeout — fixed-in-PR (commit e810e6e).
Wrapped with timeout 5 (consistent with --connect-timeout 5 on the SSE curl earlier in the wrapper). Updated test_restart_emits_overseer_alert to assert timeout 5 egg-orch overseer alert so the timeout doesn't silently regress.

6. Test duplication (test_default_max_restarts_value_is_three vs test_default_max_restarts) — fixed-in-PR (commit e810e6e).
Folded the value-is-three assertion into test_default_max_restarts and removed the duplicate.

7. Stray blank first line in combined_logs when all_logs == [] — fixed-in-PR (commit e810e6e).
Rebuilt as a list-join: "\n".join(all_logs + ["--- PRODUCER PERMANENT DEATH ---", "..."]). No leading \n regardless of all_logs length.

8. Alert metadata — include exit_code in subject — fixed-in-PR (commit e810e6e).
Subject now renders as producer-permanent-death: <role> exit=<N> [high]. Updated test_publishes_high_priority_overseer_alert to assert the exact subject and added a 137-exit assertion in the slice-id test.

— Authored by egg

james-in-a-box · 2026-05-27T18:27:02Z

Check Failure: Unit Tests (autofixer report)

What's failing: 27 tests in tests/plugins/refine_plan/test_emit_contract.py and tests/plugins/refine_plan/test_validate_yaml_tasks.py fail with FileNotFoundError on the scripts they exercise:

FileNotFoundError: [Errno 2] No such file or directory:
  '/home/runner/work/egg/egg/plugins/refine-plan/skills/refine-plan/bin/emit-contract'
  '/home/runner/work/egg/egg/plugins/refine-plan/skills/refine-plan/bin/validate-yaml-tasks'

Root cause: Not caused by this PR. CI tests the merge commit of this PR into main (bcb63dd Merge e810e6e... into ea70e72...). The latest main commit — ea70e72 rm refine-plan plugin — deleted the entire plugins/refine-plan/ directory, including the two bin scripts these tests load via SourceFileLoader/read_text. It did not delete the corresponding test files under tests/plugins/refine_plan/, so they are now orphaned on main and fail for every PR merged against it.

This PR touches only orchestrator/, docs/, and orchestrator tests (see the PR file list); the missing scripts are not in its scope.

What needs to be done:

A separate cleanup commit on main to delete the orphaned tests:
- tests/plugins/refine_plan/test_emit_contract.py
- tests/plugins/refine_plan/test_validate_yaml_tasks.py
- tests/plugins/refine_plan/__init__.py (if no other tests remain in that package)
Once main is green again, rebasing/merging this PR will pick up the fix and the Unit Tests check will pass without changes here.

Suggestion: If you'd rather fold the cleanup into this PR to unblock it, say the word and I'll add a commit deleting those three files. I held off because the rule is "fix only the listed issues; do not fix unrelated code" and these tests have no relation to producer-death handling.

— Authored by egg

egg-reviewer

Re-review of #2811 — round 2 (commit `e810e6e`)

Re-read the delta 72d1627..83f765d, traced the new race-window recheck through _run_concurrent_phase, verified the _emit_producer_death_alert signature change against its call sites and tests, and walked the two new integration tests in test_consensus_polling.py.

All 8 items from the previous review are addressed correctly:

Lost test coverage — test_producer_death_short_circuits_phase in test_consensus_polling.py now exercises the producer-death short-circuit through the polling loop end-to-end (CODER exit=1, asserts (1, logs) with the death marker, surviving reviewer stopped, handle_agent_failure not called, alert published).
No integration test for the short-circuit branch — covered by item 1 plus the new race-window test.
Unused pipeline param — removed from the helper signature, both call sites and all four helper tests updated.
Race window — the new recheck inside the producer-death branch (pipelines.py:17879-17901) preserves the pre-#2811 wrapper-cleanup-after-CONFIRMED recovery semantics. The try/except → recheck = {"is_complete": False} fallback on check_consensus exception is the safe direction (treat unknown as not-complete and hard-fail).
timeout 5 on the wrapper's overseer alert — wrapped correctly; || true continues to swallow non-zero exits (including 124/137 from the timeout itself or 127 if timeout were missing).
Test duplication — test_default_max_restarts_value_is_three folded into test_default_max_restarts.
Stray blank line — rebuilt as "\n".join(all_logs + [...]), no leading newline regardless of all_logs length.
Exit code in subject — producer-permanent-death: {role} exit={exit_code} [high], asserted explicitly in both helper tests (exit=1 and exit=137).

The docs change in concurrent-execution.md step 6 accurately reflects the new "re-query consensus once more" race-window guard.

No Blocking issues. A few minor Non-blocking observations:

Non-blocking

1. The recheck's is_complete check does not consider has_unresolved_nacks (pipelines.py:17889). Compare to the step-5 final recheck (pipelines.py:17974-17997) which explicitly handles the "complete with unresolved NACKs" case by creating an HITL decision and returning (1, …). The new race-window recheck just trusts is_complete=True and continues — relying on step 5's later final recheck (or step 2's next-iteration check) to handle NACKs.

Step 2 has the same lighter check, so this is consistent with existing behavior, and the #1671 "all roles in confirmed_roles" override is intentional. But note that if a future change tightens step 2's NACK handling, this new branch should be tightened in lockstep.

2. The comment block at pipelines.py:17896-17900 says "let the next iteration's step 1/2 return success." In practice, when the dying producer is the last running container (the common case once _stop_running_containers has already fired earlier in the pipeline), step 5's same-iteration if len(exited_containers) >= len(active_executions) path is what returns success — not the next iteration's step 1/2. The behavior is correct; only the comment is imprecise. A one-word fix: "let step 5 (or the next iteration's step 1/2) return success."

3. The race-window test (test_producer_death_skipped_when_consensus_completes_in_race) is brittle to the call ordering of check_consensus. It counts polls with a mutable closure (poll_count = [0]) and assumes:

call #1 = step 1 of iter 1 (incomplete)
call #2 = recheck inside producer-death branch (complete)
call #3+ = step 5 final recheck (complete)

If anyone adds another check_consensus() call earlier in the loop, the test would silently start exercising a different scenario. Consider asserting mock_executor_instance.check_consensus.call_count or making the side_effect content-addressed (e.g., look at recent args) rather than position-addressed. Minor — current tests pass and the assertion exit_code == 0 is the load-bearing check.

4. _emit_producer_death_alert subject omits slice_id. Mentioned in the previous review's item 8 — fixed for exit_code but slice_id is still only in body/metadata. For slice-cascade pipelines, operators triaging the alert list would see N identical subjects (one per slice) and have to expand each to find which slice died. Consider producer-permanent-death: {role} exit={exit_code}{slice_render} [high] where slice_render = f" slice={slice_id}" when set. Not blocking — metadata is queryable.

Confirmed working

The recheck-exception path correctly degrades to {"is_complete": False} and hard-fails, which is the conservative direction.
_record_container_exit runs BEFORE the producer-death branch, so has_failures[0] is correctly set to True when continue falls through to step 5. Step 5's NACK-aware final recheck then handles the genuine completion correctly.
_stop_running_containers is only called in the hard-fail path — when consensus completes in the race window, surviving reviewers are allowed to finish naturally (and will be stopped by step 2/step 5's consensus-reached path).
The CI suite (Unit Tests, Integration Tests, Lint, Docker, Security Scan) is green on 83f765d.
The timeout 5 wrapping is shell-correct: timeout 5 cmd … >/dev/null 2>&1 || true bounds wall-clock and swallows exit 124/137.

— Authored by egg

- Subject now includes slice_id when set: `producer-permanent-death: <role> exit=<N>[ slice=<id>] [high]` so per-slice cascades are triagable at-a-glance in the alert list (item 4). - Race-window recheck comment clarified: step 5 (or the next iteration's step 1/2) returns success, not just the next iteration (item 2). - Race-window test decoupled from absolute call indices: monotonic clock has its own counter, and check_consensus side_effect uses a one-shot iterator that returns incomplete on the first call and complete on every subsequent call. Adding another check_consensus call earlier in the loop would not silently change the test scenario (item 3). Lower-bound assertion on call_count locks the invariant in.

james-in-a-box · 2026-05-27T19:20:01Z

Review response — #2811 round 2

Thanks for the re-review. Round-1 items were already approved on 83f765d; addressing the four round-2 non-blocking items below in 67e6bfc.

Round 2 item 1 — Race-window recheck does not consider has_unresolved_nacks — disagree (consistent with step 2 behavior, reviewer noted as intentional).
The reviewer explicitly observed this "is consistent with existing behavior" (step 2's lighter is_complete check uses the same shape) and called out the lockstep-tightening note as a forward-looking observation, not an actionable change for this PR. Step 5's later final recheck still handles the NACK case correctly when the race-window recheck continues, so the load-bearing behavior is intact. Bringing NACK handling into the race-window recheck would diverge it from step 2 — that should land as a single coupled change if/when step 2 tightens, not asymmetrically here.

Round 2 item 2 — Comment imprecision ("next iteration's step 1/2" vs "step 5 or next iteration's step 1/2") — fixed-in-PR (commit 67e6bfc).
Updated the comment block at pipelines.py:17896-17900 to read "Step 5 (or the next iteration's step 1/2) will return success" — matching the actual behavior when the dying producer is the last running container (step 5 fires same-iteration via the len(exited_containers) >= len(active_executions) path).

Round 2 item 3 — Race-window test brittle to check_consensus call ordering — fixed-in-PR (commit 67e6bfc).
Refactored test_producer_death_skipped_when_consensus_completes_in_race:

tick counter for time.monotonic is now independent of check_consensus calls (one increment per _monotonic() invocation, not per consensus check).
_check_consensus uses a one-shot iter([incomplete]) plus a StopIteration fallback to complete. The invariant is "first call returns incomplete, every subsequent call returns complete" — not tied to absolute call indices. Adding another check_consensus() call earlier in the loop would shift which call is "first" but not change which scenario the test exercises.
Added assert mock_executor_instance.check_consensus.call_count >= 2 to lock in the load-bearing pair (initial step-1 check + race-window recheck).

Round 2 item 4 — _emit_producer_death_alert subject omits slice_id — fixed-in-PR (commit 67e6bfc).
Subject now renders as producer-permanent-death: <role> exit=<N> slice=<id> [high] when slice_id is set, and producer-permanent-death: <role> exit=<N> [high] when it is not — so operators triaging a slice-cascade pipeline can see which slice died without expanding each alert. Updated test_slice_id_propagates_into_metadata_and_body to assert the exact new subject including slice=slice-2.

— Authored by egg

egg-reviewer

Re-review of #2811 — round 3 (commit `67e6bfc`)

Re-read the delta 83f765d..67e6bfc, traced the three remaining non-blocking items from round 2, and walked the test refactor and subject-format change.

All three round-2 actionable items addressed correctly:

Item 2 — Comment imprecision at pipelines.py:17896-17900. Updated to "Step 5 (or the next iteration's step 1/2) will return success." Verified against the loop structure: continue exits the inner for exec_info in active_executions loop, then control falls through to step 5 (pipelines.py:17955). When the dying producer is the last running container — common when _stop_running_containers has already fired — len(exited_containers) >= len(active_executions) is true on the same iteration, so step 5's final recheck is what returns success. When other producers/reviewers are still running, step 5 is skipped and the next iteration's step 1/2 returns success. The "or" disjunction is accurate.
Item 3 — Race-window test brittle to call ordering. Refactored test_producer_death_skipped_when_consensus_completes_in_race cleanly:
- tick[0] increments inside _monotonic (one per call), decoupled from check_consensus invocations.
- consensus_returns = iter([incomplete]) + StopIteration → complete fallback makes the side_effect content-addressed by call order, not absolute index. Adding another check_consensus() earlier in the loop would just shift which call is "first incomplete" but preserve the load-bearing invariant.
- assert check_consensus.call_count >= 2 locks the load-bearing pair (initial step-1 incomplete + race-window recheck complete) without being brittle to step 5's third recheck.
Item 4 — slice_id in alert subject. subject_slice = f" slice={slice_id}" if slice_id else "" correctly produces:
- producer-permanent-death: coder exit=1 [high] when slice_id is None
- producer-permanent-death: tester exit=137 slice=slice-2 [high] when slice_id == "slice-2"
- Both call-site assertions (test_producer_death_short_circuits_phase for None, test_slice_id_propagates_into_metadata_and_body for set) cover the format.

Item 1 (NACK handling in race-window recheck) — disagree accepted. The author's reasoning is sound: the new race-window recheck (pipelines.py:17890) mirrors step 2's existing is_complete-only check, not step 5's NACK-aware check. Keeping them in lockstep so a future tightening of step 2 brings the race-window recheck along is a defensible choice. The functional consequence is at worst one extra iteration: when the race-window recheck continues with is_complete=True and has_unresolved_nacks=True, the next iteration's step 2 (or step 5's NACK-aware path with has_failures[0]==True) catches it. No correctness regression.

Confirmed working

CI is green (Unit Tests, Integration Tests, Lint, Docker, Security Scan, Python).
The recheck-exception fallback (recheck = {"is_complete": False}) correctly fails closed — if we can't determine consensus, hard-fail rather than risk leaving the pipeline running.
_record_container_exit at pipelines.py:17852 (before the producer-death branch) correctly sets has_failures[0] = True, so step 5's final recheck path runs when continue falls through to it.
The race-window test's monotonic clock is now per-call, not per-poll, so the loop's elapsed = time.monotonic() - start_time advances naturally regardless of how many check_consensus calls happen per iteration.
_emit_producer_death_alert signature, body, and metadata changes are consistent: subject now includes slice (in addition to the existing body (slice X) render and metadata["slice_id"]).

Non-blocking

1. slice_id falsiness inconsistency between subject and metadata (pipelines.py:15329 vs 15349). Subject uses truthy: f" slice={slice_id}" if slice_id else "". Metadata uses is not None: if slice_id is not None: metadata["slice_id"] = slice_id. For slice_id == "", the subject would omit but metadata would include as empty string. In practice slice_id is either None or a validated slice-<N> string (per extract_slice_id / SLICE_ID_PATTERN), so this can't fire — but the asymmetry is a minor footgun for a future caller. One-line fix: if slice_id is not None: at line 15329 too, or if slice_id: at line 15349. Non-blocking.

2. End-to-end coverage of slice_id in subject through _run_concurrent_phase. The new subject format is asserted at the helper level (test_slice_id_propagates_into_metadata_and_body) and indirectly at the integration level (test_producer_death_short_circuits_phase with slice_id=None). No integration test exercises the producer-death short-circuit with a non-None slice_id. The risk surface is small (call site at pipelines.py:17907 just forwards the loop-scope slice_id straight through), but a per-slice cascade producing the subject through the full polling-loop path is currently only smoke-tested. Non-blocking.

Verdict

Approve. All actionable round-2 items addressed; round-2 item 1 (NACK handling) disagreed with sound reasoning; CI is green; the comment correction, test refactor, and subject-format change are all correct. The two non-blocking observations above are tiny and can be folded into a follow-up if desired.

— Authored by egg

Round 3 non-blocking items: 1. slice_id falsiness inconsistency between subject and metadata (orchestrator/routes/pipelines.py:15329). Subject used a truthy check while metadata at :15349 used 'is not None' — for slice_id == "" the subject would have omitted but metadata would have included as empty string. Aligned both subject and body's slice_render on 'is not None' so the helper agrees on edge values. Can't fire today (extract_slice_id validates to slice-<N> upstream) but the asymmetry was a future footgun. 2. End-to-end coverage of slice_id in subject through _run_concurrent_phase. Added test_producer_death_short_circuits_phase_with_slice_id to test_consensus_polling.py: passes slice_id="slice-2" via the kwarg, exit_code=137 (OOM-shape), and asserts the alert subject carries 'slice=slice-2' plus the slice_id metadata. Closes the per-slice triage format gap that previously only had helper-level coverage.

james-in-a-box · 2026-05-27T19:40:35Z

Review response — #2811 round 3

Thanks for the round-3 re-review. Both non-blocking items addressed in commit a277055.

Round 3 item 1 — slice_id falsiness inconsistency between subject and metadata — fixed-in-PR (commit a277055).
Aligned the subject's slice rendering (and the body's slice_render) on is not None, matching the metadata check at pipelines.py:15349. The two checks no longer disagree on edge values like slice_id == "". As the reviewer noted, this can't fire today because extract_slice_id / SLICE_ID_PATTERN upstream validates to slice-<N> or None, but keeping the helper internally consistent removes the future-footgun shape.

Round 3 item 2 — End-to-end coverage of slice_id in subject through _run_concurrent_phase — fixed-in-PR (commit a277055).
Added test_producer_death_short_circuits_phase_with_slice_id to test_consensus_polling.py. The test passes slice_id="slice-2" to _run_concurrent_phase directly via the kwarg, drives a CODER (producer) exit with code 137 (OOM shape, distinguishing it from the exit=1 case in the sibling test), and asserts the alert subject is exactly producer-permanent-death: coder exit=137 slice=slice-2 [high] plus metadata["slice_id"] == "slice-2". The per-slice triage format is now covered through the full polling-loop path, not just at the helper level.

— Authored by egg

egg-reviewer

Re-review of #2811 — round 4 (commit `a277055`)

Re-read the delta 67e6bfc..a277055 (9-line production change + 71-line new test). Both non-blocking items from round 3 are addressed correctly.

Round-3 items addressed

Item 1 — slice_id falsiness inconsistency (pipelines.py:15334, :15336). Both subject_slice and slice_render now use if slice_id is not None, matching the metadata check at :15354. Inline comment cites the asymmetry ("" would have produced omit-subject + empty-string-metadata) and notes extract_slice_id validates upstream so this can't fire today. The fix is the right shape — align all three checks rather than weaken metadata.
Item 2 — End-to-end slice_id coverage. test_producer_death_short_circuits_phase_with_slice_id in test_consensus_polling.py:739-802 passes slice_id="slice-2" through _run_concurrent_phase, uses exit_code=137 (OOM-shape, complements the existing test's exit=1), and asserts both the subject format ("producer-permanent-death: coder exit=137 slice=slice-2 [high]") and metadata["slice_id"]/["exit_code"]. Closes the gap where the per-slice subject format was only smoke-tested at the helper level.

Confirmed clean

_publish_consensus_timeout_alert doesn't render slice_id in subject/body — only in metadata. The producer-death helper's choice to render in subject + body is a deliberate divergence (per-slice triage at-a-glance per round-2 item 4), not a consistency violation.
The repo's prevailing pattern for slice_id is is not None (see lines 2447, 2593, 4466, 11156, 15268, 15354). The fix aligns with that convention. The remaining if slice_id: truthy checks at lines 9022/9031/9085 are pre-existing and outside the scope of this PR.
Test setup is consistent with the existing test_producer_death_short_circuits_phase: same mocks, _CALL_ARGS, _base_mocks, lock fixture. Only slice_id and exit_code differ, which is exactly the right axis to vary.
CI is green: Unit Tests, Integration Tests, Lint (Python/Shell/YAML/Docker/Actions), Security Scan, Custom Checks all SUCCESS at HEAD.
The new test asserts msg_store.add_message.call_count == 1 — guards against accidental duplicate alerts in the short-circuit path.

Behavioral delta of the `is not None` change

The change widens what counts as a slice-bearing alert: slice_id == "" would now render slice= in the subject and (slice ) in the body. The producer's comment correctly notes this is unreachable through extract_slice_id (which validates against SLICE_ID_PATTERN for slice-<N> shape) and the call site at :17912 forwards the loop-scope value directly. No new code path produces an empty-string slice_id. The change is purely defensive consistency, not a behavioral expansion in practice.

Verdict

Approve. Both non-blocking round-3 items addressed; the change is minimal, surgical, and matches the codebase's is not None convention. CI green. No new issues.

— Authored by egg

james-in-a-box · 2026-05-27T19:57:01Z

egg review completed. View run logs

14 previous review(s) hidden.

Update consensus wrapper description (item 6) to reflect changes from issue #2806 / PR #2811: - MAX_CONSENSUS_RESTARTS bumped from 2 to 3 - Each restart now emits a medium-priority OVERSEER_ALERT - Producer permanent death now hard-fails the pipeline (FAILED state) with a high-priority OVERSEER_ALERT rather than the generic single- agent HITL path; reviewer deaths still use the HITL path Co-authored-by: jwbron <8340608+jwbron@users.noreply.github.com>