Skip to content

Fix #2806: hard-fail pipeline on producer permanent death#2811

Merged
jwbron merged 7 commits into
mainfrom
egg/issue-2806
May 27, 2026
Merged

Fix #2806: hard-fail pipeline on producer permanent death#2811
jwbron merged 7 commits into
mainfrom
egg/issue-2806

Conversation

@jwbron
Copy link
Copy Markdown
Owner

@jwbron jwbron commented May 27, 2026

Summary

Closes #2806. When a producer agent's consensus-wrapper exhausts its retry budget the orchestrator used to leave the pipeline in status: running forever — surviving reviewers heartbeat-poll on a proposal that will never come, the overseer eventually fires a soft stuck-phase-transition alert, and the operator notices ~30+ minutes later. Switching to Option A from the issue: detect the producer-death shape at the polling loop and fail the pipeline immediately, with a high-priority alert.

Four changes, in the consensus-wrapper / polling-loop seam:

  • Bump MAX_CONSENSUS_RESTARTS 2 → 3 in orchestrator/consensus_wrapper.py to give one extra recovery attempt before the orchestrator hard-fails the pipeline.
  • Per-restart OVERSEER_ALERT (anomaly agent-restart, priority medium) from the wrapper. Operators see recovery attempts in real time rather than only learning about a dead agent after exhaustion.
  • Hard-fail on producer death in _run_concurrent_phase — when a non-clean container exit's role is a producer in filtered_graph, short-circuit the polling loop with a non-zero return so the caller transitions the pipeline (or the slice via scheduler.record_failure) to FAILED. Reviewer-only deaths still flow through the existing handle_agent_failure HITL path because peer-review redistribution can recover them.
  • _emit_producer_death_alert helper publishes a high-priority OVERSEER_ALERT (anomaly producer-permanent-death) at the detection point — surfacing role, exit code, and recovery guidance (restart_phase / cancel_task) immediately.

Notes on scope and the persistence diagnostic

While tracing the issue I confirmed the user-flagged hypothesis: ConcurrentPhaseExecutor.handle_agent_failure adds an HITL decision to self.pipeline (the executor's in-memory reference), but the polling loop's mutations always go through pip = store.load_pipeline(...) / store.save_pipeline(pip) — a different object graph. Nothing persists the executor's mutation, which is why issue #2806 observed pending_decisions: [] after coder permanent death. With this PR, producer death no longer relies on that path at all; the orchestrator just fails the pipeline. The pre-existing persistence gap on the reviewer-failure HITL is out of scope here and deliberately left for a follow-up since the user's instruction was scoped to Option A + alerts.

Test plan

  • pytest orchestrator/tests/test_consensus_wrapper.py orchestrator/tests/test_producer_death_alert.py — new tests for the bumped default, the per-restart alert, and the producer-death alert helper (subject/metadata/body/slice propagation/error swallowing). 79 passed.
  • pytest orchestrator/tests/test_concurrent_executor.py orchestrator/tests/test_concurrent_phases.py — nearby tests unchanged. 67 passed.
  • ruff check clean on changed files; pre-commit hooks pass.
  • Live exercise on a real pipeline: trigger a producer crash beyond the new restart cap, confirm pipeline transitions to FAILED with the producer-permanent-death alert visible in /sdlc.

Linked issues

Producer agents whose consensus-wrapper exhausts its retry budget cannot
be replaced by the slice state machine. Previously the orchestrator left
such pipelines in `status: running` indefinitely while surviving
reviewers heartbeat-polled, masking the failure for 30+ minutes.

Four changes, all in the consensus-wrapper / polling-loop seam:

- Bump `MAX_CONSENSUS_RESTARTS` from 2 → 3 to give one extra recovery
  attempt before the orchestrator hard-fails the pipeline.
- Publish a medium-priority `OVERSEER_ALERT` (anomaly `agent-restart`)
  from the wrapper on each restart so operators see recovery attempts
  in real time rather than only after exhaustion.
- In `_run_concurrent_phase`, when a non-clean container exit's role is
  a producer in the filtered review graph, short-circuit the polling
  loop with a non-zero return so the caller transitions the pipeline
  (or slice, via `scheduler.record_failure`) to FAILED. Reviewer-only
  deaths still flow through the existing `handle_agent_failure` HITL
  path because peer-review redistribution can recover them.
- New `_emit_producer_death_alert` helper publishes a high-priority
  `OVERSEER_ALERT` (anomaly `producer-permanent-death`) at the
  detection point with role, exit code, and recovery guidance.

Tests: new `test_producer_death_alert.py` covers the alert helper end
to end (subject/metadata/body/slice propagation/error swallowing).
`test_consensus_wrapper.py` adds asserts for the bumped default and
the per-restart alert call site.
@james-in-a-box

This comment has been minimized.

The new producer-death short-circuit in _run_concurrent_phase returns
early when a producer container exits non-zero, bypassing the legacy
handle_agent_failure / HITL-decision / consensus-recovery paths.

Tests covering those legacy paths must use a non-producer role for the
failing container (reviewer_code, reviewer_code_holistic, reviewer_security,
reviewer_concurrency) to keep exercising the code under test. The
producer-death short-circuit itself is covered by the new
test_producer_death_alert.py.
@james-in-a-box
Copy link
Copy Markdown
Contributor

Fixed 10 test failures from the "Unit Tests" check.

Root cause: The new producer-death short-circuit in _run_concurrent_phase (routes/pipelines.py) returns early with (1, combined_logs) when a producer container exits non-zero, bypassing the legacy handle_agent_failure / HITL-decision / consensus-recovery paths.

Fix: Existing tests that exercised those legacy paths were written with CODER (a producer) as the failing container. Swapped them to non-producer roles so the legacy paths under test still run:

  • test_agent_exits_recorded.py::test_per_role_exit_records_in_chronological_order — coder → reviewer_code
  • test_brc_nack_iteration.py (2 tests) — coder → reviewer_code_holistic
  • test_consensus_polling.py (3 tests) — coder → reviewer_code
  • test_consensus_race_on_exit.py (3 tests) — coder/tester → reviewer_code/reviewer_security/reviewer_concurrency
  • test_consensus_timeout_recheck.py (1 test) — swapped exit codes so reviewer_code fails, coder exits clean

The producer-death short-circuit itself is covered by the new test_producer_death_alert.py introduced in this PR.

Pushed as 72d1627.

— Authored by egg

@james-in-a-box
Copy link
Copy Markdown
Contributor

james-in-a-box Bot commented May 27, 2026

Autofix tracking
{"Test/Unit Tests": 2}

@james-in-a-box

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of #2811 — hard-fail pipeline on producer permanent death

Reviewed the full diff against origin/main (merge-base 99f8a008a8), traced the new short-circuit through _run_concurrent_phase, confirmed _emit_producer_death_alert's message-store interaction, verified the wrapper template renders, and walked each test change.

The direction is correct: a producer that exhausts its restart budget cannot be replaced by the slice state machine, so failing fast is strictly better than the prior silent-hang. The implementation is also self-consistent — is_producer() is the right discriminator and the alert + _stop_running_containers() ordering is fine.

No Blocking issues. Several Non-blocking suggestions below, ordered by importance.


Non-blocking

1. Lost test coverage in test_consensus_race_on_exit.py. Commit 72d1627 swaps CODER / TESTER (producers) for reviewer-only roles in three race-on-exit tests so they bypass the new short-circuit. That's a faithful rewrite for the new code path, but it deletes the only test coverage of the race the original PR (#1564 family) was written for: a producer container that exits non-zero while consensus was concurrently confirming. Under #2811's design that scenario will now hard-fail rather than recover via the final consensus re-check — which may be the intended outcome (Option A), but it should be tested explicitly, not silently swept under the rug.

Suggest: add a new test that exercises the producer-permanent-death path (executions include CODER, all containers exit code 1, expect (1, combined_logs) containing PRODUCER PERMANENT DEATH). Then the original "producer races with consensus" coverage isn't lost — it's just been re-shaped to assert the new contract.

2. No integration test for the new short-circuit branch in _run_concurrent_phase. test_producer_death_alert.py exercises _emit_producer_death_alert in isolation (4 tests, all hit the helper directly), but nothing exercises pipelines.py:17859-17896 — the branch that decides whether the alert fires and whether _stop_running_containers() runs before returning (1, …). The behavioural claim of the PR ("producer death → phase fails immediately") is not covered. Suggest a polling-loop test in the spirit of test_consensus_polling.py that asserts result == (1, …), combined_logs contains the death marker, and _stop_running_containers was called.

3. _emit_producer_death_alert accepts pipeline: Pipeline but never reads it (orchestrator/routes/pipelines.py:15310). The body only references pipeline_id, role, phase, slice_id, exit_code. Either drop the parameter or document why it's there for future use. The call site at pipelines.py:17873 would simplify accordingly.

4. Race window between step 1 (consensus check) and step 4 (exit detection) within the same poll iteration. The new branch returns (1, …) as soon as a producer container is observed in EXITED status with a non-clean code. But a producer can legitimately exit non-zero after completing CONFIRMED if the wrapper itself crashes on cleanup — in which case the step-1 check_consensus() of the next iteration would have returned is_complete=True. The pre-existing step-5 final recheck handled this for the all-containers-exited case; the new branch skips it.

Concretely: between step 1 (consensus = incomplete) and step 4 (producer EXITED), the producer could have written its CONFIRMED state and then died. Suggest adding a one-shot executor.check_consensus() inside the new branch, identical in spirit to the step-5 recheck:

if filtered_graph.is_producer(role_value):
    try:
        recheck = executor.check_consensus()
    except Exception:
        recheck = {"is_complete": False}
    if not recheck.get("is_complete"):
        _emit_producer_death_alert(...)
        ...
        return 1, combined_logs
    # else fall through; the next iteration's step 1/2 will return 0

This preserves the pre-#2811 recovery semantics for the narrow race while still hard-failing the genuinely-dead-producer case.

5. Wrapper's egg-orch overseer alert call has no timeout (orchestrator/consensus_wrapper.py:500-513). If the orchestrator is unreachable, the call could hang and delay (or stall) the wrapper's restart loop. The >/dev/null 2>&1 || true swallows errors but doesn't bound wall-clock time. Suggest wrapping with timeout 5 (consistent with other gateway/orch calls in the wrapper).

6. Test duplication. test_default_max_restarts_value_is_three in test_consensus_wrapper.py mostly duplicates test_default_max_restarts — same regex against the same template. Consider folding into one parameterised test or removing the older one.

7. Logs ordering when producer dies before any other container has logged. combined_logs = "\n".join(all_logs) at line 17889 is fine when all_logs is non-empty, but the leading \n--- PRODUCER PERMANENT DEATH ---\n produces a stray blank first line when all_logs == []. Minor; consider "\n".join(all_logs + ["--- PRODUCER PERMANENT DEATH ---", "…"]).

8. Alert metadata. _emit_producer_death_alert does not include exit_code in the rendered body's first line / subject — it's in the body prose and in metadata, which is fine, but operators scanning the alert list see only producer-permanent-death: <role> [high]. Consider producer-permanent-death: {role} exit={exit_code} [high] for at-a-glance triage.


Confirmed working

  • MessageType.OVERSEER_ALERT is a plain string constant; the msg.message_type == "OVERSEER_ALERT" assertions in test_producer_death_alert.py are correct.
  • The {{…}} braces in consensus_wrapper.py are intentional escapes for str.format() — they render as literal {…} in the emitted shell.
  • filtered_graph.is_producer(role_value) correctly distinguishes coder/tester/documenter from reviewer roles (tester is producer-side per the implement-phase graph).
  • The egg-orch overseer alert CLI accepts --role, --anomaly, --priority, --summary, --detail exactly as the wrapper invokes them.
  • _stop_running_containers() is called before return, so reviewers blocked waiting on the dead producer's proposal will get SIGTERM rather than heartbeating indefinitely.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

Reviewer egg-reviewer raised 8 non-blocking suggestions. Addressed all
actionable items in this PR:

- Add race-window recheck inside the producer-death branch in
  `_run_concurrent_phase`: re-query `executor.check_consensus()` before
  hard-failing; if consensus is complete, log + `continue` so the next
  iteration's step 1/2 returns success. Preserves the pre-#2806 recovery
  semantics for the wrapper-cleanup-after-CONFIRMED race.
- Drop the unused `pipeline` parameter from `_emit_producer_death_alert`.
- Wrap the wrapper's `egg-orch overseer alert` call with `timeout 5` so
  a stalled orchestrator cannot delay the restart loop.
- Add exit code to the alert subject (`producer-permanent-death: <role>
  exit=<N> [high]`) for at-a-glance operator triage.
- Use list-join when building `combined_logs` to avoid a leading blank
  line when `all_logs` is empty.
- Add two integration tests in `test_consensus_polling.py` covering the
  producer-death short-circuit branch in `_run_concurrent_phase` (the
  helper-only tests in `test_producer_death_alert.py` left that branch
  uncovered): the hard-fail path and the new race-window recheck path.
- Fold the duplicate `test_default_max_restarts_value_is_three` into
  `test_default_max_restarts`; update the overseer-alert wrapper test to
  assert `timeout 5 egg-orch overseer alert`.
- Refresh `docs/guides/concurrent-execution.md` step 6 to describe the
  recheck before the producer-death hard-fail.

Closes review feedback on #2811.
@james-in-a-box
Copy link
Copy Markdown
Contributor

Review response — #2811

Thanks for the thorough review. Disposition per item below; all in-PR fixes landed in commit e810e6e.

1. Lost test coverage in test_consensus_race_on_exit.py — fixed-in-PR (commit e810e6e).
Added test_producer_death_short_circuits_phase to test_consensus_polling.py. Exercises CODER (producer) exit=1 through _run_concurrent_phase, asserts (1, logs) carries the PRODUCER PERMANENT DEATH marker, surviving reviewer container is stopped via _stop_running_containers, handle_agent_failure is NOT called, and the OVERSEER_ALERT is published. The original "producer races with consensus" scenario is now expressed as the new contract.

2. No integration test for the new short-circuit branch — fixed-in-PR (commit e810e6e).
Covered by item 1 above plus test_producer_death_skipped_when_consensus_completes_in_race (which asserts the race-window recheck added per item 4).

3. _emit_producer_death_alert accepts unused pipeline param — fixed-in-PR (commit e810e6e).
Dropped pipeline from the signature; updated the single call site in _run_concurrent_phase and both the unit tests in test_producer_death_alert.py.

4. Race window between step 1 (consensus check) and step 4 (exit detection) — fixed-in-PR (commit e810e6e).
Added an executor.check_consensus() recheck inside the producer-death branch, in the spirit of the step-5 final recheck. On is_complete=True we log and continue so the next iteration's step 1/2 returns success; on incomplete (or recheck exception → treated as incomplete) we emit the alert and hard-fail. The new test_producer_death_skipped_when_consensus_completes_in_race test locks this in. Also refreshed docs/guides/concurrent-execution.md step 6 to describe the recheck.

5. Wrapper's egg-orch overseer alert call has no timeout — fixed-in-PR (commit e810e6e).
Wrapped with timeout 5 (consistent with --connect-timeout 5 on the SSE curl earlier in the wrapper). Updated test_restart_emits_overseer_alert to assert timeout 5 egg-orch overseer alert so the timeout doesn't silently regress.

6. Test duplication (test_default_max_restarts_value_is_three vs test_default_max_restarts) — fixed-in-PR (commit e810e6e).
Folded the value-is-three assertion into test_default_max_restarts and removed the duplicate.

7. Stray blank first line in combined_logs when all_logs == [] — fixed-in-PR (commit e810e6e).
Rebuilt as a list-join: "\n".join(all_logs + ["--- PRODUCER PERMANENT DEATH ---", "..."]). No leading \n regardless of all_logs length.

8. Alert metadata — include exit_code in subject — fixed-in-PR (commit e810e6e).
Subject now renders as producer-permanent-death: <role> exit=<N> [high]. Updated test_publishes_high_priority_overseer_alert to assert the exact subject and added a 137-exit assertion in the slice-id test.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box
Copy link
Copy Markdown
Contributor

Check Failure: Unit Tests (autofixer report)

What's failing: 27 tests in tests/plugins/refine_plan/test_emit_contract.py and tests/plugins/refine_plan/test_validate_yaml_tasks.py fail with FileNotFoundError on the scripts they exercise:

FileNotFoundError: [Errno 2] No such file or directory:
  '/home/runner/work/egg/egg/plugins/refine-plan/skills/refine-plan/bin/emit-contract'
  '/home/runner/work/egg/egg/plugins/refine-plan/skills/refine-plan/bin/validate-yaml-tasks'

Root cause: Not caused by this PR. CI tests the merge commit of this PR into main (bcb63dd Merge e810e6e... into ea70e72...). The latest main commit — ea70e72 rm refine-plan plugin — deleted the entire plugins/refine-plan/ directory, including the two bin scripts these tests load via SourceFileLoader/read_text. It did not delete the corresponding test files under tests/plugins/refine_plan/, so they are now orphaned on main and fail for every PR merged against it.

This PR touches only orchestrator/, docs/, and orchestrator tests (see the PR file list); the missing scripts are not in its scope.

What needs to be done:

  • A separate cleanup commit on main to delete the orphaned tests:
    • tests/plugins/refine_plan/test_emit_contract.py
    • tests/plugins/refine_plan/test_validate_yaml_tasks.py
    • tests/plugins/refine_plan/__init__.py (if no other tests remain in that package)
  • Once main is green again, rebasing/merging this PR will pick up the fix and the Unit Tests check will pass without changes here.

Suggestion: If you'd rather fold the cleanup into this PR to unblock it, say the word and I'll add a commit deleting those three files. I held off because the rule is "fix only the listed issues; do not fix unrelated code" and these tests have no relation to producer-death handling.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of #2811 — round 2 (commit e810e6e)

Re-read the delta 72d1627..83f765d, traced the new race-window recheck through _run_concurrent_phase, verified the _emit_producer_death_alert signature change against its call sites and tests, and walked the two new integration tests in test_consensus_polling.py.

All 8 items from the previous review are addressed correctly:

  1. Lost test coveragetest_producer_death_short_circuits_phase in test_consensus_polling.py now exercises the producer-death short-circuit through the polling loop end-to-end (CODER exit=1, asserts (1, logs) with the death marker, surviving reviewer stopped, handle_agent_failure not called, alert published).
  2. No integration test for the short-circuit branch — covered by item 1 plus the new race-window test.
  3. Unused pipeline param — removed from the helper signature, both call sites and all four helper tests updated.
  4. Race window — the new recheck inside the producer-death branch (pipelines.py:17879-17901) preserves the pre-#2811 wrapper-cleanup-after-CONFIRMED recovery semantics. The try/except → recheck = {"is_complete": False} fallback on check_consensus exception is the safe direction (treat unknown as not-complete and hard-fail).
  5. timeout 5 on the wrapper's overseer alert — wrapped correctly; || true continues to swallow non-zero exits (including 124/137 from the timeout itself or 127 if timeout were missing).
  6. Test duplicationtest_default_max_restarts_value_is_three folded into test_default_max_restarts.
  7. Stray blank line — rebuilt as "\n".join(all_logs + [...]), no leading newline regardless of all_logs length.
  8. Exit code in subjectproducer-permanent-death: {role} exit={exit_code} [high], asserted explicitly in both helper tests (exit=1 and exit=137).

The docs change in concurrent-execution.md step 6 accurately reflects the new "re-query consensus once more" race-window guard.

No Blocking issues. A few minor Non-blocking observations:


Non-blocking

1. The recheck's is_complete check does not consider has_unresolved_nacks (pipelines.py:17889). Compare to the step-5 final recheck (pipelines.py:17974-17997) which explicitly handles the "complete with unresolved NACKs" case by creating an HITL decision and returning (1, …). The new race-window recheck just trusts is_complete=True and continues — relying on step 5's later final recheck (or step 2's next-iteration check) to handle NACKs.

Step 2 has the same lighter check, so this is consistent with existing behavior, and the #1671 "all roles in confirmed_roles" override is intentional. But note that if a future change tightens step 2's NACK handling, this new branch should be tightened in lockstep.

2. The comment block at pipelines.py:17896-17900 says "let the next iteration's step 1/2 return success." In practice, when the dying producer is the last running container (the common case once _stop_running_containers has already fired earlier in the pipeline), step 5's same-iteration if len(exited_containers) >= len(active_executions) path is what returns success — not the next iteration's step 1/2. The behavior is correct; only the comment is imprecise. A one-word fix: "let step 5 (or the next iteration's step 1/2) return success."

3. The race-window test (test_producer_death_skipped_when_consensus_completes_in_race) is brittle to the call ordering of check_consensus. It counts polls with a mutable closure (poll_count = [0]) and assumes:

  • call #1 = step 1 of iter 1 (incomplete)
  • call #2 = recheck inside producer-death branch (complete)
  • call #3+ = step 5 final recheck (complete)

If anyone adds another check_consensus() call earlier in the loop, the test would silently start exercising a different scenario. Consider asserting mock_executor_instance.check_consensus.call_count or making the side_effect content-addressed (e.g., look at recent args) rather than position-addressed. Minor — current tests pass and the assertion exit_code == 0 is the load-bearing check.

4. _emit_producer_death_alert subject omits slice_id. Mentioned in the previous review's item 8 — fixed for exit_code but slice_id is still only in body/metadata. For slice-cascade pipelines, operators triaging the alert list would see N identical subjects (one per slice) and have to expand each to find which slice died. Consider producer-permanent-death: {role} exit={exit_code}{slice_render} [high] where slice_render = f" slice={slice_id}" when set. Not blocking — metadata is queryable.


Confirmed working

  • The recheck-exception path correctly degrades to {"is_complete": False} and hard-fails, which is the conservative direction.
  • _record_container_exit runs BEFORE the producer-death branch, so has_failures[0] is correctly set to True when continue falls through to step 5. Step 5's NACK-aware final recheck then handles the genuine completion correctly.
  • _stop_running_containers is only called in the hard-fail path — when consensus completes in the race window, surviving reviewers are allowed to finish naturally (and will be stopped by step 2/step 5's consensus-reached path).
  • The CI suite (Unit Tests, Integration Tests, Lint, Docker, Security Scan) is green on 83f765d.
  • The timeout 5 wrapping is shell-correct: timeout 5 cmd … >/dev/null 2>&1 || true bounds wall-clock and swallows exit 124/137.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

- Subject now includes slice_id when set: `producer-permanent-death:
  <role> exit=<N>[ slice=<id>] [high]` so per-slice cascades are
  triagable at-a-glance in the alert list (item 4).
- Race-window recheck comment clarified: step 5 (or the next iteration's
  step 1/2) returns success, not just the next iteration (item 2).
- Race-window test decoupled from absolute call indices: monotonic
  clock has its own counter, and check_consensus side_effect uses a
  one-shot iterator that returns incomplete on the first call and
  complete on every subsequent call. Adding another check_consensus
  call earlier in the loop would not silently change the test scenario
  (item 3). Lower-bound assertion on call_count locks the invariant in.
@james-in-a-box
Copy link
Copy Markdown
Contributor

Review response — #2811 round 2

Thanks for the re-review. Round-1 items were already approved on 83f765d; addressing the four round-2 non-blocking items below in 67e6bfc.

Round 2 item 1 — Race-window recheck does not consider has_unresolved_nacks — disagree (consistent with step 2 behavior, reviewer noted as intentional).
The reviewer explicitly observed this "is consistent with existing behavior" (step 2's lighter is_complete check uses the same shape) and called out the lockstep-tightening note as a forward-looking observation, not an actionable change for this PR. Step 5's later final recheck still handles the NACK case correctly when the race-window recheck continues, so the load-bearing behavior is intact. Bringing NACK handling into the race-window recheck would diverge it from step 2 — that should land as a single coupled change if/when step 2 tightens, not asymmetrically here.

Round 2 item 2 — Comment imprecision ("next iteration's step 1/2" vs "step 5 or next iteration's step 1/2") — fixed-in-PR (commit 67e6bfc).
Updated the comment block at pipelines.py:17896-17900 to read "Step 5 (or the next iteration's step 1/2) will return success" — matching the actual behavior when the dying producer is the last running container (step 5 fires same-iteration via the len(exited_containers) >= len(active_executions) path).

Round 2 item 3 — Race-window test brittle to check_consensus call ordering — fixed-in-PR (commit 67e6bfc).
Refactored test_producer_death_skipped_when_consensus_completes_in_race:

  • tick counter for time.monotonic is now independent of check_consensus calls (one increment per _monotonic() invocation, not per consensus check).
  • _check_consensus uses a one-shot iter([incomplete]) plus a StopIteration fallback to complete. The invariant is "first call returns incomplete, every subsequent call returns complete" — not tied to absolute call indices. Adding another check_consensus() call earlier in the loop would shift which call is "first" but not change which scenario the test exercises.
  • Added assert mock_executor_instance.check_consensus.call_count >= 2 to lock in the load-bearing pair (initial step-1 check + race-window recheck).

Round 2 item 4 — _emit_producer_death_alert subject omits slice_id — fixed-in-PR (commit 67e6bfc).
Subject now renders as producer-permanent-death: <role> exit=<N> slice=<id> [high] when slice_id is set, and producer-permanent-death: <role> exit=<N> [high] when it is not — so operators triaging a slice-cascade pipeline can see which slice died without expanding each alert. Updated test_slice_id_propagates_into_metadata_and_body to assert the exact new subject including slice=slice-2.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of #2811 — round 3 (commit 67e6bfc)

Re-read the delta 83f765d..67e6bfc, traced the three remaining non-blocking items from round 2, and walked the test refactor and subject-format change.

All three round-2 actionable items addressed correctly:

  1. Item 2 — Comment imprecision at pipelines.py:17896-17900. Updated to "Step 5 (or the next iteration's step 1/2) will return success." Verified against the loop structure: continue exits the inner for exec_info in active_executions loop, then control falls through to step 5 (pipelines.py:17955). When the dying producer is the last running container — common when _stop_running_containers has already fired — len(exited_containers) >= len(active_executions) is true on the same iteration, so step 5's final recheck is what returns success. When other producers/reviewers are still running, step 5 is skipped and the next iteration's step 1/2 returns success. The "or" disjunction is accurate.

  2. Item 3 — Race-window test brittle to call ordering. Refactored test_producer_death_skipped_when_consensus_completes_in_race cleanly:

    • tick[0] increments inside _monotonic (one per call), decoupled from check_consensus invocations.
    • consensus_returns = iter([incomplete]) + StopIteration → complete fallback makes the side_effect content-addressed by call order, not absolute index. Adding another check_consensus() earlier in the loop would just shift which call is "first incomplete" but preserve the load-bearing invariant.
    • assert check_consensus.call_count >= 2 locks the load-bearing pair (initial step-1 incomplete + race-window recheck complete) without being brittle to step 5's third recheck.
  3. Item 4 — slice_id in alert subject. subject_slice = f" slice={slice_id}" if slice_id else "" correctly produces:

    • producer-permanent-death: coder exit=1 [high] when slice_id is None
    • producer-permanent-death: tester exit=137 slice=slice-2 [high] when slice_id == "slice-2"
    • Both call-site assertions (test_producer_death_short_circuits_phase for None, test_slice_id_propagates_into_metadata_and_body for set) cover the format.

Item 1 (NACK handling in race-window recheck) — disagree accepted. The author's reasoning is sound: the new race-window recheck (pipelines.py:17890) mirrors step 2's existing is_complete-only check, not step 5's NACK-aware check. Keeping them in lockstep so a future tightening of step 2 brings the race-window recheck along is a defensible choice. The functional consequence is at worst one extra iteration: when the race-window recheck continues with is_complete=True and has_unresolved_nacks=True, the next iteration's step 2 (or step 5's NACK-aware path with has_failures[0]==True) catches it. No correctness regression.

Confirmed working

  • CI is green (Unit Tests, Integration Tests, Lint, Docker, Security Scan, Python).
  • The recheck-exception fallback (recheck = {"is_complete": False}) correctly fails closed — if we can't determine consensus, hard-fail rather than risk leaving the pipeline running.
  • _record_container_exit at pipelines.py:17852 (before the producer-death branch) correctly sets has_failures[0] = True, so step 5's final recheck path runs when continue falls through to it.
  • The race-window test's monotonic clock is now per-call, not per-poll, so the loop's elapsed = time.monotonic() - start_time advances naturally regardless of how many check_consensus calls happen per iteration.
  • _emit_producer_death_alert signature, body, and metadata changes are consistent: subject now includes slice (in addition to the existing body (slice X) render and metadata["slice_id"]).

Non-blocking

1. slice_id falsiness inconsistency between subject and metadata (pipelines.py:15329 vs 15349). Subject uses truthy: f" slice={slice_id}" if slice_id else "". Metadata uses is not None: if slice_id is not None: metadata["slice_id"] = slice_id. For slice_id == "", the subject would omit but metadata would include as empty string. In practice slice_id is either None or a validated slice-<N> string (per extract_slice_id / SLICE_ID_PATTERN), so this can't fire — but the asymmetry is a minor footgun for a future caller. One-line fix: if slice_id is not None: at line 15329 too, or if slice_id: at line 15349. Non-blocking.

2. End-to-end coverage of slice_id in subject through _run_concurrent_phase. The new subject format is asserted at the helper level (test_slice_id_propagates_into_metadata_and_body) and indirectly at the integration level (test_producer_death_short_circuits_phase with slice_id=None). No integration test exercises the producer-death short-circuit with a non-None slice_id. The risk surface is small (call site at pipelines.py:17907 just forwards the loop-scope slice_id straight through), but a per-slice cascade producing the subject through the full polling-loop path is currently only smoke-tested. Non-blocking.


Verdict

Approve. All actionable round-2 items addressed; round-2 item 1 (NACK handling) disagreed with sound reasoning; CI is green; the comment correction, test refactor, and subject-format change are all correct. The two non-blocking observations above are tiny and can be folded into a follow-up if desired.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

Round 3 non-blocking items:

1. slice_id falsiness inconsistency between subject and metadata
   (orchestrator/routes/pipelines.py:15329). Subject used a truthy
   check while metadata at :15349 used 'is not None' — for slice_id
   == "" the subject would have omitted but metadata would have
   included as empty string. Aligned both subject and body's
   slice_render on 'is not None' so the helper agrees on edge values.
   Can't fire today (extract_slice_id validates to slice-<N> upstream)
   but the asymmetry was a future footgun.

2. End-to-end coverage of slice_id in subject through
   _run_concurrent_phase. Added
   test_producer_death_short_circuits_phase_with_slice_id to
   test_consensus_polling.py: passes slice_id="slice-2" via the
   kwarg, exit_code=137 (OOM-shape), and asserts the alert subject
   carries 'slice=slice-2' plus the slice_id metadata. Closes the
   per-slice triage format gap that previously only had helper-level
   coverage.
@james-in-a-box
Copy link
Copy Markdown
Contributor

Review response — #2811 round 3

Thanks for the round-3 re-review. Both non-blocking items addressed in commit a277055.

Round 3 item 1 — slice_id falsiness inconsistency between subject and metadata — fixed-in-PR (commit a277055).
Aligned the subject's slice rendering (and the body's slice_render) on is not None, matching the metadata check at pipelines.py:15349. The two checks no longer disagree on edge values like slice_id == "". As the reviewer noted, this can't fire today because extract_slice_id / SLICE_ID_PATTERN upstream validates to slice-<N> or None, but keeping the helper internally consistent removes the future-footgun shape.

Round 3 item 2 — End-to-end coverage of slice_id in subject through _run_concurrent_phase — fixed-in-PR (commit a277055).
Added test_producer_death_short_circuits_phase_with_slice_id to test_consensus_polling.py. The test passes slice_id="slice-2" to _run_concurrent_phase directly via the kwarg, drives a CODER (producer) exit with code 137 (OOM shape, distinguishing it from the exit=1 case in the sibling test), and asserts the alert subject is exactly producer-permanent-death: coder exit=137 slice=slice-2 [high] plus metadata["slice_id"] == "slice-2". The per-slice triage format is now covered through the full polling-loop path, not just at the helper level.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of #2811 — round 4 (commit a277055)

Re-read the delta 67e6bfc..a277055 (9-line production change + 71-line new test). Both non-blocking items from round 3 are addressed correctly.

Round-3 items addressed

  1. Item 1 — slice_id falsiness inconsistency (pipelines.py:15334, :15336). Both subject_slice and slice_render now use if slice_id is not None, matching the metadata check at :15354. Inline comment cites the asymmetry ("" would have produced omit-subject + empty-string-metadata) and notes extract_slice_id validates upstream so this can't fire today. The fix is the right shape — align all three checks rather than weaken metadata.

  2. Item 2 — End-to-end slice_id coverage. test_producer_death_short_circuits_phase_with_slice_id in test_consensus_polling.py:739-802 passes slice_id="slice-2" through _run_concurrent_phase, uses exit_code=137 (OOM-shape, complements the existing test's exit=1), and asserts both the subject format ("producer-permanent-death: coder exit=137 slice=slice-2 [high]") and metadata["slice_id"]/["exit_code"]. Closes the gap where the per-slice subject format was only smoke-tested at the helper level.

Confirmed clean

  • _publish_consensus_timeout_alert doesn't render slice_id in subject/body — only in metadata. The producer-death helper's choice to render in subject + body is a deliberate divergence (per-slice triage at-a-glance per round-2 item 4), not a consistency violation.
  • The repo's prevailing pattern for slice_id is is not None (see lines 2447, 2593, 4466, 11156, 15268, 15354). The fix aligns with that convention. The remaining if slice_id: truthy checks at lines 9022/9031/9085 are pre-existing and outside the scope of this PR.
  • Test setup is consistent with the existing test_producer_death_short_circuits_phase: same mocks, _CALL_ARGS, _base_mocks, lock fixture. Only slice_id and exit_code differ, which is exactly the right axis to vary.
  • CI is green: Unit Tests, Integration Tests, Lint (Python/Shell/YAML/Docker/Actions), Security Scan, Custom Checks all SUCCESS at HEAD.
  • The new test asserts msg_store.add_message.call_count == 1 — guards against accidental duplicate alerts in the short-circuit path.

Behavioral delta of the is not None change

The change widens what counts as a slice-bearing alert: slice_id == "" would now render slice= in the subject and (slice ) in the body. The producer's comment correctly notes this is unreachable through extract_slice_id (which validates against SLICE_ID_PATTERN for slice-<N> shape) and the call site at :17912 forwards the loop-scope value directly. No new code path produces an empty-string slice_id. The change is purely defensive consistency, not a behavioral expansion in practice.


Verdict

Approve. Both non-blocking round-3 items addressed; the change is minimal, surgical, and matches the codebase's is not None convention. CI green. No new issues.

— Authored by egg

@james-in-a-box
Copy link
Copy Markdown
Contributor

egg review completed. View run logs

14 previous review(s) hidden.

@jwbron jwbron merged commit 352378e into main May 27, 2026
23 checks passed
jwbron added a commit that referenced this pull request May 27, 2026
Update consensus wrapper description (item 6) to reflect changes from
issue #2806 / PR #2811:
- MAX_CONSENSUS_RESTARTS bumped from 2 to 3
- Each restart now emits a medium-priority OVERSEER_ALERT
- Producer permanent death now hard-fails the pipeline (FAILED state)
  with a high-priority OVERSEER_ALERT rather than the generic single-
  agent HITL path; reviewer deaths still use the HITL path

Co-authored-by: jwbron <8340608+jwbron@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pipeline should auto-mark FAILED when any producer agent dies permanently (retry budget exhausted)

1 participant