Skip to content

Bring conductor resume to flag parity with conductor run#158

Merged
jrob5756 merged 2 commits intomainfrom
resume-run-parity
May 6, 2026
Merged

Bring conductor resume to flag parity with conductor run#158
jrob5756 merged 2 commits intomainfrom
resume-run-parity

Conversation

@jrob5756
Copy link
Copy Markdown
Collaborator

@jrob5756 jrob5756 commented May 6, 2026

Problem

conductor resume was missing several flags that exist on conductor run, making the recovery story confusing and broken in important cases. The most painful gap: a workflow started with --web or --web-bg could not be resumed with a dashboard, so users lost visibility exactly when something had just gone wrong.

What's added to resume

Flag Purpose
--provider / -p Runtime provider override
--metadata / -m CLI metadata merged on top of YAML metadata
--web Start a real-time web dashboard
--web-port Dashboard port (0 = auto-select)
--web-bg Fork a detached process running resume + dashboard, print URL, exit

--web and --web-bg are mutually exclusive (matching run).

Intentionally not mirrored

Flag Why
--input / -i Inputs are restored from the checkpoint context
--workspace-instructions, --instructions instructions_preamble is persisted in the checkpoint
--dry-run Incompatible with executing from a saved point

Implementation notes

  • resume_workflow_async() now wires up the same WorkflowEventEmitter, EventLogSubscriber, ConsoleEventSubscriber, WebDashboard lifecycle, and RunContext as run_workflow_async().
  • Stop-signal handling refactored into a shared _execute_with_stop_signal helper used by both _run_with_stop_signal and the new _resume_with_stop_signal.
  • New launch_background_resume() in bg_runner.py forks a detached conductor resume subprocess and writes a PID file so conductor stop can find it.

Dashboard behavior on resume

Documented in the docstring: the dashboard only shows events from the resumed agent forward. Events from agents that completed before the checkpoint were emitted in the original process and are not replayed.

Future-proofing parity

Added a new Run / Resume Parity subsection to AGENTS.md (mirroring the existing Provider Parity style) listing the parity rule, the flags that must stay aligned, and the flags intentionally skipped — so future contributors keep them in sync.

Tests

10 new cases in tests/test_cli/test_resume_command.py:

  • --provider / -m flags pass through to resume_workflow_async
  • Malformed --metadata is rejected
  • --web + --web-port flags pass through
  • --web + --web-bg mutex error
  • --web-bg dispatches to launch_background_resume with workflow path or --from checkpoint
  • Direct unit tests of launch_background_resume command construction (subcommand, --from, port, provider, metadata) and ValueError when neither workflow_path nor checkpoint_path is given

Verification

  • uv run ruff check src tests — clean
  • uv run ruff format --check src tests — clean
  • Full test suite: 2382 passed, 9 skipped (no regressions)

Example

# Start a workflow with the dashboard
conductor run workflow.yaml --web-bg

# It crashes — now resume it WITH the dashboard
conductor resume workflow.yaml --web-bg

jrob5756 and others added 2 commits May 6, 2026 08:51
Adds the run-only flags that are meaningful during resumed execution to
the resume command, fixing the broken UX where a workflow started with
`--web` or `--web-bg` could not be resumed with a dashboard.

New flags on resume:
- --provider / -p     runtime provider override
- --metadata / -m     CLI metadata merged on top of YAML metadata
- --web               start a real-time web dashboard
- --web-port          dashboard port (0 = auto-select)
- --web-bg            fork a detached process running resume + dashboard

Intentionally not mirrored:
- --input             restored from checkpoint context
- --workspace-instructions / --instructions
                      instructions_preamble persisted in checkpoint
- --dry-run           incompatible with executing from a saved point

Implementation:
- resume_workflow_async() now wires up the same WorkflowEventEmitter,
  EventLogSubscriber, ConsoleEventSubscriber, WebDashboard lifecycle,
  and RunContext as run_workflow_async().
- Stop-signal handling refactored into shared _execute_with_stop_signal
  used by both _run_with_stop_signal and the new _resume_with_stop_signal.
- New launch_background_resume() in bg_runner.py forks a detached
  `conductor resume` subprocess with the dashboard and writes a PID
  file so `conductor stop` can find it.
- AGENTS.md gains a Run / Resume Parity subsection (mirroring the
  Provider Parity style) so future flag additions stay aligned.

Notes the dashboard caveat in the docstring: on resume, only events
from the resumed agent forward are shown. Events from agents that
completed before the checkpoint were emitted in the original process
and are not replayed.

Tests: 10 new cases covering provider/metadata pass-through,
--web flag handling, --web/--web-bg mutex, --web-bg dispatch to
launch_background_resume, malformed metadata rejection, and direct
unit tests of launch_background_resume command construction.

Verification: full suite (2382 passed / 9 skipped), lint clean,
format clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d wiring tests

bg_runner.py:
- Extract _terminate_child() and _finalize_background_launch() helpers
  shared by launch_background and launch_background_resume.
- On dashboard-startup timeout, terminate the still-running child before
  raising so it does not orphan holding the port with no PID file.
- Wrap write_pid_file in try/except, terminating the child on failure
  so we never leave a discoverable-only-by-pkill background process.
- Replace strippable assert with explicit guard.
- Update module docstring to mention both run and resume.
- Document that --no-interactive is always appended.

run.py:
- _execute_with_stop_signal: cancel pending tasks then drain via
  asyncio.gather(return_exceptions=True). The previous
  contextlib.suppress(CancelledError) only swallowed CancelledError, so
  a stored non-CancelledError on the losing task (e.g. dashboard.stop
  raised) aborted the cleanup loop and leaked the other pending task.

tests/test_cli/test_resume_command.py: 15 new cases covering
- launch_background_resume failure paths and detachment kwargs
- _execute_with_stop_signal direct semantics (no-dashboard, engine-wins,
  stop-wins, losing-task-with-exception regression)
- resume_workflow_async wiring without mocking the function itself:
  dashboard OSError non-fatal, provider_override mutates config,
  metadata merges into config, RunContext populated with bg_mode and
  run_id/log_file, --metadata value containing = survives parse

2397 passed, 9 skipped — no regressions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jrob5756 jrob5756 force-pushed the resume-run-parity branch from d4ed224 to 0181004 Compare May 6, 2026 16:32
@jrob5756 jrob5756 merged commit 38c42b4 into main May 6, 2026
7 checks passed
@jrob5756 jrob5756 deleted the resume-run-parity branch May 6, 2026 16:35
@jrob5756 jrob5756 mentioned this pull request May 6, 2026
jrob5756 added a commit that referenced this pull request May 6, 2026
- conductor resume flag parity with run (#158)
- reasoning effort displayed in dashboard (#160)
- iteration_limit_reached/resolved events for dashboard (#162)
- registry latest now means default branch HEAD, not newest tag (#157)
- forbid extra fields on Agent/Parallel/ForEach/Workflow schemas (#159)
- pretty-print tool args/results in dashboard events (#161)
- capture uv stdout+stderr on Windows install failure (#156)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant