fix(resume): replay original event log into dashboard on --web (#167) by jrob5756 · Pull Request #205 · microsoft/conductor

jrob5756 · 2026-05-18T18:34:58Z

Summary

Fixes #167. conductor resume … --web / --web-bg previously opened
an empty dashboard because the original run's events lived in a
different process and were never carried forward across the checkpoint
boundary. Now the dashboard's history is fully seeded before the
server starts accepting clients, so prior agents stay visible alongside
the live resumed run.

Changes

Checkpoint schema (engine/checkpoint.py, engine/workflow.py)

Added run_id: str = "" and event_log_path: str = "" to
CheckpointData (defaults preserve backward-compat for older
checkpoints, no version bump needed).
WorkflowEngine._save_checkpoint_on_failure forwards
self._run_context.run_id and self._run_context.log_file.

Dashboard replay (web/server.py)

WebDashboard.prepend_workflow_started(data) — inserts a fresh
workflow_started (built from the current YAML via
engine.build_workflow_started_data()) at position 0 of
_event_history so historical events apply to the correct topology.
WebDashboard.replay_events_from_jsonl(path) — line-by-line replay
of the original JSONL log, tolerating malformed/partial lines via
the existing web/replay.py::_load_events parser. Skips root-level
workflow_started / workflow_completed / workflow_failed /
checkpoint_saved events (detected by absent data.subworkflow_path),
preserving subworkflow lifecycle events so per-context wfDepth
stays balanced.
WebDashboard.replay_synthetic_from_context(ctx, config, ts) —
fallback that synthesises type-aware *_started / *_completed
pairs from WorkflowContext.execution_history. Parallel /
for-each shapes include success_count / failure_count /
elapsed so the frontend doesn't mis-render groups as failed.
All three methods append directly to _event_history (no
_queue.put_nowait); they emit logger.warning if invoked after
dashboard.start().

Engine (engine/workflow.py)

Extracted WorkflowEngine.build_workflow_started_data() from the
inline _execute_loop emit so both the engine and the CLI can build
the same event shape.
WorkflowEngine.suppress_workflow_started_emit() sets a flag that
_execute_loop checks; suppresses the engine's own emit on resume
so the dashboard sees exactly one root start (avoids wfDepth
double-counting).

Event log continuity (engine/event_log.py)

EventLogSubscriber.__init__ accepts optional existing_path and
existing_run_id kwargs; when both are provided and the file is a
writable regular file, opens it in append mode and reuses the
run_id. Falls back to a fresh log on OSError so resume never fails
due to a locked / readonly log file.

CLI wiring (cli/run.py)

resume_workflow_async reordering: load config and checkpoint → open
registry → build EventLogSubscriber (append-mode when available) →
build engine with the resolved run_id / log_file →
dashboard.prepend_workflow_started(...) →
dashboard.replay_events_from_jsonl(...) (or synthetic fallback) →
engine.suppress_workflow_started_emit() →
dashboard.start() + print URL → subscribe console + event log →
engine.resume().

Docs

Updated AGENTS.md "Run / Resume Parity" note.
Added Unreleased CHANGELOG entry.

Why it's correct

Two rubber-duck reviews caught blind spots that informed the final
design:

Single root workflow_started — The frontend distinguishes root
vs subworkflow workflow_started by wfDepth (not by
data.subworkflow_path). Without suppression, the resumed engine
would emit a second root workflow_started, which the frontend
would treat as a child workflow and routes the resumed run's
events into a phantom child context. Now exactly one root event
exists.
Topology before historical events — Frontend
agent_started / parallel_agent_completed etc. assume topology
(groups, routes) has already been initialised by workflow_started.
Replay now emits the topology event first, so historical events
land on the right nodes.
No mid-replay races — Replay populates _event_history
completely before dashboard.start(). The very first
GET /api/state and the first WebSocket connect both see the
full history.

Tests

2669 passing (15 skipped), up from baseline 2646.
New / updated tests:
- test_engine/test_checkpoint.py — run_id + event_log_path
  round-trip; back-compat defaults to "".
- test_engine/test_event_log.py — append-mode reuse of
  existing_path / existing_run_id; missing-path + missing-run_id
  fallbacks.
- test_engine/test_resume.py — engine forwards RunContext values
  into checkpoint; build_workflow_started_data() shape;
  suppress_workflow_started_emit() skips engine emit on resume.
- test_web/test_server.py — replay populates /api/state; skips
  root lifecycle events; preserves subworkflow lifecycle events;
  does not enqueue on _queue; handles missing/corrupt/partial
  files; synthetic replay emits per node type with correct fields;
  prepend_workflow_started inserts at position 0.
- test_cli/test_resume_command.py — resume_workflow_async seeds
  the dashboard with JSONL replay when the log exists, falls back to
  synthetic events when it doesn't.

Out of scope (deferred)

Dashboard banner visually marking the resume boundary in the UI.

How to verify locally

make check    # lint + typecheck
make test     # full suite (2669 passed, 15 skipped on this branch)

Reproducer from the issue:

uv run conductor run examples/some-workflow.yaml --web --input ...
# Trigger a failure to save a checkpoint, then:
uv run conductor resume examples/some-workflow.yaml --web
# Open the dashboard — prior agents are now visible.

Closes #167.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

- CRITICAL: `dashboard.start()` failure now detaches the dashboard from the engine + DialogHandler via the new `engine.clear_web_dashboard()`, so the resumed run does not block forever on a never-arriving WebSocket gate input. The CLI's local `dashboard = None` alone was not enough — the engine captured the dashboard reference at construction time. Unique to `resume_workflow_async` because its new ordering constructs the engine before `dashboard.start()`; `run_workflow_async` starts the dashboard first so the bug doesn't apply there. - HIGH: `replay_synthetic_from_context` for-each branch no longer treats an empty `outputs` list as missing. Uses the authoritative `count` field stored by `WorkflowEngine._execute_for_each_group`, falling back to `len(outputs)` only when `count` is absent. A naive `output.get("outputs") or ...` would have used the wrapper dict's key count (3) as the item count for an empty for-each. - Refactor: extracted `_synth_parallel`, `_synth_for_each`, `_synth_agent_or_script` helpers from `replay_synthetic_from_context` for readability. Normalises `output` to a dict once per node so the `isinstance` repetition collapses. - Tidy: moved `datetime` import to the file header; inlined the checkpoint-timestamp parse so the pre-init `cp_ts = None` is gone. - Docs: fixed two incorrect comments — the "WebSocket replay loop" mention (history is served via `/api/state`, not the WS loop) and the stale rationale on `_REPLAY_ROOT_SKIP_TYPES` (now correctly explains the prepend + suppress flow). Filled in the previously empty `CheckpointData` attribute docstrings for `instructions_preamble`, `run_id`, and `event_log_path`. - New tests: backward-compat `load_checkpoint` for pre-PR checkpoints missing the new fields; for-each synthetic-replay empty-outputs/explicit-count/missing-count cases; `clear_web_dashboard` propagation to DialogHandler. Full suite: 2674 passing (15 skipped), up from 2669 after this commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

`conductor resume … --web`/`--web-bg` previously opened an empty dashboard because the original run's events were never carried forward across the process boundary. Now: - Checkpoints persist `run_id` and `event_log_path`. The engine forwards these from `RunContext` in `_save_checkpoint_on_failure`. - `WebDashboard` gains `prepend_workflow_started`, `replay_events_from_jsonl`, and `replay_synthetic_from_context`. Replay populates `_event_history` BEFORE `dashboard.start()` so the first `GET /api/state` and the first WebSocket client both see a fully populated, topology-correct timeline. - `resume_workflow_async` synthesises a fresh `workflow_started` event from the *current* YAML (via the new `engine.build_workflow_started_data()` helper) and prepends it to the replay so historical agent / parallel / for-each events apply to the correct topology. The resumed engine's own `workflow_started` emit is suppressed (`engine.suppress_workflow_started_emit()`) so the dashboard sees exactly one root start — preventing `wfDepth` double-counting in the frontend. - Root-level `workflow_completed` / `workflow_failed` / `checkpoint_saved` events from the original JSONL are filtered on replay; subworkflow lifecycle events are preserved so per-context `wfDepth` stays balanced. - `EventLogSubscriber` accepts `existing_path` / `existing_run_id` and opens the original log in append mode, so a multi-resume session produces one continuous JSONL file and `run_id` stays stable for log-correlation tools. - Older checkpoints (no `event_log_path`) and runs whose log file was deleted fall back to synthesised `*_started` / `*_completed` pairs built from `WorkflowContext.execution_history`, with type-aware shapes (`script_*`, `parallel_*`, `for_each_*`) including `failure_count` / `success_count` so groups don't render as failed. Adds 23 new tests across `test_checkpoint.py`, `test_event_log.py`, `test_resume.py`, `test_server.py`, and `test_resume_command.py` (2666 → 2669 passing after the post-rubber-duck adjustments). Updates `AGENTS.md` Run/Resume Parity note and adds an Unreleased CHANGELOG entry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- CRITICAL: `dashboard.start()` failure now detaches the dashboard from the engine + DialogHandler via the new `engine.clear_web_dashboard()`, so the resumed run does not block forever on a never-arriving WebSocket gate input. The CLI's local `dashboard = None` alone was not enough — the engine captured the dashboard reference at construction time. Unique to `resume_workflow_async` because its new ordering constructs the engine before `dashboard.start()`; `run_workflow_async` starts the dashboard first so the bug doesn't apply there. - HIGH: `replay_synthetic_from_context` for-each branch no longer treats an empty `outputs` list as missing. Uses the authoritative `count` field stored by `WorkflowEngine._execute_for_each_group`, falling back to `len(outputs)` only when `count` is absent. A naive `output.get("outputs") or ...` would have used the wrapper dict's key count (3) as the item count for an empty for-each. - Refactor: extracted `_synth_parallel`, `_synth_for_each`, `_synth_agent_or_script` helpers from `replay_synthetic_from_context` for readability. Normalises `output` to a dict once per node so the `isinstance` repetition collapses. - Tidy: moved `datetime` import to the file header; inlined the checkpoint-timestamp parse so the pre-init `cp_ts = None` is gone. - Docs: fixed two incorrect comments — the "WebSocket replay loop" mention (history is served via `/api/state`, not the WS loop) and the stale rationale on `_REPLAY_ROOT_SKIP_TYPES` (now correctly explains the prepend + suppress flow). Filled in the previously empty `CheckpointData` attribute docstrings for `instructions_preamble`, `run_id`, and `event_log_path`. - New tests: backward-compat `load_checkpoint` for pre-PR checkpoints missing the new fields; for-each synthetic-replay empty-outputs/explicit-count/missing-count cases; `clear_web_dashboard` propagation to DialogHandler. Full suite: 2674 passing (15 skipped), up from 2669 after this commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jrob5756 and others added 2 commits May 18, 2026 17:20

jrob5756 force-pushed the fix/167-resume-web-empty-dashboard branch from 46fae08 to 6aa4571 Compare May 18, 2026 21:21

jrob5756 merged commit 5fa2e14 into main May 18, 2026
9 checks passed

jrob5756 deleted the fix/167-resume-web-empty-dashboard branch May 18, 2026 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(resume): replay original event log into dashboard on --web (#167)#205

fix(resume): replay original event log into dashboard on --web (#167)#205
jrob5756 merged 2 commits into
mainfrom
fix/167-resume-web-empty-dashboard

jrob5756 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jrob5756 commented May 18, 2026

Summary

Changes

Why it's correct

Tests

Out of scope (deferred)

How to verify locally

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant