Skip to content

fix(resume): replay original event log into dashboard on --web (#167)#205

Merged
jrob5756 merged 2 commits into
mainfrom
fix/167-resume-web-empty-dashboard
May 18, 2026
Merged

fix(resume): replay original event log into dashboard on --web (#167)#205
jrob5756 merged 2 commits into
mainfrom
fix/167-resume-web-empty-dashboard

Conversation

@jrob5756
Copy link
Copy Markdown
Collaborator

Summary

Fixes #167. conductor resume … --web / --web-bg previously opened
an empty dashboard because the original run's events lived in a
different process and were never carried forward across the checkpoint
boundary. Now the dashboard's history is fully seeded before the
server starts accepting clients, so prior agents stay visible alongside
the live resumed run.

Changes

Checkpoint schema (engine/checkpoint.py, engine/workflow.py)

  • Added run_id: str = "" and event_log_path: str = "" to
    CheckpointData (defaults preserve backward-compat for older
    checkpoints, no version bump needed).
  • WorkflowEngine._save_checkpoint_on_failure forwards
    self._run_context.run_id and self._run_context.log_file.

Dashboard replay (web/server.py)

  • WebDashboard.prepend_workflow_started(data) — inserts a fresh
    workflow_started (built from the current YAML via
    engine.build_workflow_started_data()) at position 0 of
    _event_history so historical events apply to the correct topology.
  • WebDashboard.replay_events_from_jsonl(path) — line-by-line replay
    of the original JSONL log, tolerating malformed/partial lines via
    the existing web/replay.py::_load_events parser. Skips root-level
    workflow_started / workflow_completed / workflow_failed /
    checkpoint_saved events (detected by absent data.subworkflow_path),
    preserving subworkflow lifecycle events so per-context wfDepth
    stays balanced.
  • WebDashboard.replay_synthetic_from_context(ctx, config, ts)
    fallback that synthesises type-aware *_started / *_completed
    pairs from WorkflowContext.execution_history. Parallel /
    for-each shapes include success_count / failure_count /
    elapsed so the frontend doesn't mis-render groups as failed.
  • All three methods append directly to _event_history (no
    _queue.put_nowait); they emit logger.warning if invoked after
    dashboard.start().

Engine (engine/workflow.py)

  • Extracted WorkflowEngine.build_workflow_started_data() from the
    inline _execute_loop emit so both the engine and the CLI can build
    the same event shape.
  • WorkflowEngine.suppress_workflow_started_emit() sets a flag that
    _execute_loop checks; suppresses the engine's own emit on resume
    so the dashboard sees exactly one root start (avoids wfDepth
    double-counting).

Event log continuity (engine/event_log.py)

  • EventLogSubscriber.__init__ accepts optional existing_path and
    existing_run_id kwargs; when both are provided and the file is a
    writable regular file, opens it in append mode and reuses the
    run_id. Falls back to a fresh log on OSError so resume never fails
    due to a locked / readonly log file.

CLI wiring (cli/run.py)

  • resume_workflow_async reordering: load config and checkpoint → open
    registry → build EventLogSubscriber (append-mode when available) →
    build engine with the resolved run_id / log_file
    dashboard.prepend_workflow_started(...)
    dashboard.replay_events_from_jsonl(...) (or synthetic fallback) →
    engine.suppress_workflow_started_emit()
    dashboard.start() + print URL → subscribe console + event log →
    engine.resume().

Docs

  • Updated AGENTS.md "Run / Resume Parity" note.
  • Added Unreleased CHANGELOG entry.

Why it's correct

Two rubber-duck reviews caught blind spots that informed the final
design:

  1. Single root workflow_started — The frontend distinguishes root
    vs subworkflow workflow_started by wfDepth (not by
    data.subworkflow_path). Without suppression, the resumed engine
    would emit a second root workflow_started, which the frontend
    would treat as a child workflow and routes the resumed run's
    events into a phantom child context. Now exactly one root event
    exists.

  2. Topology before historical events — Frontend
    agent_started / parallel_agent_completed etc. assume topology
    (groups, routes) has already been initialised by workflow_started.
    Replay now emits the topology event first, so historical events
    land on the right nodes.

  3. No mid-replay races — Replay populates _event_history
    completely before dashboard.start(). The very first
    GET /api/state and the first WebSocket connect both see the
    full history.

Tests

  • 2669 passing (15 skipped), up from baseline 2646.
  • New / updated tests:
    • test_engine/test_checkpoint.pyrun_id + event_log_path
      round-trip; back-compat defaults to "".
    • test_engine/test_event_log.py — append-mode reuse of
      existing_path / existing_run_id; missing-path + missing-run_id
      fallbacks.
    • test_engine/test_resume.py — engine forwards RunContext values
      into checkpoint; build_workflow_started_data() shape;
      suppress_workflow_started_emit() skips engine emit on resume.
    • test_web/test_server.py — replay populates /api/state; skips
      root lifecycle events; preserves subworkflow lifecycle events;
      does not enqueue on _queue; handles missing/corrupt/partial
      files; synthetic replay emits per node type with correct fields;
      prepend_workflow_started inserts at position 0.
    • test_cli/test_resume_command.pyresume_workflow_async seeds
      the dashboard with JSONL replay when the log exists, falls back to
      synthetic events when it doesn't.

Out of scope (deferred)

  • Dashboard banner visually marking the resume boundary in the UI.

How to verify locally

make check    # lint + typecheck
make test     # full suite (2669 passed, 15 skipped on this branch)

Reproducer from the issue:

uv run conductor run examples/some-workflow.yaml --web --input ...
# Trigger a failure to save a checkpoint, then:
uv run conductor resume examples/some-workflow.yaml --web
# Open the dashboard — prior agents are now visible.

Closes #167.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

jrob5756 added a commit that referenced this pull request May 18, 2026
- CRITICAL: `dashboard.start()` failure now detaches the dashboard from
  the engine + DialogHandler via the new `engine.clear_web_dashboard()`,
  so the resumed run does not block forever on a never-arriving
  WebSocket gate input. The CLI's local `dashboard = None` alone was
  not enough — the engine captured the dashboard reference at
  construction time. Unique to `resume_workflow_async` because its new
  ordering constructs the engine before `dashboard.start()`;
  `run_workflow_async` starts the dashboard first so the bug doesn't
  apply there.
- HIGH: `replay_synthetic_from_context` for-each branch no longer
  treats an empty `outputs` list as missing. Uses the authoritative
  `count` field stored by `WorkflowEngine._execute_for_each_group`,
  falling back to `len(outputs)` only when `count` is absent. A naive
  `output.get("outputs") or ...` would have used the wrapper dict's
  key count (3) as the item count for an empty for-each.
- Refactor: extracted `_synth_parallel`, `_synth_for_each`,
  `_synth_agent_or_script` helpers from `replay_synthetic_from_context`
  for readability. Normalises `output` to a dict once per node so the
  `isinstance` repetition collapses.
- Tidy: moved `datetime` import to the file header; inlined the
  checkpoint-timestamp parse so the pre-init `cp_ts = None` is gone.
- Docs: fixed two incorrect comments — the "WebSocket replay loop"
  mention (history is served via `/api/state`, not the WS loop) and
  the stale rationale on `_REPLAY_ROOT_SKIP_TYPES` (now correctly
  explains the prepend + suppress flow). Filled in the previously
  empty `CheckpointData` attribute docstrings for `instructions_preamble`,
  `run_id`, and `event_log_path`.
- New tests: backward-compat `load_checkpoint` for pre-PR checkpoints
  missing the new fields; for-each synthetic-replay
  empty-outputs/explicit-count/missing-count cases;
  `clear_web_dashboard` propagation to DialogHandler.

Full suite: 2674 passing (15 skipped), up from 2669 after this commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jrob5756 and others added 2 commits May 18, 2026 17:20
`conductor resume … --web`/`--web-bg` previously opened an empty
dashboard because the original run's events were never carried forward
across the process boundary. Now:

- Checkpoints persist `run_id` and `event_log_path`. The engine forwards
  these from `RunContext` in `_save_checkpoint_on_failure`.
- `WebDashboard` gains `prepend_workflow_started`,
  `replay_events_from_jsonl`, and `replay_synthetic_from_context`.
  Replay populates `_event_history` BEFORE `dashboard.start()` so the
  first `GET /api/state` and the first WebSocket client both see a
  fully populated, topology-correct timeline.
- `resume_workflow_async` synthesises a fresh `workflow_started` event
  from the *current* YAML (via the new
  `engine.build_workflow_started_data()` helper) and prepends it to the
  replay so historical agent / parallel / for-each events apply to the
  correct topology. The resumed engine's own `workflow_started` emit is
  suppressed (`engine.suppress_workflow_started_emit()`) so the
  dashboard sees exactly one root start — preventing `wfDepth`
  double-counting in the frontend.
- Root-level `workflow_completed` / `workflow_failed` /
  `checkpoint_saved` events from the original JSONL are filtered on
  replay; subworkflow lifecycle events are preserved so per-context
  `wfDepth` stays balanced.
- `EventLogSubscriber` accepts `existing_path` / `existing_run_id` and
  opens the original log in append mode, so a multi-resume session
  produces one continuous JSONL file and `run_id` stays stable for
  log-correlation tools.
- Older checkpoints (no `event_log_path`) and runs whose log file was
  deleted fall back to synthesised `*_started` / `*_completed` pairs
  built from `WorkflowContext.execution_history`, with type-aware
  shapes (`script_*`, `parallel_*`, `for_each_*`) including
  `failure_count` / `success_count` so groups don't render as failed.

Adds 23 new tests across `test_checkpoint.py`, `test_event_log.py`,
`test_resume.py`, `test_server.py`, and `test_resume_command.py`
(2666 → 2669 passing after the post-rubber-duck adjustments).
Updates `AGENTS.md` Run/Resume Parity note and adds an Unreleased
CHANGELOG entry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- CRITICAL: `dashboard.start()` failure now detaches the dashboard from
  the engine + DialogHandler via the new `engine.clear_web_dashboard()`,
  so the resumed run does not block forever on a never-arriving
  WebSocket gate input. The CLI's local `dashboard = None` alone was
  not enough — the engine captured the dashboard reference at
  construction time. Unique to `resume_workflow_async` because its new
  ordering constructs the engine before `dashboard.start()`;
  `run_workflow_async` starts the dashboard first so the bug doesn't
  apply there.
- HIGH: `replay_synthetic_from_context` for-each branch no longer
  treats an empty `outputs` list as missing. Uses the authoritative
  `count` field stored by `WorkflowEngine._execute_for_each_group`,
  falling back to `len(outputs)` only when `count` is absent. A naive
  `output.get("outputs") or ...` would have used the wrapper dict's
  key count (3) as the item count for an empty for-each.
- Refactor: extracted `_synth_parallel`, `_synth_for_each`,
  `_synth_agent_or_script` helpers from `replay_synthetic_from_context`
  for readability. Normalises `output` to a dict once per node so the
  `isinstance` repetition collapses.
- Tidy: moved `datetime` import to the file header; inlined the
  checkpoint-timestamp parse so the pre-init `cp_ts = None` is gone.
- Docs: fixed two incorrect comments — the "WebSocket replay loop"
  mention (history is served via `/api/state`, not the WS loop) and
  the stale rationale on `_REPLAY_ROOT_SKIP_TYPES` (now correctly
  explains the prepend + suppress flow). Filled in the previously
  empty `CheckpointData` attribute docstrings for `instructions_preamble`,
  `run_id`, and `event_log_path`.
- New tests: backward-compat `load_checkpoint` for pre-PR checkpoints
  missing the new fields; for-each synthetic-replay
  empty-outputs/explicit-count/missing-count cases;
  `clear_web_dashboard` propagation to DialogHandler.

Full suite: 2674 passing (15 skipped), up from 2669 after this commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jrob5756 jrob5756 force-pushed the fix/167-resume-web-empty-dashboard branch from 46fae08 to 6aa4571 Compare May 18, 2026 21:21
@jrob5756 jrob5756 merged commit 5fa2e14 into main May 18, 2026
9 checks passed
@jrob5756 jrob5756 deleted the fix/167-resume-web-empty-dashboard branch May 18, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resume with --web shows empty dashboard: prior agent context is not visible after checkpoint resume

1 participant