feat: breadcrumb navigation, subworkflow isolation, and stop reliability#113
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #113 +/- ##
=======================================
Coverage ? 85.59%
=======================================
Files ? 53
Lines ? 7629
Branches ? 0
=======================================
Hits ? 6530
Misses ? 1099
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Follow-up: extending this PR with the recursive-gate-navigation fixHeads up — I'm pushing an additional commit to this branch that fixes two related issues in this same surface area, plus the engine plumbing they need: Symptoms being fixed
Root cause The store today resolves "which sub-workflow context owns this event" by reading Approach Move the source of truth for sub-workflow identity to the engine:
Relationship to #110 ( The "concurrency" half of this fix is mostly latent until #110 lands — sequential sub-workflows can't have concurrent siblings under one parent. But the frontend handlers and the engine infra are both needed for #110's for_each-of-workflow path to render correctly in the dashboard (otherwise concurrent for_each iterations of a sub-workflow stack as descendants of one another instead of laying out as siblings of the parent group). I'm pushing a no-op-until-this-lands companion commit to #110 that adds the for_each-side Order of merge between #110 and #113 doesn't matter — both PRs are independently safe. |
For each iteration of a for_each whose member is a sub-workflow, emit subworkflow_started / _completed / _failed with three new fields: - parent_path: snapshot of the parent engine's dashboard context path (empty list when the dashboard infra from microsoft#113 is absent) - slot_key: a unique "<group>[<key>]" identity per iteration so concurrent iterations don't stack under one shared dashboard context - iteration: 1-based ordinal (matches the existing item_key) Conditionally thread `_dashboard_context_path` into the child engine via getattr so this code is forward-compatible with PR microsoft#113 landing in either order. When microsoft#113 has not landed, the kwarg is omitted and behavior is unchanged. When microsoft#113 has landed, the child engine receives [*parent_path, slot_key] and auto-stamps subworkflow_path on every event it emits. Wraps `_execute_subworkflow_with_inputs` in try/except so a failing iteration emits subworkflow_failed before for_each_item_failed; the original exception propagates to keep existing error handling intact. Adds a regression test that runs three for_each iterations and asserts each gets a distinct slot_key (batch[0], batch[1], batch[2]) on both started and completed events. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pushed Status:
|
34de287 to
5d4bb86
Compare
|
Resolved the conflicts by rebasing Conflict resolution:
Rebased head is |
* feat(composition): allow sub-workflows in for_each groups (#102) Remove validator restriction blocking type='workflow' in for_each groups. Wire execute_single_item() to call _execute_subworkflow_with_inputs() for workflow agents, rendering input_mapping with loop variables in scope. - Validator: Remove workflow rejection in for_each validation - Engine: Add workflow branch in execute_single_item(), new helper _execute_subworkflow_with_inputs() for pre-built inputs - Tests: Update test_workflow_in_for_each to validate (not reject) - Experimental workflows: test-for-each-workflow parent/child pair Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * style: ruff format workflow.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Emit parent_path/slot_key/iteration on for_each-of-workflow lifecycle For each iteration of a for_each whose member is a sub-workflow, emit subworkflow_started / _completed / _failed with three new fields: - parent_path: snapshot of the parent engine's dashboard context path (empty list when the dashboard infra from #113 is absent) - slot_key: a unique "<group>[<key>]" identity per iteration so concurrent iterations don't stack under one shared dashboard context - iteration: 1-based ordinal (matches the existing item_key) Conditionally thread `_dashboard_context_path` into the child engine via getattr so this code is forward-compatible with PR #113 landing in either order. When #113 has not landed, the kwarg is omitted and behavior is unchanged. When #113 has landed, the child engine receives [*parent_path, slot_key] and auto-stamps subworkflow_path on every event it emits. Wraps `_execute_subworkflow_with_inputs` in try/except so a failing iteration emits subworkflow_failed before for_each_item_failed; the original exception propagates to keep existing error handling intact. Adds a regression test that runs three for_each iterations and asserts each gets a distinct slot_key (batch[0], batch[1], batch[2]) on both started and completed events. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Daniel Green <dangreen@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3fec4d9 to
9c1abed
Compare
jrob5756
left a comment
There was a problem hiding this comment.
Very cool change. Made a few comments. Then we can merge.
| state.agentsTotal = agentNames.size; | ||
| } else { | ||
| // Child workflow — populate the active child context | ||
| const ctx = resolveContext(state.subworkflowContexts, state.activeContextPath); |
There was a problem hiding this comment.
Concurrent for-each-of-workflow context corruption.
This handler routes via the global state.activeContextPath instead of the engine-stamped data.subworkflow_path. The PR went to lengths to add subworkflow_path auto-stamping in _emit (workflow.py:430-434) and uses it in workflow_completed (line 1293) and workflow_failed (line 1331) — workflow_started is the missing one.
Reproduction with concurrent for_each-of-workflow:
- Iteration A emits
subworkflow_started→activeContextPath=[0] - Iteration B emits
subworkflow_started→activeContextPath=[1] - Iteration A's child engine emits
workflow_started(carryingsubworkflow_path=[A_slot]) → resolved toactiveContextPath=[1]→ A's graph/agents populate B's context.
Fix: resolve via subworkflow_path first, falling back to activeContextPath:
const ctx = (data.subworkflow_path !== undefined
? resolveSlotPath(state.subworkflowContexts, data.subworkflow_path as string[])?.ctx
: null)
?? resolveContext(state.subworkflowContexts, state.activeContextPath);There was a problem hiding this comment.
Fixed in d27109d. Adopted your suggested resolution pattern (subworkflow_path with activeContextPath fallback) and added subworkflow_path?: string[] to WorkflowStartedData so the resolver can read it without a cast.
Also added a regression test in test_subworkflow.py::TestSubWorkflowDashboardPath::test_concurrent_for_each_subworkflow_emits_distinct_slot_keys that exercises max_concurrent=3 (the existing for_each test only covered the sequential max_concurrent=1 case).
| // engine-supplied parent_path so concurrent completions land correctly. | ||
| let parentIndexPath: number[]; | ||
| if (data.parent_path !== undefined) { | ||
| const resolved = resolveSlotPath(state.subworkflowContexts, data.parent_path); |
There was a problem hiding this comment.
Silent fallback to root + activeContextPath corruption on slot-resolution miss.
resolved?.indexPath ?? [] silently falls back to the root context ([]), and state.activeContextPath = parentIndexPath later in the handler then unconditionally overwrites the active path. If resolveSlotPath returns null (e.g., event arrives before its sibling subworkflow_started, or any path inconsistency), the code:
- Targets
state.nodes(root) and stampsstatus: 'completed'on a same-named root agent that has nothing to do with the subworkflow. - Resets
activeContextPathto[], breaking subsequent sibling routing.
Compare to subworkflow_started at line 1364 which correctly does if (!resolved) return;. Mirror that here:
if (data.parent_path !== undefined) {
const resolved = resolveSlotPath(state.subworkflowContexts, data.parent_path);
if (!resolved) return;
parentIndexPath = resolved.indexPath;
} else {
parentIndexPath = state.activeContextPath;
}The identical bug exists in subworkflow_failed at line 1458 — please fix both.
There was a problem hiding this comment.
Fixed in 37d856d. Mirrored the if (!resolved) return; guard from subworkflow_started (line 1364) in both subworkflow_completed (line 1419) and subworkflow_failed (line 1458). On a slot-resolution miss, both handlers now return early instead of silently falling back to root and clobbering activeContextPath.
| # unwinds the child engine back to the parent, stopping the workflow. | ||
| if self._web_dashboard is not None: | ||
| if self._subworkflow_depth > 0: | ||
| raise InterruptError(agent_name=current_agent_name) |
There was a problem hiding this comment.
Headline feature is untested.
This new if self._subworkflow_depth > 0: raise InterruptError(...) branch — and the matching new stop_task logic in _handle_web_pause (lines 1077-1110) — have zero test coverage. Every test in TestCheckInterrupt (tests/test_engine/test_workflow_interrupt.py:572-658) constructs an engine with _subworkflow_depth=0, so this branch is dead code under tests. You could delete lines 969-970 and all 45 tests in the suite would still pass.
Since "Stop button reliability during subworkflows" is a headline feature of this PR, please add at minimum:
test_check_interrupt_raises_in_subworkflow— engine with_subworkflow_depth=1and_web_dashboardset; assertInterruptErroris raised.test_handle_web_pause_stop_event_in_subworkflow— pause a child engine, set_interrupt_event, assertInterruptError.test_nested_subworkflow_path_accumulates— depth ≥ 2; assertsubworkflow_pathchains correctly across multiple levels.test_subworkflow_failed_event_carries_parent_path_and_slot_key— make the child raise; assert the failed event payload (currently only the success path is asserted attests/test_engine/test_subworkflow.py:1014-1015).- A
for_each-of-workflow test asserting distinctslot_keyper iteration — the existing test only covers the trivial sequential case (slot_key == agent.name).
There was a problem hiding this comment.
Fixed in fef6b36. Added the test cases you specified (and one bonus):
TestCheckInterruptSubworkflow::test_check_interrupt_raises_in_subworkflow- engine with_subworkflow_depth=1and_web_dashboardset; assertsInterruptErroris raised andinterrupt_eventis cleared.TestCheckInterruptSubworkflow::test_check_interrupt_consumed_silently_at_root- regression guard for the root-depth silent-consume branch.TestHandleWebPauseSubworkflow::test_handle_web_pause_stop_event_in_subworkflow- pauses a child engine, sets_interrupt_eventmid-pause, assertsInterruptError.TestHandleWebPauseSubworkflow::test_handle_web_pause_root_ignores_interrupt_event- documents the intentional root-vs-subworkflow asymmetry as an executable spec.TestSubWorkflowDashboardPath::test_subworkflow_failed_event_carries_parent_path_and_slot_key- child raises, asserts the failed event payload (agent_name,parent_path,slot_key,error_type,message).TestSubWorkflowDashboardPath::test_nested_subworkflow_path_accumulates- depth 2 (parent -> mid -> leaf), assertssubworkflow_pathchains correctly across both levels and that the root engine still emitsworkflow_completedwithout a stamp.TestSubWorkflowDashboardPath::test_concurrent_for_each_subworkflow_emits_distinct_slot_keys- your point Idle recovery: cumulative counter fails long-running agents — add per-workflow config and reset-on-progress #5. The existingtest_for_each_subworkflow_emits_distinct_slot_keysalready covers slot-key uniqueness for for_each-of-workflow, but it ran withmax_concurrent=1so iterations were sequential. The new test usesmax_concurrent=3so iterations actually overlap, proving uniqueness is not an artifact of serial execution.
Engine suite: 53 passed (was 46), one pre-existing Windows path failure in test_event_log.py is unrelated.
| # the user to first Resume then wait for the next between-agent check. | ||
| stop_task = None | ||
| if self._subworkflow_depth > 0 and self._interrupt_event is not None: | ||
| self._interrupt_event.clear() |
There was a problem hiding this comment.
Pause+Stop UX diverges between root and subworkflow.
Pre-clearing interrupt_event here means a quick double-click of Stop (where the second click lands before partial output reaches _handle_web_pause) silently discards the second signal. Behavior also differs from root: at root, a second Stop while paused is ignored until Resume (only Kill works); here, the second Stop is honored — but only if it arrives after the pause is established.
At minimum, add a code comment explaining the intentional asymmetry. Ideally, document the divergence in the dashboard UX docs (or align behavior between root and subworkflow).
There was a problem hiding this comment.
Fixed in 1c3d199 with an inline comment in _handle_web_pause covering all three points:
- Why root depth deliberately omits the
interrupt_eventsubscription (pause exits only on Resume or Kill). - Why subworkflow depth subscribes to it (so Stop unwinds the child engine without requiring Resume first).
- The tiny
clear()/create_task()race window where a Stop click can be silently discarded, and the trade-off vs. carrying a stale Stop signal across pause cycles.
Also added TestHandleWebPauseSubworkflow::test_handle_web_pause_root_ignores_interrupt_event (in fef6b36) as an executable spec for the asymmetry, so the next person who tries to "fix" it gets a failing test.
Aligning behavior between root and subworkflow Stop semantics feels out of scope for this PR - happy to file a follow-up if you want them unified one way or the other.
The workflow_started handler resolved the owning context via
state.activeContextPath, which is unsafe under concurrent
for_each-of-workflow:
1. Iteration A emits subworkflow_started -> activeContextPath=[0]
2. Iteration B emits subworkflow_started -> activeContextPath=[1]
3. Iteration A's child engine emits workflow_started (carrying
subworkflow_path=[A_slot]) -> resolved to activeContextPath=[1]
-> A's graph/agents populate B's context.
Mirror the resolution pattern already used by workflow_completed and
workflow_failed: prefer subworkflow_path from the event payload, fall
back to activeContextPath only when the field is absent (legacy
engines).
Also adds the optional subworkflow_path field to the
WorkflowStartedData TypeScript type so the resolver can read it
without a cast.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both handlers used
esolved?.indexPath ?? [] as a fallback when
resolveSlotPath returned null. That silently routed the event to the
root context, with two distinct corruption modes:
1. argetNodes defaulted to state.nodes (root), so a same-named
root agent had its status overwritten to 'completed' or 'failed'
by an unrelated subworkflow event.
2. state.activeContextPath = parentIndexPath then unconditionally
reset the active path to [], breaking subsequent sibling routing.
Mirror the guard that subworkflow_started already uses (line 1365):
when resolveSlotPath returns null, return early. A null result means
the event arrived before its sibling subworkflow_started or the path
is inconsistent — neither warrants polluting root state.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yloads
Adds coverage for code paths that previously had none, all introduced
by the breadcrumb-navigation PR:
TestCheckInterruptSubworkflow
- test_check_interrupt_raises_in_subworkflow
Sub-engine in web mode propagates the interrupt as
InterruptError so the child unwinds back to the parent.
- test_check_interrupt_consumed_silently_at_root
Regression guard for the root-depth web-mode silent-consume
branch (the partial-output handler does the real pausing).
TestHandleWebPauseSubworkflow
- test_handle_web_pause_stop_event_in_subworkflow
Pause-then-Stop in a sub-workflow exits via interrupt_event
without requiring Resume first.
- test_handle_web_pause_root_ignores_interrupt_event
Documents the intentional root-vs-subworkflow asymmetry: at
root, only Resume or Kill exit a pause.
TestSubWorkflowDashboardPath (additions)
- test_subworkflow_failed_event_carries_parent_path_and_slot_key
The exception branch of _execute_subworkflow emits
subworkflow_failed with parent_path and slot_key. Previously
only the success-path emit was asserted.
- test_nested_subworkflow_path_accumulates
At depth >= 2 (parent -> mid -> leaf), each engine emits its
own workflow_completed and the auto-stamped subworkflow_path
chains correctly across nesting levels.
- test_concurrent_for_each_subworkflow_emits_distinct_slot_keys
Existing for_each test ran with max_concurrent=1; this variant
uses max_concurrent=3 so iterations actually overlap, proving
slot_key uniqueness is not an artifact of serial execution.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds an inline comment in _handle_web_pause explaining:
1. Why root depth deliberately omits the interrupt_event subscription
(pause exits only on Resume or Kill).
2. Why subworkflow depth subscribes to it (so Stop unwinds the child
engine without requiring Resume first).
3. The tiny clear()/create_task() race window where a Stop click can
be silently discarded, and the trade-off vs. carrying a stale Stop
signal across pause cycles.
No behavior change. Documents an intentional design choice flagged in
PR microsoft#113 review.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yloads
Adds coverage for code paths that previously had none, all introduced
by the breadcrumb-navigation PR:
TestCheckInterruptSubworkflow
- test_check_interrupt_raises_in_subworkflow
Sub-engine in web mode propagates the interrupt as
InterruptError so the child unwinds back to the parent.
- test_check_interrupt_consumed_silently_at_root
Regression guard for the root-depth web-mode silent-consume
branch (the partial-output handler does the real pausing).
TestHandleWebPauseSubworkflow
- test_handle_web_pause_stop_event_in_subworkflow
Pause-then-Stop in a sub-workflow exits via interrupt_event
without requiring Resume first.
- test_handle_web_pause_root_ignores_interrupt_event
Documents the intentional root-vs-subworkflow asymmetry: at
root, only Resume or Kill exit a pause.
TestSubWorkflowDashboardPath (additions)
- test_subworkflow_failed_event_carries_parent_path_and_slot_key
The exception branch of _execute_subworkflow emits
subworkflow_failed with parent_path and slot_key. Previously
only the success-path emit was asserted.
- test_nested_subworkflow_path_accumulates
At depth >= 2 (parent -> mid -> leaf), each engine emits its
own workflow_completed and the auto-stamped subworkflow_path
chains correctly across nesting levels.
- test_concurrent_for_each_subworkflow_emits_distinct_slot_keys
Existing for_each test ran with max_concurrent=1; this variant
uses max_concurrent=3 so iterations actually overlap, proving
slot_key uniqueness is not an artifact of serial execution.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds an inline comment in _handle_web_pause explaining:
1. Why root depth deliberately omits the interrupt_event subscription
(pause exits only on Resume or Kill).
2. Why subworkflow depth subscribes to it (so Stop unwinds the child
engine without requiring Resume first).
3. The tiny clear()/create_task() race window where a Stop click can
be silently discarded, and the trade-off vs. carrying a stale Stop
signal across pause cycles.
No behavior change. Documents an intentional design choice flagged in
PR microsoft#113 review.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
4e07de4 to
6fb6e93
Compare
Both handlers used
esolved?.indexPath ?? [] as a fallback when
resolveSlotPath returned null. That silently routed the event to the
root context, with two distinct corruption modes:
1. argetNodes defaulted to state.nodes (root), so a same-named
root agent had its status overwritten to 'completed' or 'failed'
by an unrelated subworkflow event.
2. state.activeContextPath = parentIndexPath then unconditionally
reset the active path to [], breaking subsequent sibling routing.
Mirror the guard that subworkflow_started already uses (line 1365):
when resolveSlotPath returns null, return early. A null result means
the event arrived before its sibling subworkflow_started or the path
is inconsistent — neither warrants polluting root state.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yloads
Adds coverage for code paths that previously had none, all introduced
by the breadcrumb-navigation PR:
TestCheckInterruptSubworkflow
- test_check_interrupt_raises_in_subworkflow
Sub-engine in web mode propagates the interrupt as
InterruptError so the child unwinds back to the parent.
- test_check_interrupt_consumed_silently_at_root
Regression guard for the root-depth web-mode silent-consume
branch (the partial-output handler does the real pausing).
TestHandleWebPauseSubworkflow
- test_handle_web_pause_stop_event_in_subworkflow
Pause-then-Stop in a sub-workflow exits via interrupt_event
without requiring Resume first.
- test_handle_web_pause_root_ignores_interrupt_event
Documents the intentional root-vs-subworkflow asymmetry: at
root, only Resume or Kill exit a pause.
TestSubWorkflowDashboardPath (additions)
- test_subworkflow_failed_event_carries_parent_path_and_slot_key
The exception branch of _execute_subworkflow emits
subworkflow_failed with parent_path and slot_key. Previously
only the success-path emit was asserted.
- test_nested_subworkflow_path_accumulates
At depth >= 2 (parent -> mid -> leaf), each engine emits its
own workflow_completed and the auto-stamped subworkflow_path
chains correctly across nesting levels.
- test_concurrent_for_each_subworkflow_emits_distinct_slot_keys
Existing for_each test ran with max_concurrent=1; this variant
uses max_concurrent=3 so iterations actually overlap, proving
slot_key uniqueness is not an artifact of serial execution.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds an inline comment in _handle_web_pause explaining:
1. Why root depth deliberately omits the interrupt_event subscription
(pause exits only on Resume or Kill).
2. Why subworkflow depth subscribes to it (so Stop unwinds the child
engine without requiring Resume first).
3. The tiny clear()/create_task() race window where a Stop click can
be silently discarded, and the trade-off vs. carrying a stale Stop
signal across pause cycles.
No behavior change. Documents an intentional design choice flagged in
PR microsoft#113 review.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@jrob5756 good catch — confirmed and fixed. what happened: i did the review-fix work on a local branch from another machine, then a force-push of unrelated work ( remediation (no force-push):
cherry-picked on top of current origin and fast-forwarded — no force-push, your in-flight reviews stay intact. re finding #1 (workflow_started subworkflow_path routing): PR head is now |
Adds full subworkflow awareness to the per-run web dashboard: State pollution fix: - Added wf_depth counter to workflow store — only depth-0 workflow_started initializes root context. Inner workflow events are routed to isolated SubworkflowContext objects. - Each subworkflow invocation gets its own nodes/routes/agents maps, keyed by (parentAgent, iteration). Repeated runs of the same subworkflow no longer share state. Subworkflow event handling: - Added TypeScript types for subworkflow_started, subworkflow_completed, subworkflow_failed events (mirrors engine emit). - Event handlers create/update child contexts and track the active context path for routing subsequent events. Breadcrumb navigation: - New BreadcrumbBar component shows the context stack above the graph (e.g., Root > twig-sdlc-planning > plan-issue). - Click any breadcrumb to navigate to that context level. - Double-click a workflow agent node in the graph to dive into its subworkflow context. - Graph rebuilds automatically when context changes. Context stack architecture: - SubworkflowContext[] tree structure mirrors workflow nesting. - activeContextPath tracks where live events are routed. - viewContextPath tracks what the user is viewing (independent). - getViewedContext() returns the correct nodes/routes for rendering. - All event handlers use activeTarget() helper to route to the correct context's nodes/groupProgress. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 4-5 of breadcrumb navigation feature: WorkflowNode component: - New node type for type:'workflow' agents with dashed border and Layers icon (visually distinct from regular agent nodes). - Shows child workflow name, elapsed time, and a chevron indicator when a SubworkflowContext exists. - Double-click to navigate into the subworkflow graph. SubworkflowDetail panel: - New detail component shown when a workflow agent is selected. - Lists all subworkflow runs for that agent with status, agent count, and cost summary. - Click any run to navigate into its context. Context-aware rendering: - All graph node components (AgentNode, ScriptNode, GateNode, GroupNode, AnimatedEdge) now read from getViewedContext().nodes instead of root state.nodes — ensures correct status display when viewing child contexts. - DetailPanel reads from viewed context for node lookup. - GroupDetail reads groupProgress from viewed context. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
getViewedContext() creates a new object on every call, causing infinite re-render loops (React error #185) when used inside Zustand selectors. New hooks in use-viewed-context.ts use useMemo with stable state references: - useViewedNodes() — nodes map for current context - useViewedGroupProgress() — group progress for current context - useViewedHighlightedEdges() — edge highlights - useViewedSubworkflowContexts() — child contexts - useViewedGraphData() — full graph data for WorkflowGraph All graph components and detail panels updated to use these hooks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a subworkflow agent is running, the shared interrupt_event was being silently consumed by _check_interrupt() in web mode (line 857: 'if self._web_dashboard is not None: return None'). This meant Stop would pause the current agent but then resume and continue — the interrupt never propagated to the parent engine. Two fixes: 1. _check_interrupt(): when _subworkflow_depth > 0, raise InterruptError instead of silently consuming the interrupt. This unwinds the child engine back to the parent's _execute_subworkflow try/except, stopping the workflow. 2. _handle_web_pause(): in subworkflows, also watch interrupt_event alongside resume/kill/disconnect events. A second Stop click while an agent is paused now raises InterruptError immediately, without requiring Resume first. Root-level (depth 0) behavior is unchanged — Stop still pauses the current agent with Resume/Kill options in the dashboard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On Windows with Python 3.14+, the proactor event loop's accept callback can fire after Server.close() sets _sockets = None during shutdown, causing an AssertionError in base_events.py:_attach that crashes the workflow process. Fix: - Add _guarded_serve() wrapper that catches AssertionError when the uvicorn server is in shutdown state (should_exit = True) - Install a custom event-loop exception handler during server lifetime that suppresses the same race when it surfaces through callbacks - _is_proactor_shutdown_race() validates: AssertionError type, server shutdown state, and asyncio-originating traceback frames - Restore original exception handler in stop() The guard is narrowly scoped: only AssertionError during server shutdown is suppressed. All other exceptions delegate to the original handler. Tests: 9 new tests covering the race detection, exception handler delegation, guarded serve behavior, and edge cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yers When navigating between subworkflow layers via breadcrumbs or double-click, old React Flow edges from the previous layer persisted as floating links disconnected from any visible nodes. Two fixes: - WorkflowGraph: explicitly clear nodes and edges when switching to an empty context (subworkflow data not yet populated) - graph-layout: filter edges against the actual node ID set to prevent orphan edges from routes referencing non-existent nodes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… nodes
Parse ?agent={name} and ?subworkflow={name} query params on initial load
to auto-select and center the matching node in the workflow graph. This
enables the meta-dashboard (conductor-dashboard) to generate clickable
breadcrumb links that open the conductor UI focused on a specific node.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update useDeepLink hook to: - Parse slash-separated subworkflow paths (e.g., ?subworkflow=planning/design) for navigating multiple levels deep into nested subworkflows - Support combined ?subworkflow=X&agent=Y to select an agent within a subworkflow context - Remove dependency on subworkflowContexts selector (array mutation doesn't trigger re-renders); rely on late-joiner replay instead Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrote useDeepLink to fix timing issues that prevented navigation: - Use zustand.subscribe() instead of useEffect + selector reactivity. The old approach relied on subworkflowContexts selector changes, but the store mutates the array in-place during processEvent, so the selector never detected changes. - Resolve the full subworkflow path in one shot via index walking instead of calling navigateIntoSubworkflow() in a loop. - Set viewContextPath directly via setState instead of relying on action functions that might see stale state. - Add error banner when deep-link target is invalid: shows the error message and a link back to the root dashboard. - Validate agent exists in the target context's agent list. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When viewing a sub-workflow graph, the start and end nodes are now replaced with distinct ingress/egress nodes that: - Use a dashed border with rounded-xl style to visually distinguish them from regular start/end nodes - Display 'From <parent agent>' on the ingress node - Display 'Return to <parent agent>' on the egress node - Navigate back to the parent workflow on double-click Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The breadcrumb navigation feature could not reach gates inside nested sub-workflows: the dashboard's view path did not follow the active context as a parent agent spawned a child workflow, so a human_gate inside that child was unreachable from the breadcrumb tree. The root cause: the store inferred sub-workflow parentage from the shared activeContextPath, then mutated it on every subworkflow_started. That conflated the user's view cursor with the engine's execution cursor and made events impossible to route correctly under any concurrency. Engine: each WorkflowEngine now carries _dashboard_context_path, the slot-key path identifying its position in the recursive sub-workflow tree (root = []). _execute_subworkflow accepts a slot_key and threads [*parent_path, slot_key] into the spawned child engine. _emit auto- stamps subworkflow_path on every event a sub-engine produces. Sequential subworkflow_started/_completed/_failed include parent_path and slot_key. Frontend: subworkflow_started/completed/failed and workflow_completed/ failed handlers resolve the owning context strictly from engine- supplied parent_path / subworkflow_path via a new resolveSlotPath helper, instead of mutating a single shared activeContextPath. activeTarget consults subworkflow_path on event data when present so per-iteration agent_message/tool/turn events land in the right per- iteration context. viewContextPath sticks to the live edge when the user has not navigated away, so newly-spawned gates inside sub- workflows are reachable without manual breadcrumb clicks. Handlers fall back to legacy behavior when these new fields are absent. SubworkflowDetail navigates by ctx.slotKey (stable across iteration reorders) instead of (agent_name, iteration). Breadcrumbs render slot keys so concurrent iterations are distinguishable. Adds TestSubWorkflowDashboardPath covering parent_path/slot_key emission for sequential sub-workflows and the auto-stamped subworkflow_path on the child workflow_completed event. This is a self-contained slice of a broader fix that also addresses concurrent for_each-of-workflow stacking; the matching for_each-side emit changes are staged in microsoft#110 and become functional once both PRs land. Order of merge does not matter — handlers degrade gracefully when events lack the new fields. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add flexible segment matching for subworkflow deep-links: 1. Exact slotKey match (e.g. plan_child[item-0]) 2. Positional index (e.g. plan_child#0, 0-based) 3. Bare agent name (when unambiguous) Ambiguous bare names (multiple for_each iterations) now produce an actionable error listing valid alternatives instead of silently failing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…arrows Edges whose source or target was a child node inside a parallel group rendered at incorrect absolute positions (React Flow positions children relative to the parent group). This caused orphaned arrowheads in the top-left corner of the graph. Fix: remap any edge endpoint that references a group child to the parent group node instead, so dagre can properly route the edge and React Flow draws it at the correct position. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add FitViewOnContextSwitch component that triggers fitView when navigating between sub-workflow contexts, preventing stale viewport positioning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When ?agent= is provided without ?subworkflow=, explicitly pin viewContextPath to root ([]) before the agent lookup runs. This prevents the sticky-follow mechanism from advancing the view into a stale subworkflow/for_each iteration during WS replay, which caused 'Agent not found' errors for root-level agents. Also handles ?subworkflow=&agent=foo (empty subworkflow string) correctly — empty string is falsy so falls through to the agent-only reset path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Under concurrent for_each iterations, the global activeContextPath is advanced by each subworkflow_started event. When workflow_started for an earlier iteration's child engine arrived, it was resolved against the (now-stale) activeContextPath, so its agents/routes were written to the wrong sibling ctx — or got overwritten by a later sibling. That manifested as phantom routes (and missing routes) when the user navigated into a deeply-nested for_each iteration: routes from one iteration would surface in another, leaving edges drawn between agents that don't appear in the iteration's view. workflow_completed and workflow_failed already used the engine- supplied subworkflow_path slot key (commit f77b20b); workflow_started was the one handler missed. This change makes it consistent with its sibling handlers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both handlers used
esolved?.indexPath ?? [] as a fallback when
resolveSlotPath returned null. That silently routed the event to the
root context, with two distinct corruption modes:
1. argetNodes defaulted to state.nodes (root), so a same-named
root agent had its status overwritten to 'completed' or 'failed'
by an unrelated subworkflow event.
2. state.activeContextPath = parentIndexPath then unconditionally
reset the active path to [], breaking subsequent sibling routing.
Mirror the guard that subworkflow_started already uses (line 1365):
when resolveSlotPath returns null, return early. A null result means
the event arrived before its sibling subworkflow_started or the path
is inconsistent — neither warrants polluting root state.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yloads
Adds coverage for code paths that previously had none, all introduced
by the breadcrumb-navigation PR:
TestCheckInterruptSubworkflow
- test_check_interrupt_raises_in_subworkflow
Sub-engine in web mode propagates the interrupt as
InterruptError so the child unwinds back to the parent.
- test_check_interrupt_consumed_silently_at_root
Regression guard for the root-depth web-mode silent-consume
branch (the partial-output handler does the real pausing).
TestHandleWebPauseSubworkflow
- test_handle_web_pause_stop_event_in_subworkflow
Pause-then-Stop in a sub-workflow exits via interrupt_event
without requiring Resume first.
- test_handle_web_pause_root_ignores_interrupt_event
Documents the intentional root-vs-subworkflow asymmetry: at
root, only Resume or Kill exit a pause.
TestSubWorkflowDashboardPath (additions)
- test_subworkflow_failed_event_carries_parent_path_and_slot_key
The exception branch of _execute_subworkflow emits
subworkflow_failed with parent_path and slot_key. Previously
only the success-path emit was asserted.
- test_nested_subworkflow_path_accumulates
At depth >= 2 (parent -> mid -> leaf), each engine emits its
own workflow_completed and the auto-stamped subworkflow_path
chains correctly across nesting levels.
- test_concurrent_for_each_subworkflow_emits_distinct_slot_keys
Existing for_each test ran with max_concurrent=1; this variant
uses max_concurrent=3 so iterations actually overlap, proving
slot_key uniqueness is not an artifact of serial execution.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds an inline comment in _handle_web_pause explaining:
1. Why root depth deliberately omits the interrupt_event subscription
(pause exits only on Resume or Kill).
2. Why subworkflow depth subscribes to it (so Stop unwinds the child
engine without requiring Resume first).
3. The tiny clear()/create_task() race window where a Stop click can
be silently discarded, and the trade-off vs. carrying a stale Stop
signal across pause cycles.
No behavior change. Documents an intentional design choice flagged in
PR microsoft#113 review.
Reported-by: jrob5756 (PR microsoft#113 review)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rebuilds the production bundle so it incorporates the subworkflow_completed/failed slot-resolution fix (f374a19) that was cherry-picked into this branch. Without this rebuild the committed `static/` bundle would not reflect the source change and the dashboard would still exhibit the bug at runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous fix (593f4e1) reset viewContextPath to root for agent-only deep-links to stop sticky-follow from stranding the user inside a stale for_each iteration during replay. That solved the false-negative case (agent at root) but introduced a false-negative for nested agents: if the agent only exists inside a sub-workflow / for_each iteration, the resolver would now error with 'Agent X not found in root workflow.' That broke external integrations (e.g. notification feeds) that surface ?agent=X without knowing the parent slot chain. Resolution algorithm now: - subworkflow=foo&agent=bar: navigate to foo, look for bar there; if bar isn't at foo but exists elsewhere, list discovered locations in the error so the next click is obvious. - agent=bar (no subworkflow): try root first; otherwise walk every sub-workflow context. On exactly one match: navigate there. On many matches (e.g. bar ran in every for_each iteration): pick running > deepest > newest, mirroring the engine's live-event routing precedence. - Zero matches anywhere: deterministic error, view pinned to root so sticky-follow still doesn't strand. Also reworks the wait condition: instead of firing on the first state change after agents.length > 0 (which races against WS replay of nested subworkflow_started events), the resolver debounces 200ms of state quiescence and only applies once the target is resolvable or the workflow has reached a terminal state. A 5s hard cap keeps live-workflow deep-links from hanging when the target never appears. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bc0c747 to
b2eb97f
Compare
jrob5756
left a comment
There was a problem hiding this comment.
All critical and important fixes verified at b2eb97f:
- ✅ C1: workflow_started routes via subworkflow_path with activeContextPath fallback
- ✅ I1: bail-on-resolution-miss guards added in both subworkflow_completed and subworkflow_failed
- ✅ I2: 9 new tests in TestCheckInterruptSubworkflow / TestHandleWebPauseSubworkflow / TestSubWorkflowDashboardPath cover the depth>0 interrupt branch, stop-while-paused, distinct slot_keys per for_each iteration, nested path accumulation, and failed event payload — all passing
- ✅ S1: comprehensive root-vs-subworkflow asymmetry comment in _handle_web_pause
Carrying I3 (server.py over-broad AssertionError suppression on Windows shutdown) and S2/S3 (comment/docstring nits) into a follow-up issue rather than blocking on them. Nice work, ship it.
Bundles four follow-ups for PR microsoft#113: * feat(web): "Dive In" affordance on for-each iteration rows so users can navigate into a specific iteration's subworkflow context. The detail panel for a for_each group now exposes a Layers button per iteration that resolves the slot key (`<group>[<key>]`) against subworkflowContexts and calls navigateIntoSubworkflow. Resolves the reported case where pg_execution_group[0] could not be linked into. * fix(web/server): tighten _is_proactor_shutdown_race to require the deepest traceback frame to originate from asyncio internals (microsoft#145 I3). Previously the gate fell back to permissive True when the traceback was missing or empty, which could swallow unrelated AssertionErrors raised by user/workflow code during shutdown. * docs(engine): explain why _check_interrupt clears the interrupt event before raising — single-listener reset semantics, no race, callers rely on the post-raise event being clear (microsoft#145 S2). * docs(web): note on resolveSlotPath that newest-wins makes older for-each iterations unreachable by stale slot paths (microsoft#145 S3). Tests: - Updated TestProactorShutdownRace to assert the tightened contract: asyncio-framed traceback -> True, non-asyncio frame -> False, no traceback -> False. All 11 cases pass. - 137 web/engine tests pass; ruff check + format clean. Refs: microsoft#113, microsoft#145 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nits (#145) (#146) * fix(web): dedupe duplicate routes in graph layout When user YAML defines a conditional route plus a catch-all to the same target (a common pattern, e.g. 'when: ready==false' then bare 'to: gate' fallback), the engine evaluates them in order with first-match-wins semantics — they represent ONE visual transition, not parallel edges. Previously graph-layout.ts emitted an Edge per route entry, so duplicate (from, to) pairs produced overlapping/diverging edges that dagre laid out as phantom strands stretching off-canvas. Visible at the impl subworkflow view for routes like preflight_lite -> preflight_lite_gate (when + bare) and load_work_tree -> \ (when + bare). Dedupe by (from, to) — first route wins, matching engine semantics. Highlighted-edge state is already keyed by (from, to) so this is consistent with the rest of the rendering pipeline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: for-each dive-in nav + PR #113 review nits (#145) Bundles four follow-ups for PR #113: * feat(web): "Dive In" affordance on for-each iteration rows so users can navigate into a specific iteration's subworkflow context. The detail panel for a for_each group now exposes a Layers button per iteration that resolves the slot key (`<group>[<key>]`) against subworkflowContexts and calls navigateIntoSubworkflow. Resolves the reported case where pg_execution_group[0] could not be linked into. * fix(web/server): tighten _is_proactor_shutdown_race to require the deepest traceback frame to originate from asyncio internals (#145 I3). Previously the gate fell back to permissive True when the traceback was missing or empty, which could swallow unrelated AssertionErrors raised by user/workflow code during shutdown. * docs(engine): explain why _check_interrupt clears the interrupt event before raising — single-listener reset semantics, no race, callers rely on the post-raise event being clear (#145 S2). * docs(web): note on resolveSlotPath that newest-wins makes older for-each iterations unreachable by stale slot paths (#145 S3). Tests: - Updated TestProactorShutdownRace to assert the tightened contract: asyncio-framed traceback -> True, non-asyncio frame -> False, no traceback -> False. All 11 cases pass. - 137 web/engine tests pass; ruff check + format clean. Refs: #113, #145 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address PR #146 review comments 1. _guarded_serve now reuses _is_proactor_shutdown_race gate instead of permissive should_exit-only check. Both exception handler sites now share the same strict contract: suppress only when the deepest traceback frame originates from asyncio internals. 2. Edge dedupe clears the 'when' label when collapsing routes with different conditions to the same target, avoiding a misleading single-condition label on a multi-condition edge. 3. Added test_guarded_serve_reraises_non_asyncio_assertion_during_shutdown to verify the tightened contract. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Daniel Green <dangreen@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, #121-#123, #125, #129, #130, #131, #139, #141-#144, #146) CHANGELOG: add 6 newer PRs (#119, #121, #122, #123, #125, #113, #130, #131, #141, #146) to [Unreleased] alongside the previously documented batch. docs/workflow-syntax.md: - Add metadata + instructions fields to the workflow configuration block. - Add input_mapping and max_depth to Sub-Workflow Steps; correct stale claims that circular references are rejected and that workflow steps cannot be used in for_each groups. - Add 'Sub-workflows in for_each groups' subsection with example. - Add JSON stdout auto-parsing note + example to Script Steps output section. - Add type-appropriate zero values table to Workflow Inputs. - Add 'Workflow Metadata Variables' subsection covering workflow.dir, workflow.file, workflow.name. - Update on_start hook context list to include the new workflow.dir/file vars. docs/cli-reference.md: - Document --metadata/-m, --workspace-instructions, and --instructions flags on conductor run. - Add 'Metadata and Instructions' examples block. - Update conductor validate to describe the new template-reference error/warning checks added in #125. docs/providers/claude.md, docs/providers/comparison.md: - Replace stale 'All models support a 200K token context window' / '200K (all models)' claims with notes that the dashboard now sources context_window_max from each provider's SDK at runtime (#144). README.md: - Refresh the Features list to mention sub-workflow composition, dialog mode, workspace instructions, breadcrumb navigation, and the enhanced validate behavior. - Add --metadata, --workspace-instructions, --instructions to the conductor run options table. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* docs: changelog + doc updates for unreleased PRs (#100, #109-#111, #119, #121-#123, #125, #129, #130, #131, #139, #141-#144, #146) CHANGELOG: add 6 newer PRs (#119, #121, #122, #123, #125, #113, #130, #131, #141, #146) to [Unreleased] alongside the previously documented batch. docs/workflow-syntax.md: - Add metadata + instructions fields to the workflow configuration block. - Add input_mapping and max_depth to Sub-Workflow Steps; correct stale claims that circular references are rejected and that workflow steps cannot be used in for_each groups. - Add 'Sub-workflows in for_each groups' subsection with example. - Add JSON stdout auto-parsing note + example to Script Steps output section. - Add type-appropriate zero values table to Workflow Inputs. - Add 'Workflow Metadata Variables' subsection covering workflow.dir, workflow.file, workflow.name. - Update on_start hook context list to include the new workflow.dir/file vars. docs/cli-reference.md: - Document --metadata/-m, --workspace-instructions, and --instructions flags on conductor run. - Add 'Metadata and Instructions' examples block. - Update conductor validate to describe the new template-reference error/warning checks added in #125. docs/providers/claude.md, docs/providers/comparison.md: - Replace stale 'All models support a 200K token context window' / '200K (all models)' claims with notes that the dashboard now sources context_window_max from each provider's SDK at runtime (#144). README.md: - Refresh the Features list to mention sub-workflow composition, dialog mode, workspace instructions, breadcrumb navigation, and the enhanced validate behavior. - Add --metadata, --workspace-instructions, --instructions to the conductor run options table. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: bump version to 0.1.11 and changelog #148 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds full subworkflow awareness to the per-run web dashboard with breadcrumb navigation, fixes node status pollution from repeated subworkflow runs, and fixes the Stop button not working during subworkflows.
Changes
State Pollution Fix
wfDepthcounter to the Zustand workflow store — only depth-0workflow_startedinitializes root contextSubworkflowContextwith its own nodes/routes/agents/groupProgress mapsSubworkflow Event Handling
subworkflow_started,subworkflow_completed,subworkflow_failedactiveTarget()helperBreadcrumb Navigation
BreadcrumbBarcomponent shows the context stack above the graphgetViewedContext()+useViewed*()hooks provide stable, memoized context dataSubworkflow Node Visual
WorkflowNodeReact Flow node type with dashed border and Layers iconSubworkflowDetailpanel lists subworkflow runs with status/cost, click to navigateStop Button Reliability (Engine Fix)
_check_interrupt(): in subworkflows (depth > 0), raisesInterruptErrorinstead of silently consuming the interrupt event — unwinds the child engine back to the parent_handle_web_pause(): in subworkflows, also watchesinterrupt_eventso Stop-while-paused works without requiring Resume firstContext-Aware Rendering
useViewed*()hooks instead of root stateTesting