RFC: first-class on_error routing (brainstorm)#227
Conversation
Adds an engineering brief proposing that conductor teach the routes table about errors. Nodes raise a typed error envelope; outgoing routes match on success or error; four route actions (to / retry / halt / propagate) cover the full control-flow surface; sub-workflow errors propagate by default; provider transport exhaustion surfaces as a routable kind. Submitted by polyphony, a downstream workflow-heavy consumer (15 workflows, 296 nodes) where this primitive would let ~40 hand-rolled error-recovery human_gate nodes collapse to one or two route lines each. This is a brainstorm-tier doc — open questions called out at the top of the brief, recommendations marked, no schema changes yet. Polyphony tracking: AB#3257 (parent epic: AB#3253). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #227 +/- ##
=======================================
Coverage ? 88.26%
=======================================
Files ? 60
Lines ? 9667
Branches ? 0
=======================================
Hits ? 8533
Misses ? 1134
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Reframes D1 (Typed error envelope) around a language-neutral contract: conductor sets CONDUCTOR_ERROR_OUT; the script writes JSON to that path and exits 0. Three lines in pwsh / bash / python / node / dotnet — no bespoke per-engine commands required. Optional shipped helpers in helpers/error/ exist as sugar, not as the contract. Direct prior art cited: GitHub Actions' GITHUB_OUTPUT env-var-file migration (away from stdout sentinels). Surveyed alternatives (Azure DevOps, RFC 7807, JSON-RPC, POSIX sysexits, systemd sd_notify, compiler JSON diagnostics, SARIF) and rejected each with rationale. Adds: - D1.1 The script-side contract (engine-agnostic, with prior art) - D1.2 Side-by-side envelope examples in pwsh, bash, python, node, dotnet - D1.3 Optional shipped helpers per engine (table) - D1.4-5 How agents / workflows raise the envelope - D1.6 Where do kinds come from? (intentional classification table) - D1.7 Optional 'raises: [kind]' declaration on nodes for self-doc + lint - D1.8 Wrapping third-party CLIs (bash + pwsh worked examples) Adds an explicit non-goal: conductor does not infer kinds from exit codes, stderr patterns, or any other runtime signal. Updates touch points to call out the temp-file mechanics on Windows (tempfile.mkstemp + delete=False; no read/write overlap since conductor reads only after subprocess exit) and to add the optional 'raises:' schema field. Phase 1 acceptance criteria expanded: cross-engine tests (pwsh-on-Windows, bash-on-Linux, python on both); 'raises:' lint and runtime undeclared-kind check. Removes pwsh-specific Write-ConductorError as the headline ergonomic surface. It still ships, but as one helper among five, not as the canonical entrypoint. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pushed a substantive update to D1 (the script-side error contract). Top-level shape is unchanged; the change is about engine reach. What changed and why A reviewer (Daniel) pushed back: conductor's script executor is engine-agnostic ( The fix is to make the contract language-neutral, with prior art:
Also in this push
Net effect: the contract is now provably the simplest thing that could work for any script engine, the prior art is named, and the ergonomic helpers are positioned correctly (optional, not load-bearing). Commit: 02eace8 |
RFC adding an engineering brief that proposes teaching conductor's
routes:table about errors.Read order: the TL;DR + thesis + top decisions + critical open questions at the top of the doc are the only parts that need eyes. Everything below the
# Deep divesfold is the design at full depth — skim or skip.What it proposes
A node raises a typed error envelope; outgoing routes match on either success or error; four route actions (
to/retry/halt/propagate) cover the full control-flow surface; sub-workflow errors propagate by default; provider transport exhaustion surfaces as a routable kind.``yaml
routes:
retry: { max: 5, backoff: exponential, initial_seconds: 2 }
to: drift_recovery_gate # post-exhaustion fallback
halt: { message: "model down: {{ error.message }}" }
propagate: true # uncaught -> parent workflow
``
Why
Polyphony has ~40
human_gatenodes across its 15 workflows whose sole purpose is to absorb script-level failures ({success: false}lands here, human picks retry / skip / abort). They exist because conductor has no other primitive that can catch a node-level failure. With this brief shipped, most of those collapse to one or two route lines on the failing node.What's in this PR
Just the brainstorm doc —
docs/projects/error-routing/on-error-routing.brainstorm.md. No schema changes, no code.What I'd like back
Not a yes/no on the whole thing. Specifically:
Submitted from polyphony's architecture/rationalization track (AB#3257, parent epic AB#3253).