Skip to content

RFC: first-class on_error routing (brainstorm)#227

Draft
PolyphonyRequiem wants to merge 2 commits into
microsoft:mainfrom
PolyphonyRequiem:proposal/on-error-routing
Draft

RFC: first-class on_error routing (brainstorm)#227
PolyphonyRequiem wants to merge 2 commits into
microsoft:mainfrom
PolyphonyRequiem:proposal/on-error-routing

Conversation

@PolyphonyRequiem
Copy link
Copy Markdown
Member

RFC adding an engineering brief that proposes teaching conductor's routes: table about errors.

Read order: the TL;DR + thesis + top decisions + critical open questions at the top of the doc are the only parts that need eyes. Everything below the # Deep dives fold is the design at full depth — skim or skip.

What it proposes

A node raises a typed error envelope; outgoing routes match on either success or error; four route actions (to / retry / halt / propagate) cover the full control-flow surface; sub-workflow errors propagate by default; provider transport exhaustion surfaces as a routable kind.

``yaml
routes:

  • to: next_node # success path (unchanged)
  • on_error: external.git.drift
    retry: { max: 5, backoff: exponential, initial_seconds: 2 }
  • on_error: external.git.drift
    to: drift_recovery_gate # post-exhaustion fallback
  • on_error: provider.exhausted
    halt: { message: "model down: {{ error.message }}" }
  • on_error: "*"
    propagate: true # uncaught -> parent workflow
    ``

Why

Polyphony has ~40 human_gate nodes across its 15 workflows whose sole purpose is to absorb script-level failures ({success: false} lands here, human picks retry / skip / abort). They exist because conductor has no other primitive that can catch a node-level failure. With this brief shipped, most of those collapse to one or two route lines on the failing node.

What's in this PR

Just the brainstorm doc — docs/projects/error-routing/on-error-routing.brainstorm.md. No schema changes, no code.

What I'd like back

Not a yes/no on the whole thing. Specifically:

  1. A read on the three critical open questions flagged at the top of the brief (post-retry-exhaustion semantics, route-actions vs. smaller alternative, reserved-kind namespace).
  2. A read on the thesis: is "routes table as universal control flow" the right north star for conductor, or am I projecting?
  3. If yes-in-principle, your preference on phasing.

Submitted from polyphony's architecture/rationalization track (AB#3257, parent epic AB#3253).

Adds an engineering brief proposing that conductor teach the routes table
about errors. Nodes raise a typed error envelope; outgoing routes match on
success or error; four route actions (to / retry / halt / propagate) cover
the full control-flow surface; sub-workflow errors propagate by default;
provider transport exhaustion surfaces as a routable kind.

Submitted by polyphony, a downstream workflow-heavy consumer (15 workflows,
296 nodes) where this primitive would let ~40 hand-rolled error-recovery
human_gate nodes collapse to one or two route lines each.

This is a brainstorm-tier doc — open questions called out at the top of the
brief, recommendations marked, no schema changes yet.

Polyphony tracking: AB#3257 (parent epic: AB#3253).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@efa520f). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #227   +/-   ##
=======================================
  Coverage        ?   88.26%           
=======================================
  Files           ?       60           
  Lines           ?     9667           
  Branches        ?        0           
=======================================
  Hits            ?     8533           
  Misses          ?     1134           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Reframes D1 (Typed error envelope) around a language-neutral contract:
conductor sets CONDUCTOR_ERROR_OUT; the script writes JSON to that path
and exits 0. Three lines in pwsh / bash / python / node / dotnet — no
bespoke per-engine commands required. Optional shipped helpers in
helpers/error/ exist as sugar, not as the contract.

Direct prior art cited: GitHub Actions' GITHUB_OUTPUT env-var-file
migration (away from stdout sentinels). Surveyed alternatives (Azure
DevOps, RFC 7807, JSON-RPC, POSIX sysexits, systemd sd_notify, compiler
JSON diagnostics, SARIF) and rejected each with rationale.

Adds:
- D1.1 The script-side contract (engine-agnostic, with prior art)
- D1.2 Side-by-side envelope examples in pwsh, bash, python, node, dotnet
- D1.3 Optional shipped helpers per engine (table)
- D1.4-5 How agents / workflows raise the envelope
- D1.6 Where do kinds come from? (intentional classification table)
- D1.7 Optional 'raises: [kind]' declaration on nodes for self-doc + lint
- D1.8 Wrapping third-party CLIs (bash + pwsh worked examples)

Adds an explicit non-goal: conductor does not infer kinds from exit
codes, stderr patterns, or any other runtime signal.

Updates touch points to call out the temp-file mechanics on Windows
(tempfile.mkstemp + delete=False; no read/write overlap since conductor
reads only after subprocess exit) and to add the optional 'raises:'
schema field.

Phase 1 acceptance criteria expanded: cross-engine tests
(pwsh-on-Windows, bash-on-Linux, python on both); 'raises:' lint and
runtime undeclared-kind check.

Removes pwsh-specific Write-ConductorError as the headline ergonomic
surface. It still ships, but as one helper among five, not as the
canonical entrypoint.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PolyphonyRequiem
Copy link
Copy Markdown
Member Author

Pushed a substantive update to D1 (the script-side error contract). Top-level shape is unchanged; the change is about engine reach.

What changed and why

A reviewer (Daniel) pushed back: conductor's script executor is engine-agnostic (asyncio.create_subprocess_exec, no language assumptions), and the original draft leaned on a pwsh Write-ConductorError cmdlet as the headline ergonomic surface. That was an accident of the polyphony codebase being pwsh-heavy, not a design choice — and it pushed the cost of adoption onto every other engine.

The fix is to make the contract language-neutral, with prior art:

  • Contract: conductor sets \ to a temp-file path; the script writes the envelope JSON to that path and exits 0. Three lines in pwsh, bash, python, node, or dotnet (side-by-side examples in D1.2).
  • Prior art: GitHub Actions explicitly migrated from stdout sentinels (::set-output::) to env-var-files (\, \, \) for these exact reasons (stdout pollution, parser fragility, security). We're walking the same path. Cross-platform behavior is proven at scale.
  • Helpers ship as sugar, not as the contract. helpers/error/ directory with one file per common engine, all optional. The contract is what you can implement in 60 seconds from the spec.

Also in this push

  • D1.6 'Where do kinds come from?' — an authorship table making it explicit that conductor never infers kinds from exit codes or stderr patterns. Without intentional authorship at the failure site, all you can know is internal.script_error.
  • D1.7 optional raises: [kind] declaration on nodes — opt-in self-doc that powers a lint pass (validator can flag on_error: x.z against a node that declares raises: [x.y]) and a runtime undeclared-kind check (kinds raised but not declared become internal.undeclared_kind). Strictly optional; existing nodes need no change.
  • D1.8 wrapping third-party CLIs — concrete bash and pwsh examples of translating git fetch / npm install exit codes into typed envelopes at the workflow author's boundary.
  • New non-goal: no inference of kinds from runtime signals.
  • Touch points updated: tempfile.mkstemp(delete=False) for the path so Windows file-locking doesn't bite; conductor reads only after subprocess exit so no read/write overlap.
  • Phase 1 acceptance criteria expanded: cross-engine tests (pwsh-on-Windows, bash-on-Linux, python on both); raises: lint and runtime checks.

Net effect: the contract is now provably the simplest thing that could work for any script engine, the prior art is named, and the ergonomic helpers are positioned correctly (optional, not load-bearing).

Commit: 02eace8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants