Skip to content

Host-Owned Backpressure, Budget-Bound Retry, Tasks as Graph Nodes

Latest

Choose a tag to compare

@bw19 bw19 released this 26 Jun 01:37

v1.42.0 settles the reliability model of the workflow engine and tightens the composition model around it. The orchestration engine (dwarf, bumped to v0.7.1) no longer classifies the errors a task returns: there is no adaptive backpressure valve, no per-task circuit breaker, and no rate-limit/unavailable disposition. Backpressure now belongs to the layer that holds the scarce resource — the task arms its own retry, the LLM provider declares its own throttle, and the Foreman owns exactly one last-resort retry for a missing microservice. Retry is bound by a wall-clock horizon rather than an attempt count, and that horizon defaults to the step's own time budget, which is now overridable per flow via FlowOptions.TimeBudget. Alongside the reliability work, the release removes the last way to run a bare task as a standalone unit: a task is only ever a node in a graph; the independently-invocable, engine-backed unit is the workflow. flow.Subtask, Engine.CreateTask, and the foreman's CreateTask endpoint are gone, and the generated Subflow client is renamed Subgraph and now carries only workflow methods. The upgrade skill performs the rename and the migration mechanically.

Highlights

  • The engine never reads status codes. Any error a task returns is terminal for that attempt — it routes to onError/onTimeout or fails the step. The adaptive backpressure valve, circuit breaker, and error dispositions are deleted; cross-replica gossip drops from four operations to two.
  • Backpressure is the resource-owner's job. A task that wants to ride out a transient failure arms flow.Retry itself; an LLM provider attaches a retryAfter to a genuine throttle. The engine consumes only the backoff shape.
  • Retry is wall-clock-bound, not attempt-bound. flow.Retry(initialDelay, delayMultiplier, maxIntervalDelay, giveUpAfter) gives up the moment the next wait would overshoot giveUpAfter of elapsed time since the step was first created (giveUpAfter <= 0 is unlimited).
  • Per-flow time budget. FlowOptions.TimeBudget overrides the engine's default per-task budget for a flow; subgraph descendants inherit it. The crash-recovery lease is sized from the same budget so they cannot diverge.
  • Provider-side LLM rate-limit handling. Each LLM provider classifies its own errors and expresses retryability through a single retryAfter attribute, closing the poison-request loop at the contract level and keeping the upstream status code authentic.
  • A task is only a node in a graph. flow.Subtask, Engine.CreateTask, the foreman CreateTask endpoint, and the synthesized single-task graph are removed. The generated Subflow client is renamed Subgraph (NewSubgraph) and carries only workflow methods.
  • Recover failed flows. Recover restarts every failed step of a failed flow in one idempotent pass, guarded by a status CAS so concurrent recoveries can't double-rewind a shared spawn step.
  • Fire-and-forget flows. FlowOptions.DeleteOnCompletion deletes a flow the moment it completes successfully; a completed disposable flow returns a uniform 404 from Await/Snapshot/Run.
  • A new missing-microservice metric and a Reliability dashboard row. microbus_foreman_timeout_requests (labeled task_url, outcome) tracks the 404 ack-timeout retry; the workflow dashboard leads with 404 volume.

Reliability and Backpressure

The Engine Stops Classifying Errors

The orchestration engine treats any error a task returns as terminal for that attempt. It no longer inspects the HTTP status or error text, and there is no longer a rate-limit or unavailability signal it wraps an error with. The per-task valve, the circuit breaker, and the ErrRateLimited/ErrUnavailable dispositions are deleted from the engine; the rate-cut, saturated, and breaker-state metrics are gone, and cross-replica gossip drops from four operations (enqueue, status change, valve sync, breaker trip) to two (enqueue, status change).

The rationale is one of vantage. The engine's only handle is the task URL, but the resources that actually get scarce are finer: an LLM provider's per-account token quota is shared across many task URLs and invisible to the engine, and a downstream a task calls is a downstream of the task. The engine cannot be the resource-accurate controller for either, so backpressure moves to the layer that holds the resource identity. The full model is laid out in Reliability and Recovery.

Budget-Bound, Wall-Clock Retry

flow.Retry is now bound by a wall-clock horizon instead of an attempt count:

flow.Retry(initialDelay, delayMultiplier, maxIntervalDelay, giveUpAfter)

giveUpAfter is measured against the step's creation time, so it includes execution time and survives parks; giveUpAfter <= 0 means unlimited. The give-up decision is made client-side in Retry — the bool tracks return nil versus return err — and only the backoff shape crosses to the engine. Retry short-circuits before a wait that would overshoot the horizon, so the last attempt never sleeps past its own deadline. Flow.StepCreatedAt() and Flow.Attempt() are new accessors: the former anchors the horizon, the latter lets a caller still bound by count at the call site.

The One Engine-Level Retry: Missing Microservices

The single case a task cannot own is its own hosting microservice being momentarily absent — the task body never runs. On a 404 ack-timeout (and only that, never a 404 from a task that actually ran), the Foreman arms flow.Retry on the carrier and re-probes, with the step's time budget as the horizon. A brief absence — a rolling deploy, a scale-to-zero, a restart — heals transparently; a 404 that persists past the budget becomes an ordinary step failure. This is the only engine-level retry in the system.

Provider-Side LLM Rate-Limit Handling

Each LLM provider now classifies its own errors and expresses retryability to ChatLoop through a single error attribute, retryAfter: present means retryable (and is the wait), absent means permanent. ChatLoop holds no policy and never inspects the status code. An unclassified error has no retryAfter, so the default is don't-retry — which closes the poison-request loop at the contract level — and the upstream status code stays authentic rather than remapped. The four providers (claudellm, chatgptllm, geminillm, litellm) each report errors uniformly (message + status + retryAfter + raw headers + truncated body) and keep an in-memory per-model gate that preempts further calls to a throttled model until its window clears. The short-term retry horizon is the task's own time budget; long-term retry is the caller's — Chat returns the messages accumulated before an error so a caller can resume rather than restart, and llmapi.RetryAfter(err) is the typed retry signal.

Flow Lifecycle

Recover Failed Flows

Recover(flowKey, stateOverrides) restarts every failed step of a failed flow in one pass, re-running the unhandled failures and leaving completed siblings untouched. It is idempotent — if a step fails again, a second Recover picks it up — and each per-step rewind is guarded by a status CAS on failed, so two concurrent Recover calls (or a Recover racing a RestartFrom) can no longer both rewind a shared spawn step and double-decrement its cohort counters. It joins RestartFrom (rewind from a chosen step) and Restart (re-run the whole flow) as the recovery verbs. The operator loop — List failed flows, read the Error, fix the cause, Recover — is documented in Reliability and Recovery.

Per-Flow Time Budget

FlowOptions.TimeBudget overrides the engine's default per-task time budget for a single flow, bounding every ExecuteTask dispatch's context deadline. It is frozen at Create, immutable for the flow's life, and inherited by subgraph descendants (it also bounds the subgraph LoadGraph). It is a per-task ceiling, not a flow-wide deadline; a task's own sub.TimeBudget endpoint declaration still applies, and the effective deadline is the smaller of the two. The crash-recovery lease is sized from the same time_budget_ms the claim CAS references, so lease and budget can never diverge. The engine imposes no ceiling — bounding the budget is the flow creator's responsibility.

Delete on Completion

FlowOptions.DeleteOnCompletion deletes a flow the instant it completes successfully — for fire-and-forget jobs whose output is not retained. A completed disposable flow is gone everywhere: Snapshot, Await, and Run all return a uniform 404 regardless of timing (in time, late, or repeated), and that 404 is the completion signal. Failed and cancelled disposable flows are retained and return their real outcome; a caller that wants the outcome of a successful disposable flow uses NotifyOnStop.

Composition: A Task Is Only a Node in a Graph

A task has no standalone, engine-backed call form. A task is only ever a node in a graph; the independently-invocable unit is the workflow. That leaves exactly two ways to compose: graph composition (task B is a node in the same graph, sharing the parent's full state) and a subgraph of a declared workflow (flow.Subgraph, with state isolated to the explicit in/out). To make a single task callable on its own, you declare a one-node workflow around it; the apparent duplication is the point, because it makes the author declare intent rather than auto-promoting an internal step into a callable unit. For shared logic, share a Go helper.

Concretely, this release removes every mechanism that synthesized a one-node graph around a task URL:

  • flow.Subtask and the engine's synthesized single-task graph are gone; flow.Subgraph always loads a declared graph.
  • Engine.CreateTask (dwarf) and the foreman CreateTask endpoint (foremanapi.CreateTask) are removed.
  • genservice no longer emits the per-task Subflow.<Task> methods. The generated client is renamed SubflowSubgraph (NewSubgraph), carries only workflow/subgraph methods, and is emitted only for services that declare a workflow.

Other Changes

  • Graph.AddTransitionFanOut(from, to...) wires an unconditional edge from one source to each named destination — the static-fan-out convenience that pairs with AddTransitionChain (linear) over plain AddTransition. It creates only the outgoing edges; a rejoining node still needs SetFanIn. Distinct from the dynamic AddTransitionForEach.
  • Upgrade skills defer regeneration and verification to the orchestrator. A numbered upgrade-vX-Y-Z skill now applies only source edits (via genupgrade plus shell edits) and never invokes a generator or go vet/go test. The upgrade-microbus orchestrator owns a single regeneration + go mod tidy && go vet ./... && go test ./... pass after every numbered skill. This fixes multi-version upgrade chains that previously broke by naming a since-retired generator or verifying a partially migrated tree.

Breaking Changes

The upgrade skill handles each of these mechanically.

  • flow.Subtask, Engine.CreateTask, and the generated Subflow.<Task> methods are removed. A NewSubflow(flow).SomeTask(...) call site no longer compiles. Author a one-node workflow around the task and invoke it as a subgraph instead.
  • The generated Subflow client is renamed Subgraph. NewSubflow(flow) becomes NewSubgraph(flow), and the client is emitted only for services that declare a workflow. The workflow/subgraph methods themselves are unchanged.
  • The foreman CreateTask endpoint is removed. Replace any foremanapi.CreateTask(...) caller with a declared workflow.
  • flow.Retry's signature changed to Retry(initialDelay, delayMultiplier, maxIntervalDelay, giveUpAfter) — the bound is a wall-clock horizon, not a max-attempt count. Unlimited is giveUpAfter <= 0.
  • The engine's error dispositions are gone. workflow.ErrRateLimited/ErrUnavailable and IsRateLimited/IsUnavailable are removed; an error returned to the engine is terminal. Per-task rate-limit and circuit-breaker configs and their metrics no longer exist.

Migration

From inside a Microbus project, ask Claude Code to upgrade Microbus:

Get the latest version of Microbus.

The upgrade bumps go.mod to v1.42.0 (which requires dwarf v0.7.1), refreshes .claude/rules/ and .claude/skills/, and runs the versioned upgrade-v1-42-0 routine, which rewrites NewSubflow to NewSubgraph, flags any NewSubflow(flow).SomeTask(...) or foremanapi.CreateTask(...) call site for migration to a one-node workflow, and updates flow.Retry call sites to the wall-clock signature. The orchestrator then regenerates every microservice with genservice, runs go mod tidy && go vet ./... && go test ./..., and you review the diff. service.go handler bodies are not touched beyond the mechanical rewrites.

Documentation