v1.42.0 settles the reliability model of the workflow engine and tightens the composition model around it. The orchestration engine (dwarf, bumped to v0.7.1) no longer classifies the errors a task returns: there is no adaptive backpressure valve, no per-task circuit breaker, and no rate-limit/unavailable disposition. Backpressure now belongs to the layer that holds the scarce resource — the task arms its own retry, the LLM provider declares its own throttle, and the Foreman owns exactly one last-resort retry for a missing microservice. Retry is bound by a wall-clock horizon rather than an attempt count, and that horizon defaults to the step's own time budget, which is now overridable per flow via FlowOptions.TimeBudget. Alongside the reliability work, the release removes the last way to run a bare task as a standalone unit: a task is only ever a node in a graph; the independently-invocable, engine-backed unit is the workflow. flow.Subtask, Engine.CreateTask, and the foreman's CreateTask endpoint are gone, and the generated Subflow client is renamed Subgraph and now carries only workflow methods. The upgrade skill performs the rename and the migration mechanically.
Highlights
- The engine never reads status codes. Any error a task returns is terminal for that attempt — it routes to
onError/onTimeoutor fails the step. The adaptive backpressure valve, circuit breaker, and error dispositions are deleted; cross-replica gossip drops from four operations to two. - Backpressure is the resource-owner's job. A task that wants to ride out a transient failure arms
flow.Retryitself; an LLM provider attaches aretryAfterto a genuine throttle. The engine consumes only the backoff shape. - Retry is wall-clock-bound, not attempt-bound.
flow.Retry(initialDelay, delayMultiplier, maxIntervalDelay, giveUpAfter)gives up the moment the next wait would overshootgiveUpAfterof elapsed time since the step was first created (giveUpAfter <= 0is unlimited). - Per-flow time budget.
FlowOptions.TimeBudgetoverrides the engine's default per-task budget for a flow; subgraph descendants inherit it. The crash-recovery lease is sized from the same budget so they cannot diverge. - Provider-side LLM rate-limit handling. Each LLM provider classifies its own errors and expresses retryability through a single
retryAfterattribute, closing the poison-request loop at the contract level and keeping the upstream status code authentic. - A task is only a node in a graph.
flow.Subtask,Engine.CreateTask, the foremanCreateTaskendpoint, and the synthesized single-task graph are removed. The generatedSubflowclient is renamedSubgraph(NewSubgraph) and carries only workflow methods. - Recover failed flows.
Recoverrestarts every failed step of a failed flow in one idempotent pass, guarded by a status CAS so concurrent recoveries can't double-rewind a shared spawn step. - Fire-and-forget flows.
FlowOptions.DeleteOnCompletiondeletes a flow the moment it completes successfully; a completed disposable flow returns a uniform404fromAwait/Snapshot/Run. - A new missing-microservice metric and a Reliability dashboard row.
microbus_foreman_timeout_requests(labeledtask_url,outcome) tracks the404ack-timeout retry; the workflow dashboard leads with404volume.
Reliability and Backpressure
The Engine Stops Classifying Errors
The orchestration engine treats any error a task returns as terminal for that attempt. It no longer inspects the HTTP status or error text, and there is no longer a rate-limit or unavailability signal it wraps an error with. The per-task valve, the circuit breaker, and the ErrRateLimited/ErrUnavailable dispositions are deleted from the engine; the rate-cut, saturated, and breaker-state metrics are gone, and cross-replica gossip drops from four operations (enqueue, status change, valve sync, breaker trip) to two (enqueue, status change).
The rationale is one of vantage. The engine's only handle is the task URL, but the resources that actually get scarce are finer: an LLM provider's per-account token quota is shared across many task URLs and invisible to the engine, and a downstream a task calls is a downstream of the task. The engine cannot be the resource-accurate controller for either, so backpressure moves to the layer that holds the resource identity. The full model is laid out in Reliability and Recovery.
Budget-Bound, Wall-Clock Retry
flow.Retry is now bound by a wall-clock horizon instead of an attempt count:
flow.Retry(initialDelay, delayMultiplier, maxIntervalDelay, giveUpAfter)giveUpAfter is measured against the step's creation time, so it includes execution time and survives parks; giveUpAfter <= 0 means unlimited. The give-up decision is made client-side in Retry — the bool tracks return nil versus return err — and only the backoff shape crosses to the engine. Retry short-circuits before a wait that would overshoot the horizon, so the last attempt never sleeps past its own deadline. Flow.StepCreatedAt() and Flow.Attempt() are new accessors: the former anchors the horizon, the latter lets a caller still bound by count at the call site.
The One Engine-Level Retry: Missing Microservices
The single case a task cannot own is its own hosting microservice being momentarily absent — the task body never runs. On a 404 ack-timeout (and only that, never a 404 from a task that actually ran), the Foreman arms flow.Retry on the carrier and re-probes, with the step's time budget as the horizon. A brief absence — a rolling deploy, a scale-to-zero, a restart — heals transparently; a 404 that persists past the budget becomes an ordinary step failure. This is the only engine-level retry in the system.
Provider-Side LLM Rate-Limit Handling
Each LLM provider now classifies its own errors and expresses retryability to ChatLoop through a single error attribute, retryAfter: present means retryable (and is the wait), absent means permanent. ChatLoop holds no policy and never inspects the status code. An unclassified error has no retryAfter, so the default is don't-retry — which closes the poison-request loop at the contract level — and the upstream status code stays authentic rather than remapped. The four providers (claudellm, chatgptllm, geminillm, litellm) each report errors uniformly (message + status + retryAfter + raw headers + truncated body) and keep an in-memory per-model gate that preempts further calls to a throttled model until its window clears. The short-term retry horizon is the task's own time budget; long-term retry is the caller's — Chat returns the messages accumulated before an error so a caller can resume rather than restart, and llmapi.RetryAfter(err) is the typed retry signal.
Flow Lifecycle
Recover Failed Flows
Recover(flowKey, stateOverrides) restarts every failed step of a failed flow in one pass, re-running the unhandled failures and leaving completed siblings untouched. It is idempotent — if a step fails again, a second Recover picks it up — and each per-step rewind is guarded by a status CAS on failed, so two concurrent Recover calls (or a Recover racing a RestartFrom) can no longer both rewind a shared spawn step and double-decrement its cohort counters. It joins RestartFrom (rewind from a chosen step) and Restart (re-run the whole flow) as the recovery verbs. The operator loop — List failed flows, read the Error, fix the cause, Recover — is documented in Reliability and Recovery.
Per-Flow Time Budget
FlowOptions.TimeBudget overrides the engine's default per-task time budget for a single flow, bounding every ExecuteTask dispatch's context deadline. It is frozen at Create, immutable for the flow's life, and inherited by subgraph descendants (it also bounds the subgraph LoadGraph). It is a per-task ceiling, not a flow-wide deadline; a task's own sub.TimeBudget endpoint declaration still applies, and the effective deadline is the smaller of the two. The crash-recovery lease is sized from the same time_budget_ms the claim CAS references, so lease and budget can never diverge. The engine imposes no ceiling — bounding the budget is the flow creator's responsibility.
Delete on Completion
FlowOptions.DeleteOnCompletion deletes a flow the instant it completes successfully — for fire-and-forget jobs whose output is not retained. A completed disposable flow is gone everywhere: Snapshot, Await, and Run all return a uniform 404 regardless of timing (in time, late, or repeated), and that 404 is the completion signal. Failed and cancelled disposable flows are retained and return their real outcome; a caller that wants the outcome of a successful disposable flow uses NotifyOnStop.
Composition: A Task Is Only a Node in a Graph
A task has no standalone, engine-backed call form. A task is only ever a node in a graph; the independently-invocable unit is the workflow. That leaves exactly two ways to compose: graph composition (task B is a node in the same graph, sharing the parent's full state) and a subgraph of a declared workflow (flow.Subgraph, with state isolated to the explicit in/out). To make a single task callable on its own, you declare a one-node workflow around it; the apparent duplication is the point, because it makes the author declare intent rather than auto-promoting an internal step into a callable unit. For shared logic, share a Go helper.
Concretely, this release removes every mechanism that synthesized a one-node graph around a task URL:
flow.Subtaskand the engine's synthesized single-task graph are gone;flow.Subgraphalways loads a declared graph.Engine.CreateTask(dwarf) and the foremanCreateTaskendpoint (foremanapi.CreateTask) are removed.genserviceno longer emits the per-taskSubflow.<Task>methods. The generated client is renamedSubflow→Subgraph(NewSubgraph), carries only workflow/subgraph methods, and is emitted only for services that declare a workflow.
Other Changes
Graph.AddTransitionFanOut(from, to...)wires an unconditional edge from one source to each named destination — the static-fan-out convenience that pairs withAddTransitionChain(linear) over plainAddTransition. It creates only the outgoing edges; a rejoining node still needsSetFanIn. Distinct from the dynamicAddTransitionForEach.- Upgrade skills defer regeneration and verification to the orchestrator. A numbered
upgrade-vX-Y-Zskill now applies only source edits (viagenupgradeplus shell edits) and never invokes a generator orgo vet/go test. Theupgrade-microbusorchestrator owns a single regeneration +go mod tidy && go vet ./... && go test ./...pass after every numbered skill. This fixes multi-version upgrade chains that previously broke by naming a since-retired generator or verifying a partially migrated tree.
Breaking Changes
The upgrade skill handles each of these mechanically.
flow.Subtask,Engine.CreateTask, and the generatedSubflow.<Task>methods are removed. ANewSubflow(flow).SomeTask(...)call site no longer compiles. Author a one-node workflow around the task and invoke it as a subgraph instead.- The generated
Subflowclient is renamedSubgraph.NewSubflow(flow)becomesNewSubgraph(flow), and the client is emitted only for services that declare a workflow. The workflow/subgraph methods themselves are unchanged. - The foreman
CreateTaskendpoint is removed. Replace anyforemanapi.CreateTask(...)caller with a declared workflow. flow.Retry's signature changed toRetry(initialDelay, delayMultiplier, maxIntervalDelay, giveUpAfter)— the bound is a wall-clock horizon, not a max-attempt count. Unlimited isgiveUpAfter <= 0.- The engine's error dispositions are gone.
workflow.ErrRateLimited/ErrUnavailableandIsRateLimited/IsUnavailableare removed; an error returned to the engine is terminal. Per-task rate-limit and circuit-breaker configs and their metrics no longer exist.
Migration
From inside a Microbus project, ask Claude Code to upgrade Microbus:
Get the latest version of Microbus.
The upgrade bumps go.mod to v1.42.0 (which requires dwarf v0.7.1), refreshes .claude/rules/ and .claude/skills/, and runs the versioned upgrade-v1-42-0 routine, which rewrites NewSubflow to NewSubgraph, flags any NewSubflow(flow).SomeTask(...) or foremanapi.CreateTask(...) call site for migration to a one-node workflow, and updates flow.Retry call sites to the wall-clock signature. The orchestrator then regenerates every microservice with genservice, runs go mod tidy && go vet ./... && go test ./..., and you review the diff. service.go handler bodies are not touched beyond the mechanical rewrites.
Documentation
- Updated: State — the two composition forms and the typed
NewSubgraphclient, replacing the removedSubtasksibling. - Updated: Yield and Re-Enter and Building Agentic Workflows — a subgraph invokes a whole workflow, never a bare task.
- Updated: Reliability and Recovery — host-owned backpressure, wall-clock retry, the ack-timeout last resort, and the
List→Recoveroperator loop. - Updated: Priority and Fairness —
FlowOptions.TimeBudget. - Updated: Uniform Code Structure — the
Subgraphclient stub, emitted only for services that declare a workflow.