Umbrella Orchestrator Scaling

Umbrella: Orchestrator Scaling (#101)

Issue: noetl/ai-meta#101 Status: In progress — block a + b landed in noetl/server v3.9.0; block-b stage 2 (references-in-state + frame pruning) + the GKE-tier benchmark still ahead. Primary repo: noetl/server (+ noetl/worker)

See also: Design note — CQRS event log + batched projection (the write-path scaling direction).

⤴ Reframed 2026-06-19 by #115 — RFC: Decoupled Context + Event Chain. #115 names the root cause this umbrella's incremental-state + results-by-reference work bounded but did not remove: orchestrator context grows unbounded because every step passes the entire accumulated context to the next (17.4MB drive-state / 1.32MB command observed). It inverts #101's "resolve over-budget refs back into a rebuilt WorkflowState" to "never put data in state, never scan noetl.event to build state": the schema carries references only, state is reconstructed by walking a one-level event chain (no event-table scan), and the walk-and-cache moves off-server onto a system-pool system/state_builder. #101's remaining consume side (worker selective render-time ref-resolution) becomes #115 Phase 1 — the immediate unblock for the 3 stuck #114 fixtures + the off-server-drive prod cutover. The incremental OrchStateCache here is superseded by #115's immutable-chain cache.

Update 2026-06-15 — block b landed (server v3.9.0)

noetl/server#197 → v3.9.0 (1760c19): projection-snapshot bounded rebuild (memory flat — 167KB snapshot at 200k events, was OOM at ~19k), throttled consistency COUNT (O(events) per-trigger COUNT off the hot path), a background reconcile poller that force-advances every active execution every 8s so a missed non-triggering straggler can't permanently deadlock the cursor, results-by-reference resolution, and the GET /api/executions/{id} memory-bomb fix (was loading all events).

Validated: kind 10×1000 (flat memory, 0 OOMs across 200k events); GKE db-g1-small + PgBouncer 10×200 (cleared the prior fetch_vital_signs deadlock, poller observed advancing a stuck execution, 0 fails / 0 restarts, Cloud SQL ~15 backends). Connection topology: server keeps its own small bounded noetl pool; workers multiplex demo_noetl through a separate PgBouncer pool.

Goal

Make the Rust orchestrator hold up under high cursor concurrency and large tool results. Two coupled problems surfaced validating test_pft_flow_v2 at scale on kind, fixed together.

The two fixes

1. Incremental orchestrator state

trigger_orchestrator reloaded and replayed the whole noetl.event log on every completion — O(n) per trigger, O(n²) over a run. Under high cursor concurrency the simultaneous full loads spiked memory and OOM-killed the server (1Gi limit).

A per-execution OrchStateCache (src/state.rs) now holds the reconstructed WorkflowState and advances it by applying only events newer than last_event_id, behind a per-execution lock — one execution's completion triggers serialise; different executions never contend. A count mismatch (a late straggler event) falls back to a full rebuild for correctness; terminal executions are evicted to free memory. Cursor frame tracking moved into StepInfo (CursorFrame) so it survives incremental application instead of being reconstructed by a full scan each trigger.

2. Results-by-reference resolution

The worker stages tool results over its inline budget in the durable result store and emits a {data:{_ref}} placeholder + a sibling reference block, nested under the call.done envelope at result.context.result.reference. The orchestrator never resolved those when reading events to build state — so a cursor claim that returned its rows by reference looked like it returned zero rows → 0-row frames → the loop re-claimed the same pending rows forever (observed: 20,844 events, work-queue stuck at pending=5). Any result over the default 100 KB budget would have hit this in production.

hydrate_result_references (src/handlers/events.rs) resolves the reference (both nested and top-level envelope shapes) from noetl.result_store and splices the data in place before from_events / evaluate_state see the events, so extract_user_data / the cursor drive / build_context read it like an inline result.

The worker side (worker#89) makes the inline budget env-configurable (NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES, default 100 KB) so ops can tune it and tests can force the reference path.

Note: the table actually written is noetl.result_store (columns result_id, name, scope, source_step, data, …). The richer noetl.result_ref schema (extracted, preview, store tiers) exists in DDL but is currently unused/empty — relevant to the follow-up below.

Recent activity

Date	What	Pointers
2026-06-19	🐞 #114 — oversized `command.issued` offload shipped (server v3.29.5); refs_in_state consume side (#101) is the remaining off-server-drive cutover blocker. Under the publish-only gate the off-server drive (`refs_in_state=false`) embeds the full resolved upstream context into the next step's command, so its `command.issued` event reached ~1.32MB > NATS `max_payload` (1MB) → publish ack-timeout → `step.enter` persisted, command never issued, wedge. Fix (server#242 → v3.29.5): a context over `NOETL_COMMAND_CONTEXT_MAX_BYTES` (512KB) is offloaded to `noetl.result_store` with a `{__context_ref__}` marker; `get_command`/`claim_command` resolve it before the worker sees the command (metrics `context_offloaded`/`context_ref_resolved`). `apply_event` reads only command.issued meta → rebuild/sole-writer/idempotency hold; within-budget commands unchanged. Rig phase 8 (e2e#64). Kind gate-ON: off-server rig PASS — new `test_oversize_command_context` COMPLETED, max command.issued ctx 585B, offload+resolve fired, 0 `__orchestrate__` event rows, lag 0; all command.issued <1MB; 6 of #113's 9 fixtures COMPLETE. Chose ref-on-oversize over `refs_in_state=true` (candidate #1): a kind experiment proved the latter fixes the state-bloat (`lease_expiry` COMPLETES, drive-state 29KB) but breaks the bulk-consuming fixtures (`storage_tiers`/`output_select` fail at the bulk step — worker render-time ref-resolution not impl), so the default stays false. Remaining 3 fixtures + the off-server-drive cutover (#107/#111) now hinge on the refs_in_state consume side (#101) — `__orchestrate__` drive-state bloat (17.4MB for `storage_tiers`) + the `_ref`/bulk-resolve gap. No prod default flipped.	server#242 · e2e#64 · #114 · #101
2026-06-19	🚀 GKE pre-flip PREP — prod images pushed, GMP monitoring LIVE, roll-forward manifests staged. NO traffic flip, NO `PUBLISH_ONLY`. Read-only verification reconciled the prep brief against the live prod cluster: prod already runs the full Rust stack (the #49 cutover is done; `noetl` Service already selects `app=noetl-server-rust`; live images are pre-#103 `batch-dispatch-v1`/`cursor-100`), both flip secrets already exist (`NOETL_ENCRYPTION_KEY` + `noetl-internal-api-token`), and prod monitoring is Google Managed Prometheus, not VictoriaMetrics. Built + pushed the post-#103 images to the prod AR (server v3.29.3 `@sha256:6d2de32…` + worker v5.35.0, linux/amd64). Applied + verified GMP monitoring (`ci/manifests/noetl/gmp/`): `PodMonitoring` for the worker pools + server `/metrics` (the noetl namespace had none — GMP ignores `prometheus.io/scrape` annotations, so app metrics weren't scraped at all) + materializer-lag `Rules` (identical PromQL/thresholds to the kind VMRule); proven live via the Managed Prometheus query API (`up{namespace="noetl"}`=4 series, worker+server metrics flowing). Staged the roll-forward manifests (server→v3.29.3 `PUBLISH_ONLY=false`; system-pool→v5.35.0 `MATERIALIZER_ENABLED=false`) in a PR — not applied (they roll live workloads). Runbook gained a "Production (GKE)" section (GMP not VM, secrets present, image-roll prerequisite, exact operator sequence) + a GMP managedAlertmanager pager stub. Operator-gated remainder (surfaced, not done): roll the live images, enable the materializer shadow, wire the pager, then flip — one revert away. No prod default changed.	ops PR · #103 · #49
2026-06-19	🛡️ Materializer-lag GUARDRAIL shipped — the pre-flip observability gate for `PUBLISH_ONLY`. The server was FLIP-READY; the one remaining operator gate was a materializer-lag metric + alert. Worker #116 → v5.35.0 extends the JetStream lag poller to track the `noetl_events`/`noetl_materializer` consumer on an independent task → `noetl_worker_nats_consumer_pending{consumer="noetl_materializer"}` (+`_ack_pending`) climbs even when the materializer loop is stalled/dead (it can't report its own lag). `consumer_lag_for(stream,consumer)` reuses the same JetStream connection + the existing gauge (no new metric). Ops #195+#196: `VMRule` (backlog warning>200/critical>2000/growing + stall-under-gate [guarded on backlog>0] + project-errors + absent-under-gate), worker `/metrics` `VMServiceScrape` (was unscraped), VMAlert enabled, Grafana dashboard, flip runbook `noetl-cqrs-publish-only-flip.md` (pre-flip green-baseline check + one-command revert). Kind-proven full cycle on the VM stack: green baseline (backlog 0, published==projected==acked) → induced lag (materializer fault-injected, events publishing under the gate) → gauge climbs 0→684 via the independent poller, alerts fire (backlog warning+critical + stall) → recover → drains→0 idempotently (0 dup/loss), alerts clear. Flip-readiness now includes the monitoring gate. Default-off.	worker#116 · ops#195 · ops#196 · #103
2026-06-19	🎯 CQRS server cutover COMPLETE — FLIP-READY. The 2 ExecutionService cancel/finalize sites now route through the `emit_event` chokepoint (server#240 → v3.29.3, the third and final flip blocker). `cancel` (`playbook_cancelled`) + `finalize` (`playbook_completed`/`playbook_failed`) were the last synchronous server `noetl.event` writers under the gate. `ExecutionService` now carries `AppState`; `resolve_catalog_id` falls back `noetl.event`→`noetl.command` (mirrors #236, since `noetl.event` is empty under the gate for a fresh exec); `require_state` guards the pool-less test shim. Kind-proven both modes: gate-off cancel/finalize INSERT synchronously, byte-identical columns (error preserved), published delta +0, natural completion still COMPLETED; gate-on `noetl_event_ingest_published_total{playbook_cancelled}=1`/`{playbook_failed}=1` (PUBLISHED, not inserted), materializer sole writer, both execs reach correct terminal state, rows==distinct, 0 `catalog_id=0`, no loss/dup. Dual-mode e2e rig `kind_validate_cancel_finalize_gate.sh` (e2e#62). No remaining synchronous server `noetl.event` writers under the gate. All three flip blockers closed → flipping `PUBLISH_ONLY` on is a staged operator decision. Default-off; no prod default changed.	server#240 · e2e#62 · #103
2026-06-19	CQRS materializer ack-after-materialize durability RESOLVED + fault-tested (the last correctness item before the `PUBLISH_ONLY` flip). Deferred (ack-after-processing) ack in the `subscription` `SourceClient`/NATS source — `AckMode::Defer` surfaces the `$JS.ACK` reply subject as a durable handle, `SourceClient::ack(ids, Ack/Nack/Term)` disposes it later (tools#71). An in-process worker materializer consume-loop (worker#115, `NOETL_MATERIALIZER_ENABLED`, default off) drains `noetl_events` with deferred ack, POSTs `events/project`, and acks only on 2xx — on failure the batch redelivers (no loss). Chosen over playbook deferred-ack (step model can't hold an ack handle across pods/steps). System-pool wiring ops#194. Kind fault-injection (gate-on, sole writer): happy drained==projected==acked zero-dup; fault before ack → redeliver → materialize, loss=0, idempotent. Default-off; pointer bumps staged on the tools→worker crate-publish cascade.	tools#71 · worker#115 · ops#194 · #103
2026-06-18	CQRS 2d-3 sole-writer cutover implemented, default-off. `emit_event` chokepoint + `NOETL_EVENT_INGEST_PUBLISH_ONLY` gate (server#235): 13 producer sites PUBLISH to `noetl_events` instead of INSERT under the gate → materializer is the sole writer; trigger relocates to `events_project` (read-your-writes); `system/` drainers exempt (write synchronously, else deadlock). Kind: a gate-on exec writes 0 `noetl.event` rows (server no longer writes the log); gate-off byte-identical (25/25 regression + off-server e2e PASS).* End-to-end single-exec completion needs a clean-cluster soak (shared kind saturated by accumulated test execs). 557 tests + clippy green. Operator-gated; no prod default flipped.	server#235 · #103
2026-06-18	CQRS 2d shadow validated on kind; materializer proven sole-writer-capable; 2d-3 cutover designed + staged. The `system/event_materializer` + `system/projector` system playbooks land (ops#192, built on the in-flight `feat/cqrs-2d1-materializer-playbook` branch). Live on kind (tailer on): materializer reproduces the log byte-identically + idempotently — `events/project` of 25 real off-server-orchestrate rows → `{projected:0,duplicates:25}` (zero double-writes), a full playbook cycle → `{projected:0,duplicates:20}`; off-server orchestrate e2e green with the tailer on. Fixed a stall (`batch:` 500→25; drains over the 100 KiB inline budget stage to a `_ref`, breaking `{{ drain_events.count }}`). 2d-3 cutover (sole writer) mapped to a server-wide ~18-site event-write chokepoint refactor + default-off `NOETL_EVENT_INGEST_PUBLISH_ONLY` gate + orchestrator-trigger relocation (read-your-writes) — designed, operator-gated, not flipped.	ops#192 · #103
2026-06-16	CQRS 2d-2 (server) DONE + live-validated. Shared `normalize_event_to_row` (extracted from `handle_event_inner`) + `POST /api/internal/events/materialize` (normalizes native producer events → idempotent batch-insert `noetl.event`, byte-identical to the synchronous path) — the foundation for the 2d-3 cutover. Plus a tailer fix: pre-skip events over NATS max_payload instead of wedging the cursor (caught live — orchestrator `command.issued` events are 5.4MB, the cursor fan-out's rendered context; flagged for 2d-3 + as a write-path cost). PFT 217 events green; 669 tests.	server#206 · #103
2026-06-16	CQRS step 2 — phase 2b-2 (playbook) PR open: the `system/projector`. Bounded drain of `noetl_events` (`tool: subscription`, ack on_success) → aggregate distinct execution_ids → POST `/api/internal/projection/advance`; never touches `noetl.` directly, routed to the system worker pool, driven by a CronJob that loops bounded drains ~55s/run (KEDA-on-pending-count noted as the production keep-up path). Server side extended to ensure the durable `noetl_projector` pull consumer on the stream. The full CQRS read path now exists end-to-end (gated off):* worker/server → `noetl.event` → tailer → `noetl_events` → projector playbook → `/projection/advance` → `projection_snapshot` → orchestrator reads it.	ops#189 · server#204 · #103
2026-06-16	CQRS step 2 — phase 2b-1 (server) PR open: projector owns `projection_snapshot`. Chosen path (vs a shadow `noetl.projection`): the `system/projector` owns the snapshot the orchestrator reads. `POST /api/internal/projection/advance` recomputes + saves each execution's snapshot via the block-b bounded-rebuild machinery (`rebuild_state` + `orch_snapshot::save`, no dispatch, idempotent); `NOETL_PROJECTOR_OWNS_SNAPSHOT` (default off) makes the orchestrator stop self-writing the snapshot and only read it. Default off preserves block-b/the OOM fix exactly — ownership transfer is a deliberate reversible flag, not a code-coupled atomic change. Stacked on the 2a tailer #202. 669 tests + clippy green.	server#204 · server#203 · #103
2026-06-16	CQRS step 2 — phase 2a (producer) PR open (reworked from a DB trigger after review). A background tailer reads committed `noetl.event` rows by a persisted cursor (`noetl.stream_cursor`) and batch-publishes them onto a `noetl_events` JetStream stream (dedup by `event_id`) for the `system/projector` playbook (2b) to fold. Not a trigger (a trigger welds the producer to Postgres internals + doesn't survive a storage-type change) and not an in-process channel fed at the 17 insert sites (couples every emit path, loses in-flight on crash); the tailer reads committed rows so a restart re-scans an overlap the stream dedup collapses. Env-gated `NOETL_EVENT_STREAM_ENABLED`, default off, no-op without NATS. At the 2d cutover the worker publishes to the same stream and the tailer is deleted. 669 tests + clippy green; metrics `noetl_event_stream_published_total` + `_cursor`.	server#202 · server#200 · #103
2026-06-16	(superseded) First 2a attempt used a `noetl.event` → `noetl.outbox` DB trigger (server#201); closed after review — trigger couples the producer to Postgres internals + per-event granularity is wrong. Replaced by the tailer above.	server#201 (closed)
2026-06-15	Both PRs opened + kind-validated green. PFT `test_pft_flow_v2` (1 facility × 5 patients, worker budget=256 forcing every result by reference) COMPLETED — 34 references / 9 steps stored+resolved, 292 events (vs 20,844 runaway pre-fix), all 25 work-queue rows `done`. server 666 tests pass, worker 19 pass, clippy clean.	server#197 · worker#89

Next concrete steps

✅ CQRS PUBLISH_ONLY flip-readiness — DONE, incl. the monitoring gate. The server is FLIP-READY. All three flip blockers are closed: (a) ack-after-materialize durability (deferred ack + worker materializer loop, done + fault-tested), (b) off-server-drive × gate reconciliation (#104, server v3.29.2 cold_rebuild, proven), (c) the 2 ExecutionService cancel/finalize sites (server#240 → v3.29.3, kind-proven both modes). No remaining synchronous server noetl.event writers under the gate. ✅ Materializer-lag guardrail SHIPPED (worker #116 v5.35.0 lag gauge on an independent poller + ops #195/#196 VMRule + worker scrape + VMAlert + dashboard + flip runbook noetl-cqrs-publish-only-flip.md); kind-proven induce→fire→recover→clear. ✅ GKE pre-flip PREP landed (2026-06-19, staged): prod verified already-Rust (the #49 cutover is done; live images pre-#103), both flip secrets present, prod monitoring = Google Managed Prometheus (not VM). Pushed the post-#103 images to the prod AR (server v3.29.3 @sha256:6d2de32…
- worker v5.35.0); applied + verified GMP monitoring (PodMonitoring + materializer-lag Rules, translated from the kind VMRule); staged the roll-forward manifests (not applied). The remaining steps are all operator-gated, in order: (1) roll the live system pool → v5.35.0 then the live server → v3.29.3; (2) enable the materializer shadow (NOETL_MATERIALIZER_ENABLED=true, gate still off) — the green-baseline check; (3) wire the GMP managedAlertmanager pager (templated stub in the runbook); (4) flip NOETL_EVENT_INGEST_PUBLISH_ONLY=true, one revert away. No prod default changed; the gate stays default-off until that staged operator flip.
🐞 Off-server-drive cutover (#107/#111) — oversized-event class CLEARED (#114, server v3.29.5); now blocked on the refs_in_state consume side (#101). The #114 oversized-command.issued-event offload shipped (server#242 → v3.29.5): every command.issued event is < 1MB and 6 of #113's 9 large-context fixtures COMPLETE under the gate. The remaining 3 (test_output_select, test_storage_tiers, kind_playbook_lease_expiry) progress past the oversized-event wedge but hit DEEPER refs_in_state=false issues — the __orchestrate__ drive-state bloats (the drive command's WorkflowState reached 17.4MB for storage_tiers; ~1MB + a non-convergent loop for lease_expiry) and the _ref/bulk-resolve lazy-load gap (output_select). A kind experiment confirmed refs_in_state=true fixes the bloat (lease_expiry completes) but breaks the bulk-consuming fixtures because the worker render-time ref-resolution (the consume side) is not implemented. So the concrete next work is exactly #101's consume side: resolve noetl:// refs in render_context at the worker's tool-dispatch render time (+ cursor-claim handling) so over-budget results stay as refs in state. That unblocks the last 3 fixtures and the off-server-drive prod cutover. The #114 offload is an orthogonal safety cap that stays regardless.
Merge server#197 + worker#89; bump ai-meta pointers; update server wiki (results-by-reference resolution behavior) at pointer-bump time.
Follow-up (separate work, user-steered): extend the contract so event.result and command.context carry references only — never inline data — with an extracted predicate-fields block on the reference for round-trip-free when:/set: evaluation (the result_ref.extracted column was designed for this). Open design questions: predicate eval via extracted fields vs. full resolve; field selection via explicit output_select: vs. auto-extract. Part of the user's projection + reference + step-container / heterogeneous-runtime model.

Umbrella: Cursor / Claim Loop Mode — the cursor loop these fixes scale.
agents/rules/execution-model.md — "workers hydrate inputs from the shared cache" is the boundary results-by-reference resolution implements.
noetl/server deployment-specification · noetl/worker deployment-specification (NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES).

Design note — CQRS event log + batched projection

Captured 2026-06-15 after the GKE db-g1-small benchmark + the session-vs-transaction pool-mode test. The write path — not compute — is the scaling wall.

Problem

At the small tier the PFT runs ~1.5 items/s, and the cost is round-trip count, not CPU. Per patient: several throttled API calls + a data store + ~5 noetl.event writes (lifecycle), all synchronous round-trips through PgBouncer to a 1-vCPU Cloud SQL. The noetl.event log is the write-hot bottleneck, and beyond a point batching can't fix it — event volume (per-patient, per-API-call) is intrinsic to the workload when per-call granularity is needed for replay/audit/ error-handling (the whole point of test_pft_flow_v2). A small synchronous Postgres is the wrong tool for high-volume append.

Two write paths (keep them separate)

Playbook data (collected patient info → demo_noetl). Batchable: hold a frame's results in the shared cache / GCS, write as one multi-row INSERT.
noetl.event log (platform event sourcing). The bottleneck — needs a log-optimized store, not bigger batches.

Target architecture — CQRS

Split the write model (append-only event log) from the read model (state/projections), each in the store it's good at:

Write side → NATS JetStream (already in the stack; no Kafka unless we outgrow it). Workers publish events here — durable, replicated, built for high-throughput sequential append. JetStream-ack is the commit point.
Projector → consumes JetStream in batches, folds into noetl.projection / noetl.projection_snapshot (indexed, queryable) and archives cold events (Postgres archive table or GCS) for replay/audit at a relaxed cadence.
Read side → the orchestrator reads the projection, never the raw event stream. v3.9.0's snapshot-cached WorkflowState (read bounded state, not full replay) is already the read-model half of this.

State + projections are not lost — they're derived by the projector.

Durability contract (answers "don't lose the step-instance cache on crash")

The truth never lives in the worker. A worker-local buffer is fine as staging, but the durable commit is:

Worker batch-publishes its events to JetStream; acks the command only after JetStream confirms. Crash after publish → events survive in the stream.
Crash before publish → the command was never acked → the command queue re-delivers it → the body re-runs, safely, because apply is idempotent (cursor cursor_issued/cursor_completed id-sets dedup).

So a step-instance runtime error costs a re-execution, never data.

Trade-offs

Eventual consistency — the projection lags the log by the projector's batch interval; the orchestrator must tolerate read-your-writes lag (or the projector commits the slice it just folded before evaluate).
Operational — a projector to run + JetStream retention / cold-event archive policy. More moving parts, but the established CQRS shape, and NoETL already has the substrate (noetl.outbox, noetl.projection, system/projector).

Roadmap — sequence, don't big-bang

Step 1 — cheap wins (no re-architecture). Frame-batch the cursor body (one bulk data write + one control-plane event-set per frame, not per patient); batch the server's noetl.event INSERTs (group-commit / POST /api/events/batch); a bigger Cloud SQL tier. Targets ~5–10× on the small tier. Scoped as the next concrete work.
Step 2 — CQRS event-log split (#103, in progress). Events → JetStream (write log) → batch projector → projection tables (read model) → orchestrator reads projection. The natural extension of block-b stage 2; where "the event log is the bottleneck" stops being true. Refined shape (per data-access-boundary.md): the projector is a system/* catalog playbook on the system worker pool, not bespoke Rust — it consumes the noetl_events JetStream stream in batches and folds the read model via the server internal API (/events/project, already shipped). Phased: 2a producer — a background event-log → JetStream tailer (not a DB trigger; storage-agnostic queue, batched), PR server#202, default off · 2b system/projector playbook (JetStream batch consumer → projection transaction) · 2c orchestrator reads projection · 2d cutover (worker publishes to the same stream, drop the synchronous noetl.event INSERT + the tailer).

NoETL Dashboard

Home — overview
Repo Map
Releases
Sessions Log

Active Umbrellas

Secrets Wallet (#61) — SECURITY (design)
Rust Server Port (#49) — PRIMARY
Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
System Pool Design (#46) — PRIMARY
Regression Baseline Migration (#98) — e2e
Subscription / Listener Tool (#90) — RFC
Container Tool Callback (#43)
Rust Worker Parity Gaps (#47 · #48)
Event Envelope Reconciliation (#51 in TaskList)

Closed Umbrellas

Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
Rust Worker Migration (#30)
Python Services → Rust (#45)

Conventions

Per-repo wikis

noetl/noetl wiki — app + DSL
noetl/server wiki — Rust control plane
noetl/worker wiki — Rust pull worker
noetl/tools wiki — tool registry crate
noetl/cli wiki — CLI + local mode
noetl/gateway wiki — gatekeeper
noetl/ops wiki — Helm + manifests
noetl/travel wiki — domain SPA reference
Docs site — engineer-facing architecture

Umbrella Orchestrator Scaling

Umbrella: Orchestrator Scaling (#101)

Update 2026-06-15 — block b landed (server v3.9.0)

Goal

The two fixes

1. Incremental orchestrator state

2. Results-by-reference resolution

Recent activity

Next concrete steps

Related

Design note — CQRS event log + batched projection

Problem

Two write paths (keep them separate)

Target architecture — CQRS

Durability contract (answers "don't lose the step-instance cache on crash")

Trade-offs

Roadmap — sequence, don't big-bang

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally