Umbrella Rust Server Port

Umbrella — Rust Server Port (PRIMARY)

ai-task: noetl/ai-meta#49 · Opened: 2026-06-02 · Last update: 2026-06-02 · Priority: PRIMARY (interlocked with Umbrella: System Pool Design — #46) · Target crate: noetl/server (currently v2.0.1, early skeleton) · Source: noetl/noetl/server/ Python FastAPI

Goal

Port the Python noetl-server (FastAPI / uvicorn, ~15-20k LoC, 87 route decorators) to the existing Rust noetl/server crate. Full HTTP API parity so the gateway + workers + CLI don't notice a swap. Cutover via strangler-fig at the ingress layer.

Visual — interlock with #46

   ┌──────────────────────┐          ┌──────────────────────────┐
   │  noetl/ai-meta #49   │          │   noetl/ai-meta #46      │
   │  Rust server port    │          │   System pool playbooks  │
   │  (THIS UMBRELLA)     │          │                          │
   │                      │          │                          │
   │  Phase A: reads      │          │                          │
   │  Phase B: writes     │          │                          │
   │  Phase C: internal   ├─────────►│  Phase 2:                │
   │   endpoints          │ unblocks │   system/outbox_publisher│
   │                      │          │   system/projector       │
   │  Phase D: engine     │          │                          │
   │  Phase E: SSE etc    │          │  Phase 1.b: deployment   │
   │  Phase F: shards     │          │                          │
   └──────────┬───────────┘          └──────────┬───────────────┘
              │                                 │
              │ produces                        │ consumes
              ▼                                 ▼
   ┌──────────────────────────────────────────────────────────────┐
   │ HTTP API surface (server is the data gatekeeper)             │
   │                                                              │
   │  POST /api/internal/outbox/claim                             │
   │  POST /api/internal/outbox/mark-published                    │
   │  POST /api/internal/outbox/mark-failed                       │
   │  GET  /api/internal/outbox/pending-count                     │
   │  POST /api/internal/events/project                           │
   └──────────────────────────────────────────────────────────────┘

Phase C lands on both Python (noetl/noetl) and Rust (noetl/server) in parallel — system playbooks call HTTP, not DB, so they don't care which server is responding. This is the single PR that unblocks both umbrellas' next phases.

Why now

Three architectural decisions in the same 2026-06-02 session converge to make this a top priority:

System worker pool requires /api/internal/* endpoints (per data-access-boundary rule) — workers don't touch noetl.* direct; they call the server. Those endpoints don't exist yet in Python OR Rust.
Sharding readiness — the platform's path to multi-region / multi-tenant scale runs through a sharded server. Re-engineering the Python FastAPI server for sharding is comparable cost to a Rust port that does sharding correctly from day one.
Python footprint reduction — after the publisher + projector retire via #46 system playbooks, the FastAPI server is the largest remaining Python service. Porting it closes the loop on the runtime hot path.

Constraints (from latest architectural decisions)

Rule	Implication for this port
Data access boundary	Rust server is the only thing that talks to `noetl.` directly; new `/api/internal/` endpoints land here for the system pool
Execution model	Server stays the gatekeeper for data + the orchestrator of state machines; doesn't move to playbooks
Observability	Every endpoint ships with span + metric + `execution_id` correlation in the same change set
Deployment validation	Kind-first per endpoint port; production cutover via ingress flip on prod-shaped env
API contract preserved	Rust request/response shapes are byte-identical to Python's during migration; no drift; no "new and improved" during port
Sharding-first	Every endpoint that touches per-execution state derives `execution_id`; routing layer built in from day one
Strangler-fig cutover	Endpoint-by-endpoint flip via ingress, never big-bang

Phases

Phase A — Read endpoints (no orchestrator dependency)

Routes that just read DB state. Lowest risk; biggest test of the read path. Many already scaffolded in repos/server/src/handlers/.

Acceptance: every read endpoint returns byte-identical JSON to the Python version against the same DB state. Diff harness in kind validation.

Phase B — Worker write boundary

Endpoints the Rust worker uses to emit results. Must be solid.

POST /api/events (worker's put_result) — already wired; verify under load
POST /api/catalog/register — already wired
POST /api/credentials (encrypted-at-rest write)
POST /api/keychain
POST /api/runtime/heartbeat
POST /api/runtime/register

Acceptance: Rust worker pointed at Rust server completes a full playbook execution against kind with event log identical to Python-server-pointed run.

Phase C — Internal endpoints for system pool (UNBLOCKS #46 Phase 2)

NEW endpoints — Python doesn't have them today. Lands on BOTH Python AND Rust so the system pool can deploy against either during migration.

Python side ✅ landed + kind-validated 2026-06-02 via #659 (v4.10.0) + #660 (v4.10.1 with kind-validation fixes).

Rust side ✅ landed + kind-validated 2026-06-02 via #12 (v2.1.0) + #13 (v2.1.1 with schema fix) + #14 (axum 0.8 route-syntax fix — without this v2.1.1 panics in Router::route() at startup before binding the HTTP listener) + noetl/ops#147 (kind deployment manifest).

All 11 assertions of automation/development/validate-internal-api.sh pass against the Rust server (same harness that validated Python; identical pass rate).

Endpoint	Python	Rust
`POST /api/internal/outbox/claim?limit=N`	✅ kind-validated	✅ kind-validated
`POST /api/internal/outbox/mark-published`	✅ kind-validated	✅ kind-validated
`POST /api/internal/outbox/mark-failed`	✅ kind-validated (backoff confirmed)	✅ kind-validated (1s backoff for attempts=1)
`GET /api/internal/outbox/pending-count` (KEDA scaler source)	✅ kind-validated	✅ kind-validated
`POST /api/internal/events/project`	✅ kind-validated (fresh + idempotent)	✅ kind-validated (fresh + idempotent)
ServiceAccount bearer-token auth gate	✅ kind-validated (403 + 503 paths)	✅ kind-validated (403 on no-auth / wrong-token / Basic-scheme)
Span (`tracing::instrument`) + `execution_id` per endpoint	⚠️ partial	✅ tracing spans (Prometheus metrics deferred)

Three real-world bugs found + fixed during kind validation (see Sessions Log 2026-06-02 (late evening)):

Python router prefix double-prefix — /api/internal → /internal (Python-only).
Python dict-row tuple subscript in pending-count (Python-only).
noetl.event schema mismatch — timestamp column missing, NOT NULL columns absent, partitioned table doesn't support ON CONFLICT (event_id) (both Python + Rust).

Validation harness: noetl/ops automation/development/validate-internal-api.sh (PR awaiting merge).

Acceptance: system worker pool on kind runs system/outbox_publisher end-to-end against either the Python or Rust server's internal endpoints. Python side validated; full pipeline lands when #46 Phase 1.b deploys the system pool.

Phase D — Orchestrator engine port (the big lift)

Python's catalog → command-generation → state-machine logic. ~5-8k LoC Python in repos/noetl/noetl/server/. Rust skeleton at repos/server/src/engine/ (~1,967 LoC).

POST /api/execute — full port: persists initial command, publishes NATS notification, snowflake-generated event_id (noetl/server#27, #28, #29)
State-machine orchestrator wired into event ingest — trigger_orchestrator loads events, calls WorkflowOrchestrator::evaluate, persists generated events + commands, emits terminal playbook.completed / playbook.failed (noetl/server#31, Phase D R2)
persist_engine_command extracted as shared helper for /api/execute and orchestrator paths
noetl-executor crate already feeds WorkflowOrchestrator (Phase D R1 survey closed the gap)

Acceptance: full execution lifecycle handled by Rust server, replayable against the same event log as Python. Kind validation tests/fixtures/r2_two_step 2-step linear playbook runs end-to-end, terminates at playbook.completed (Phase D R2, ai-meta@TBD). Phase D materially complete — remaining work is conditional/iterator/parallel control-flow coverage (future rounds), not the core engine wiring.

Phase E — SSE + remaining endpoints

GET /api/executions/{id}/events/stream — SSE for the gateway (axum has SSE support built-in)
Remaining ~20-30 Python @router routes triaged; port the ones with callers; drop the ones without

Phase F — Sharding design + cutover

shard_id = hash(execution_id) % N
StatefulSet deployment (replaces today's Deployment)
Inter-shard coordination (catalog + credentials shared; executions sharded)
Gateway/load-balancer extension to route by execution_id header
Migration path: single-replica StatefulSet → scale to N → cutover
Helm chart values: server.replicas → server.shards
Production cutover — flip ingress; Python server retires

Sharding sequence — request flow with N shards

Client          Gateway          Server-Shard-0    Server-Shard-1     Postgres
  │                │                   │                 │                │
  │ POST /api/execute                  │                 │                │
  │ (no execution_id yet)              │                 │                │
  ├───────────────►│                   │                 │                │
  │                │ pick any shard    │                 │                │
  │                │ (load-balance)    │                 │                │
  │                ├──────────────────►│                 │                │
  │                │                   │ generate eid    │                │
  │                │                   │ (snowflake)     │                │
  │                │                   │ INSERT execution│                │
  │                │                   ├─────────────────┼───────────────►│
  │                │ ◄─────────────────┤                 │                │
  │ ◄──────────────┤ {execution_id: 12345}               │                │
  │                │                                     │                │
  │ GET /api/executions/12345                            │                │
  │ X-Execution-ID: 12345                                │                │
  ├───────────────►│                                     │                │
  │                │ shard = 12345 % 2 = 1               │                │
  │                ├─────────────────────────────────────►│                │
  │                │                                     │ SELECT state    │
  │                │                                     ├────────────────►│
  │                │ ◄───────────────────────────────────┤                │
  │ ◄──────────────┤ {status: RUNNING, ...}              │                │
  │                                                                       │
  │  Phase C internal endpoints (system pool calls)                       │
  │                                                                       │
  │  POST /api/internal/outbox/claim                                      │
  │  X-Execution-ID: not required (claim is shard-aware via worker pod)   │
  ├───────────────►│                                                      │
  │                │  outbox claim fans out across all shards via         │
  │                │  per-shard `outbox-publisher-<shard>` subscription   │
  │                │  (each system pool worker pod owns a shard slice    │
  │                │   like the projector does today)                     │
  ▼                ▼                                                      ▼

Sharding rules:

Resource	Strategy
`noetl.execution` / `noetl.event` / `noetl.command` / `noetl.outbox`	Per-execution_id sharding (write to owning shard)
`noetl.catalog` / `noetl.credential` / `noetl.keychain`	Shared (read from any shard, write to designated leader shard)
`noetl.runtime` (worker heartbeats)	Per-pool sharding (worker_id hash)

Migration to sharded mode is N=1 first (no functional change), then scale to N=3 with executionID % N routing.

Out of scope

System worker pool runtime + system playbooks → Umbrella: System Pool Design (#46). This umbrella provides the endpoints #46 needs; #46 builds the deployment + playbooks.
Rust worker tool-kind gaps → Umbrella: Rust Worker Parity Gaps (#47 + #48). Orthogonal.
Container tool kind callback → Umbrella: Container Tool Callback (#43). Orthogonal.
DSL parser / noetl-tools / noetl-executor — separate Rust crates; not part of the server port.

First three sub-issues (next session opens these)

Per the issue-tracking convention, file these against noetl/server when work begins:

Phase A read-endpoint parity audit + diff harness — surfaces drift between Rust and Python responses for already-wired endpoints.
Phase C internal endpoints (LANDS FIRST — even before Phase A finishes) — /api/internal/outbox/* + /api/internal/events/project on BOTH Python and Rust. Unblocks #46 Phase 2.
Event envelope crate (EE-4 from TaskList #51) — noetl-events shared crate that worker + executor + server depend on.

Recent activity

Date	Event
2026-06-02	Umbrella filed during the architecture-pivot session. Priority PRIMARY alongside #46.
2026-06-02	Cross-linked from #46 — the two umbrellas interlock (#49 provides API endpoints #46's playbooks consume).
2026-06-02	No code work started. Phase 1 plan + first three sub-issues defined; ready to pick up.
2026-06-02	Phase A read-endpoint parity harness landed (noetl/server#18 + #20). Phase C internal endpoints wired (#17, #19, #25). v2.2.0 shipped.
2026-06-03 (morning)	Phase B write-boundary parity complete — Prometheus surface + all 6 write endpoints instrumented (noetl/server#21, #23). Rust worker → Rust server e2e validated; 60k/60k load smoke at 920 req/s, p99 164ms. v2.4.0 shipped.
2026-06-03 (afternoon)	`/api/execute` full port (noetl/server#27) + `args:null` fix (#28) + result-envelope shape compliance (#29 → noetl/server#30). Phase D R1 survey closed; template resolution already wired via existing `TemplateRenderer.render_value`.
2026-06-03 (evening)	Phase E SSE port (`GET /api/executions/{id}/events/stream`) shipped. Phase D R2 orchestrator wired into event ingest (noetl/server#31, v2.5.0). Kind validation: 2-step linear playbook (`tests/fixtures/r2_two_step`) runs end-to-end on Rust server, event log terminates at `playbook.completed`.
2026-06-04 (evening)	Phase D R3a — `step.when` enable guard wired through orchestrator (noetl/server#32, v2.6.0). Skip-chain walks inline through skipped steps' `next.arcs` until landing on a guard-passing step or `end`. Companion fix in `state.build_context` exposes workload under `workload.*` namespace. Two new unit tests + kind validation on `tests/fixtures/r3a_conditional` (both `skip_middle=true` and `=false` terminate at `playbook.completed`).
2026-06-04 (late evening)	Phase D R3b — `step.loop` iterator fan-out + state aggregation (noetl/server#33, v2.7.0). Orchestrator detects `step.loop`, evaluates `loop.in_expr`, emits ONE `step.enter` (with `iterations_expected` in context) + N iteration commands via `build_iteration_command` with per-iteration `command_id` shape `<exec>:<step>:<event>:i<index>`. `StepInfo` tracks `iterations_expected` + dedup `iteration_command_ids` HashSet + `iteration_results`; step flips to `Completed` only after all N distinct command_ids land. 5 new unit tests (28/28 engine). Kind validation: orchestrator path validated (fan-out + NATS publish + worker claim); full e2e completion gated on R3b-2 (worker-side Python iter-variable injection) + dual-worker NATS subject race.
2026-06-05 (early morning)	Phase D R3c — parallel branches via `next.arcs` mode:inclusive (noetl/server#34, v2.8.0). Phase D R3 closes. Multi-target transitions already dispatched correctly via existing evaluator path; this PR fixes the orchestrator's `if next_step_name == "end"` early-return that falsely completed workflows when ONE parallel branch hit `end` while siblings were still in flight. New `reached_end` flag + `reached_end_quiescent` clause at end of `process_in_progress`: completion fires only when reached_end AND no new commands queued AND no other branches running. 3 new orchestrator unit tests (31/31 engine). Kind validation: orchestrator triggered exactly 3 times as designed; final `playbook.completed` blocked by dual-worker NATS subject race (same gap as R3b — not an R3c issue).
2026-06-05 (midday)	noetl/ai-meta#53 closes — Phase D R3 e2e drained for R3a/R3c/R2. Three PRs landed: noetl/server#35 (v2.8.1) makes `/api/worker/pool/register` accept `component_type` alias the Rust worker actually sends; noetl/worker#41 (v5.11.0) makes the worker honor `server_url` from each NATS command notification (new `ControlPlaneClient::with_server_url` + `CommandExecutor::execute_with_server_url`); noetl/server#36 (v2.8.2) adds the R3a skip-chain re-entry guard surfaced once the multi-trigger path worked. End-to-end kind validation with both Rust + Python worker pools running: R2 (2-step linear), R3a (both skip_middle branches), R3c (parallel) all reach `playbook.completed` cleanly without dual-worker workarounds. R3b iterators still gated on R3b-2 (worker-side Python iter-variable injection).
2026-06-05 (late afternoon)	Phase D R3 e2e fully closed (noetl/server#37, v2.8.3) — R3b iterators reach `playbook.completed` after all N iterations. Two fixes: (1) `build_iteration_command` injects iteration vars into `tool.config.args` (worker Python sees them as globals via `globals().update(args)` — fixes the long-standing `NameError: name 'item' is not defined`); (2) `state.rs` step.enter handler now reads `iterations_expected` from `event.result.context` (the canonical storage shape after `trigger_orchestrator` wraps via the `{status, context}` envelope), so the iteration aggregation actually fires and `command.completed` no longer falls into the "plain step" branch that prematurely marked the step Completed. New reproducer unit test; 32/32 engine tests pass. All four fixtures end-to-end on kind: R2 (linear), R3a (both branches), R3b (3 iterations), R3c (parallel) → `playbook.completed`.
2026-06-07	Phase F R3b sequence complete (drift-guard end-to-end). Three rounds shipped in sequence: R3b-1 (noetl/server#47, v2.12.0) adds `GET /api/runtime/shard-info` returning the server's `shard_for()` result; R3b-2 (noetl/gateway#26, v3.2.0) adds the gateway twin `GET /sharding/preview` (local compute, NOT proxy); R3b-3 (noetl/ops#158) ships `automation/development/validate-shard-drift-guard.sh` posting to both endpoints across 30 `(execution_id, shard_count)` pairs and asserting `shard_index` agreement. Catches drift modes the per-side unit tests can't see — twox-hash crate version split, `SHARD_HASH_SEED` divergence, i64→bytes endianness flip, hash crate algo change with non-major bump. Server side sanity-validated against noetl-server-rust v2.12.0 in kind (30/30 pairs return expected varying `shard_index` distribution); full cross-side execution deferred to operator since gateway v3.2.0 not yet in local kind. Phase F status: R1 + R1.5 + R2 + R3a + R3a-2 + R3b all ✅; next is R4 (DB sharding) or R5 (cutover).
2026-06-07	Phase F R4 kicked off — decision: per-shard physical Postgres, NOT Citus. Tracked under sub-issue noetl/server#48 with 5-round decomposition (R4-1 pool layer, R4-2 AppState wiring, R4-3 per-execution handler cutover, R4-4 cluster-wide list fan-out, R4-5 kind validation with N=2 shards). R4-1 shipped (noetl/server#49, v2.13.0): `ShardConnection` DSN type + `ShardingConfig` env loader + `DbPoolMap` N+1 pool container with `pool_for(execution_id)` / `cluster()` / `all_shards()` accessors. Single-pool fallback when `NOETL_SHARDS` empty — degenerate one-shard map whose cluster handle IS the only shard pool; behaviour bit-identical to today. 12 + 2 new tests pin `shard_for` routing against R3b drift-guard pairs so any future change to `pool_for` keeps the wire contract honest. No handler changes; R4 ships dormant until operator sets `NOETL_SHARDS`.

Phase D status (closed)

Round	Code	Kind E2E
R1 (template resolution)	✅ shipped in Phase B	✅
R2 (orchestrator wired into event ingest, #31)	✅ v2.5.0	✅
R3a (`step.when` enable guard, #32 + #36)	✅ v2.6.0 + v2.8.2	✅
R3b (`step.loop` iterator, #33 + #37)	✅ v2.7.0 + v2.8.3	✅
R3c (parallel branches, #34)	✅ v2.8.0	✅

Next concrete steps

EE-4 fully shipped ✅ — all three rounds merged + crates.io publish landed:

Round	PR	Outcome
1	noetl/cli#49	`noetl-events` workspace crate extracted from `noetl-executor::events`; executor 1-line re-exports.
2	noetl/cli#50	Crates.io publish prep + actual publish: `noetl-events 0.1.0`, `noetl-executor 0.4.0`, `noetl 4.9.0` all on crates.io after manual workflow dispatch on cli@d6e2432.
3	noetl/server#38	Server takes `noetl-events = "0.1"` direct dep, adds `From<ExecutorEvent>` + `TryFrom<&EventRequest>` impls, 4 wire-compat tests pinning the shared-subset round-trip. Server v2.9.0.

Design call: server's EventRequest legitimately retains its 5 server-only fields (result_kind, result_uri, event_ids, actionable, informative) and String wire format for execution_id / event_id (JSON-number precision for browser clients) — the two types stay distinct but the SHARED SUBSET is now anchored to the canonical noetl_events::ExecutorEvent.

Forward:

Phase F — sharding — design + cutover. Survey landed 2026-06-05 (below). R1 (design doc + endpoint inventory codification) shipped via noetl/server#40 → durable design doc on noetl/server wiki — sharding-design. R1.5 (app-side snowflake ID migration) is the next concrete sub-issue — load-bearing prerequisite per observability.md Principle 3; must land before R4 (DB sharding).
Remaining Python-only endpoints — scope a sweep through the Phase A parity-harness output to identify the long-tail of routes the Rust server doesn't yet cover (none currently block production traffic; the system pool + workers already operate on the Rust surface for all observed call patterns).

Phase F — sharding survey (2026-06-05)

Read-only investigation of the Rust server (repos/server/src/) to scope the sharding boundary. The architecture is ready: every per-execution operation is already keyed by execution_id, NATS subjects are already execution-aware (noetl.commands.{system|shared}.<execution_id> derived in publish_command_notification at execute.rs:535), and the orchestrator is stateless re: execution (loads events fresh from the DB on every trigger). Sharding is an orchestration problem, not a redesign.

Natural shard boundary

The execution. All state belonging to a single execution_id (events, commands, variables, outbox rows) lives on one shard. Cluster-wide state (catalog, credentials, keychain, runtime registration) is read-only from the execution's perspective and replicates across shards or lives behind a shared read-only path.

Endpoint inventory (40+ routes)

Per-execution (shardable by execution_id):

Endpoint	execution_id source
`POST /api/execute`	body (server generates snowflake)
`POST /api/events`	body (string-wire i64)
`POST /api/events/batch`	body
`GET /api/commands/{event_id}`	path → DB lookup
`POST /api/commands/{event_id}/claim`	path → DB lookup
`GET /api/executions/{execution_id}`	path
`GET /api/executions/{execution_id}/status`	path
`POST /api/executions/{execution_id}/cancel`	path
`GET /api/executions/{execution_id}/cancellation-check`	path
`POST /api/executions/{execution_id}/finalize`	path
`GET /api/executions/{execution_id}/events/stream` (SSE)	path
`GET / POST / DELETE /api/vars/{execution_id}`	path
`POST /api/internal/outbox/claim`	per-execution batch
`POST /api/internal/outbox/mark-published` / `mark-failed`	body
`POST /api/internal/events/project`	body

Cluster-wide (NOT shardable):

Endpoint	Reason
`GET /api/executions` (list)	cluster-wide query
`GET/POST /api/catalog/*`	global playbook catalog
`POST/GET /api/credentials/*`	global credential store
`POST/GET /api/keychain/{catalog_id}/*`	keyed by `catalog_id`, not `execution_id`
`POST /api/worker/pool/{register,deregister,heartbeat}`	worker pools are global
`GET /api/worker/pools`	cluster-wide read
`GET /api/internal/outbox/pending-count`	cluster-wide count (KEDA scaler trigger)

DB table inventory

Shardable by execution_id:

noetl.event — append-only event log
noetl.command — command queue
noetl.execution — execution records
noetl.outbox — transactional outbox
noetl.variables — ephemeral step-scoped cache

Cluster-wide:

noetl.catalog — playbook catalog
noetl.credential — credential store
noetl.keychain — keychain entries
noetl.runtime — worker pool registration/heartbeat

Three open design questions

Gateway-aware vs server-aware shard routing. Either the ingress (gateway / load balancer) inspects execution_id and routes to the owning shard, or any server replica accepts any request and proxies internally. Trade-off: gateway-aware is one network hop and centralizes the routing logic; server-aware is more resilient but adds inter-shard mTLS + an N-hop worst case. Lean recommendation: gateway-aware for Phase F (simpler); server-aware as a Phase G optimization if gateway becomes a bottleneck.
Modulo on full snowflake vs timestamp / machine_id. The snowflake i64 layout is [timestamp: 41][machine_id: 10][sequence: 12]. hash(execution_id) % N (full 64-bit) distributes evenly; timestamp % N creates time-based hotspots (all playbooks started in the same second hit one shard); machine_id % N clusters by generating machine. Decision: full i64 modulo. Cheap and even.
Catalog / credential / keychain sync strategy. These must be consistent across shards. Options: (a) shared single-master Postgres (simple; write bottleneck on heavy catalog churn); (b) multi-master with conflict resolution (risky); (c) read-only replicas per shard from a single master (highest ops cost, cleanest separation). Lean recommendation: (a) single-master for Phase F; revisit (c) as Phase G if replication lag becomes the limiting factor.

Proposed Phase F decomposition (5 rounds, Phase D shape)

Round	Scope	Output
R1	Sharding design doc + endpoint inventory codification	Sub-issue with the design call (gateway-aware, full-i64 modulo, single-master cluster-wide tables) and the 40+ endpoint inventory tagged shardable / metashard. ADR added to noetl.dev/docs.
R2	Server-side `shard_id()` helper + (optional) `route_to_shard()` proxy	New `repos/server/src/state.rs` helpers; HTTP proxy fallback for mis-routed requests (gives a safety net while gateway routing rolls out).
R3	Gateway-side dispatch + load balancer config	Wire ingress (Kong / nginx / Cloud LB) to extract `execution_id` and route by consistent hashing. Replicate the shard assignment in LB config or a small plugin.
R4	DB sharding + per-shard schema migration	Partition `event` / `command` / `execution` / `outbox` / `variables` by `execution_id` via Citus or per-shard separate schemas. Replicate `catalog` / `credential` / `keychain` / `runtime` (or keep on shared read-only path). Schema migration dry-run on kind first.
R5	Cutover + validation	Spin up N=2 shard replicas in kind, run the Phase D e2e harness across shards, confirm execution affinity (all events for one execution hit one shard), then prod canary + traffic split.

Load-bearing prerequisite — app-side snowflake IDs

Per agents/rules/observability.md Principle 3, execution_id should be generated in the application using a Rust snowflake library, not via the DB-side noetl.snowflake_id() function. Reasons listed in the rule: spans need the id at span-creation time; retries are idempotent only with a stable id across attempts; cross-component publish doesn't have to wait for the INSERT round trip; tests need deterministic ids; multi-cluster sharding can't agree on a single DB-side sequence.

Currently the Rust server still calls the DB function on every event / command create. This wants to flip before R4 (DB sharding) — once tables are partitioned, the DB-side sequence either becomes per-partition (clusters within a shard) or has to be globally synchronized across shards (defeats the point). Likely an R1.5 / R2 prerequisite — track it explicitly in the R1 design doc.

Out-of-scope for Phase F

Multi-region — that's Phase G or later.
Per-tenant sharding — orthogonal concern; Phase F is plain execution_id modulo.
Dynamic re-sharding (adding shards on the fly) — Phase F lands a fixed N; resizing is later.

Umbrella: System Pool Design (#46) — interlocks with this umbrella.
Umbrella: Container Tool Callback (#43) — orthogonal.
Umbrella: Rust Worker Parity Gaps (#47, #48) — orthogonal.
ADR: System Worker Pool and WASM Plug-in Surface — covers the data access boundary that shapes Phase C.
noetl-server wiki: Runtime shape — implementation-level companion.
Data Access Boundary — the rule that shapes Phase C.

NoETL Dashboard

Home — overview
Repo Map
Releases
Sessions Log

Active Umbrellas

Domain-Specific SLM Platform (#139) — RFC (design); travel#63 is the reference impl
Secrets Wallet (#61) — SECURITY (design)
Rust Server Port (#49) — PRIMARY
Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
System Pool Design (#46) — PRIMARY
Regression Baseline Migration (#98) — e2e
Subscription / Listener Tool (#90) — RFC
Container Tool Callback (#43)
Rust Worker Parity Gaps (#47 · #48)
Event Envelope Reconciliation (#51 in TaskList)

Closed Umbrellas

Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
Rust Worker Migration (#30)
Python Services → Rust (#45)

Conventions

Per-repo wikis

noetl/noetl wiki — app + DSL
noetl/server wiki — Rust control plane
noetl/worker wiki — Rust pull worker
noetl/tools wiki — tool registry crate
noetl/cli wiki — CLI + local mode
noetl/gateway wiki — gatekeeper
noetl/ops wiki — Helm + manifests
noetl/travel wiki — domain SPA reference
Docs site — engineer-facing architecture

Uh oh!

Umbrella Rust Server Port

Umbrella — Rust Server Port (PRIMARY)

Goal

Visual — interlock with #46

Why now

Constraints (from latest architectural decisions)

Phases

Phase A — Read endpoints (no orchestrator dependency)

Phase B — Worker write boundary

Phase C — Internal endpoints for system pool (UNBLOCKS #46 Phase 2)

Phase D — Orchestrator engine port (the big lift)

Phase E — SSE + remaining endpoints

Phase F — Sharding design + cutover

Sharding sequence — request flow with N shards

Out of scope

First three sub-issues (next session opens these)

Recent activity

Phase D status (closed)

Next concrete steps

Phase F — sharding survey (2026-06-05)

Natural shard boundary

Endpoint inventory (40+ routes)

DB table inventory

Three open design questions

Proposed Phase F decomposition (5 rounds, Phase D shape)

Load-bearing prerequisite — app-side snowflake IDs

Out-of-scope for Phase F

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally