Umbrella Rust Server Port

Umbrella — Rust Server Port (PRIMARY)

ai-task: noetl/ai-meta#49 · Opened: 2026-06-02 · Last update: 2026-06-02 · Priority: PRIMARY (interlocked with Umbrella: System Pool Design — #46) · Target crate: noetl/server (currently v2.0.1, early skeleton) · Source: noetl/noetl/server/ Python FastAPI

Goal

Port the Python noetl-server (FastAPI / uvicorn, ~15-20k LoC, 87 route decorators) to the existing Rust noetl/server crate. Full HTTP API parity so the gateway + workers + CLI don't notice a swap. Cutover via strangler-fig at the ingress layer.

Visual — interlock with #46

   ┌──────────────────────┐          ┌──────────────────────────┐
   │  noetl/ai-meta #49   │          │   noetl/ai-meta #46      │
   │  Rust server port    │          │   System pool playbooks  │
   │  (THIS UMBRELLA)     │          │                          │
   │                      │          │                          │
   │  Phase A: reads      │          │                          │
   │  Phase B: writes     │          │                          │
   │  Phase C: internal   ├─────────►│  Phase 2:                │
   │   endpoints          │ unblocks │   system/outbox_publisher│
   │                      │          │   system/projector       │
   │  Phase D: engine     │          │                          │
   │  Phase E: SSE etc    │          │  Phase 1.b: deployment   │
   │  Phase F: shards     │          │                          │
   └──────────┬───────────┘          └──────────┬───────────────┘
              │                                 │
              │ produces                        │ consumes
              ▼                                 ▼
   ┌──────────────────────────────────────────────────────────────┐
   │ HTTP API surface (server is the data gatekeeper)             │
   │                                                              │
   │  POST /api/internal/outbox/claim                             │
   │  POST /api/internal/outbox/mark-published                    │
   │  POST /api/internal/outbox/mark-failed                       │
   │  GET  /api/internal/outbox/pending-count                     │
   │  POST /api/internal/events/project                           │
   └──────────────────────────────────────────────────────────────┘

Phase C lands on both Python (noetl/noetl) and Rust (noetl/server) in parallel — system playbooks call HTTP, not DB, so they don't care which server is responding. This is the single PR that unblocks both umbrellas' next phases.

Why now

Three architectural decisions in the same 2026-06-02 session converge to make this a top priority:

System worker pool requires /api/internal/* endpoints (per data-access-boundary rule) — workers don't touch noetl.* direct; they call the server. Those endpoints don't exist yet in Python OR Rust.
Sharding readiness — the platform's path to multi-region / multi-tenant scale runs through a sharded server. Re-engineering the Python FastAPI server for sharding is comparable cost to a Rust port that does sharding correctly from day one.
Python footprint reduction — after the publisher + projector retire via #46 system playbooks, the FastAPI server is the largest remaining Python service. Porting it closes the loop on the runtime hot path.

Constraints (from latest architectural decisions)

Rule	Implication for this port
Data access boundary	Rust server is the only thing that talks to `noetl.` directly; new `/api/internal/` endpoints land here for the system pool
Execution model	Server stays the gatekeeper for data + the orchestrator of state machines; doesn't move to playbooks
Observability	Every endpoint ships with span + metric + `execution_id` correlation in the same change set
Deployment validation	Kind-first per endpoint port; production cutover via ingress flip on prod-shaped env
API contract preserved	Rust request/response shapes are byte-identical to Python's during migration; no drift; no "new and improved" during port
Sharding-first	Every endpoint that touches per-execution state derives `execution_id`; routing layer built in from day one
Strangler-fig cutover	Endpoint-by-endpoint flip via ingress, never big-bang

Phases

Phase A — Read endpoints (no orchestrator dependency)

Routes that just read DB state. Lowest risk; biggest test of the read path. Many already scaffolded in repos/server/src/handlers/.

Acceptance: every read endpoint returns byte-identical JSON to the Python version against the same DB state. Diff harness in kind validation.

Phase B — Worker write boundary

Endpoints the Rust worker uses to emit results. Must be solid.

POST /api/events (worker's put_result) — already wired; verify under load
POST /api/catalog/register — already wired
POST /api/credentials (encrypted-at-rest write)
POST /api/keychain
POST /api/runtime/heartbeat
POST /api/runtime/register

Acceptance: Rust worker pointed at Rust server completes a full playbook execution against kind with event log identical to Python-server-pointed run.

Phase C — Internal endpoints for system pool (UNBLOCKS #46 Phase 2)

NEW endpoints — Python doesn't have them today. Lands on BOTH Python AND Rust so the system pool can deploy against either during migration.

Python side ✅ landed + kind-validated 2026-06-02 via #659 (v4.10.0) + #660 (v4.10.1 with kind-validation fixes).

Rust side ✅ landed + kind-validated 2026-06-02 via #12 (v2.1.0) + #13 (v2.1.1 with schema fix) + #14 (axum 0.8 route-syntax fix — without this v2.1.1 panics in Router::route() at startup before binding the HTTP listener) + noetl/ops#147 (kind deployment manifest).

All 11 assertions of automation/development/validate-internal-api.sh pass against the Rust server (same harness that validated Python; identical pass rate).

Endpoint	Python	Rust
`POST /api/internal/outbox/claim?limit=N`	✅ kind-validated	✅ kind-validated
`POST /api/internal/outbox/mark-published`	✅ kind-validated	✅ kind-validated
`POST /api/internal/outbox/mark-failed`	✅ kind-validated (backoff confirmed)	✅ kind-validated (1s backoff for attempts=1)
`GET /api/internal/outbox/pending-count` (KEDA scaler source)	✅ kind-validated	✅ kind-validated
`POST /api/internal/events/project`	✅ kind-validated (fresh + idempotent)	✅ kind-validated (fresh + idempotent)
ServiceAccount bearer-token auth gate	✅ kind-validated (403 + 503 paths)	✅ kind-validated (403 on no-auth / wrong-token / Basic-scheme)
Span (`tracing::instrument`) + `execution_id` per endpoint	⚠️ partial	✅ tracing spans (Prometheus metrics deferred)

Three real-world bugs found + fixed during kind validation (see Sessions Log 2026-06-02 (late evening)):

Python router prefix double-prefix — /api/internal → /internal (Python-only).
Python dict-row tuple subscript in pending-count (Python-only).
noetl.event schema mismatch — timestamp column missing, NOT NULL columns absent, partitioned table doesn't support ON CONFLICT (event_id) (both Python + Rust).

Validation harness: noetl/ops automation/development/validate-internal-api.sh (PR awaiting merge).

Acceptance: system worker pool on kind runs system/outbox_publisher end-to-end against either the Python or Rust server's internal endpoints. Python side validated; full pipeline lands when #46 Phase 1.b deploys the system pool.

Phase D — Orchestrator engine port (the big lift)

Python's catalog → command-generation → state-machine logic. ~5-8k LoC Python in repos/noetl/noetl/server/. Rust skeleton at repos/server/src/engine/ (~1,967 LoC).

POST /api/execute — full port: persists initial command, publishes NATS notification, snowflake-generated event_id (noetl/server#27, #28, #29)
State-machine orchestrator wired into event ingest — trigger_orchestrator loads events, calls WorkflowOrchestrator::evaluate, persists generated events + commands, emits terminal playbook.completed / playbook.failed (noetl/server#31, Phase D R2)
persist_engine_command extracted as shared helper for /api/execute and orchestrator paths
noetl-executor crate already feeds WorkflowOrchestrator (Phase D R1 survey closed the gap)

Acceptance: full execution lifecycle handled by Rust server, replayable against the same event log as Python. Kind validation tests/fixtures/r2_two_step 2-step linear playbook runs end-to-end, terminates at playbook.completed (Phase D R2, ai-meta@TBD). Phase D materially complete — remaining work is conditional/iterator/parallel control-flow coverage (future rounds), not the core engine wiring.

Phase E — SSE + remaining endpoints

GET /api/executions/{id}/events/stream — SSE for the gateway (axum has SSE support built-in)
Remaining ~20-30 Python @router routes triaged; port the ones with callers; drop the ones without

Phase F — Sharding design + cutover

shard_id = hash(execution_id) % N
StatefulSet deployment (replaces today's Deployment)
Inter-shard coordination (catalog + credentials shared; executions sharded)
Gateway/load-balancer extension to route by execution_id header
Migration path: single-replica StatefulSet → scale to N → cutover
Helm chart values: server.replicas → server.shards
Production cutover — flip ingress; Python server retires

Sharding sequence — request flow with N shards

Client          Gateway          Server-Shard-0    Server-Shard-1     Postgres
  │                │                   │                 │                │
  │ POST /api/execute                  │                 │                │
  │ (no execution_id yet)              │                 │                │
  ├───────────────►│                   │                 │                │
  │                │ pick any shard    │                 │                │
  │                │ (load-balance)    │                 │                │
  │                ├──────────────────►│                 │                │
  │                │                   │ generate eid    │                │
  │                │                   │ (snowflake)     │                │
  │                │                   │ INSERT execution│                │
  │                │                   ├─────────────────┼───────────────►│
  │                │ ◄─────────────────┤                 │                │
  │ ◄──────────────┤ {execution_id: 12345}               │                │
  │                │                                     │                │
  │ GET /api/executions/12345                            │                │
  │ X-Execution-ID: 12345                                │                │
  ├───────────────►│                                     │                │
  │                │ shard = 12345 % 2 = 1               │                │
  │                ├─────────────────────────────────────►│                │
  │                │                                     │ SELECT state    │
  │                │                                     ├────────────────►│
  │                │ ◄───────────────────────────────────┤                │
  │ ◄──────────────┤ {status: RUNNING, ...}              │                │
  │                                                                       │
  │  Phase C internal endpoints (system pool calls)                       │
  │                                                                       │
  │  POST /api/internal/outbox/claim                                      │
  │  X-Execution-ID: not required (claim is shard-aware via worker pod)   │
  ├───────────────►│                                                      │
  │                │  outbox claim fans out across all shards via         │
  │                │  per-shard `outbox-publisher-<shard>` subscription   │
  │                │  (each system pool worker pod owns a shard slice    │
  │                │   like the projector does today)                     │
  ▼                ▼                                                      ▼

Sharding rules:

Resource	Strategy
`noetl.execution` / `noetl.event` / `noetl.command` / `noetl.outbox`	Per-execution_id sharding (write to owning shard)
`noetl.catalog` / `noetl.credential` / `noetl.keychain`	Shared (read from any shard, write to designated leader shard)
`noetl.runtime` (worker heartbeats)	Per-pool sharding (worker_id hash)

Migration to sharded mode is N=1 first (no functional change), then scale to N=3 with executionID % N routing.

Out of scope

System worker pool runtime + system playbooks → Umbrella: System Pool Design (#46). This umbrella provides the endpoints #46 needs; #46 builds the deployment + playbooks.
Rust worker tool-kind gaps → Umbrella: Rust Worker Parity Gaps (#47 + #48). Orthogonal.
Container tool kind callback → Umbrella: Container Tool Callback (#43). Orthogonal.
DSL parser / noetl-tools / noetl-executor — separate Rust crates; not part of the server port.

First three sub-issues (next session opens these)

Per the issue-tracking convention, file these against noetl/server when work begins:

Phase A read-endpoint parity audit + diff harness — surfaces drift between Rust and Python responses for already-wired endpoints.
Phase C internal endpoints (LANDS FIRST — even before Phase A finishes) — /api/internal/outbox/* + /api/internal/events/project on BOTH Python and Rust. Unblocks #46 Phase 2.
Event envelope crate (EE-4 from TaskList #51) — noetl-events shared crate that worker + executor + server depend on.

Recent activity

Date	Event
2026-06-02	Umbrella filed during the architecture-pivot session. Priority PRIMARY alongside #46.
2026-06-02	Cross-linked from #46 — the two umbrellas interlock (#49 provides API endpoints #46's playbooks consume).
2026-06-02	No code work started. Phase 1 plan + first three sub-issues defined; ready to pick up.
2026-06-02	Phase A read-endpoint parity harness landed (noetl/server#18 + #20). Phase C internal endpoints wired (#17, #19, #25). v2.2.0 shipped.
2026-06-03 (morning)	Phase B write-boundary parity complete — Prometheus surface + all 6 write endpoints instrumented (noetl/server#21, #23). Rust worker → Rust server e2e validated; 60k/60k load smoke at 920 req/s, p99 164ms. v2.4.0 shipped.
2026-06-03 (afternoon)	`/api/execute` full port (noetl/server#27) + `args:null` fix (#28) + result-envelope shape compliance (#29 → noetl/server#30). Phase D R1 survey closed; template resolution already wired via existing `TemplateRenderer.render_value`.
2026-06-03 (evening)	Phase E SSE port (`GET /api/executions/{id}/events/stream`) shipped. Phase D R2 orchestrator wired into event ingest (noetl/server#31, v2.5.0). Kind validation: 2-step linear playbook (`tests/fixtures/r2_two_step`) runs end-to-end on Rust server, event log terminates at `playbook.completed`.
2026-06-04 (evening)	Phase D R3a — `step.when` enable guard wired through orchestrator (noetl/server#32, v2.6.0). Skip-chain walks inline through skipped steps' `next.arcs` until landing on a guard-passing step or `end`. Companion fix in `state.build_context` exposes workload under `workload.*` namespace. Two new unit tests + kind validation on `tests/fixtures/r3a_conditional` (both `skip_middle=true` and `=false` terminate at `playbook.completed`).
2026-06-04 (late evening)	Phase D R3b — `step.loop` iterator fan-out + state aggregation (noetl/server#33, v2.7.0). Orchestrator detects `step.loop`, evaluates `loop.in_expr`, emits ONE `step.enter` (with `iterations_expected` in context) + N iteration commands via `build_iteration_command` with per-iteration `command_id` shape `<exec>:<step>:<event>:i<index>`. `StepInfo` tracks `iterations_expected` + dedup `iteration_command_ids` HashSet + `iteration_results`; step flips to `Completed` only after all N distinct command_ids land. 5 new unit tests (28/28 engine). Kind validation: orchestrator path validated (fan-out + NATS publish + worker claim); full e2e completion gated on R3b-2 (worker-side Python iter-variable injection) + dual-worker NATS subject race.
2026-06-05 (early morning)	Phase D R3c — parallel branches via `next.arcs` mode:inclusive (noetl/server#34, v2.8.0). Phase D R3 closes. Multi-target transitions already dispatched correctly via existing evaluator path; this PR fixes the orchestrator's `if next_step_name == "end"` early-return that falsely completed workflows when ONE parallel branch hit `end` while siblings were still in flight. New `reached_end` flag + `reached_end_quiescent` clause at end of `process_in_progress`: completion fires only when reached_end AND no new commands queued AND no other branches running. 3 new orchestrator unit tests (31/31 engine). Kind validation: orchestrator triggered exactly 3 times as designed; final `playbook.completed` blocked by dual-worker NATS subject race (same gap as R3b — not an R3c issue).
2026-06-05 (midday)	noetl/ai-meta#53 closes — Phase D R3 e2e drained for R3a/R3c/R2. Three PRs landed: noetl/server#35 (v2.8.1) makes `/api/worker/pool/register` accept `component_type` alias the Rust worker actually sends; noetl/worker#41 (v5.11.0) makes the worker honor `server_url` from each NATS command notification (new `ControlPlaneClient::with_server_url` + `CommandExecutor::execute_with_server_url`); noetl/server#36 (v2.8.2) adds the R3a skip-chain re-entry guard surfaced once the multi-trigger path worked. End-to-end kind validation with both Rust + Python worker pools running: R2 (2-step linear), R3a (both skip_middle branches), R3c (parallel) all reach `playbook.completed` cleanly without dual-worker workarounds. R3b iterators still gated on R3b-2 (worker-side Python iter-variable injection).
2026-06-05 (late afternoon)	Phase D R3 e2e fully closed (noetl/server#37, v2.8.3) — R3b iterators reach `playbook.completed` after all N iterations. Two fixes: (1) `build_iteration_command` injects iteration vars into `tool.config.args` (worker Python sees them as globals via `globals().update(args)` — fixes the long-standing `NameError: name 'item' is not defined`); (2) `state.rs` step.enter handler now reads `iterations_expected` from `event.result.context` (the canonical storage shape after `trigger_orchestrator` wraps via the `{status, context}` envelope), so the iteration aggregation actually fires and `command.completed` no longer falls into the "plain step" branch that prematurely marked the step Completed. New reproducer unit test; 32/32 engine tests pass. All four fixtures end-to-end on kind: R2 (linear), R3a (both branches), R3b (3 iterations), R3c (parallel) → `playbook.completed`.
2026-06-04	Phase F R3b sequence complete (drift-guard end-to-end). Three rounds shipped in sequence: R3b-1 (noetl/server#47, v2.12.0) adds `GET /api/runtime/shard-info` returning the server's `shard_for()` result; R3b-2 (noetl/gateway#26, v3.2.0) adds the gateway twin `GET /sharding/preview` (local compute, NOT proxy); R3b-3 (noetl/ops#158) ships `automation/development/validate-shard-drift-guard.sh` posting to both endpoints across 30 `(execution_id, shard_count)` pairs and asserting `shard_index` agreement. Catches drift modes the per-side unit tests can't see — twox-hash crate version split, `SHARD_HASH_SEED` divergence, i64→bytes endianness flip, hash crate algo change with non-major bump. Server side sanity-validated against noetl-server-rust v2.12.0 in kind (30/30 pairs return expected varying `shard_index` distribution); full cross-side execution deferred to operator since gateway v3.2.0 not yet in local kind. Phase F status: R1 + R1.5 + R2 + R3a + R3a-2 + R3b all ✅; next is R4 (DB sharding) or R5 (cutover).
2026-06-04	Phase F R4 kicked off — decision: per-shard physical Postgres, NOT Citus. Tracked under sub-issue noetl/server#48 with 5-round decomposition (R4-1 pool layer, R4-2 AppState wiring, R4-3 per-execution handler cutover, R4-4 cluster-wide list fan-out, R4-5 kind validation with N=2 shards). R4-1 shipped (noetl/server#49, v2.13.0): `ShardConnection` DSN type + `ShardingConfig` env loader + `DbPoolMap` N+1 pool container with `pool_for(execution_id)` / `cluster()` / `all_shards()` accessors. Single-pool fallback when `NOETL_SHARDS` empty — degenerate one-shard map whose cluster handle IS the only shard pool; behaviour bit-identical to today. 12 + 2 new tests pin `shard_for` routing against R3b drift-guard pairs so any future change to `pool_for` keeps the wire contract honest. No handler changes; R4 ships dormant until operator sets `NOETL_SHARDS`.
2026-06-04	Phase F R4-2 shipped (noetl/server#50, v2.14.0). `DbPoolMap` wired into `AppState` as a new `pools` field alongside the existing `db: DbPool`. Handler call-sites untouched — R4-3 does the disciplined per-execution cutover (~12 files) so each PR diff stays reviewable rather than landing a ~70-call-site swap in one go. New `AppState::new_legacy` shim for test / example callers without a `ShardingConfig`; new sync `DbPoolMap::from_single_pool` constructor lets the shim skip the async path. `main.rs` loads `ShardingConfig::from_env()`, builds the map (sharded when `NOETL_SHARDS` set, fallback otherwise), logs shard topology at startup. Doc example updated to `new_legacy`. 2 new tokio pool tests pin the single-pool path against the i64-extreme inputs the R3b drift-guard exercises.
2026-06-04	Phase F R4-3a shipped (noetl/server#51, v2.15.0). First slice of R4-3 (per-execution handler cutover). 9 sites in `events.rs` migrated from `state.db` to `state.pools.pool_for(execution_id)`; 1 cluster-wide read (catalog content in `trigger_orchestrator`) moves to `state.pools.cluster()`. 2 event_id-keyed sites (`get_command`, `claim_command`) deferred to R4-4 with inline `TODO(R4-4)` markers — execution_id isn't in scope at request time so neither shard selector nor tx begin can pick the per-execution pool up front; R4-4 picks between a path-param redesign and a cross-shard probe helper. Mid-round design pivot: user raised whether shard endpoints should be DB-resident keychain-backed records (multi-cloud + per-shard rotation). Decision: deferred to Phase G; full design + 5-round Phase G decomposition recorded in server wiki sharding-design § Phase G. R4 continues with env-var DSN through R4-5.
2026-06-04	Phase F R4-3b shipped (noetl/server#52, v2.16.0). Second slice of R4-3 (per-execution handler cutover). `execute.rs` cutover with a clean 3+3 split: 3 cluster-wide `noetl.catalog` reads (`resolve_catalog` × 2 + `get_playbook_yaml`) move to `state.pools.cluster()`; 3 per-execution writes (`playbook_started` event INSERT, `command.issued` event INSERT, `insert_command_row` command INSERT) move to `state.pools.pool_for(execution_id)`. `state.db` no longer appears in execute.rs. No new tests — mechanical pool-handle swap. 246/247 full suite passes (one pre-existing parser failure carried over from R3b-1, unrelated). R4-3c next (health.rs, 3 sites all cluster-wide); then R4-4 (cluster-wide list fan-out + event_id-keyed deferrals) → R4-5 (kind validation N=2).
2026-06-04	Phase F R4-3c shipped — R4-3 complete (noetl/server#53, v2.17.0). Final slice of R4-3. `health.rs` cutover: 3 cluster-wide sites (`db_health_check` for `GET /api/health`, `pool.size()` + `pool.num_idle()` for `GET /api/pool/status`) move to `state.pools.cluster()`. Inline comments at both handlers point at the future per-shard health surface (`/api/health/shards` + per-shard `/metrics` gauge labels) without blocking the basic readiness signal. R4-3 cutover complete across events.rs + execute.rs + health.rs. Only 2 `state.db` sites remain in the entire codebase — both in events.rs (`get_command` + `claim_command`), both event_id-keyed, both explicitly `TODO(R4-4)`-tagged. R4-4 closes those out + adds the cluster-wide `GET /api/executions` list fan-out.
2026-06-04	Phase F R4-4 shipped (noetl/server#54, v2.18.0). Cross-shard fan-out + event_id resolver. Two new `DbPoolMap` helpers: `for_each_shard` (sequential per-shard fan-out; outputs in shard-index order) and `find_first` (probes every shard, returns first `Some`). Sequential await — no `futures` crate dep, no `tokio::spawn` 'static bounds; parallelism is a Phase G concern. Closes the 2 event_id-keyed TODO(R4-4) sites from R4-3a (`get_command` + `claim_command`); `claim_command` resolves event_id → execution_id via `find_first` then opens the tx on `pool_for(execution_id)` to keep shard-locality on the claim tx. `state.db` is now absent from the entire src/ tree — R4-3 + R4-4 complete the handler-side migration to `DbPoolMap`. 4 new tokio pool tests; 250/251 full suite passes (pre-existing parser failure unrelated). Full `ExecutionService` refactor (DbPoolMap ownership + per-execution method migration + `GET /api/executions` list fan-out with the catalog join split into per-shard events aggregation + cluster-master catalog lookup) is the natural next slice (R4-4b) — deliberately deferred to keep this PR diff reviewable.
2026-06-04	Phase F R4-4b shipped — in-server migration complete (noetl/server#55, v2.19.0). `ExecutionService::new(pools: DbPoolMap, snowflake)` (was `(db: DbPool, ...)`). 9 per-execution sites across `get`/`get_status`/`cancel`/`is_cancelled`/`finalize` → `pool_for(execution_id)`; catalog path lookup → `pools.cluster()`. `list()` rewritten as fan-out: per-shard `execution_stats` CTE (no catalog JOIN) with `(limit + offset)` over-fetch, merge sorted by `started_at DESC`, cluster-master `SELECT path FROM noetl.catalog WHERE catalog_id = ANY($1)`, stitch paths, post-merge path-LIKE filter (case-insensitive), `skip(offset).take(limit)`. New `ExecutionService::new_legacy(db, snowflake)` shim wraps a single pool for test/example callers without a pool map. main.rs wires `state.pools.clone()` into the constructor. Path filter quirk documented inline (post-merge filter can reduce effective row count below limit). cargo build clean; 250/251 full suite passes. R4-3 + R4-4 + R4-4b together complete the in-server migration to `DbPoolMap`. R4-5 (kind validation N=2 shards in noetl/ops) is the final R4 round.
2026-06-04	Rust-only e2e complete — orchestrator template context + noetl-tools chain + legacy cleanup (noetl/server#67 (v2.19.4) + noetl/tools#17 (v2.17.1) + noetl/tools#18 (v2.18.0) + noetl/worker#42). `control_flow_workbook` runs FULLY end-to-end on the Rust-only stack 🎉 — first NoETL playbook exercising the complete control-flow surface (workbook → python with result capture → conditional next.arcs → parallel branches → terminal) without Python. Five fixes shipped across three repos: PythonTool result capture; TaskSequenceTool runtime; worker noetl-tools dep bump; orchestrator template context (expose step data at top level + capture call.done so command.completed's bare envelope doesn't clobber the rich payload). Kind cluster legacy cleanup: deleted noetl-server, noetl-worker, noetl-outbox-publisher deployments + noetl-projector statefulset + noetl/noetl-ext/noetl-projector/noetl-worker-metrics services + associated configmaps + noetl-worker SA. Patched noetl-worker-system-pool to point at noetl-server-rust before the legacy `noetl` service deletion. Cluster is now Rust-only: noetl-server-rust + noetl-worker-rust + noetl-worker-system-pool. Closes noetl/ai-meta#60, noetl/tools#15, noetl/tools#16, noetl/server#66.
2026-06-04	Rust-only e2e expansion — pipeline + failure + workbook shipped together (noetl/server#61, #63, #65; v2.19.3). Three orchestrator + parser gaps surfaced by the same e2e probe sequence on the Rust-only stack right after the workload + input alias fix. (Gap C — pipeline flat shape) `ToolDefinition::Pipeline` accepts both YAML shapes via an untagged `PipelineItem` enum (flat `name`-as-field vs nested label-as-key); custom Serialize normalises on the wire so the worker's `task_sequence` consumer is unchanged. Flat form is the v10 dominant (446 vs 37 occurrences in e2e fixtures). Unblocked iterator_save_test, start_with_action, end_with_action from 400 parse rejection. (Gap B — failure termination) Four interlocking gaps fixed: handler trigger gate extended to `command.failed`; `trigger_orchestrator` derives trigger_event_type from the actual event; `process_in_progress` short-circuits on `command.failed` using durable `StepState::Failed` signal; `state.rs::apply_event` extracts failure error from `result.context.error` fallback so `step.error` gets populated. Parallel-branch deferral preserved. Validated: `control_flow_workbook` now emits `playbook.failed
2026-06-04	Rust e2e canonical v10 playbook compat — workload + input alias (noetl/server#59, v2.19.2). Two interlocking YAML-decode gaps surfaced by hello_world e2e on the Rust-only stack right after the EE-5 fix unblocked event emission. (1) `ToolSpec.args` accepts `input` as a serde alias — the canonical v10 playbook YAML writes `tool.input: { ... }` (446 occurrences across e2e fixtures vs. 37 for `args:`), but Rust silently dropped the unknown field at decode and any script referencing the workload by name raised `NameError`. (2) `emit_playbook_started_event` merges `playbook.workload` defaults with `request.payload` overrides — without this, downstream steps' `build_context()` returned `{}` because `ExecutionState::handle_event` hydrates `state.workload` from an event that carried only the request payload (the `generate_initial_commands` path had its own merge for the start step's command context — divergent sites, silent drift). Three-state diagnostic on hello_world: no fix → `NameError`; input alias only → `HELLO_WORLD: None`; both fixes → `HELLO_WORLD: Hello World` ✅. Validated end-to-end in local kind with Python deployments scaled to 0. 3 new unit tests pin both YAML shapes. Closes noetl/ai-meta#56 + noetl/server#58. Unblocks all v10-shape playbooks on Rust e2e.
2026-06-04	Phase F R5 follow-up — EE-5 lax decode of integer execution_id / event_id shipped (noetl/server#57, v2.19.1). Rust-only e2e validation surfaced a year-old contract drift hidden by Pydantic v2's lax int→str coercion on the Python stack: worker emits `noetl-events::ExecutorEvent.execution_id: i64` over `.json(&event)` (JSON integer), server's `EventRequest.execution_id: String` strict-decoded and 422'd with `invalid type: integer, expected a string`. Three worker-→-server inbound id fields gained custom serde adapters (`deserialize_string_or_i64` + `Option<String>` variant) so both the worker's `i64` shape and the documented browser-facing `String` shape decode cleanly: `EventRequest.execution_id`, `EventRequest.event_id`, `BatchEventRequest.execution_id`. Outbound encoding stays `String`. 6 new unit tests pin the dual-shape contract; bogus shapes (arrays / objects) still 422. Validated end-to-end in local kind on the Rust-only stack (Python deployments scaled to 0): `noetl exec tests/fixtures/playbooks/hello_world` runs both steps to `playbook.completed` — the same scenario that previously stalled after 3 retry attempts. Closes noetl/ai-meta#55 + noetl/server#56; unblocks Phase F R5 Rust-only e2e tier. Wiki update lands in lockstep on noetl/server wiki event-envelope § EE-5.
2026-06-04	Phase F R4-5 shipped — R4 complete (noetl/ops#160). Kind validation script `automation/development/validate-shard-routing-n2.sh` (~358 lines). Creates 3 databases on existing postgres pod (idempotent), patches noetl-server-rust with `NOETL_SHARDS` + `NOETL_CLUSTER_DSN`, probes R3b-1 → asserts `shard_count=2`, spawns N executions (default 10), asserts each landed on the predicted shard per `shard_for(execution_id, 2)` by querying each per-shard DB directly, re-runs R3b-3 drift-guard against the sharded server, trap-EXIT reverts the deployment patch. Three databases on one pod (not three pods): exercises the in-server routing code path without process-level isolation overhead; per-pod isolation is the Phase G concern. R4 cutover end-to-end shipped across noetl/server v2.13.0 → v2.19.0 + noetl/ops kind validation. Closes noetl/server#48. Next + only remaining Phase F round: R5 (production cutover, separate ops decision).
2026-06-05	Phase F R5 Tier 4 — v10 control-flow runs end-to-end on Rust-only (noetl/server#69 v2.19.5, 6 commits, closes server#68; noetl/worker#44 v5.11.1, closes worker#43; noetl/tools v2.18.1). R5's e2e-playbook tier re-probe found + fixed 7 distinct bugs across the Rust stack: catalog `get_next_version` INT4-vs-i16 SQL decode; `insert_catalog_entry` `RETURNING id`→`catalog_id`; `ToolSpec` Option fields serialized `null` breaking PythonConfig decode (`skip_serializing_if`); orchestrator `end`-step-with-action dispatch + task_sequence label-keyed result flatten + intra-pass dispatch dedup; minijinja `UndefinedBehavior::Chainable` so `{{ ctx.x \| default(step.y) }}` resolves when `ctx` is undefined; drop the `step != "end"` orchestrator trigger gate so end's `command.completed` emits `playbook.completed`; worker preserves array `tool_config` for task_sequence (every v10 `tool: [...]` step was dropping its config to `{}`). Four v10 fixtures reach `playbook.completed` on the Rust-only kind stack (Python at 0): `start_with_action` (conditional arcs + set-blocks, `from_start` data threaded), `end_with_action` (terminal step's cleanup action ran, `ctx.temp_table` threaded), `loop_test` (iterator fan-out, 41 events), `control_flow_workbook` (workbook→eval→parallel→terminal, 37 events). `actions_test` correct-fails on a missing `TEST_SECRET` env (failure-termination fired cleanly). Pointers bumped: server `52043a8`, worker `912678a`, tools `ff5b6e6`. R5 stays open — this is the v10 control-flow slice of Tier 4; remaining: sharded-N=2 mode + 60-fixture regression rig + production cutover.
2026-06-05	Phase F R5 Tier 4 — keychain-credential path validated (noetl/server#71 v2.19.6, closes server#70; noetl/worker#46 v5.11.2, closes worker#45). Registered the `pg_k8s` postgres credential and probed the DB-backed fixtures — surfaced a 3-bug chain in the keychain subsystem. (1) Credential store bound the AES-GCM ciphertext as `Vec<u8>` (BYTEA) to the TEXT `data_encrypted` column → `POST /api/credentials` 500'd, blocking every keychain credential. Fix: base64-armor on write + decode on read (server#71). (2) `resolve_auth_alias` read only the `auth:` key, but v10 playbooks carry the alias under `credential:` → no connection fields injected → tool fell back to a default unreachable connection. Fix: accept `auth` OR `credential`, strip both (worker#46). Proven end-to-end: `iterator_save_test`'s `create_table` (top-level `credential: pg_k8s`) registers → decrypts → field-maps (`db_host`→`host`, …) → connects → runs a real `CREATE TABLE` against `demo_noetl`. (3, filed) Nested task_sequence sub-task credentials don't resolve — `process_items`'s `save_item` is a postgres task inside a pipeline; task_sequence dispatches sub-tasks through noetl-tools (no ControlPlaneClient), bypassing worker-side resolution (worker#47; recommended fix = worker pre-resolves nested aliases). Pointers bumped: server `7a845da`, worker `4144ea0`. `postgres_test`'s `query_catalog` has no `credential:` (fixture-design gap, not a code bug).
2026-06-05	Phase F R5 Tier 4 — nested-pipeline credentials resolved + template-timing finding (noetl/worker#48 v5.11.3, closes worker#47). Bug #3 from the credential chain above: split `resolve_auth_alias` into a dispatcher + single-tool helper; for a `task_sequence` envelope the worker now walks the `tool_config` pipeline array and pre-resolves each sub-task's `credential:`/`auth:` alias before dispatch (task_sequence dispatches sub-tasks through noetl-tools, which has no ControlPlaneClient). Validated: `iterator_save_test`'s `process_items.save_item` (postgres-in-a-pipeline-in-an-iterator) now connects to `demo_noetl`. Credential-path chain complete + validated: store TEXT/base64 (server#71) → `credential:`-key alias (worker#46) → nested resolution (worker#48). Last `iterator_save_test` blocker — filed noetl/server#72: the server pre-renders the entire task_sequence pipeline config at command-build time, so `{{ _prev.* }}` (task_sequence runtime values) render against an undefined `_prev` and — since the v2.19.5 `UndefinedBehavior::Chainable` change — silently become empty strings → malformed SQL (`VALUES ('exec:', 'exec', '', )`). Evidence read from `noetl.event.context`. Recommended fix: defer task_sequence sub-task rendering to the task_sequence tool. Also flagged: postgres tool swallows the real SQL error behind a generic "db error" (noetl-tools observability gap). Pointer bumped: worker `fbefd57`.
2026-06-05	Phase F R5 Tier 4 — `iterator_save_test` GREEN; full v10 + credential + iterator-pipeline surface validated (noetl/server#73 v2.19.7, closes server#72). Added `render_value_deferring(value, ctx, deferred_roots)`: renders every `{{ ... }}` block EXCEPT those referencing a deferred root (masked with a null-delimited placeholder, restored verbatim). The Pipeline branch defers `_prev`/`_results`, so `{{ pg_auth }}` / `{{ execution_id }}` / `{{ item }}` resolve at command-build (worker keychain-alias resolution still sees a resolved alias) while `{{ _prev.* }}` survives to task_sequence, which renders it per-iteration with the real previous-task output. 3 new template unit tests (18/18 template::jinja pass). Validated — the data is the proof: `iterator_save_test` reaches `playbook.completed` and writes 3 rows to the real `demo_noetl` DB (`item1/100`, `item2/200`, `item3/300`); 35-event chain `start → create_table → process_items (×3: transform → save_item) → end → playbook.completed`. The complete credential + iterator + pipeline chain is now merged + validated end-to-end on Rust-only: store TEXT/base64 (server#71) → `credential:`-key alias (worker#46) → nested resolution (worker#48) → `_prev`/`_results` defer-render (server#73). This exercises the deepest v10 surface NoETL has — an iterator whose body is a multi-tool pipeline that chains data between tools, resolves a keychain credential on a nested sub-task, and writes to an external DB. Pointer bumped: server `24a1966`. Remaining R5: sharded-N=2 mode, 60-fixture regression rig, production cutover. Still filed (noetl-tools): postgres tool "db error" observability gap.
2026-06-05	Phase F R5 Tier 4 — postgres-tool observability (real SQLSTATE errors) (noetl/tools#22 v2.18.2, closes tools#21; noetl/worker#49 dep bump). Closed the last follow-up from the credential/iterator saga. `tokio_postgres::Error`'s Display renders just `db error` for server-side failures — the real SQLSTATE + message lives in the attached `DbError`. `format_pg_error(context, &e)` surfaces `severity: message (SQLSTATE code)` + DETAIL/HINT, with a `std::error::Error::source` chain fallback for connection/type errors; wired into the Query + Execute sites. Worker bumped to noetl-tools 2.18.2 (Cargo.lock-only; `^2.18` already covered it; `chore(deps)` non-releasing so worker stays semver v5.11.3 on `64db698`). Validated on the Rust-only kind stack: a bad query (`SELECT * FROM <nonexistent>`) reports `Database error: Query failed: ERROR: relation "..." does not exist (SQLSTATE 42P01)` in the `call.error` event (was: `Query failed: db error`). Diagnosing failing-SQL playbook steps no longer requires reading `noetl.event.context`. Pointers bumped: tools `e636845`, worker `64db698`.
2026-06-05	Phase F R5 Tier 4 — regression rig: canonical v10 SQL + http config shapes (noetl/tools#24 v2.18.3 closes tools#23; noetl/tools#25 test; noetl/tools#26 v2.18.4; noetl/worker#50 + #51). Started the fixture regression sweep (~30 self-contained fixtures) on the Rust-only kind stack; fixed three config-shape classes the canonical v10 fixtures exercise. (1) postgres `command:` alias + multi-statement SQL (v2.18.3): canonical v10 postgres steps use `command:` (incl. `v10_canonical_example.yaml`) — config required `query`; both postgres (extended protocol) and duckdb (`prepare()`) rejected `CREATE …; INSERT …; SELECT …`. Added the alias + a quote-aware splitter (leading statements batched, final keeps the typed path; param-free only). (2) task_sequence→duckdb regression test (tools#25) after a rollout claim-race briefly masqueraded as a fix bug. (3) duckdb `command:` alias + http coercion (v2.18.4): duckdb gains the alias; http `params`/`headers`/`form` were `HashMap<String,String>` but templated values render to integers/null (`invalid type: integer/null, expected a string`) → lenient coercion (numbers/bools→string, null dropped). Newly GREEN: `duckdb_test`, `json_serialization_save`, `duckdb_retry_query`, `pagination/{offset,cursor,max_iterations,pipeline}`, `retry_simple_config`. Cluster recovery first (server latched into `NATS not configured` after a podman restart — fixed by a server rollout). Server-side follow-up: `loop_with_pagination` renders `{{ execution_id }}` empty inside a multi-statement postgres `command` → `= ;` syntax error (render-scoping at command-build). Pointers bumped: tools `3e1df11` (v2.18.4), worker `4759bd1`.
2026-06-05	Phase F R5 Tier 4 — e2e sweep + orchestrator-strand fix (noetl/server#95 v2.27.2, closes server#94; fixture noetl/e2e#28). Re-ran the sweep against a fresh v2.27.1 kind build of the Tier-1 self-contained set: 16 PASS, 3 correct-FAIL (`should_error_tool_is_required` negative test, `v10_canonical_example` fake URL, `spike/` experimental). Found a real orchestrator-strand bug: `vars_test/test_vars_template_access` hung in `RUNNING` forever after `set_variables`. Root cause — `trigger_orchestrator` caught a deterministic `evaluate()` error (an invalid `{{ ctx. }}` Jinja expression in a downstream step's `code` body, rendered by `engine::commands::build_tool_command`) in a WARN-only arm and emitted no terminal event. Fix: emit a terminal `playbook.failed` (FAILED, error surfaced, parented on the trigger) on an evaluate `Err`; transient/infra errors before evaluate stay retryable. Same stall class as the #58 `command.failed` fix. Unit test locks the precondition (invalid template body → `evaluate` returns Err); lib suite 307/0. Kind-validated: a bad-template playbook → FAILED (was RUNNING-forever); `template_access` (fixture fixed), `hello_world`, `test_transient`, `loop_test` → COMPLETED. Server wiki: event-envelope terminal-events section. 3 deferred findings filed: #62 (`/api/executions` list status query O(all events), 7–8 s + list-vs-detail status drift), #63 (python `script.source.code` shape, ~18 fixtures), #64 (`artifact` tool kind). Pointers bumped: server `5808486` (v2.27.2), e2e `f3a49b2`. Remaining R5: clear #62/#63/#64, external-infra fixtures, sharded-N=2 mode, production cutover.

| 2026-06-07 | Phase D R5 R7 — cross-server parity harness; Replay engine port complete (noetl/server#157 v2.57.0; closes server#148). Final slice of Phase D R5. All seven rounds shipped today (v2.51.0 → v2.57.0); the Replay engine port (Python's ~1236-LoC noetl/server/api/replay/service.py) is now ported to Rust with structural-parity unit-test coverage + cross-server parity rig. Harness shape: tests/parity_harness/events.json (13 synthetic events exercising all six replay projections + payload refs) + tests/parity_harness/expected.json (Python's structured fold output, pre-recorded) + tests/parity_harness/regenerate_expected.py (standalone Python 3.10+ script — verbatim extract of noetl/server/api/replay/service.py::fold_replay_state + helpers, no noetl-package imports to dodge the transitive-dep chain that blocked direct noetl-package import — nats, snowflake-connector-python, env-var validators, …). tests/parity_harness.rs 8-test integration suite: top-level counts, execution status + last_node_name, execution.payload_refs, all six projection maps with per-key status / counters / payload refs / attributes. All 8 pass. Parity contract is structural, not byte-for-byte hex on checksum.value / projection_checksums[*].value — Python and Rust hash different digest inputs (Python feeds normalize_replayed_*_projection flat-row output into SHA-256; Rust hashes the typed state directly per R4's design); both deliver determinism + replay validation but produce different hex. The typed Checksum { type, value } shape (R4) keeps the wire stable across future algorithm additions. No kind-val required — test-only PR with no runtime changes (release-please rolling-MINOR bump from the feat(replay): commit prefix). Sub-issue server#148 auto-closed at 17:53:40Z via the PR body's Closes keyword. Pointer bumped: server 395f8cf. | | 2026-06-07 | Phase D R5 R6 — payload resolver (noetl/server#156 v2.56.0; Refs server#148). Sixth slice of Phase D R5 — every event's result.reference JSON gets parsed into a typed PayloadSummary and appended to the relevant projection's payload-refs list. Mirrors Python's _payload_ref / _payload_summary / per-projection payload_refs population in noetl/server/api/replay/service.py. PayloadSummary struct = {sha256, schema_digest, row_count, media_type, ref} — all Option<…> + skip_serializing_if; the ref field is serde(rename = "ref") over Rust reference_uri (Rust keyword-shy). PayloadRefEntry struct = {event_id, reference, summary} per Python's dict shape. ReplayEventRow.result: Option<serde_json::Value> — noetl.event.result jsonb column added to all three load_events SQL queries. ReplayExecutionState.payload_refs: Vec<PayloadRefEntry> appended in event_id order from every event with result.reference. ReplayFrameState.output_ref + output_ref_summary populated on frame.committed / frame.failed; summary is Some(default) (all-None fields) when the terminal event has no reference (mirrors Python's _payload_summary(None)). ReplayBusinessObjectState.payload_refs + last_payload_ref — every event touching the BO with a result.reference appended; last_payload_ref points at the most recent. extract_payload_ref(event) mirrors Python's _payload_ref — reads event.result.reference, returns None when result absent, no reference key, or reference null. payload_summary(reference) mirrors Python's _payload_summary three-tier fallback per field: reference.<field> → reference.rows_ref.meta.<field> → reference.rows_ref.ipc.<field>; sha256 additionally falls back to reference.digest; ref falls back to reference.uri. 15 new unit tests covering all fallback paths + per-projection population. Server lib 564/0/0 (was 549/0/0). Kind-validated: built localhost/noetl-server-rust:v2.56.0, loaded into kind, rolled deployment. Re-probe of fanout_reduce execution 322023958058635264 returns identical shape as v2.55.0 (no result.reference events in this fixture — back-compat preserved). A second execution (640422512395813188) confirms the live resolver path — execution.payload_refs populates with 3 entries, each carrying a real SHA-256 hex digest (d0de6b8de78fd04b2e752a96ebef12df4a9b32e92565b3f6e55860ae12762133) extracted by the resolver; row_count + ref are null because those source events' result.reference JSON didn't carry those fields (fallback chain returned None as documented). Only R7 (cross-server parity harness against Python) remains in the umbrella. Pointer bumped: server a8a054a. | | 2026-06-07 | Phase D R5 R5 — snapshot seed + base_state + upcaster digest (noetl/server#155 v2.55.0; Refs server#148). Fifth slice of Phase D R5 — the replay fold can now start from a prior fold's output and continue from there rather than always re-folding from event 1. Mirrors Python's base_state + snapshot_seed + upcaster_registry_digest parameters on fold_replay_state in noetl/server/api/replay/service.py. ReplaySnapshotSeed struct mirrors Python's frozen dataclass (aggregate_id + aggregate_type + version: i64 + checksum: Checksum + state: ReplayState + meta: Map); ReplaySnapshotInfo is the output-side subset (omits the seed's full state because it already went into base_state); ReplayFoldOptions struct (Default impl) carries the three optional inputs. New ReplayState fields: upcaster_registry_digest: Option<String> + replay_snapshot: Option<ReplaySnapshotInfo> — both skip_serializing_if = "Option::is_none" so default-options folds produce the exact same JSON as R1–R4 (wire-shape back-compat preserved). fold_replay_state_with_options new entry point; existing 5-arg fold_replay_state becomes a thin back-compat shim passing ReplayFoldOptions::default(). Continuation semantics: base_state strips its checksum + projection_checksums (they recompute at the end); counters (event_count, last_event_id, …) continue from where base left off; caller's tenant_id / organization_id / execution_id override whatever base recorded; caller's upcaster_registry_digest wins over base's, but None from caller preserves base's value. Snapshot-storage backend deferred — the HTTP handler stays unchanged for R5; wiring up a snapshot store + load_snapshot_seed shape + deciding when to seed is a downstream sub-issue against server#148. R5 lands the data-contract round; storage + HTTP opt-in are the next slice. 8 new unit tests covering option propagation, snapshot info surfacing, counter continuation, checksum stripping + recomputation, tenant/org override, upcaster digest precedence rules. Server lib 549/0/0 (was 541/0/0). Kind-validated: built localhost/noetl-server-rust:v2.55.0, loaded into kind, rolled deployment. Re-probe of the prior fanout_reduce execution 322023958058635264 returns identical JSON shape as v2.54.0 (event_count=25, status=COMPLETED, commands_n=4, all six projection_checksums entries populated); new replay_snapshot + upcaster_registry_digest keys absent from JSON output (Option::is_none skip working). Snapshot-seeded behaviour covered by unit-test layer (no snapshot store in kind yet). Pointer bumped: server 3dc6b66. | | 2026-06-07 | Phase D R5 R4 — typed Checksum + projection_checksums (noetl/server#154 v2.54.0; Refs server#148). Fourth slice of Phase D R5 — every replay fold now produces a typed Checksum over the full state + a 6-entry projection_checksums map covering every per-projection slot. Per user direction (this session): the hash function is the type of the checksum, not a sibling field — ChecksumType enum (initial variant Sha256) gates future types (BLAKE3, SHA-512, …) without a wire-format break. ChecksumType enum serializes lowercase snake_case "sha256" matching Python's state["checksum_algorithm"] wire form; Checksum struct = { type: ChecksumType, value: String /* hex */ }; ReplayState.checksum: Option<Checksum> (skip_serializing_if when None) + ReplayState.projection_checksums: BTreeMap<String, Checksum> (six entries on every fold: execution, stage, frame, command, business_object, loop). stable_json_bytes helper encodes values as deterministic JSON — sorted keys recursively + compact separators (matches Python's json.dumps(sort_keys=True, separators=(",", ":")) byte form). compute_checksums runs once at the end of fold_replay_state — per-projection SHA-256 over each typed sub-state, then top-level SHA-256 over the full state with projection_checksums populated and checksum field still None (skip_serializing_if handles the self-reference cleanly — digest doesn't depend on itself). Design decision: the digest input is the typed Rust state directly, NOT through Python's normalize_replayed_<projection>_projection flat-row layer. Reasons: typed BTreeMap ordering + stable_json_bytes sorted-key recursion deliver the same determinism guarantee; the Rust state IS the source of truth for the server's view; cross-Python byte-for-byte parity isn't an R4 requirement — that's R7's "cross-server parity harness" round (additive work, doesn't touch the R4 wire shape). 9 new unit tests covering type-serialization shape, hex output format, deterministic re-runs, projection isolation, top-level self-non-dependence. Server lib 541/0/0 (was 532/0/0). Kind-validated: built localhost/noetl-server-rust:v2.54.0, loaded into kind, rolled deployment. Re-probe of the prior fanout_reduce execution 322023958058635264 returns checksum: {type: "sha256", value: "41265876487f32350fc60c5039358456ded76598b99e7a0833ac4a17ceaae426"} and projection_checksums with all six entries; sample command entry hex 58d8220005758b7f18e27d9042b3ef5fa8ca86471c9d2ea33a869fd0db31231b. Same event log → same hex (verified by fold_checksum_deterministic_across_runs); every projection's hex differs from every other (distinct identity per sub-state input). Pointer bumped: server adec21c. | | 2026-06-07 | Phase D R5 R3 — loops + business_objects projections (noetl/server#153 v2.53.0; Refs server#148). Third slice of Phase D R5 — replay fold now populates the last two per-projection maps from the event stream. Two new typed state structs (ReplayLoopState with loop_id + step_name + total + done + failed + completed + last_event_id; ReplayBusinessObjectState with object_key + object_type + object_id + status + version + event_count + per-lifecycle event_ids + attributes) replace R2's serde_json::Map placeholders. ReplayState.{loops,business_objects} flip to BTreeMap<String, Replay{Loop,BusinessObject}State> for deterministic key ordering. Two new ID extractors mirror Python's _loop_id / _business_object_identity resolution order — loop id reads meta.loop_id / meta.loop_event_id / meta.__loop_epoch_id; business-object identity reads meta.business_object.{object_type|type} / {object_id|id} first, then flat meta.business_object_{type,id} / meta.object_{type,id}, then parses aggregate_type=business_object + aggregate_id=business_object/<type>/<id> as fallback. business_object_status helper mirrors Python's _business_object_status (explicit non-empty status wins; else suffix-derives ACTIVE / DELETED). Two populate functions with full event-shape coverage — loop counters bump on command.{completed,failed} + loop.shard.{done,failed} + loop.{done,fanin.completed}; business-object attributes REPLACE from meta.business_object.state and PATCH from meta.business_object.{patch|attributes}; status transitions through the suffix-derived path (DELETED also sets deleted_event_id); version recomputes from meta.business_object.version || meta.business_object_version || event_count. payload_refs deferred to R6 (the payload-resolver round). 13 new unit tests; server lib 532/0/0 (was 518/0/0). Cleanup commit (server@a235b60) dropped the "canonical" qualifier from R3 doc comments per writing-style banned-word rule + user direction. Kind-validated: built localhost/noetl-server-rust:v2.53.0, loaded into kind, rolled deployment. Re-probe of the prior fanout_reduce execution 322023958058635264 returns loops + business_objects maps empty (the v10 control-flow shape doesn't emit loop.* events or business-object metadata — expected). Fold correctness for those projections verified through the unit-test layer (fold_populates_loop_with_counters_and_completion, fold_populates_business_object_through_lifecycle). Wire-contract shift to typed BTreeMap is the load-bearing change for R4's projection_checksums hash input. R4 design refinement (captured this session, per user direction): R4 ships the checksum + projection_checksums fields with a TYPED shape — ChecksumType enum (initial variant Sha256; future could add Blake3, Sha512, ...) + Checksum { type, value } struct. ReplayState.checksum is Option<Checksum>; ReplayState.projection_checksums is BTreeMap<String, Checksum>. Reasons: the hash function is the type of the checksum, not a sibling field; future types slot in via the enum without a wire-format break. Full design on server#148 comment. Pointer bumped: server 3174c75. | | 2026-06-07 | Phase D R5 R2 — stages + frames + commands projections (noetl/server#152 v2.52.0; Refs server#148). Second slice of Phase D R5 — replay fold now populates stages + frames + commands projections from the event stream. Mirrors Python's state["stages"] / state["frames"] / state["commands"] per-projection dicts. ReplayEventRow extended with stage_id / frame_id / command_id / worker_id / aggregate_type / aggregate_id / meta columns (all #[sqlx(default)]); three new typed state structs (ReplayStageState / ReplayFrameState / ReplayCommandState); ReplayState.{stages,frames,commands} flip from serde_json::Map to BTreeMap<String, Replay{Stage,Frame,Command}State> for deterministic key ordering (matters when R4 lands typed Checksum + projection_checksums); three new ID extractors mirror Python's resolution order (top-level column → aggregate_type+aggregate_id → meta.<key>); three new populate functions with full status transitions (stage opened → OPEN / closed → CLOSED; frame dispatched → CLAIMED / started → RUNNING / committed → COMPLETED / failed → FAILED / abandoned → ABANDONED; command full lifecycle). 10 new unit tests; server lib 518/0/0 (was 508/0/0). Kind-validated: re-probe of prior fanout_reduce execution 322023958058635264 returns commands map populated with 4 entries (one per dispatched command: start, normalize_customer, enrich_customer, reduce_customer) carrying worker_id + issued_event_id + last_event_id; stages + frames stay empty because the v10 control-flow shape doesn't emit stage.* / frame.* events (expected). Pointer bumped: server 266b4c7. | | 2026-06-07 | Phase D R5 R1 — Replay endpoint scaffold + execution projection (noetl/server#149; tracks server#148). Opens Phase D Round 5 — the Replay engine port (Python's noetl/server/api/replay/service.py ~1236 LoC → Rust). Sub-issue server#148 carries the 7-round decomposition (R1 scaffold + execution / R2 stages+frames+commands / R3 loops+business_objects / R4 typed Checksum + projection_checksums / R5 snapshot seeds / R6 payload resolver / R7 cross-server parity harness). This PR ships R1: new GET /api/replay/state route mirroring Python's endpoint.py byte-for-byte (query params + defaults + projection enum names + mutually-exclusive cutoffs returning 400); new services::replay module with ReplayService + ReplayCutoff + ReplayProjection + ReplayState + pure deterministic fold_replay_state; minimal execution projection fold using the same terminal-event short-circuit pattern Phase D R4 landed in the orchestrator + status endpoint (playbook.completed → COMPLETED, step.enter flips UNKNOWN → RUNNING + tracks last_node_name). Other map fields (stages/frames/commands/business_objects/loops) stay empty in R1 — populated in R2/R3. 8 new service tests + 1 handler test; server lib 508/0/0 (was 499/0/0). | | 2026-06-07 | Phase D R4 follow-up — status endpoint short-circuits on terminal events (noetl/server#147 v2.50.1; closes server#146). Read-side bug surfaced during the Phase D R4 fanout_reduce kind-val: GET /api/executions/{id}/status continued to return {"status":"RUNNING","completed_steps":0} for 90s+ after playbook.completed landed in the event log. Two compounding causes — (1) inline SQL step-stats heuristic had no terminal-event lookup; (2) completed_steps filter looked for status='COMPLETED' only, missing the realistic 'success' lowercase emitted by command.completed events. Fix: new terminal query short-circuits on playbook.completed/playbook.failed (mirrors the list endpoint's existing bool_or(playbook.completed) semantics); widened filter to accept status IN ('COMPLETED', 'completed', 'success'). 6 new unit tests; server lib 499/0/0 (was 493/0/0). Kind-validated: rebuilt + loaded localhost/noetl-server-rust:v2.50.1, re-query of prior execution flipped {"status":"RUNNING","completed_steps":0} → {"status":"COMPLETED","completed_steps":4} on the same DB data; fresh fanout_reduce execution reached COMPLETED in ~600ms wall. Phase D R4 read-side now matches the orchestrator's decision the moment a terminal event lands. Pointer bumped: server d26abf8. | | 2026-06-07 | Phase D R4 — fanout_reduce kind-val GREEN on Rust-only stack. 🎉 Built noetl-server-rust:v2.50.0 from server@499b079 via podman + Dockerfile, loaded into kind, rolled the deployment, ran the fanout_reduce_phase6 fixture from e2e@5da36ea. Direct DB query against noetl.event confirms all three barrier assertions pass: playbook.completed event exists @ 14:25:57.254 + exactly 1 step.enter for reduce_customer + reduce_customer.command.completed AFTER both branches' command.completed (r=14:25:57.248 > a=14:25:56.954 ∧ > b=14:25:57.042). Server log confirms Step 'reduce_customer' already dispatched in this pass, skipping + Orchestrator marked execution as terminal terminal_event=playbook.completed. Phase D R4 closes at the orchestrator + e2e level. Surfaced separately: GET /api/executions/{id}/status returns RUNNING after playbook.completed lands (read-side bug, doesn't affect orchestrator correctness) — filed as noetl/server#146. | | 2026-06-07 | Phase D R4 slice 3 — fanout_reduce kind-val rig (noetl/e2e#32; closes e2e#31). Durable fixture + kind-val script for the canonical fan-in shape (start → branch_a/branch_b → reduce_customer → end). Fixture is a copy of the Python reference at repos/noetl/tests/fixtures/playbooks/fanout_reduce/fanout_reduce_phase6.yaml with a header comment documenting the orchestrator contract being exercised. Script (scripts/kind_validate_fanout_reduce.sh) preflights kubectl + noetl + curl + python3 + server reachable, registers + executes, waits up to $NOETL_FANOUT_TIMEOUT_SECS (default 180s), then asserts three things on the event log: (1) final execution status COMPLETED, (2) exactly one step.enter for reduce_customer (barrier prevented double-dispatch), (3) reduce_customer.command.completed arrives AFTER both branches' command.completed events. Dumps server + worker logs on failure for triage. Modeled after kind_validate_container_callback.sh; same lint discipline (bash -n + yaml.safe_load + python snippet dry-runs all green). Phase D R4 slices 1-3 all shipped at the artefact level. Next housekeeping: build a fresh noetl-server image carrying v2.50.0 + load into kind + run the rig to capture green-state evidence. Pointer bumped: e2e 5da36ea. | | 2026-06-07 | Phase D R4 slice 2 — apply_event handles step.skipped (noetl/server#145 v2.50.0; closes server#144). Closes the gap exposed by slice 1's #[ignore] test. New "step.skipped" | "step_skipped" arm in state::WorkflowState::apply_event records the step into state.steps with StepState::Skipped and stamps entered_at + completed_at to the event timestamp. is_step_done already treated Skipped as terminal at state.rs:540 — the missing piece was the apply_event mapping. Without this, the fan-in barrier kept deferring a multi-upstream reduce step forever if any upstream was guard-skipped. Slice 1's ignored test (test_reduce_step_treats_skipped_upstream_as_done) flipped to active + 2 new state tests; server lib 493 passed / 0 failed / 0 ignored (was 490/0/+1). Pointer bumped: server 499b079. | | 2026-06-07 | Phase D R4 first slice — fan-in / reduce barrier (noetl/server#143 v2.49.0; closes server#142). Orchestrator now gates dispatch of any step with multiple incoming arcs until ALL upstream steps reach a terminal state (Completed | Failed | Skipped). Pre-PR the orchestrator fired a reduce step on the FIRST completing upstream — never seeing the others' results — which broke the canonical fanout_reduce_phase6 shape (start → branch_a,branch_b → reduce → end). New module-private build_incoming_arcs(steps) helper mirrors the Python planner's incoming map across all four NextSpec variants (Single/List/Router/Targets). process_in_progress calls it once per pass; dispatch loop adds a barrier-defer check between same-pass dedup and dispatch. Single-upstream targets unaffected. command.failed already short-circuits via the dedicated path at the top of the function, so reaching the dispatch loop with any upstream Failed is structurally impossible. Three new active tests + one #[ignore] documenting an apply_event follow-up gap (no case for step.skipped, so a guard-false upstream stays Pending rather than Skipped; the barrier code already handles Skipped on the reconstructed-state side via is_step_done). Server lib 490 passed / 0 failed / 1 ignored. Context-merging slice + kind-val with the fanout_reduce_phase6 fixture are separate follow-ups — playbook templates can already read {{ <upstream>.<field> }} individually; auto-merging upstream step.result into the reduce step's Jinja scope is a planner-level concern. Pointer bumped: server be37e5c. | | 2026-06-08 | noetl-server v2.64.0 — step-level set: + ctx shims; e2e validation 25/27 PASS (noetl/server#168 OPEN, v2.64.0 on branch fix/ctx-shim-orchestrator-eval; Refs noetl/ai-meta#49). Two orchestrator fixes: (1) with_ctx_shims() adds ctx.*/workload.* namespace entries to Jinja evaluation context at all 7 orchestrator call sites (commit b05f978); (2) set_vars field added to Step struct so step-level set: YAML is parsed + applied after each completed step (commit 48cb008) — root cause of pagination_basic failing at 19 events (now progresses to 45+). Full 38-playbook e2e sweep on Rust-only kind stack: 25/27 testable PASS (92.6%). 1 server bug remaining (test_storage_tiers artifact _ref). Performance: heavy_loop_aggregation 526 events in 31.89s (~16.5 events/s sustained). Performance report posted on #49. PR awaiting merge. | | 2026-06-08 | Sequential-mode iterator dispatch — closes noetl/ai-meta#76 (noetl/server#166 v2.62.0). First Claude-direct Rust PR under agents/rules/handoff-routing.md. LoopMode enum (Sequential default / Parallel) in playbook/types.rs; LoopSpec.mode parsed from loop.spec.mode YAML. StepInfo.iterations_dispatched tracks command.issued count for the sequential dispatch guard. Sequential pattern: dispatch iteration 0 at fan-out; on each command.completed, check iterations_dispatched == iterations_completed() and dispatch next. Default is Sequential, so existing playbooks without explicit spec.mode get sequential behavior; existing parallel-mode tests updated to set mode: Parallel explicitly. 3 new tests; lib pass; clippy clean. Kind-val GREEN: test/loop COMPLETED 5/5 iterations + iterator_save_test COMPLETED 4 steps. Pointer bumped: server 2430bc2. | | 2026-06-09 | E2E sweep cleanup — noetl-tools v3.1.0 + noetl-server v3.0.1 (noetl/tools#47 + noetl/server#171; Refs #49). The prior-session e2e-sweep fixes shipped with diagnostic tracing::debug! scaffolding; this session stripped the scaffolding, kept the production deltas, opened + merged both PRs. tools v3.1.0: YAML boolean when: true in policy rules checks as_bool() before string-template fallthrough (Value::Bool::as_str() returns None); \|tojson fallback when a single {{ expr }} template renders a complex object to Python-style repr; UndefinedBehavior::Chainable on the tools engine; two misplaced tests moved inside mod tests {}. server v3.0.1: result-store PUT body limit raised to 64 MB (DefaultBodyLimit — was 413-ing 15 MB+ payloads); render_pipeline_config stashes set/args/spec/command before Tera rendering; iter namespace map in build_iteration_command; cmd_render_ctx uses command.context override. All 7 e2e sweep playbooks PASS on the Rust-only kind stack (Python at 0): hello_world, loop_test, actions_test, test_args_passing, iterator_save_test, test_storage_tiers, test_http_jsonplaceholder. Pointers bumped: tools d294a6c, server 33789b0. PRs cited Refs #69 by mistake (stale — #69 is the closed worker output_select bug); corrected on #49. Deferred: worker noetl-tools crates.io revert blocked on v3.1.0 publish (release commit [skip ci]; crates.io at v3.0.0). Phase F status: only R5 (production cutover) remains — a separate ops decision. |

Phase D status (closed)

Round	Code	Kind E2E
R1 (template resolution)	✅ shipped in Phase B	✅
R2 (orchestrator wired into event ingest, #31)	✅ v2.5.0	✅
R3a (`step.when` enable guard, #32 + #36)	✅ v2.6.0 + v2.8.2	✅
R3b (`step.loop` iterator, #33 + #37)	✅ v2.7.0 + v2.8.3	✅
R3c (parallel branches, #34)	✅ v2.8.0	✅
R4 slice 1 (fan-in / reduce barrier, #143)	✅ v2.49.0	✅ GREEN — fanout_reduce_phase6 on Rust-only stack, 14:25:57.254
R4 slice 2 (apply_event step.skipped, #145)	✅ v2.50.0	✅ GREEN — fanout_reduce_phase6 on Rust-only stack, 14:25:57.254
R4 slice 3 (fanout_reduce kind-val rig, e2e#32)	✅	✅ GREEN — three barrier assertions verified against `noetl.event`

Phase D R5 — Replay engine port (Python `noetl/server/api/replay/service.py` ~1236 LoC → Rust)

Sub-issue: noetl/server#148 carries the 7-round decomposition.

Round	Topic	Status
R1	Endpoint scaffold + `execution` projection (server#149)	✅ v2.51.0 — kind-val GREEN
R2	`stages` + `frames` + `commands` projections (server#152)	✅ v2.52.0 — kind-val GREEN
R3	`loops` + `business_objects` projections (server#153)	✅ v2.53.0 — kind-val GREEN
R4	typed `Checksum` (`ChecksumType` enum + value) + `projection_checksums` (server#154)	✅ v2.54.0 — kind-val GREEN
R5	snapshot seed + `base_state` + upcaster digest (server#155)	✅ v2.55.0 — kind-val GREEN
R6	payload resolver (server#156)	✅ v2.56.0 — kind-val GREEN
R7	cross-server parity harness against Python (server#157)	✅ v2.57.0 — 8-test rig, all pass; closes server#148

Next concrete steps

EE-4 fully shipped ✅ — all three rounds merged + crates.io publish landed:

Round	PR	Outcome
1	noetl/cli#49	`noetl-events` workspace crate extracted from `noetl-executor::events`; executor 1-line re-exports.
2	noetl/cli#50	Crates.io publish prep + actual publish: `noetl-events 0.1.0`, `noetl-executor 0.4.0`, `noetl 4.9.0` all on crates.io after manual workflow dispatch on cli@d6e2432.
3	noetl/server#38	Server takes `noetl-events = "0.1"` direct dep, adds `From<ExecutorEvent>` + `TryFrom<&EventRequest>` impls, 4 wire-compat tests pinning the shared-subset round-trip. Server v2.9.0.

Design call: server's EventRequest legitimately retains its 5 server-only fields (result_kind, result_uri, event_ids, actionable, informative) and String wire format for execution_id / event_id (JSON-number precision for browser clients) — the two types stay distinct but the SHARED SUBSET is now anchored to the canonical noetl_events::ExecutorEvent.

Forward:

Phase F — sharding — R4 complete (shipped 2026-06-04 across noetl/server v2.13.0 → v2.19.0 + noetl/ops kind validation; noetl/server#48 closed). Design doc + endpoint inventory live on noetl/server wiki — sharding-design. In-server DbPoolMap routing is end-to-end kind-validated; ships dormant unless the operator sets NOETL_SHARDS. Only remaining Phase F round is R5 (production cutover — separate ops decision after the operator runs validate-shard-routing-n2.sh against their live kind cluster and confirms acceptance criteria; tracked separately whenever that rollout window opens).
Phase G (deferred) — keychain-backed shard endpoints + result_ref.credential_alias. Design + 5-round decomposition recorded in noetl/server wiki sharding-design § Phase G. Triggers when multi-cloud or per-shard credential rotation becomes a real production requirement.
Remaining Python-only endpoints — scope a sweep through the Phase A parity-harness output to identify the long-tail of routes the Rust server doesn't yet cover (none currently block production traffic; the system pool + workers already operate on the Rust surface for all observed call patterns).
R5 fixture regression rig — e2e sweep cleanup landed (2026-06-09). The 38-playbook e2e sweep (noetl-server v2.64.0 + noetl-worker v5.15.0, Python at 0) shipped two orchestrator fixes — ctx/workload namespace shims + step-level set: mutation support — which merged via #168 (v2.63.0). The remaining sweep deltas (YAML when: true boolean, |tojson object-template fallback, 64 MB result-store body limit, pipeline command/spec stash) were stripped of diagnostic logging and merged 2026-06-09 as noetl-tools v3.1.0 (#47) + noetl-server v3.0.1 (#171). All 7 sweep playbooks now PASS on the Rust-only kind stack: hello_world, loop_test, actions_test, test_args_passing, iterator_save_test, test_storage_tiers, test_http_jsonplaceholder. ~~1 remaining server-side bug: test_storage_tiers artifact _ref null.~~ Resolved via the v3.0.1 result-store body-limit fix (the over-budget durable PUT was 413-ing, leaving the worker on the degraded shm-only branch). Remaining non-PASS fixtures are all external-dependency / fixture-design issues (wrong service names, missing credentials, external HTTP servers, OpenAI API key). Deferred: worker noetl-tools crates.io revert blocked on v3.1.0 publish ([skip ci] release commit; crates.io at v3.0.0). Open server-side finding: loop_with_pagination renders {{ execution_id }} empty inside a multi-statement postgres command: at command-build time. Resolved 2026-06-08 on noetl-server v2.57.1. Porting gap noted: the artifact tool kind. Resolved 2026-06-07 via tools v2.20.0.

Phase F — sharding survey (2026-06-05)

Read-only investigation of the Rust server (repos/server/src/) to scope the sharding boundary. The architecture is ready: every per-execution operation is already keyed by execution_id, NATS subjects are already execution-aware (noetl.commands.{system|shared}.<execution_id> derived in publish_command_notification at execute.rs:535), and the orchestrator is stateless re: execution (loads events fresh from the DB on every trigger). Sharding is an orchestration problem, not a redesign.

Natural shard boundary

The execution. All state belonging to a single execution_id (events, commands, variables, outbox rows) lives on one shard. Cluster-wide state (catalog, credentials, keychain, runtime registration) is read-only from the execution's perspective and replicates across shards or lives behind a shared read-only path.

Endpoint inventory (40+ routes)

Per-execution (shardable by execution_id):

Endpoint	execution_id source
`POST /api/execute`	body (server generates snowflake)
`POST /api/events`	body (string-wire i64)
`POST /api/events/batch`	body
`GET /api/commands/{event_id}`	path → DB lookup
`POST /api/commands/{event_id}/claim`	path → DB lookup
`GET /api/executions/{execution_id}`	path
`GET /api/executions/{execution_id}/status`	path
`POST /api/executions/{execution_id}/cancel`	path
`GET /api/executions/{execution_id}/cancellation-check`	path
`POST /api/executions/{execution_id}/finalize`	path
`GET /api/executions/{execution_id}/events/stream` (SSE)	path
`GET / POST / DELETE /api/vars/{execution_id}`	path
`POST /api/internal/outbox/claim`	per-execution batch
`POST /api/internal/outbox/mark-published` / `mark-failed`	body
`POST /api/internal/events/project`	body

Cluster-wide (NOT shardable):

Endpoint	Reason
`GET /api/executions` (list)	cluster-wide query
`GET/POST /api/catalog/*`	global playbook catalog
`POST/GET /api/credentials/*`	global credential store
`POST/GET /api/keychain/{catalog_id}/*`	keyed by `catalog_id`, not `execution_id`
`POST /api/worker/pool/{register,deregister,heartbeat}`	worker pools are global
`GET /api/worker/pools`	cluster-wide read
`GET /api/internal/outbox/pending-count`	cluster-wide count (KEDA scaler trigger)

DB table inventory

Shardable by execution_id:

noetl.event — append-only event log
noetl.command — command queue
noetl.execution — execution records
noetl.outbox — transactional outbox
noetl.variables — ephemeral step-scoped cache

Cluster-wide:

noetl.catalog — playbook catalog
noetl.credential — credential store
noetl.keychain — keychain entries
noetl.runtime — worker pool registration/heartbeat

Three open design questions

Gateway-aware vs server-aware shard routing. Either the ingress (gateway / load balancer) inspects execution_id and routes to the owning shard, or any server replica accepts any request and proxies internally. Trade-off: gateway-aware is one network hop and centralizes the routing logic; server-aware is more resilient but adds inter-shard mTLS + an N-hop worst case. Lean recommendation: gateway-aware for Phase F (simpler); server-aware as a Phase G optimization if gateway becomes a bottleneck.
Modulo on full snowflake vs timestamp / machine_id. The snowflake i64 layout is [timestamp: 41][machine_id: 10][sequence: 12]. hash(execution_id) % N (full 64-bit) distributes evenly; timestamp % N creates time-based hotspots (all playbooks started in the same second hit one shard); machine_id % N clusters by generating machine. Decision: full i64 modulo. Cheap and even.
Catalog / credential / keychain sync strategy. These must be consistent across shards. Options: (a) shared single-master Postgres (simple; write bottleneck on heavy catalog churn); (b) multi-master with conflict resolution (risky); (c) read-only replicas per shard from a single master (highest ops cost, cleanest separation). Lean recommendation: (a) single-master for Phase F; revisit (c) as Phase G if replication lag becomes the limiting factor.

Proposed Phase F decomposition (5 rounds, Phase D shape)

Round	Scope	Output
R1	Sharding design doc + endpoint inventory codification	Sub-issue with the design call (gateway-aware, full-i64 modulo, single-master cluster-wide tables) and the 40+ endpoint inventory tagged shardable / metashard. ADR added to noetl.dev/docs.
R2	Server-side `shard_id()` helper + (optional) `route_to_shard()` proxy	New `repos/server/src/state.rs` helpers; HTTP proxy fallback for mis-routed requests (gives a safety net while gateway routing rolls out).
R3	Gateway-side dispatch + load balancer config	Wire ingress (Kong / nginx / Cloud LB) to extract `execution_id` and route by consistent hashing. Replicate the shard assignment in LB config or a small plugin.
R4	DB sharding + per-shard schema migration	Partition `event` / `command` / `execution` / `outbox` / `variables` by `execution_id` via Citus or per-shard separate schemas. Replicate `catalog` / `credential` / `keychain` / `runtime` (or keep on shared read-only path). Schema migration dry-run on kind first.
R5	Cutover + validation	Spin up N=2 shard replicas in kind, run the Phase D e2e harness across shards, confirm execution affinity (all events for one execution hit one shard), then prod canary + traffic split.

Load-bearing prerequisite — app-side snowflake IDs

Per agents/rules/observability.md Principle 3, execution_id should be generated in the application using a Rust snowflake library, not via the DB-side noetl.snowflake_id() function. Reasons listed in the rule: spans need the id at span-creation time; retries are idempotent only with a stable id across attempts; cross-component publish doesn't have to wait for the INSERT round trip; tests need deterministic ids; multi-cluster sharding can't agree on a single DB-side sequence.

Currently the Rust server still calls the DB function on every event / command create. This wants to flip before R4 (DB sharding) — once tables are partitioned, the DB-side sequence either becomes per-partition (clusters within a shard) or has to be globally synchronized across shards (defeats the point). Likely an R1.5 / R2 prerequisite — track it explicitly in the R1 design doc.

Out-of-scope for Phase F

Multi-region — that's Phase G or later.
Per-tenant sharding — orthogonal concern; Phase F is plain execution_id modulo.
Dynamic re-sharding (adding shards on the fly) — Phase F lands a fixed N; resizing is later.

Umbrella: System Pool Design (#46) — interlocks with this umbrella.
Umbrella: Container Tool Callback (#43) — orthogonal.
Umbrella: Rust Worker Parity Gaps (#47, #48) — orthogonal.
ADR: System Worker Pool and WASM Plug-in Surface — covers the data access boundary that shapes Phase C.
noetl-server wiki: Runtime shape — implementation-level companion.
Data Access Boundary — the rule that shapes Phase C.

NoETL Dashboard

Home — overview
Repo Map
Releases
Sessions Log

Active Umbrellas

Secrets Wallet (#61) — SECURITY (design)
Rust Server Port (#49) — PRIMARY
Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
System Pool Design (#46) — PRIMARY
Regression Baseline Migration (#98) — e2e
Subscription / Listener Tool (#90) — RFC
Container Tool Callback (#43)
Rust Worker Parity Gaps (#47 · #48)
Event Envelope Reconciliation (#51 in TaskList)

Closed Umbrellas

Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
Rust Worker Migration (#30)
Python Services → Rust (#45)

Conventions

Per-repo wikis

noetl/noetl wiki — app + DSL
noetl/server wiki — Rust control plane
noetl/worker wiki — Rust pull worker
noetl/tools wiki — tool registry crate
noetl/cli wiki — CLI + local mode
noetl/gateway wiki — gatekeeper
noetl/ops wiki — Helm + manifests
noetl/travel wiki — domain SPA reference
Docs site — engineer-facing architecture

Umbrella Rust Server Port

Umbrella — Rust Server Port (PRIMARY)

Goal

Visual — interlock with #46

Why now

Constraints (from latest architectural decisions)

Phases

Phase A — Read endpoints (no orchestrator dependency)

Phase B — Worker write boundary

Phase C — Internal endpoints for system pool (UNBLOCKS #46 Phase 2)

Phase D — Orchestrator engine port (the big lift)

Phase E — SSE + remaining endpoints

Phase F — Sharding design + cutover

Sharding sequence — request flow with N shards

Out of scope

First three sub-issues (next session opens these)

Recent activity

Phase D status (closed)

Phase D R5 — Replay engine port (Python noetl/server/api/replay/service.py ~1236 LoC → Rust)

Next concrete steps

Phase F — sharding survey (2026-06-05)

Natural shard boundary

Endpoint inventory (40+ routes)

DB table inventory

Three open design questions

Proposed Phase F decomposition (5 rounds, Phase D shape)

Load-bearing prerequisite — app-side snowflake IDs

Out-of-scope for Phase F

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally

Phase D R5 — Replay engine port (Python `noetl/server/api/replay/service.py` ~1236 LoC → Rust)