Skip to content
Kadyapam edited this page Jun 21, 2026 · 266 revisions

NoETL Ecosystem Dashboard

Last refreshed: 2026-06-20 (Claude session — ✅ PROD CQRS rollout RECORDED (ops#200ops@d6633f6, ai-meta 08c73e5) + LIVE-PROD e2e validation of the gate-ON cutover. Ran the Rust regression + specialized playbooks against LIVE PROD (gke …noetl-cluster, server v3.39.1 / worker v5.40.2, PUBLISH_ONLY=true + STATE_BUILDER=offserver, materializer sole writer): 28/30 executions PASS (24/26 distinct fixtures + 4 composition-spawned children) — python/args/vars/loops/control-flow/output-select/large-result/actions/fan-out-reduce/duckdb/http/save-to-postgres/sub-playbook composition; every gate-ON execution COMPLETED with sole-writer (rows==distinct, 0 catalog_id=0, 0 __orchestrate__ event rows), clean chain (roots=1/terminals=1/dangling=0/walk==total), never-scan (worker state_builder_event_scans Δ0 / cumulative 0), materializer lag 0. The 2 FAILs (postgres_test pg_noetl_k8s / postgres_jsonb pg_local) are a prod-env credential-unreachable difference, NOT a cutover bug — both still produced a clean playbook.failed terminal + single-root chain (failure path is gate-ON-correct). No platform bug → no issue filed. Prod left healthy (health ok, lag 0, 0 pod restarts ~45 min). Footprint (uncleanable): tenant prefix prod-e2e-20260620-1946 — 26 catalog / 30 execs / 947 noetl.event rows. Refs #103/#107/#111. — prior: 🚀 PROD CQRS CUTOVER EXECUTED + gate-ON validated — server v3.39.1 / worker v5.40.2 rolled to prod GKE (gke_noetl-demo-19700101, ns noetl) and NOETL_EVENT_INGEST_PUBLISH_ONLY + NOETL_STATE_BUILDER=offserver flipped LIVE. Materializer is the sole noetl.event writer (event_ingest_published_total == materializer_acked_total); the off-server drive builds state from the noetl_events WAL with zero event-scans; 5 tenant validation execs all COMPLETED, chains roots=1/terminals=1/dangling=0; backlog 0, 0 restarts. A one-time owner-applied prev_event_id migration unblocked the write path — the runtime noetl role isn't the table owner, so the columns were provisioned as postgres via the live pgbouncer-embedded owner credential (no secret rotated/printed). GSM pg_noetl_k8s is stale → operator should rotate/realign it. The event-chain DDL skipped WARN persists by design (ownership checked before the IF-NOT-EXISTS skip). ops #200 (digest bumps + executed-rollout record); ai-meta pointer bump follows merge; one-command revert on standby. Closes the last #103 operator gap; puts #107/#115/#111 off-server into prod. — Earlier same day: ✅ #118 + #119 SHIPPED + gate-ON kind-validated + CLOSED — single- AND multi-replica off-server are now blemish-free (single-root incl. finalize, zero fallback) AND restart-robust (the WAL index rehydrates). noetl-server v3.39.1 (server#253, c5f8cb2) + noetl-worker v5.40.2 (worker#123, 48b0bde) + e2e (e2e#73, fe97d92). #118 — root cause corrected: the terminal event does go through ChainHeads.link_batch — the defect is a duplicate finalize. Under offserver+PUBLISH_ONLY single-replica, the first drive emits the terminal (chain-linked) + evicts the chain head; a straggler drive then rebuilds from the materializer-lagged WAL (state not terminal yet), drives again, and emits a SECOND playbook.completed that links to the now-None head → NULL prev_event_id orphan (2 roots) + a benign state-build event-scan. Fix = a bounded process-local FinalizedGuard (exactly-one-terminal-per-execution) suppressing the duplicate at emit_events before the chain linker (a suppressed duplicate never advances/consumes the head); gate-off byte-identical (a duplicate never occurs on the synchronous in-process drive); metric noetl_terminal_dedup_total{suppressed}. Rig gains a HARD terminals==1 per-exec assertion. #119 (the blocker that hid the #118 symptom — off-server WAL-drain index stall on worker restart): the authoritative drain used a durable noetl_state_builder consumer whose cursor persists across restarts while the in-memory WalEventIndex rebuilds empty → the cursor outran the fresh index → build_spine_to(expected_head) permanently Incomplete → off-server execs looped offserver_retry and never completed (so the #118 fork could not even be reached). Fix (worker-only, inside NOETL_STATE_BUILDER=offserver; PROD runs the in-server drive so untouched): the drain now defaults to an ephemeral DeliverPolicy::All consumer that rebuilds the full index from the retained noetl_events WAL on every boot (no persisted cursor to outrun; also correct for >1 worker pod — each holds the complete event set, not a load-balanced subset); instant revert NOETL_STATE_BUILDER_DURABLE=1; proof = one-shot index rehydrated… log + new noetl_worker_state_builder_indexed_executions gauge; never reintroduces a noetl.event scan. Gate-ON kind-validated (server 118-finalize v3.39.1 + worker 119-rehydrate v5.40.2, offserver+publish_only+audit_only+plugin_drive): restart-rehydration proven — a forced mid-flight delete --force of the system-pool pod → the new pod logged index rehydrated … indexed_executions=17 wal_events=200 (pre-fix this was 0 → the stall); single-replica 6/6 stress iterations / ~126 execs every chain roots=1(incl. terminal)/dangling=0/walk==rows/terminals=1/orch_events=0, zero build-scan + zero hot-path-scan; multi-replica (2-replica affinity StatefulSet) 21 execs COMPLETE, roots=1/terminals=1 all, forwarded_ok +202, kv_remote_hit +12, zero scans; 224 worker + 597 server lib tests + clippy green; baseline restored; PROD/defaults untouched. — prior: ✅ #117 SHIPPED — off-server from_events spine ordered by prev_event_id chain + walked from the real tip (expected_head), the high-concurrency fan-out reduce wedge. Under concurrent fan-out two branch completions arrive at the owner id-inverted, so emit_events stamps a higher-id event as the predecessor of a lower-id one; the worker tracked the head as max(event_id) but ChainHeads.link_batch advances the watermark to the last-arrived event (the real tip), so a max-id walk MISSED the inverted tip and the fan-in reduce never fired. worker-only, inside NOETL_STATE_BUILDER=offserver (PROD untouched); byte-identical to the old sort for monotonic chains; NOETL_OFFSERVER_SPINE_ORDER=event_id reverts. noetl-worker v5.40.1 (worker#122, baeae78) + e2e (#72, cdf1768). 2-replica affinity gate-ON stress: 6/6 iterations, 108/108 execs COMPLETE, 15 execs with a real prev_event_id > event_id inversion all fired reduce_customer + completed; never-scan + sole-writer + roots=1 hold. Single-replica 7/8 (the 1 fail a separate pre-existing terminal-finalize race, non-wedging). — prior: ✅ #116 PROGRAM-SCALE STEP 2 SHIPPED + multi-replica gate-ON validated → ai-meta pointers bumped: execution-affinity single-owner WRITE ORDERING; the off-server stack is now multi-replica chain-COHERENT. Step 1 (KV data coherence, v3.38.0) was necessary-not-sufficient — the command.issued prev-read (handlers::execute) and head CAS-advance (emit_events) are two non-atomic steps, so concurrent cross-replica emits forked the chain. Affinity closes it by routing every trigger (POST /api/events, which also fires the drive) to the single replica that sharding::ShardConfig::owns(execution_id) owns (stable XxHash64); a non-owner forwards (reverse-proxy POST, one-hop loop guard, degrade-to-local). On the owner the single-process drive lock + in-memory ChainHeads make the read→advance atomic, no distributed lock; KV is the genesis/handoff vehicle (owner resolves LOCAL → kv_remote_hit→0 by design). Chose forwarding over a per-drive lease — solves the fork AND the double-drive with one mechanism, reusing src/sharding.rs. noetl-server v3.39.0 (server#252, 5e00d0a): src/affinity.rs (ExecutionAffinity + shard_index_from_hostname); flags NOETL_EXECUTION_AFFINITY / NOETL_PEER_URL_TEMPLATE / NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off, prod unchanged); handle_event forwards + reconcile poller skips non-owned; metric noetl_execution_affinity_total{outcome} (forwarded_ok = the proof). e2e (e2e#71, 66b6e1b): 2-replica StatefulSet topology (manifests/replica-affinity/, distinct shard per pod via hostname ordinal + headless DNS) + deploy_replica_affinity_topology.sh + rig flipped HARD (forwarded_ok proof; kv_remote_hit informational under affinity). Multi-replica gate-ON kind-validated (2-replica StatefulSet; server offserver+audit_only+nats_kv+affinity+publish_only; worker offserver): NOETL_COHERENCE_DRIVE_AFFINITY=shipped PASS — linear/loop/fanout COMPLETE; every chain roots=1/dangling=0/walk==total (NO fork — exactly what forked without affinity); forwarded_ok +9; state_build_event_scans +0 + hotpath scan +0 (never-scan across replicas); sole-writer rows==distinct, __orchestrate__ event=0; degraded +0; single-replica unchanged. 595 server tests + clippy green; baseline restored. Known follow-up #117 (separate, pre-existing): the off-server from_events spine orders by event_id; under a chain-order≠id-order inversion (affinity's forwarding makes it likelier under high-concurrency fan-out) the fan-in reduce wedged on 1/9 execs in a 9-way concurrent run (chain stayed clean — NOT a fork; fix = order the spine by prev_event_id walk). Prod multi-replica verdict: write-ordering is COMPLETE (no fork) — prod can horizontally scale the off-server stack for linear/loop reliably; high-concurrency FAN-OUT needs #117 first. All affinity flags default off; PROD GKE untouched. — prior: ✅ #115 PROGRAM-SCALE STEP 1 SHIPPED — multi-replica coherence DATA LAYER (NATS-KV ChainHeads+ExecDescriptor, NOETL_REPLICA_COHERENCE=nats_kv, default local); server v3.38.0 8f39a79 + e2e e222877; single-replica bit-for-bit parity, 2-replica cross-replica resolves proven — but necessary-not-sufficient (chain forked without affinity → step 2 above). — prior: ✅ #115 PHASE 5 SHIPPED + gate-ON validated: atomic-working-item context (tenet 6) — the drive hands a worker only its minimal declared slice. NOETL_ATOMIC_ITEM_CONTEXT (default false, prod unchanged). #77 dependency resolved: Explicit Input Binding is CLOSED (BREAKING v3.0.0) — it shipped the declaration surface (input:/args: + set:input:); Phase 5 adds the missing extractor + drive narrowing (the server previously attached the full accumulated context to every command regardless of input:). noetl-server v3.37.0 (server#250, a96ade8): orchestrate-core::input_bindinganalyze(step) statically extracts the base-context keys a step references (minijinja undeclared_variables; ctx.XX, bare step-name→key, injected roots→none), conservative (unbounded ref → full context); CommandBuilder/WorkflowOrchestrator::with_atomic_item_context narrow the persisted worker-bound context for plain non-loop steps (server-side render still runs against the full context); plugin OrchestrateInput/OrchestrateStateInput carry a #[serde(default)] flag; metric noetl_atomic_item_context_total{outcome}. noetl-worker v5.40.0 (worker#121, 2484d17): forwards the flag onto the off-server from_events drive input. e2e (e2e#69, 79505fa) atomic_item_context.yaml + kind_validate_atomic_item_context.sh. Gate-ON kind-validated (server p5-atomic, flag on, STATE_BUILDER=server + PLUGIN_DRIVE=true + PUBLISH_ONLY=true): flag-ON consumer render_context = [producer_a] ONLY (producer_b + start + steps + workload + execution_id + catalog_id + path all dropped), execution COMPLETED, narrowed metric +1; flag-OFF full 8-key context, COMPLETED (back-compat); offserver regression COMPLETED / __orchestrate__ event=0 / dispatched+applied advance / system-pool isolation / lag-0; 7 input_binding + 132 orchestrate-core + 584 server + 10 worker tests + clippy green; baseline restored. Realizes tenet 6; remainder = program-scale (per-shard WAL, multi-replica descriptor coherence). — prior: ✅ #115 PHASE 6 SHIPPED + gate-ON LITERAL-ZERO validated → ai-meta pointers bumped: the hot-path noetl.event read class is RETIRED; the table is AUDIT-ONLY. NOETL_EVENT_READ_PATH=event_scan\|audit_only (default event_scan, prod unchanged). Phase 4 cleared the drive scan under offserver; Phase 6 retires the remaining lifecycle readers (the WHERE execution_id replay class outside the drive). noetl-server v3.36.0 (server#249, b71ca1d): under audit_only, get_catalog_id (per-ingest) + inherit_parent_trace + the subscription dedup-audit + container-callback catalog/existence reads serve from the in-memory execute-time ExecDescriptor; a cold descriptor (post-terminal straggler after eviction / restart) resolves catalog_id from noetl.command (synchronous queue) — never a noetl.event scan. Proof metric noetl_event_hotpath_reads_total{site,outcome}. ops (ops#199, e5b0737) pins event_scan on the prod server manifest (operator-gated flip). e2e (e2e#67+#68, 0ab3c0a) kind_validate_event_read_path_phase6.sh. Gate-ON kind-validated: hot-path scan Δ0 (served_descriptor +96 + served_command +3), drive state_build_total Δ0 + event_scans Δ0 ⇒ ZERO noetl.event scans anywhere on the hot path, end-to-end; linear/loop/fan-out/output_select COMPLETE; sole-writer + lag-0; audit still works (direct SELECT + status COMPLETED + replay event_count=25); gate rig PASS (no regression); 585 server tests + clippy green; baseline restored; default event_scan, prod unchanged. The RFC's never-scan end state (tenet 3) is reached under the flag; remainder = Phase 5 (atomic-item, needs #77) + program-scale (per-shard WAL, multi-replica descriptor coherence). — prior: ✅ #115 PHASE 4 REMAINDER SHIPPED — the off-server drive edge is now STATELESS. Under NOETL_STATE_BUILDER=offserver the server performs ZERO state rebuild + ZERO noetl.event reads on the drive path. noetl-server v3.35.0 (server#248, 6e30fc3) — a per-execution ExecDescriptor (catalog_id + routing seeded at playbook_started; terminal stamped at the emit_events chokepoint) lets trigger_orchestrator_inner/apply_worker_orchestration route + apply WITHOUT building state; expected_head from the in-memory ChainHeads; cold descriptor (restart) falls through to the server-built path (re-seeds) so chain_walk + event_scan stay fallbacks. noetl-worker v5.39.0 (worker#120, 8e1f651) — resolves trigger_event_type off the WAL from trigger_event_id; a stateless command with an incomplete WAL after the bounded retry is a benign __offserver_retry__ no-op the reconcile poller re-drives (never a partial state). e2e (e2e#66, f4bb342) — the rig now asserts noetl_state_build_total Δ0 + dispatched_offserver_stateless/applied_stateless advance. Gate-ON kind-validated: state_build_total Δ0 + event_scans Δ0 across linear(13)/loop(62)/fan-out(25)/output_select(31), all COMPLETE, offserver==server parity, sole-writer 25==25, lag-0, materializer dup 0; 583 server + 218 worker tests + clippy green; baseline restored; default server, prod unchanged. Completes #107 step 2 server-side (state rebuild + event reads removed from the drive path under the flag); Phase 5 (atomic-item context, needs #77) + Phase 6 (retire the event read path) remain. — prior: ✅ #115 PHASE 4 DRIVE CUTOVER SHIPPED + gate-ON parity-validated → ai-meta pointers bumped. The orchestrator drive now constructs its WorkflowState off the server on the system worker pool, from the noetl_events WAL spine (wasm run/from_events), under NOETL_STATE_BUILDER=offserver (default server, prod unchanged): noetl-worker v5.38.0 (worker#119, bef13e5) — shared WAL index + authoritative durable noetl_state_builder consumer + dispatch_wasm builds from the spine with a staleness guard (expected_head); noetl-server v3.34.0 (server#247, f0922bd) — marks the offserver command + carries expected_head; ops (ops#198, b1da9f1) the system-pool knob; e2e (e2e#65, b38b6dd) the committed gate-ON parity rig. Validated gate-ON: offserver==server fingerprint (fan-in fires once), worker served +3 / event_scans 0 / wal_events +25, server state_build_event_scans 0, cache cold+1/incr+2, sole-writer 25==25 / lag-0; linear + loop legs COMPLETED off-server; baseline restored. Phase-4 remainder (remove residual server chain-walk bookkeeping → fully zero server reads) + Phases 5–6 remain. — prior: ✅ #115 PHASE 4 (off-server state builder) KERNEL + FLAG SHIPPED + shadow kind-validated. The pool-side off-server builder reconstructs orchestrator WorkflowState from the noetl_events WAL (not the materialized noetl.event table): a per-execution chain index walks the prev_event_id spine head→root and caches the built spine keyed by the immutable chain head, advancing only the new tail on the next trigger. Shipped: noetl-worker v5.37.0 (worker#118, fef961c) — the state_builder kernel + a live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off) + metrics; noetl-server v3.33.0 (server#246, 3e6006d) — the NOETL_STATE_BUILDER=offserver|server flag scaffold (default server, prod unchanged). Validated gate-ON on live kind (PUBLISH_ONLY + off-server drive + materializer sole-writer): the shadow replayed the WAL and chain-walked spines whose indexed==spine sizes exactly match the Phase-3 topologies (linear 13, loop 62, fan-out 25, output_select 31, storage_tiers 55) — parity by construction; WAL-read proven (wal_events_total=993) with ZERO noetl.event scans (event_scans_total=0); cache proven (cold_rebuild=28 on replay/restart + incremental=21 live tail-advance, e.g. a fresh fan-out indexed live as Incremental(5), indexed==spine==25 == DB event_rows 25); a fresh fan-out COMPLETED gate-ON with event_rows==distinct / 0 __orchestrate__ event rows / materializer pending=0 / project_errors=0; 8 worker unit tests + 2 server config tests + clippy green. Baseline restored. PROD GKE untouched; no gate/mode/builder default changed. Phase 5 (atomic-item context, needs #77) / Phase 6 (retire event read path) + the off-server drive cutover (offserver wiring) remain. — prior: ✅ #115 PHASE 3 MERGED → server v3.32.0. server#245 self-merged → server v3.32.0 (8338417) — the chain-walk state builder. Behind NOETL_STATE_BUILD_MODE=chain_walk|event_scan (default event_scan, prod unchanged), the drive rebuilds WorkflowState by walking prev_event_id head→root (head from the in-memory ChainHeads watermark, then (execution_id,event_id) PK lookups — never a WHERE execution_id scan) and feeding the same from_events (orchestrate-core unchanged; parity by construction). Conservative fallback to event-scan on cold-head / lag / non-genesis. Validated gate-ON across 5 topologies (linear/loop/fan-out/output_select/storage_tiers): parity 41/41 MATCH 0-mismatch (tx-isolated, normalized), NO-SCAN proven (event_scans_total=0 across 40 drive builds, 1064 PK hops, 0 fallbacks), all fixtures COMPLETE in chain_walk mode, sole-writer + lag-0 + gate rig PASS, 577 lib tests + clippy green. Phase 4 (off-server state builder + WAL cache) builds on this — now in progress (move the chain-walk state construction OFF the server onto the system worker pool, reading the WAL/NATS stream, pool-side cache keyed by the immutable chain head + incremental tail-advance). — prior: ✅ #115 PHASE 2 MERGED — one-level prev_event_id event chain (server#244 → server v3.31.0 f5bd4a8 + noetl#667 → noetl ecd16a2; ai-meta pointer bump afdb365). — prior: ✅ #115 PHASE 2 IMPLEMENTED + kind-validated — one-level prev_event_id event chain. Each noetl.event carries prev_event_id (the immediately-previous event in causal order) + each noetl.command the issuing-event link, so per-execution events form a walkable singly-linked list followable pointer-by-pointer without scanning noetl.event (additive; no reader consumes it yet — that's Phase 3). The emit chokepoint emit_events stamps the event link from a per-execution chain-head watermark (ChainHeads) — one path covers drive events + command.issued + worker-lifecycle (via handle_event) on both the gate-off INSERT and the gate-on publish (to_stream_json), the materializer persisting it; the command link = the real step.enter/unblocking completion so cursor-fan-out bodies share their branch origin (§4.4). Server-only (no orchestrate-core change — the chokepoint subsumes the "core carries prev" design). Chain-correctness proven gate-ON (PUBLISH_ONLY + off-server drive + materializer sole-writer) across 6 executions — linear 13/13, loop 62/62, fan-out 25/25 (real shared branch origin), sub-playbook 46/46, + Phase-1 output_select 31/31 & storage_tiers 55/55 bounded: each has 1 root, 0 dangling / 0 duplicate event prev, 1 head, pointer-walk == full sequence (no gaps), real-step command dangling=0; kind_validate_orchestrate_gate.sh PASS (sole-writer 25==25, 0 dup cycles, catalog0=0, lag 0); 573 server lib tests + clippy green. PRs server#244 + noetl#667 open, awaiting merge → pointer bump; PROD untouched, no gate default changed. — prior: ✅ #115 PHASE 1 SHIPPED — references-in-state consume side; closed #113 + #114 (all 9 stalled fixtures green gate-ON). Worker resolve_context_references made selective (worker#117 → v5.36.0): resolve a noetl:// ref only when this command's tool input binds the step's bulk (a path the bounded extracted summary can't satisfy — whole-object bind, .data over a summarised rowset, array element past [0], _truncated node); predicate / scalar / _ref access reads off the summary with no store round-trip, and an upstream result a step doesn't consume stays a reference (foreign bulk never inflates the render). Server hydrate_result_references surfaces _ref/_store/_uri on the kept summary (so {{ step._ref }} lazy-load + {{ step._ref is defined }}/{{ step._store }} predicates resolve without bulk) and refs_in_state default flipped to true (server#243 → v3.30.0) — references stay out of state + commands by default (the #114 kind experiment proved the flip needs this consume side first; it landed in the same change set). Validated kind gate-ON (PUBLISH_ONLY + off-server drive + materializer sole-writer): all 9 of #113's stalls reach playbook.completed — output_select / storage_tiers / lease_expiry / pipeline_heavy_payload / save_edge_cases / large_result_extraction / http_to_postgres_{direct,simple,bulk_python}; max next-command context across the 9 = 412KB (was 1.32MB), storage_tiers' 17.4MB drive-state gone (36KB), lease_expiry's 201 spinning orch cmds → 16; 0 __orchestrate__ event rows every run (sole-writer intact), materializer lag = 0; offserver rig PASS (decode_error +0) + 6-fixture fast-path regression sample (fan-out/loops/conditionals) green. #101's consume side done; off-server-drive prod cutover (#107/#111) unblocked on the payload/state-size axis. PROD GKE untouched (pre-#108 in-server drive); the refs_in_state flip is a code default — prod runtime gates unchanged. ai-meta → worker 0a66b41 + server 3014f6f. — prior: 📐 RFC FILED (#115) — Decouple result data from context: reference-only schema + one-level event chain + worker-side state builder. A design deliverable (no feature code) directed by the platform owner, reframing the off-server-drive / event-sourcing model around six tenets: the root cause is unbounded context growth (every step passes the entire accumulated context forward — build_context folds all prior step results, orchestrator.rs clones it per arc, build_command hands the whole map to each command; evidence: 17.4MB drive-state for storage_tiers, 1,324,800B next-command context for output_select), fixed by (2) reference-only noetl.* schema (data lives in the result/object store by URN; rows carry {ref, extracted} only), (3) a never-scan-noetl.event invariant on the state/drive read path (beyond #103's sole-writer — removes event-table reads), (4) a one-level event chain (command→prev-event→prev-event, walked pointer-by-pointer, not a full replay), (5) an off-server system/state_builder WASM playbook that walks the chain + caches state on the system pool, and (6) an atomic-working-item context contract (a tool worker receives only its minimal item slice, not the whole playbook). Full RFC on the wiki (Umbrella: Decoupled Context + Event Chain); 6-phase plan with Phase 1 = #101's consume side (worker selective render-time ref-resolution — the immediate unblock for the 3 stuck #114 fixtures + the off-server-drive prod cutover). Reframes #101 (subsumed; its incremental-state cache + resolve-into-state is inverted to never-data-in-state + never-scan), subsumes the storage tier of #104, is the state-construction design for #107 steps 2–4, unblocks #111, builds on #103's write path (the sole-writer gate is unaffected — we did NOT change the kind gate state), and leaves #114 as a safety cap. Filed: issue #115 + board 3 (Todo) + this wiki page; #101/#107/#111 cross-linked. — prior: 🐞 #114 oversized-command.issued offload shipped (server v3.29.5). Under the publish-only gate a command.issued event carrying the full upstream context (~1.32MB) exceeded NATS max_payload (1MB) → publish never acked → wedge. Fix: when a command context exceeds NOETL_COMMAND_CONTEXT_MAX_BYTES (default 512KB) persist_engine_command(s) offload it to noetl.result_store with a tiny {__context_ref__} marker; get_command/claim_command resolve it before the worker sees it (metrics context_offloaded/context_ref_resolved). server#242→v3.29.5 + rig phase 8 e2e#64. Kind gate-ON: rig PASS (new test_oversize_command_context COMPLETED, max command.issued ctx 585B, offload+resolve fired, 0 __orchestrate__ event rows, lag 0); every command.issued event <1MB across all fixtures; 6 of #113's 9 large-context fixtures now COMPLETE. Chose ref-on-oversize over refs_in_state=true (candidate #1): a kind experiment proved refs_in_state=true fixes the state-bloat (lease_expiry completes) but breaks bulk-consuming fixtures (storage_tiers/output_select fail at the bulk step) because the worker render-time ref-resolution isn't implemented — so the default stays false. The remaining 3 fixtures + the off-server-drive cutover (#107/#111) are now blocked on the refs_in_state consume side (#101)__orchestrate__ drive-state bloat (17.4MB for storage_tiers) + the _ref/bulk-resolve gap. No prod default flipped; prod is pre-#108 in-server drive (unaffected). — prior: 🐞 #113 off-server-drive payload-size + cancel fix shipped (server v3.29.4). Fixed the worker-driven drive stall when an __orchestrate__ result exceeds the 100KB inline budget: the worker offloads it to the durable result store with only a reference.ref (no inline output_b64), and apply_worker_orchestration now resolves+decodes that ref (server#241, metric ref_resolved) instead of dropping the drive decision → non-convergent re-loop; plus cancel now stops the drive (match underscore playbook_cancelled + ExecutionState::is_terminal terminal guard evicts the orch-cache, no restart). Companion convergence rig e2e#63. Kind gate-ON proven (785KB drive result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; materializer sole-writer lag 0). 5/9 #113 large-context fixtures COMPLETE; the other 4 hit a DISTINCT oversized-command.issued (full upstream context embedded → >1MB NATS payload) stall → #114 (#113 stays open until all 9 close). No prod default flipped; prod is pre-#108 in-server drive (unaffected). — prior: 🚀 #103 GKE pre-flip PREP landed (staged, NO traffic flip, NO PUBLISH_ONLY). Verified live that prod already runs the full Rust stack (the #49 Python→Rust cutover is done; the noetl Service selector is already app=noetl-server-rust; live images are pre-#103 batch-dispatch-v1/cursor-100), both flip secrets already exist (NOETL_ENCRYPTION_KEY + noetl-internal-api-token), and prod monitoring is Google Managed Prometheus, not VictoriaMetrics. Pushed the post-#103 images to the prod Artifact Registry (server v3.29.3 + worker v5.35.0, amd64); applied + verified GMP monitoring (PodMonitoring for worker+server /metrics — the noetl namespace had none, so app metrics weren't scraped at all — + materializer-lag Rules, translated from the kind VMRule; up{namespace="noetl"}=4 live series); staged the roll-forward manifests (server→v3.29.3 gate-off, system-pool→v5.35.0 materializer-off) in a PR not applied (they roll live workloads). Operator-gated remainder (surfaced, not done): roll the live images, enable the materializer shadow, wire the GMP managedAlertmanager pager, then flip. No prod default changed. — prior: 🛡️ #103 materializer-lag GUARDRAIL shipped — the pre-flip observability gate is now in place. The server was already FLIP-READY; the remaining operator gate was a materializer-lag metric + alert so the staged PUBLISH_ONLY flip is safe and one revert away. Shipped (default-off): worker #116v5.35.0 extends the JetStream lag poller to track the noetl_events/noetl_materializer consumer on an independent task (so a stalled/dead materializer loop — which can't report its own lag — still surfaces as a climbing noetl_worker_nats_consumer_pending{consumer="noetl_materializer"} gauge); ops #195+#196 add the VMRule (backlog warning/critical/growing + stall-under-gate + project-errors + absent-under-gate), a worker /metrics VMServiceScrape (was unscraped), VMAlert enabled, a Grafana dashboard, and the flip runbook noetl-cqrs-publish-only-flip.md (pre-flip green-baseline check + one-command revert). Kind-proven the full cycle: green baseline (backlog 0, published==projected==acked) → induced lag (materializer fault-injected while events publish under the gate) → backlog gauge climbs 0→684 via the independent poller, alerts fire (backlog warning+critical + stall) → recover (fault removed) → materializer drains backlog→0 idempotently (0 dup, 0 loss), alerts clear. Flip-readiness now includes the monitoring gate; PUBLISH_ONLY stays default-off, no prod default changed. — prior: 🎯 #103 server cutover COMPLETE — the server is FLIP-READY. The last of the three flip blockers is closed: the two ExecutionService terminal writers (POST /cancelplaybook_cancelled, POST /finalizeplaybook_completed/playbook_failed) now route through the emit_event chokepoint, so they honour NOETL_EVENT_INGEST_PUBLISH_ONLY like the other 13 producers instead of writing noetl.event synchronously under the gate. Shipped: server #240v3.29.3 (b6e5d31, default-off) + dual-mode e2e rig kind_validate_cancel_finalize_gate.sh (e2e#62). Kind-proven both modes: gate-OFF cancel/finalize INSERT synchronously, byte-identical columns (error preserved), published delta +0, natural completion still COMPLETED; gate-ON noetl_event_ingest_published_total{playbook_cancelled}=1/{playbook_failed}=1 (PUBLISHED, not inserted), materializer (system pool) is the sole writer, both executions reach the correct terminal state, rows==distinct, 0 catalog_id=0, no loss/dup. No remaining synchronous server noetl.event writers on the producer path under the gate — the server is a complete non-writer of the event log when PUBLISH_ONLY is on. Flipping PUBLISH_ONLY on is now a staged operator decision (behind a materializer-lag alert, one revert away); no prod default changed. — prior: #104 off-server-drive × gate reconciliation PROVEN (server v3.29.2, cold_rebuild); #103 ack-after-materialize durability resolved (tools 3.13.0 + worker 5.34.0 + ops#194).) Refresh cadence: every session that lands meaningful cross-repo work (per agents/rules/wiki-maintenance.md Rule 0a)

Standing direction (2026-06-04). Per memory entry, Python tiers are deprioritized. Forward Rust-only e2e work is tracked under #54 (Phase F R5). Python pieces stay deployable for backwards-compat on GKE but are NOT a target for new feature work.

Single pane of glass for the NoETL platform. Every active umbrella, every submodule, every release lands here so a single page shows what's in flight, what shipped, what's next.

Convention. This wiki is the cross-repo dashboard. Per-repo wikis (e.g. noetl/server wiki, noetl/ops wiki) document that repo's surface; this wiki documents the system of repos. See Wiki convention for the split.

Active umbrellas

Umbrellas open in the ai-task queue: the Rust server parity-port umbrella (#49), the orchestrator-scaling work (#101), the new event-WAL + derivable-storage model (#104), the Rust regression baseline migration (#98), the Python-era deploy legacy cleanup (#97), and the postgres timestamptz NaiveDateTime bug (#95). #100 (cursor/claim loop mode) closed 2026-06-15 — server v3.8.0 + tools v3.10.1; test_pft_flow_v2 all_passed:true on kind (see Recently closed). #99 (transfer-tool credential aliases) closed 2026-06-14 via tools#65 + worker#87 + e2e#58 (see Recently closed). The subscription / listener tool RFC (#90) closed 2026-06-12 with all 7 phases shipped + live-proven (see Recently closed); refinement follow-ups #91#94

# Opened Last update Umbrella Status Wiki page
#107 2026-06-17 2026-06-17 Program: Distributed Multitenant OS — Server Dissolution → Global Grid The strategic roof over #101–#105. Blueprint names NoETL a distributed multitenant OS (server→stateless edge; NATS WAL + object store the only durable state; processing = event-driven system playbooks on a sharded grid; foundation for quantum-cloud-hybrid). 5-step path: CQRS cutover (step 1, shadow green) → orchestrator-as-plug-in (#108, done 2026-06-18 — drive core fully wasm-resident (#109 closed); the system/orchestrate plug-in compiles to a 0-import .wasm (server#224), runs identically to native in wasmtime (server#225), is seeded into the registry on boot + servable (server#226), and drives the real workload identically live — shadow over the 10×1000 PFT, 529 evals 0 mismatch (server#227); worker-driven cutover — the drive runs OFF-SERVER on the pool, kind-validatedentry/run_state dispatch (worker#113) + apply_orchestration_result (server#228) + the flag-gated scheduler/apply/state-guard (server#229, NOETL_ORCHESTRATE_PLUGIN_DRIVE); simple_python drove start→end→COMPLETED through the worker round-trip. #108 CLOSED 2026-06-18 — (c) the default-flip shipped (server#233, v3.28.0, drive default ON) after a clean scale soak; orchestrator-as-plug-in is done; the in-server shadow + wasmtime server dep retired in #110 / server#234) → per-shard WAL → drop Postgres → cross-shard federation. docs blueprint
#115 2026-06-19 2026-06-20 RFC: Decouple result data from context — reference-only schema + one-level event chain + worker-side state builder PROGRAM-SCALE STEP 2 SHIPPED + multi-replica validated 2026-06-20 — execution-affinity write ordering (#116). Step 1 (KV coherence, server v3.38.0) was necessary-not-sufficient; affinity routes every trigger for an execution to the replica that ShardConfig::owns it (non-owner forwards) so the read→advance is atomic per execution → 2+ replicas produce one unforked chain. server#252 → server v3.39.0 5e00d0a (src/affinity.rs; NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME, all default off) + e2e#71 66b6e1b (2-replica StatefulSet topology + rig HARD gate). Multi-replica gate-ON kind PASS: linear/loop/fanout COMPLETE, every chain roots=1/dangling=0/walk==total, forwarded_ok proof, never-scan + sole-writer across replicas; single-replica unchanged; 595 server tests + clippy green; baseline restored. Remaining #117: off-server from_events spine ordered by event_id wedges fan-in under a chain-order≠id-order inversion (affinity + high-concurrency fanout) — fix = order spine by prev_event_id walk; linear/loop already reliable. PROD GKE untouched; all affinity flags default off. Phase 1 SHIPPED 2026-06-19 — references-in-state consume side. Worker resolve_context_references made selective (resolve a noetl:// ref only when this command's tool input binds the step's bulk — a path the bounded extracted summary can't satisfy; predicate / scalar / _ref access reads off the summary; worker#117 v5.36.0). Server hydrate_result_references surfaces _ref/_store/_uri on the kept summary + refs_in_state default flipped to true (server#243 v3.30.0). Kind gate-ON: all 9 #113 stalls COMPLETE (max command ctx 412KB, 0 __orchestrate__ event rows, materializer lag 0); the 3 targets bounded (output_select 1.32MB→10KB, storage_tiers 17.4MB→36KB, lease_expiry 201 spinning orch cmds→16). Closed #113 + #114. Off-server-drive prod cutover (#107/#111) unblocked on the size axis; PROD GKE untouched. Phase 2 MERGED 2026-06-19 — one-level prev_event_id event chain (server#244 → server v3.31.0 f5bd4a8 + noetl#667 → noetl ecd16a2; ai-meta pointer afdb365): each noetl.event/noetl.command carries the chain link, stamped at the emit chokepoint from a per-execution chain-head watermark, covering both gate paths + the materializer; chain-correctness proven walkable / 1-root / no-gap / no-scan across 6 gate-ON executions (incl. a real fan-out branch origin), sole-writer + Phase-1 bounded sizes intact, 573 tests + clippy green. Post-merge verified on live kind (both prev_event_id columns present, server image reflects merged code, gate-ON baseline live). Phase 3 MERGED 2026-06-19 — chain-walk state builder (server-side, flagged) (server#245 → server v3.32.0 8338417; ai-meta pointer bumped): the drive rebuilds WorkflowState by walking prev_event_id head→root (in-memory ChainHeads head + (execution_id,event_id) PK lookups, no WHERE execution_id scan) → same from_events (orchestrate-core unchanged; parity by construction), behind `NOETL_STATE_BUILD_MODE=chain_walk event_scan(defaultevent_scan, prod unchanged), event-scan kept as the fallback (cold-head / lag / non-genesis). Gate-ON validated: parity 41/41 MATCH 0-mismatch (tx-isolated), NO-SCAN proven (event_scans_total=0, 40 builds / 1064 PK hops / 0 fallbacks), all topologies COMPLETE, sole-writer + lag-0 + gate rig PASS, 577 tests + clippy green. Self-merged (no classifier block). **Phase 4 KERNEL + FLAG SHIPPED + shadow kind-validated** ([worker#118](https://github.com/noetl/worker/pull/118) → worker **v5.37.0** fef961c+ [server#246](https://github.com/noetl/server/pull/246) → server **v3.33.0**3e6006d): the pool-side state_builderreconstructsWorkflowState from the **noetl_eventsWAL** — a per-execution chain index walksprev_event_id head→root, caching the spine keyed by the immutable chain head + incremental tail-advance + cold-rebuild on miss/restart. A live WAL **shadow** loop (NOETL_STATE_BUILDER_SHADOW, default off) + the NOETL_STATE_BUILDER=offserver
#49 2026-06-02 2026-06-14 Rust server FastAPI parity port — primary server v3.6.0 (system worker pool + cleanup/purge endpoint, server#193); prod is Rust-only (server-rust + worker-rust + system pool). Remaining entangled refactor tracked in #97. Umbrella: Rust Server Port
#101 2026-06-15 2026-06-16 Orchestrator scaling: incremental state + results-by-reference resolution In progress — PRs open (server#197 incremental OrchStateCache + hydrate_result_references; worker#89 NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES). References-in-state flag-on stall fixed + re-validated 2026-06-16 (worker feat/refs-in-state-extractedbuild_extracted now a bounded navigable structural summary so output.data.rows[0].<field> resolves off the reference; PFT advanced 13→253 events, 31 refs active). Awaiting merge + pointer bumps. REFRAMED 2026-06-19 by #115 — the RFC inverts #101's "resolve refs into a rebuilt WorkflowState" to "never put data in state, never scan events to build state". Consume side DONE 2026-06-19 (#115 Phase 1 — worker#117 v5.36.0 + server#243 v3.30.0): worker selective render-time ref-resolution + _ref/_store surfacing + refs_in_state default true; references now stay out of state/commands by default. The incremental OrchStateCache is superseded by the immutable-chain cache in later #115 phases. Umbrella: Orchestrator Scaling
#103 2026-06-15 2026-06-19 Step 2 — CQRS event log: events to JetStream + batch projector (write-path scaler) GKE pre-flip PREP landed (2026-06-19) — prod images pushed, GMP monitoring live, manifests staged; NO traffic flip / NO PUBLISH_ONLY. Prod verified already-Rust (pre-#103 images), secrets present, monitoring = GMP (not VM); operator-gated remainder = roll images → materializer shadow → pager → flip. Server cutover COMPLETE — FLIP-READY (default-off). All three flip blockers closed: (1) ack-after-materialize durability (deferred ack + worker materializer loop), (2) off-server-drive × gate reconciliation (#104, server v3.29.2 cold_rebuild), (3) the 2 ExecutionService cancel/finalize sites now route through the emit_event chokepoint (server#240v3.29.3, e2e rig e2e#62). No remaining synchronous server noetl.event writers under the gate — kind-proven both modes (gate-off byte-identical INSERT; gate-on cancel/finalize PUBLISHED, materializer sole writer, terminal state reached, 0 loss/dup). Flipping PUBLISH_ONLY on is a staged operator decision (materializer-lag alert, one revert away); no prod default changed. Materializer-lag guardrail SHIPPED 2026-06-19 (worker #116 v5.35.0 lag gauge on an independent poller + ops #195/#196 VMRule + worker scrape + VMAlert + dashboard + flip runbook) — kind-proven induce→fire→recover→clear; flip-readiness now includes the observability gate. Umbrella: Orchestrator Scaling
#102 2026-06-15 2026-06-16 Step 1 orchestrator throughput: batch event-log writes + frame-batch cursor body In progress — Part A landed (server#198 → v3.10.0), validated on GKE (1×50 PFT COMPLETED, batch event-log path). Part B (frame-batch cursor body) ahead. Umbrella: Orchestrator Scaling
#104 2026-06-16 2026-06-19 Event WAL + derivable result storage: NATS-as-WAL, logical-URI naming, Feather tier In progress — off-server-drive × gate reconciliation PROVEN (2026-06-19): gate-ON (PUBLISH_ONLY=true) with the off-server drive (PLUGIN_DRIVE=true) + materializer sole writer is now green on kind (the combo #103 left unproven) — fresh exec + cursor fan-out → COMPLETED, server wrote 0 noetl.event rows (all PUBLISHED), materializer materialized all exactly once (rows==distinct ids, 0 dup), read-your-writes held via the relocated trigger. Server #238v3.29.2 rebuilds WorkflowState from the durable log on cold-cache apply (crash-recovery, kind-proven cold_rebuild); committed e2e rig kind_validate_orchestrate_gate.sh (e2e#61). This unblocked the #103 flip (the 2 cancel/finalize sites are now also done — server v3.29.3 #240; #103 is FLIP-READY). Prior: naming foundation noetl_tools::locator (tools#68/#70, v3.12.0) + worker URI stamp (worker#99); blueprint (docs#180). Umbrella: Event WAL Storage
#98 2026-06-14 2026-06-14 Grow the Rust regression baseline: migrate Python-era e2e fixtures Snowflake key-pair JWT validated — the last external-tool gap is closed. tools v3.9.2 (tools#62/#63/#64); create_sf_database + setup_sf_table COMPLETED via JWT on kind. Transfer step deferred to #99. Core green: 64 fixtures.
#97 2026-06-14 2026-06-14 Retire remaining Python-era deploy legacy (manifests, kind automation, helm chart) Open — Todo. Python manifests, kind redeploy automation that hardcodes Python deployment names, stale helm release rev 185.
#111 2026-06-18 2026-06-19 E2E: worker-driven orchestrate topology coverage + server-API-only gap tracking In progress — three committed kind rigs now: kind_validate_orchestrate_offserver.sh (e2e#59) asserts the off-server topology (gate-off), kind_validate_orchestrate_gate.sh (e2e#61) asserts it composes with the PUBLISH_ONLY gate, and kind_validate_cancel_finalize_gate.sh (e2e#62, 2026-06-19) dual-mode-asserts the ExecutionService cancel/finalize writers honour the gate (gate-off byte-identical INSERT; gate-on PUBLISHED, materializer sole writer, terminal state, 0 loss/dup) — the rig that closed the last #103 flip blocker. Off-server rig live-green (COMPLETED, __orchestrate__ in noetl.event = 0, dispatched=applied, shadow metric absent). Durable home for the server-API-only gap (server still sole-writer + rebuilds state — moves under #103/#104) + two operator decisions: (A) retire in-process drive fallback (gated on prod adopting a post-#108 image; prod still pre-#108), (B) reap accumulating __orchestrate__ PENDING delivery rows in noetl.command.
#95 2026-06-14 2026-06-14 noetl-tools postgres pg_value_to_json returns null for timestamptz / NaiveDateTime columns Open — Todo. Bug: timestamptz columns serialize to JSON null instead of the ISO-8601 string.

Recently closed (last 7 days)

# Closed Title
#119 2026-06-20 Off-server WAL state-builder drain stranded executions after a worker restart — FIXED: rebuild the in-memory index from the retained noetl_events WAL on every boot. The authoritative drain used a durable noetl_state_builder consumer whose cursor persists across restarts while the in-memory WalEventIndex rebuilds empty → the cursor outran the fresh index → build_spine_to(expected_head) permanently Incomplete → off-server execs looped offserver_retry and never completed (this is what hid the #118 symptom). Fix (worker-only, inside NOETL_STATE_BUILDER=offserver; PROD runs the in-server drive so untouched): the drain defaults to an ephemeral DeliverPolicy::All consumer rebuilding the full index from the retained WAL on every boot (no persisted cursor to outrun; also correct for >1 worker pod); instant revert NOETL_STATE_BUILDER_DURABLE=1; proof = index rehydrated… log + new noetl_worker_state_builder_indexed_executions gauge; never reintroduces a noetl.event scan. noetl-worker v5.40.2 (worker#123, 48b0bde). Gate-ON kind: forced mid-flight pod delete --force → new pod index rehydrated … indexed_executions=17 wal_events=200; single-replica 6/6 stress + multi-replica 21 execs COMPLETE, zero scans.
#118 2026-06-20 Single-replica off-server terminal-finalize chain fork (a duplicate playbook.completed orphaned the chain as a NULL-prev_event_id 2nd root) — FIXED: a bounded FinalizedGuard (exactly-one-terminal-per-execution) suppresses the duplicate at emit_events before the chain linker. A straggler drive on a materializer-lagged single replica re-drove off the lagged WAL (state not terminal yet) and emitted a 2nd terminal that linked to the now-evicted head → 2 roots + a benign event-scan. The guard makes the first terminal win; gate-off byte-identical (a duplicate never occurs on the synchronous in-process drive); metric noetl_terminal_dedup_total{suppressed}; rig gains a HARD terminals==1 assertion. Absent under multi-replica execution-affinity (#116 serializes finalize to the owner). noetl-server v3.39.1 (server#253, c5f8cb2) + e2e (e2e#73, fe97d92). Gate-ON kind (unblocked by #119): single-replica 6/6 stress iterations / ~126 execs every chain roots=1(incl. terminal)/terminals=1/zero-scan; multi-replica 21 execs clean.
#117 2026-06-20 Off-server from_events spine ordered by event_id broke fan-in under a chain-order≠id-order inversion (high-concurrency fan-out reduce wedge) — FIXED: order the spine by the prev_event_id chain walk + walk from the real tip (expected_head). worker v5.40.1 (worker#122, baeae78) + e2e (#72, cdf1768). 2-replica affinity gate-ON stress 6/6 iterations / 108 execs COMPLETE; 15 real id-inversions all fired the fan-in reduce. The residual single-replica terminal-finalize chain-linking race is now FIXED as #118.
#113 2026-06-19 Worker-driven orchestrate drive stalled when the drive result / accumulated context crossed the inline budget — fixed at the source by #115 Phase 1 (references-in-state consume side). All 9 of 9 stalled core fixtures now reach playbook.completed gate-ON (PUBLISH_ONLY + off-server drive + materializer sole-writer), server v3.30.0 + worker v5.36.0: output_select / storage_tiers / lease_expiry / pipeline_heavy_payload / save_edge_cases / large_result_extraction / http_to_postgres_{direct,simple,bulk_python}. Max next-command context across the 9 = 412KB (save_edge_cases); 0 __orchestrate__ event rows on every run; materializer lag = 0. The #113 decode fix (server#241 v3.29.4) + #114 offload cap (server#242 v3.29.5) stay as safety nets; the growth that pushed past budget is gone (worker selective resolve keeps foreign bulk out of the render; refs_in_state-true keeps the drive state + command.issued bounded). Landed via worker#117 + server#243.
#114 2026-06-19 Off-server drive oversized command.issued (full upstream context embedded) — resolved by #115 Phase 1 (refs_in_state default true). With the worker selective consume side in place, refs_in_state defaults true so the next command's render_context carries {reference, extracted} (small) instead of the full upstream payload — command.issued no longer approaches NATS max_payload. The #114 offload safety cap (maybe_offload_command_context, 512KB) stays as belt-and-suspenders. Verified gate-ON: all 4 of this issue's fixtures COMPLETE; max next-command context 412KB across the full 9. Landed via server#243 + worker#117.
#112 2026-06-18 Worker /dev/shm SIGBUS — k8s default 64 MiB tmpfs vs the 256 MiB Arrow IPC cache budget. Every worker process (Rust noetl-worker + legacy Python worker) allocates an Arrow IPC shared-memory cache at init (NOETL_IPC_CACHE_BUDGET_BYTES, default 256 MB) backed by POSIX shm on /dev/shm; the k8s container-runtime default /dev/shm is a 64 MiB tmpfs, so under shm-heavy load the cache writes past 64 MiB, the store page-faults against the full tmpfs, and the worker dies with SIGBUS (exit 135) and crash-loops. Surfaced during #103 CQRS kind validation on the system pool; a transient live fix was reverted, leaving the committed manifests latent. Fix (ops#193) gives every worker deployment a memory-backed /dev/shm (emptyDir medium: Memory, sizeLimit: 320Mi > budget), pins NOETL_IPC_CACHE_BUDGET_BYTES=268435456 next to the sizeLimit so the two can't drift, and raises the memory limit to 768Mi (tmpfs is charged to the pod cgroup). Applied to all 7 worker manifests (system / shared-rust / subscription / subscription-runtime / Python-cpu + 2 prod variants). Kind-validated on the system pool: reproduced SIGBUS (exit 135) on the 64 MiB tmpfs → after fix /dev/shm is 320 MiB and a full 256 MiB write completes (exit 0, peak 256M/320M) with the pod healthy (restarts=0, no OOM); cluster restored to baseline. ai-meta → ops f4df4c1 + wiki worker deployment-specification.
#110 2026-06-18 Retired the in-server orchestrate shadow + the wasmtime server dependency. The separable server-slimming follow-up to #108 (closed): the in-server shadow was the slice-4 cutover-confidence harness (ran the system/orchestrate plug-in inside the server via an embedded wasmtime host, diffed 529/0 against the in-process drive). With the worker-driven drive default-on + proven, the live drive uses the worker's wasmtime host — never the server's — so the shadow was dead weight. server#234 removed orchestrate_shadow.rs, the orchestrate-shadow cargo feature + the optional wasmtime dep (the cranelift/wasmtime tree — ~1000 Cargo.lock lines — fell out; cargo tree -i wasmtime now matches nothing), the trigger_orchestrator_inner shadow hook, the main.rs boot loader, the NOETL_ORCHESTRATE_PLUGIN_SHADOW config field, and the noetl_orchestrate_shadow_total metric. Kept run_state + NOETL_ORCHESTRATE_PLUGIN_DRIVE (default true). refactor: = no version bump (stays v3.28.0). Kind smoke on a 4-page cursor loop, both drive modes: COMPLETED with __orchestrate__ rows in noetl.event = 0 (worker-driven: 10 drive cmds on the system pool, dispatched=applied=10, event_suppressed=30); shadow metric gone from /metrics. ai-meta → server f3043c9 + wiki deployment-specification. See Umbrella-System-Pool-Design.
#108 2026-06-18 🎯 The orchestrator drive runs OFF-SERVER on the system pool, and it's now the DEFAULT. Step 2 of the dissolution (#107) — the server's brain moved onto the worker pool as the system/orchestrate WASM plug-in. Arc: slices 1–3 (off-server drive, server#229) → 4a (cursor+fan-out validated) → 4b + follow-up a (the __orchestrate__ meta-command touches noetl.event zero times, server#230/#231) → (b) system-pool isolation (server#232 + worker#114 + ops#191) → (c) the deliberate default-flip (server#233, v3.28.0): NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true. Gated on a scale soak (kind): a 694-drive cursor+fan-out run COMPLETED with __orchestrate__ rows in noetl.event = 0 and all 694 drives claimed on the system pool (shared pool got only the real steps); the default-on path (no env var) reproduced the identical shape (361 drives, isolated, 0 burst); 15/15 regression fixtures green; revert (=false → in-process drive) verified. ai-meta → server 80cc0e6 + worker 437b0be. See Umbrella-System-Pool-Design.
#109 2026-06-17 Event-ABI round — orchestrator + evaluate moved into the wasm core (#108 final slices). Slice 3 (server#223) relocated orchestrator/evaluate to noetl-orchestrate-core; the whole drive (renderer, playbook, commands, evaluator, state, orchestrator switch) now compiles native + wasm32-unknown-unknown. evaluate reads the pure core::event::Event; db::Event converts at the trigger_orchestrator boundary. 122 core + 565 server tests green, 0 WASI imports, kind PFT 10×1000 clean (0 errors, 0 restarts). Plug-in round (data-plane ABI + command_emit + scheduler + shadow-diff) tracked under #108.
#106 2026-06-17 EventEnvelope rejects timezone-less timestamps — blocked the CQRS materializer. The tailer publishes to_jsonb(noetl.event row) and created_at is timestamp WITHOUT time zone, so the timestamp had no offset; EventEnvelope.timestamp: Option<DateTime<Utc>> rejected it with the misleading premature end of input. Fixed (server#217) with a flexible deserialize_with (RFC3339 + tz-less→UTC). Validated live: materializer projects {projected:0, duplicates:20} = byte-identical. The #103 shadow gate is green.
#105 2026-06-17 Plug-in compilation & hot-reload: WASM-compiled system playbooks + managed library. The WASM plug-in runtime is complete + fully live-proven on kind: dispatch (tool_kind: "wasm" → wasmtime host → digest from registry → run), capability flush (object_putnoetl.object_store), real data flow (inputargs, worker#110), and hot-replace (worker#112) — republishing the same path@version swaps the running pool's behavior with 0 restarts. Landed across worker#93/#95/#97/#99/#101/#103/#105/#107/#108/#110/#112 + server#210/#212/#214 + tools#68 + docs#181. Remaining (executor: author sugar; compiled materialiser port) deferred to #104/#103. See Umbrella-WASM-Plugin-Compilation.
#100 2026-06-15 Cursor/claim loop mode: loop.spec.mode: cursor in the Rust orchestrator. noetl-server v3.8.0 (server#196) cursor loop engine + output namespace; noetl-tools v3.10.1 (tools#66) postgres -- comment splitter fix; worker#88 dep bump. test_pft_flow_v2 all_passed:true, 5/5 per data type on kind against the throttling/error-injecting paginated-api. See Umbrella-Cursor-Loop-Mode.
#99 2026-06-14 Transfer tool: Snowflake↔Postgres both directions, with credential-alias resolution. Both transfer arms implemented in noetl-tools v3.10.0 (tools#65); worker v5.22.0 (worker#87) pre-resolves source.auth/target.auth aliases; SF→PG coerces string cells via $n::text::<udt> + reformats Snowflake-epoch timestamps to RFC3339; PG→SF generates SQL-escaped INSERTs. e2e fixture migrated (e2e#58). Kind-validated bidirectionally against the live sf_test account: every step COMPLETED, real types preserved.
#90 2026-06-12 Subscription / listener tool (RFC) — all 7 phases shipped + live-proven. Bounded-drain tool (Mode A) → kind: Subscription continuous runtime (Mode B) + header-directive engine → gateway push-ingress (Mode C) + auth-gated directive trust → store-and-forward spool + circuit breaker → out-of-cluster Cloud Run + gcs spool → CLI local noetl subscribePhase 7 scale hardening. Final phase: server v3.5.0 (server#189) POST /api/execute/batch (N→N, partial-failure contained) + opt-in exactly-once dedup window (noetl.subscription_dedup, bounded-by-age, race-safe, default off); worker v5.19.0 (worker#79) batch dispatch + dedup opt-in + per-subscription rate limits (token-bucket fetch-side backpressure → source keeps backlog, no loss); ops (ops#176) + e2e (e2e#48); no tools change. Live on kind: batch 12→12 COMPLETED on the subscription pool + per-message traceparent; dedup duplicate→1 execution + subscription.message.deduplicated; rate-limit engaged + 10/10 → executions (no loss). Refinement follow-ups tracked: #91#94 + tools#57. ai-meta → server 7b217d8 + worker 7531f4a + ops 6db69b9 + e2e 203593b.
#89 2026-06-11 JSON null re-injected via {{ step }} serialized as JS undefined — fixed in the server template renderer — the cursor pagination fixture's terminal page returns next_cursor: null; re-injecting the whole {{ fetch_page }} envelope into the next step's input rendered that field as the bare token undefined, producing invalid JSON. render_to_value then failed serde_json::from_str and returned the entire envelope as a raw string, so the consuming Python step got response as a str and crashed ('str' object has no attribute 'get'). Root cause was the server, not the worker (the issue's hypothesis): src/template/jinja.rs::json_value_to_minijinja maps JSON nullValue::UNDEFINED and minijinja's map repr emits undefined; the server's renderer was a divergent copy missing the | tojson retry the noetl-tools engine already had. Fix (server#177, v3.0.6) adds that retry — a lone {{ expr }} rendering container-shaped-but-invalid JSON re-renders with | tojson; minijinja_to_json maps undefined/none → JSON null. 5 new regression tests; 619 lib + 8 parity green. Kind-validated: cursor walks all 4 pages, terminal next_cursor: null handled, validate_results collects 35 (first_id=1, last_id=35, success) — matching offset. ai-meta → server 8e17fbe.
#88 2026-06-10 e2e offset/cursor pagination fixtures read response.body.* not response.data.* — the Rust http tool nests the parsed JSON payload under body ({{ fetch_page }}{body, headers, status_code}), so the fixtures' response.get('data', {}) resolved to {}, has_more/next_cursor defaulted falsy, and the loop exited after page 1 even with the post-#85 loop machinery correct. Fixed both check_pagination steps to response.get('body', {}) via e2e#40. Kind-validated against the live paginated-api test-server (Rust server/worker :dev): offset walks 0→10→20→30, has_more T/T/T/F, users 10/10/10/5, validate_results success 35 (first_id=1, last_id=35), playbook.completed COMPLETED. cursor path-fix correct + walks all 4 pages (Mg==Mw==NA==null, 35 events fetched) but final collection blocked by a distinct worker bug → filed #89 (terminal next_cursor: null serialized as JS undefined). Other pagination fixtures (retry/max_iterations/pipeline*/loop_with_pagination) share the same envelope-key assumption over /api/v1/assessments|flaky ({data, paging}) — flagged, left for follow-up. ai-meta → e2e 72a7525.
#85 2026-06-10 Workflow-arc loops now advance across iterations + terminate cleanly — built on the dispatch-guard re-entry layer, two coupled orchestrator fixes via server#176 (v3.0.5). (1) Durable event-sourced loop-ctx propagation: step-level set: ctx.* loop variables were recomputed per pass and reverted to the workload default (loop thrashed 0,0,1,0,1,2,…); root cause was start's initializer set re-firing every pass in random HashMap order against check_pagination's advancing set. Fix persists each completion's rendered set: as a ctx.updated event (latest-wins fold + build_context overlay), emitted once per completion keyed by the stable completion event_id (not Utc::now()-fallback completed_at). (2) Loop-exit hang: the exit branch was marked step.skipped on a loop-body-completion pass (recency-based branch-point detector missed it), turning it terminal so the exit dispatch was suppressed; fixed with a structural loop-branch-point test. 614 lib tests (6 new; 2 verified to fail without their guard); clippy-clean. Kind-validated: counter loop advances 0→1→2→3 + terminates; real-http offset pagination advances 4 pages collecting 35. (Separate finding, filed follow-up: the e2e offset/cursor fixtures read response.data.users but the Rust http tool nests it at response.body.users.) ai-meta → server e519fdc.
#87 2026-06-10 Multi-tool step: a later sub-tool can now reference an earlier sibling's output — in a tool: [list] step, each sub-tool's result was stored for the aggregated output but never injected into the running context, so {{ <label>.<field> }} rendered empty (masked in quoted positions; a syntax error at or near "," in unquoted numeric SQL positions, e.g. save_edge_cases test_large_payload). Fixed via tools#48 (v3.1.1): inject each sub-tool's result under its label (with a synthetic .data self-ref) so later siblings resolve it. Worker adopts 3.1.1 via worker#69. Kind-validated: save_edge_cases test_large_payloadrecord_count = 100 (no syntax error), save_delegation_test clean. ai-meta → tools 76f942a + worker b97f642.
#83 2026-06-10 Orchestrator fan-in barrier deadlocked workflow loopsbuild_incoming_arcs counted a loop back-edge (check_pagination → fetch_page) as an upstream so the barrier deferred the loop head forever. Fixed via server#175 (v3.0.4): exclude back-edges via a new forward_reachable helper (genuine fan-in unaffected). Kind-validated; fanout_reduce green. ai-meta → server 480ba72.
#84 2026-06-10 Orchestrator never populated event.nameloop.done arc gates always skippedwhen: {{ event.name == "loop.done" }} (10+ fixtures) never matched, so in-step loop: steps hung after completion. Fixed via server#175 (v3.0.4): inject event.name = "loop.done" into a completed loop step's next-arc context. Kind-validated (test_pagination_basic completes). ai-meta → server 480ba72.
#86 2026-06-10 e2e fixtures: duckdb tool field is command/query, not commandscommands: (plural) failed pre-dispatch (missing field 'query' / malformed tool config). Renamed across 4 storage/gcs fixtures via e2e#39; save_all_storage_types green. ai-meta → e2e b0a5c85.
#78 2026-06-10 noetl-worker: pre-dispatch errors now emit terminal call.error instead of hanging — credential-alias resolution + tool-config deserialization failures used to ?-propagate out of execute_with_server_url; the dispatch loop only logged them, so the execution sat at command.started forever. Fixed via worker#68 (v5.15.1): typed CredentialResolutionError (terminal AliasNotFound/Invalid vs retryable Transient) + CredentialHttpError carrying the HTTP status so classify_fetch_error decides retryability by code (terminal: 404/400/401/403/500; retryable: 408/429/502/503/504 + transport), and handle_predispatch_failure emits call.error + command.failed. Diagnosis correction: the live pg_noetl_k8s repro is an HTTP 500 "Decryption failed: aead::Error", not a 404 (the worker has no /api/keychain/ call). Kind-val GREEN: test/postgres now → call.error/command.failed/playbook.failed (no hang); hello_world still completes. ai-meta → worker 99e2c66.
#80 2026-06-10 container_callback chain green end to end — fixing the watcher's missing curl (retired bitnami/kubectlalpine/k8s:1.30.3, ops#168) surfaced two more layered bugs. Server: container-callback call.done insert targeted a non-existent attempt column → HTTP 500; fixed to the deployed noetl.event schema (server#173, v3.0.3). OOM path: the watcher only read Job-level conditions so failed_oom could never fire — added pod-level OOMKilledfailed_oom / ImagePullBackOfffailed_image_pull classification + RFC3339 completed_at (bare now was returning HTTP 422) (ops#168); and the e2e fixture's calloc-lazy bytes(40MiB) never dirtied pages so it never OOM'd (e2e#38). kind_validate_container_callback.sh both probes GREEN — happy_path → succeeded, oom → failed_oom. ai-meta → ops cacc513 + server 5d2cf58 + e2e 6aaf06e.
#79 2026-06-10 e2e kind-val runner scripts updated to current noetl CLI surface — both scripts/kind_validate_*.sh aborted on unrecognized subcommand 'playbook'; the validation logic + event taxonomy were intact, only the invocation layer drifted. Fixed via e2e#37: register playbook / exec <catalog-path> --runtime distributed --json / status --json / event log via noetl query (.result, order by event_id); fail-fast CLI-surface guard. fanout_reduce PASS start-to-finish on kind (server-rust v3.0.1); container_callback drives cleanly and stops at the watcher curl gap (#80). ai-meta → e2e a3594b3; wiki Kind-Val Runners.
#82 2026-06-10 GUI: credential View/Edit recovered for pre-wallet (legacy-encrypted) records — Secrets Wallet (#61) made credential storage forward-only, so pre-wallet records 500 on decrypt and the GUI View/Edit flow dead-ended on a generic toast. Fixed via gui#36: View explains the cause + points to Edit; Edit reopens with the list-row metadata + a warning banner so re-saving re-seals the record under the current wallet. ai-meta → gui 8cacc9e (v1.11.1).
#81 2026-06-10 Container tool unusable: ToolSpec.command (String) vs ContainerConfig.command (Vec) type contradiction — landed via server#172 (v3.0.2). ToolSpec.command Option<String>Option<serde_json::Value> so the container tool's array command decodes server-side + passes through to the worker's Vec<String> (scalar stays a JSON string for shell/db tools); ToolCall::from_spec forwards verbatim. 2 regression tests; clippy clean. Kind-val GREEN: server accepts command: ["/bin/sh","-c"], worker creates the K8s Job, Job reaches Complete 1/1. Chain counter-bump validation stays gated on #79/#43.
#77 2026-06-09 Explicit input:/set: forward-only data binding — BREAKING v3.0.0 across noetl-tools + noetl-server. All 5 PRs merged: tools#45 (v3.0.0) + server#169 (v3.0.0) + e2e#35 (13 fixtures migrated) + cli#57 (v4.10.0, executor 0.5.0) + worker#66 (dep bump). Kind-val GREEN.
#76 2026-06-08 Sequential-mode iterator dispatch — LoopMode enum (Sequential default / Parallel), StepInfo.iterations_dispatched guard. Landed via noetl/server#166 (v2.62.0). First Claude-direct Rust PR under agents/rules/handoff-routing.md. Kind-val GREEN: test/loop COMPLETED 5/5 + iterator_save_test COMPLETED 4 steps.
#70 2026-06-08 noetl-server missing PUT /api/result/<execution_id> endpoint — landed via noetl/server#160 (v2.58.0). Durable result-store endpoints (PUT + GET /api/result/<eid>/<step>/<ref>) ported to the Rust server. Kind-val GREEN: output_select_test reached playbook.completed with test_result: "PASSED".
#69 2026-06-08 noetl-worker: over-budget call.done returned reference-only envelope, missing inline _ref for downstream {{ step._ref }} templates. Landed via noetl/worker#63 (v5.15.0): build_call_done_result's durable-success branch now embeds context: { data: { _ref: <noetl://...> } } alongside the existing reference block. Kind re-val pending noetl/ai-meta#70 (server-side PUT /api/result/<eid> endpoint missing — falls back to degraded shm-only path where there's no noetl:// URI to embed).
#68 2026-06-08 noetl-tools: ArtifactTool config required input but server pipeline emitted args (the post-#56 normalized name) — landed via noetl/tools#40 (v2.23.1) + worker dep bump via noetl/worker#62. One-line #[serde(alias = "args")] on ArtifactConfig.input accepts both shapes. Re-val surfaced a downstream _ref/output_select gap → filed as noetl/ai-meta#69.
#67 2026-06-08 Rust orchestrator hangs on mode: exclusive routing — untaken sibling never emits step.skipped, R4 fan-in barrier deadlocks. Landed via noetl/server#159 (v2.57.2). Three-part fix: (1) evaluator::evaluate_next_transitions surfaces unmatched siblings as not_matched_with_target; (2) orchestrator::process_in_progress two-pass emit-skipped-then-dispatch (HashMap-order-independent); (3) +2 unit tests (lib 568/0/0). Kind-val GREEN — comprehensive_test.yaml reached playbook.completed in ~4s (was hanging forever pre-fix).
#66 2026-06-07 Rust orchestrator: cross-step {{ step.data }} template resolves to None — landed via noetl/server#158 (v2.57.1). WorkflowState::build_context injects a self-referencing .data key on extracted user_data, guarded so the task_sequence flatten path's existing .data stays intact. 2 new unit tests + 566/0/0 lib. Surfaced by #65 kind-val; concrete repro kind execution 322087210360770560.
#65 2026-06-07 noetl-tools: external python script loaders (file/gcs/http source types) + legacy main() function convention — landed via tools#38 (v2.22.0) + tools#39 (v2.23.0); kind-val GREEN on the live worker (execution 322087210360770560 reached playbook.completed in ~6s; loaded main(name, count) returned the expected payload). noetl/worker#61 OPEN+mergeable to bump the worker pin. Surfaced finding: noetl/ai-meta#66.
#43 2026-06-07 R-3 Phase C-2: container tool kind — design callback pattern for K8s Job dispatch. All four Rust rounds shipped (1 k8s-watcher ops@8892043 / 2 callback endpoint server v2.48.0 / 3 Tool::Container tools v2.21.0 / 5 kind-val rig e2e@17de21d). Round 4 (Python parity) parked per Rust-only direction. Worker-side pending_callback adoption is a coordinated follow-up.
#64 2026-06-07 noetl-tools: artifact tool kind missing from Rust registry — landed via tools#35 v2.20.0 (thin ArtifactTool adapter translating the Python-era YAML shape into a ResultFetchTool call)
#61 2026-06-07 Secrets Wallet (Rust): envelope encryption + KMS/secret-provider plugins + sealed worker delivery + distributed multi-region resolution — all named phases + 6d.X cloud dynamic providers shipped; umbrella feature-complete
#54 2026-06-06 Phase F R5 — regression + e2e validation under sharded topology — closed at the umbrella level (Tier 1 + Tier 3 + Tier 4 e2e all GREEN on the Rust-only stack; subsequent regression findings filed as their own ai-task issues — #62/#63/#64/#65/#66 all now also closed)
#62 2026-06-05 noetl-server: /api/executions list query candidate-first rewrite + status-drift fix (server#99 v2.28.1) — 6.5 s → 0.015 s (~430×)
#63 2026-06-05 noetl-tools: python tool accepts nested script.source.code (inline) — tools#33 v2.19.3 + worker#54 adoption, test_script_loading kind-validated; external loaders split to #65
#60 2026-06-04 Rust orchestrator template context doesn't expose step data for next.arcs / step.when
#59 2026-06-04 Rust orchestrator doesn't resolve tool.kind:workbook references to inline actions
#58 2026-06-04 Rust orchestrator doesn't emit playbook.failed on command.failed — executions stall
#57 2026-06-04 Rust server rejects flat (name-as-field) pipeline shape in v10 playbook YAML
#56 2026-06-04 Canonical v10 playbook workload + input alias unrecognized on Rust stack
#55 2026-06-04 Rust server EventEmitRequest.execution_id wire-type drift blocks worker traffic
#53 2026-06-04 Rust worker → Rust server e2e compatibility (Phase D R3b/R3c terminal completion gap)
#52 2026-06-03 noetl-tools: add js_consume operation to nats tool kind
#51 2026-06-03 Fix system/outbox_publisher.yaml auth block to use AuthResolver pattern
#50 2026-06-03 Phase 2.a — system/outbox_publisher playbook + routing + auth wiring

Ecosystem map

See the Repo Map page for the full submodule inventory. Quick view of the production repos and their current versions (2026-06-20):

Repo Role Version Recent
noetl/server Rust control plane v3.39.0 ✅ #116 program-scale step 2 — execution-affinity single-owner WRITE ORDERING (multi-replica gate-ON validated) (server#252, v3.39.0 5e00d0a): closes the chain-fork race step 1 left open (the command.issued prev-read + head CAS-advance are two non-atomic steps → concurrent cross-replica emits forked the chain). Affinity routes every trigger for an execution (POST /api/events, which fires the drive) to the single replica that sharding::ShardConfig::owns(execution_id) owns (stable XxHash64); non-owner forwards a reverse-proxy POST (one-hop loop guard, degrade-to-local). On the owner the single-process drive lock + in-memory ChainHeads make the read→advance atomic, no distributed lock; KV is the genesis/handoff vehicle (owner resolves LOCAL → kv_remote_hit→0). New src/affinity.rs; NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off, prod unchanged); metric noetl_execution_affinity_total{outcome} (forwarded_ok = proof). Multi-replica gate-ON kind (2-replica StatefulSet, offserver+audit_only+nats_kv+affinity+publish_only): rig PASS — chains roots=1/dangling=0/walk==total (NO fork), forwarded_ok +9, never-scan + sole-writer across replicas, executions COMPLETE; single-replica unchanged; 595 tests + clippy green. Follow-up #117: off-server from_events spine ordered by event_id wedges fan-in under a chain-order≠id-order inversion (high-concurrency fanout). Prior: ✅ #115 program-scale step 1 — multi-replica coherence DATA LAYER (NATS-KV-backed ChainHeads + ExecDescriptor); execution-affinity STAGED (server#251, v3.38.0 8f39a79): NOETL_REPLICA_COHERENCE=nats_kv (default local, prod unchanged) backs the off-server drive's watermark + descriptor with JetStream KV buckets — head advance = CAS (one chain under concurrent emits), descriptor = CAS merge; in-process maps = write-through cache / degraded fallback. src/coherence.rs; ChainHeads/ExecDescriptors async; metric noetl_replica_coherence_total{structure,op,outcome} (proof kv_remote_hit). Kind: single-replica nats_kv bit-for-bit parity with local (all topologies COMPLETE, clean chains, scans +0); 2-replica proved cross-replica resolves work. Necessary-not-sufficient: 2+ replicas still fork the chain (non-atomic issuing_event-read vs head-advance) → execution-affinity STAGED as step 2 (substrate in src/sharding.rs). Prior: ✅ #115 Phase 5 — atomic-working-item context (tenet 6): the drive hands a worker only its minimal declared slice (server#250, v3.37.0 a96ade8): new orchestrate-core::input_binding (analyze/project_context over minijinja undeclared_variables, conservative bail-to-full) + CommandBuilder::with_atomic_item_context; NOETL_ATOMIC_ITEM_CONTEXT (default false, prod unchanged); metric noetl_atomic_item_context_total{outcome}. Builds on #77 (Explicit Input Binding, CLOSED). Gate-ON kind-validated: flag-ON consumer ctx narrowed to the one declared key, COMPLETED; flag-OFF full ctx (back-compat). Prior: ✅ #115 Phase 6 — retire the hot-path noetl.event read class; the table is AUDIT-ONLY (server#249, v3.36.0 b71ca1d): NOETL_EVENT_READ_PATH=event_scan|audit_only (default event_scan, prod unchanged). Under audit_only, the remaining lifecycle readers of noetl.event (the WHERE execution_id replay class outside the drive — get_catalog_id per-ingest, inherit_parent_trace, subscription dedup-audit + container-callback catalog/existence) serve from the in-memory execute-time ExecDescriptor; a cold descriptor (post-terminal straggler after eviction / restart) resolves catalog_id from noetl.command (the synchronous queue) — never a noetl.event scan. New proof metric noetl_event_hotpath_reads_total{site,outcome}. Gate-ON kind-validated: hot-path scan Δ0 (served_descriptor +96 + served_command +3), drive state_build_total Δ0 + event_scans Δ0 ⇒ ZERO noetl.event scans anywhere on the hot path, end-to-end; linear/loop/fan-out/output_select COMPLETE; sole-writer + lag-0; audit still works (direct SELECT + status + replay event_count=25); 585 tests + clippy green; default event_scan, prod unchanged. The RFC never-scan end state (tenet 3) is reached under the flag. Prior: ✅ #115 Phase 4 REMAINDER — stateless off-server drive edge (zero state rebuild + zero noetl.event reads) (server#248, v3.35.0 6e30fc3): removes the residual server-side chain-walk bookkeeping on the drive path. Under NOETL_STATE_BUILDER=offserver a warm execute-time ExecDescriptor (catalog_id + routing seeded at playbook_started; terminal stamped at the emit_events chokepoint) lets trigger_orchestrator_inner route the system/orchestrate command WITHOUT building WorkflowStateexpected_head from the in-memory ChainHeads, trigger_event_id passed so the worker resolves the trigger type off its WAL, NO server-built state (__stateless__). apply_worker_orchestration sources catalog_id+routing from the descriptor (skips cold-rebuild) + evicts on terminal worker-built state. Cold descriptor (restart) falls through to the server-built path (re-seeds) → chain_walk + event_scan stay fallbacks. Proof: noetl_orchestrate_drive_total{dispatched_offserver_stateless,applied_stateless} advance while noetl_state_build_total stays flat. Gate-ON kind-validated: state_build_total Δ0, event_scans Δ0, dispatched_offserver_stateless +3 / applied_stateless +3, linear(13)/loop(62)/fan-out(25)/output_select(31) COMPLETE, offserver==server parity, sole-writer 25==25, lag-0; 583 tests + clippy green; default server, prod unchanged. Completes #107 step 2 server-side. Prior: ✅ #115 Phase 4 DRIVE CUTOVER — mark the off-server drive command + carry expected_head (server#247, v3.34.0 f0922bd): under NOETL_STATE_BUILDER=offserver (default server, prod unchanged) trigger_orchestrator_inner marks the system/orchestrate command __offserver_build__ + carries execution_id + expected_head (the staleness watermark) so the worker self-sources the drive state from the WAL; the server-built state rides along as the worker's incomplete-chain fallback. Gate-ON parity rig PASS (offserver==server, fan-in exactly-once, server scans 0). 580 tests + clippy green; no prod default changed. Prior: ✅ #115 Phase 4 — NOETL_STATE_BUILDER=offserver|server flag scaffold (default server) (server#246, v3.33.0 3e6006d): the server-side flag for the off-server state-builder drive cutover. server (default, prod unchanged) builds WorkflowState in-process; offserver routes the drive to obtain state from the pool-side off-server builder (worker state_builder, reading the noetl_events WAL). Flag only — the offserver drive-cutover wiring is staged (pool-side builder landed in worker v5.37.0). 2 config tests; no prod default changed. Prior: ✅ #115 Phase 3 MERGED — chain-walk state builder (flagged, default-off) (server#245, v3.32.0 8338417): behind NOETL_STATE_BUILD_MODE=chain_walk the drive reconstructs WorkflowState by following the one-level prev_event_id chain head→root (head from the in-memory ChainHeads watermark, then (execution_id,event_id) PK point-lookups — never a WHERE execution_id scan of noetl.event) and feeds the same from_events (orchestrate-core unchanged → parity by construction). Falls back to event_scan (default) on cold-head / materializer-lag / non-genesis so correctness is never sacrificed; NOETL_STATE_BUILD_PARITY_CHECK shadow-builds both ways in one REPEATABLE READ snapshot and asserts equal. New metrics noetl_state_build_total{mode,outcome} + noetl_state_build_event_scans_total (no-scan proof, 0 under chain_walk) + noetl_state_build_chain_hops + noetl_state_build_parity_total{result}. Gate-ON kind-validated: parity 41/41 MATCH, event_scans_total=0 / 1064 PK hops / 0 fallbacks, all topologies COMPLETE, sole-writer + lag-0, 577 lib tests + clippy green. Self-merged; prod default unchanged. Phase 4 (off-server builder + WAL cache) builds on this. Prior: ✅ #115 Phase 2 MERGED — one-level prev_event_id event chain (server#244, v3.31.0): each noetl.event carries prev_event_id (the immediately-previous event in causal order) + each noetl.command the issuing-event link, stamped at the emit chokepoint emit_events from a per-execution chain-head watermark (ChainHeads) — covering drive events + command.issued + worker-lifecycle on both gate paths, the materializer persisting it. Additive (nothing reads the link yet — that is Phase 3, in progress). Chain-correctness kind-proven walkable/1-root/no-gap/no-scan across 6 gate-ON topologies; 573 lib tests + clippy green. ai-meta pointer afdb365. Prior: ✅ #115 Phase 1 — surface _ref/_store on kept refs + refs_in_state default true (server#243, v3.30.0): hydrate_result_references keep_refs branch merges ref/store/uri onto the bounded extracted summary it surfaces as context.data (so {{ step._ref }} lazy-load + {{ step._ref is defined }}/{{ step._store }} predicates resolve off the summary without bulk), and refs_in_state now defaults true — references stay out of state + commands (drive state + command.issued bounded; the worker selective consume side landed in worker#117). Closed #113 + #114; kind gate-ON all 9 stalls COMPLETE. Prior: 🐞 #114 — offload oversized command context (server#242, v3.29.5): a command.issued context over NOETL_COMMAND_CONTEXT_MAX_BYTES (512KB) is offloaded to noetl.result_store with a {__context_ref__} marker; get_command/claim_command resolve it before the worker sees the command (metrics context_offloaded/context_ref_resolved) — so the published event stays under NATS max_payload and never wedges the publish-only gate. Kind gate-ON: rig PASS, every command.issued event <1MB, 6 of #113's 9 fixtures COMPLETE; chose ref-on-oversize over refs_in_state=true (the latter breaks bulk-consuming fixtures — consume side not impl). Remaining 3 fixtures + the cutover gated on the refs_in_state consume side (#101). Prior: 🐞 #113 — off-server drive: recover offloaded drive result + stop drive on cancel (server#241, v3.29.4): apply_worker_orchestration resolves+decodes an OFFLOADED __orchestrate__ drive result (over the 100KB inline budget → durable reference.ref, no inline output_b64) instead of dropping it → non-convergent re-loop (new noetl_orchestrate_drive_total{stage=ref_resolved}); cancel now stops the drive (match underscore playbook_cancelled + ExecutionState::is_terminal terminal guard evicts the orch-cache, no restart). Kind gate-ON proven (785KB result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; sole-writer lag 0); rig e2e#63. 5/9 #113 fixtures COMPLETE; 4 hit a distinct oversized-command.issued stall → #114. No prod default flipped. Prior: 🎯 #103 — CQRS write-path cutover COMPLETE, FLIP-READY (server#240, v3.29.3): the 2 ExecutionService cancel/finalize writers route through the emit_event chokepoint — the last synchronous server noetl.event writer under the gate is closed, so with NOETL_EVENT_INGEST_PUBLISH_ONLY=on the server writes zero event rows (materializer is the sole writer). Kind-proven both modes (gate-off byte-identical INSERT; gate-on PUBLISHED + materializer sole writer + terminal state + 0 loss/dup). All three flip blockers closed → flipping the gate on is a staged operator decision. Default-off; no prod default changed. Prior: #104 — off-server-drive × gate crash-recovery (v3.29.2, server#238): cold-cache apply rebuilds WorkflowState from the durable log. Prior: #108 (c) — worker-driven orchestrator drive now DEFAULT ON (server#233, v3.28.0): NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true after the scale soak proved zero noetl.event burst + full system-pool isolation; in-process drive kept as the =false revert. Prior: #101 — bounded-memory orchestrator + stall-proof reconcile (v3.9.0, server#197): projection-snapshot bounded rebuild (flat memory — 167KB snapshot at 200k events, was OOM at ~19k); throttled O(events) consistency COUNT off the hot path; background reconcile poller (force-advances every active execution every 8s → no permanent deadlock under DB backpressure); results-by-reference resolution; GET /api/executions/{id} memory-bomb fix. Validated kind 10×1000 (flat memory) + GKE db-g1-small/PgBouncer 10×200 (poller broke a stall, 0 fails/restarts, Cloud SQL ~15 backends). Prior: #90 Phase 7 — POST /api/execute/batch + opt-in exactly-once dedup window (v3.5.0, server#189, closes server#188): batch endpoint creates N executions in one round-trip with partial-failure containment, reusing the single-execute path so per-message routing/trace/dedup are intact; the opt-in dedup window (noetl.subscription_dedup, bounded-by-age, race-safe INSERT … ON CONFLICT, scoped by subscription, subscription.message.deduplicated audit, default off — RFC §10 OQ1); validation of the new dispatch.batch_dispatch/batch_max/dedup/limits spec blocks; noetl_execute_outcomes_total + noetl_execute_batch_size. Live on kind: batch 12→12 COMPLETED; dedup duplicate→1 execution; direct-curl within/outside-window + dedup-off all green. ai-meta → server 7b217d8. Prior (v3.4.2): #90 Phase 5 gcs/s3 spool credential optional (ADC). Prior: #90 Phase 4 — spool config validation + subscription lifecycle-status fix (v3.4.1, server#184+server#185): validate the spec.spool block at registration; lifecycle-status reconstruction now matches only the six lifecycle event types so spool/circuit events (which share the subscription's execution_id) can't 500 subscription_get/activate. Prior: #90 Phase 3 — push-ingress config endpoint + push catalog validation (v3.3.0, server#182): mode: push requires an ingress.verify block (hmac_sha256 | bearer | pubsub_oidc; none rejected) + new GET /api/internal/ingress/{listener} (service-account-gated) resolving the verify-secret alias through the Wallet + idempotent subscription registration — the gateway's DB-free config source. Live-validated on kind. Prior: #90 Phase 2 — kind: Subscription type + lifecycle + pool routing + W3C trace (v3.2.0, server#180): first-class kind: Subscription catalog type (source/mode/dispatch validation, no step-DAG) + event-sourced lifecycle endpoints /api/subscriptions (register→activate→pause/resume→drain→deactivate, idempotent register, GET list/get) + execution_pool override on /api/execute routing the whole run to noetl.commands.<pool>.<eid> (persisted in playbook_started meta, orchestrator reads back) + W3C trace into meta.trace + command notification + child inheritance. Startup-seeds the subscription resource kind; decodes noetl.event.created_at as TIMESTAMP. Live E2E green. Prior (v3.1.0): subscription ToolKind. Prior: Round-trip JSON null in whole-object {{ step }} references (v3.0.6, server#177; closes noetl/ai-meta#89): a single-expression {{ step }} reference to a prior result envelope carrying a null field rendered that field as the JS token undefined via minijinja's map repr (json_value_to_minijinja maps JSON nullValue::UNDEFINED); render_to_value then failed serde_json::from_str and returned the whole envelope as a raw string, so the consuming python/rhai step received an unparseable str and crashed. Fix adds a | tojson retry to render_to_value (mirrors the noetl-tools TemplateEngine::render_value the server copy had diverged from): a lone {{ expr }} whose plain render is container-shaped-but-invalid JSON re-renders with | tojson, and minijinja_to_json maps undefined/none → JSON null so the field round-trips as null. 5 new regression tests (null in nested + top-level objects, null array element, explicit | tojson no-double-pipe, scalars unchanged); 619 lib + 8 parity green; clippy clean. Kind-validated against the live paginated-api test-server: cursor fixture walks all 4 pages, terminal next_cursor: null handled, validate_results collects 35 (first_id=1, last_id=35, success) — matching offset; pre-fix the 4th check_pagination was command.completed error. ai-meta → server 8e17fbe. Prior: Workflow-arc loops advance across iterations + terminate (v3.0.5, server#176; closes noetl/ai-meta#85): two coupled fixes atop the dispatch-guard re-entry layer. (1) Durable event-sourced loop-ctx propagation — step-level set: ctx.* loop variables were recomputed per pass and reverted to the workload default (loop thrashed 0,0,1,0,1,2,…); root cause was start's initializer set re-firing every pass in random HashMap order against the loop's advancing set. Fix persists each completion's rendered set: as a ctx.updated event (latest-wins fold in WorkflowState.ctx + build_context overlay), emitted once per completion keyed by the stable completion event_id (the Utc::now()-fallback completed_at is unstable across reconstructions). (2) Loop-exit hang — the exit branch was marked step.skipped on a loop-body-completion pass (recency-based branch-point detector missed it), turning it terminal so the is_step_done guard suppressed the exit dispatch; fixed with a structural loop-branch-point test (any step with a back-edge arc). 614 lib tests (+6; 2 verified to fail without their guard); clippy-clean. Kind-validated: counter loop 0→1→2→3 + terminates; real-http offset pagination 4 pages collecting 35. ai-meta → server e519fdc. (Separate finding: the e2e offset/cursor fixtures read response.data.users but the Rust http tool nests it at response.body.users — filed as a follow-up.) Prior: Unblock workflow loops + loop.done-gated transitions (v3.0.4, server#175; closes noetl/ai-meta#83 + #84): the fan-in/reduce barrier counted a loop back-edge (check_pagination → fetch_page) as an upstream and deferred the loop head forever — fix excludes back-edges via a new forward_reachable helper (genuine fan-in unaffected); and event.name was never populated for arc evaluation so when: {{ event.name == "loop.done" }} never matched (10+ fixtures hung after an in-step loop:) — fix injects event.name = "loop.done" into a completed loop step's next-arc context. Found in a full e2e regression re-sweep (19→27/36 on kind); landed with e2e fixture fix e2e#39 (duckdb commandscommand, closes #86). Follow-ups #85/#87 filed open. 26 orchestrator tests +2 new; kind-validated. Prior: container-callback insert matches the deployed event schema (v3.0.3, server#173; tracks noetl/ai-meta#43): the container-callback handler emitted its resume call.done via a stale query targeting attempt + id columns that don't exist on the deployed noetl.event (PK (execution_id, event_id)) — every watcher POST 500'd with column "attempt" of relation "event" does not exist, blocking the #43 chain. Replaced with an inline INSERT matching the working handlers::events column set; terminal outcome rides in a chk_event_result_shape-conforming result envelope. Kind-val GREEN: kind_validate_container_callback.sh both probes pass (happy_path → succeeded, oom → failed_oom). Prior: container-tool command type contradiction fixed (v3.0.2, server#172; closes noetl/ai-meta#81): ToolSpec.command Option<String>Option<serde_json::Value> so the container tool kind's K8s-Job-style array command decodes server-side (was a 400 data did not match any variant of untagged enum ToolDefinition) and passes through unchanged to the worker's ContainerConfig.command: Option<Vec<String>>; a scalar stays a JSON string for the shell/db consumers; ToolCall::from_spec forwards the value verbatim instead of wrapping in Value::String. 2 new regression tests (playbook::types 18/18); clippy clean. Kind-val GREEN end-to-end — server accepts command: ["/bin/sh","-c"], worker dispatches the container tool, K8s Job reaches Complete 1/1 (pre-fix kubectl get jobs empty). Prior: e2e-sweep cleanup (v3.0.1, server#171; tracks noetl/ai-meta#49): 64 MB result-store PUT body limit (DefaultBodyLimit — was rejecting 15 MB+ payloads with HTTP 413); render_pipeline_config stashes set/args/spec/command blocks before Tera rendering; iter namespace map in build_iteration_command; cmd_render_ctx uses command.context override; stripped diagnostic tracing::debug! blocks. All 7 e2e sweep playbooks PASS on Rust-only kind stack. Prior: Sequential-mode iterator dispatch (v2.62.0, server#166; closes noetl/ai-meta#76): LoopMode enum (Sequential default / Parallel); LoopSpec.mode parsed from loop.spec.mode YAML; StepInfo.iterations_dispatched tracks command.issued count for the sequential dispatch guard; sequential pattern dispatches iteration 0 at fan-out then on each command.completed dispatches next if iterations_dispatched == iterations_completed(). Default is Sequential — existing playbooks without explicit spec.mode get sequential behavior. 3 new tests; lib pass; clippy clean. Kind-val GREEN: test/loop COMPLETED 5/5 + iterator_save_test COMPLETED 4 steps. First Claude-direct Rust PR under agents/rules/handoff-routing.md. Prior: Durable result-store endpoints (v2.58.0, server#160; closes noetl/ai-meta#70): PUT + GET /api/result/<eid>/<step>/<ref>. Kind-val GREEN: output_select_test reached playbook.completed. Prior: Rust orchestrator exclusive-routing fix — step.skipped for untaken siblings (v2.57.2, server#159; closes noetl/ai-meta#67): under mode: exclusive only one arc fires; pre-fix, the static planner declared the untaken sibling's target as an upstream of any downstream merge step, then the R4 fan-in barrier waited for it forever. Three-part fix: (1) evaluator::evaluate_next_transitions stops break-ing on exclusive-mode match — surfaces unmatched siblings as not_matched_with_target results; (2) orchestrator::process_in_progress two-pass refactor — emit step.skipped for ALL unmatched arc targets first, then dispatch matched (HashMap-order-independent); (3) +2 unit tests + new defensive Jinja regression guard (server lib 568/0/0). Kind-val GREEN: e2e/fixtures/playbooks/comprehensive_test.yaml reaches playbook.completed in ~4s (was hanging forever). Single-commit patch on top of v2.57.1. Prior: Rust orchestrator step.data template accessor fix (v2.57.1, server#158; closes noetl/ai-meta#66): WorkflowState::build_context now injects a self-referencing .data key on the extracted user_data so {{ step.field }} (existing flat path) AND {{ step.data.field }} (new wrapped path) both resolve. Guarded by !map.contains_key("data") to preserve task_sequence flatten back-compat (a labeled sub-task's data field stays addressable as both <step>.<label>.data.x AND <step>.data.x). 2 new unit tests; cargo test lib 566/0/0; release build clean. Single-commit patch on top of v2.57.0. Prior: Phase D R5 R7 — cross-server parity harness; Replay engine port complete (v2.57.0, server#157; closes server#148; tracks noetl/ai-meta#49 Phase D R5). Final slice — hermetic parity rig: events.json (13 synthetic events) + expected.json (Python's pre-recorded fold output) + regenerate_expected.py (standalone Python script — verbatim extract of service.py fold + helpers, no noetl-package imports). tests/parity_harness.rs 8-test integration suite asserts structural parity field-by-field across all six projections + payload refs. Parity is structural not byte-for-byte hex (Python and Rust hash different digest inputs per R4's design). All 8 tests pass; lib 564/0/0; release build clean. No kind-val needed — test-only PR. All 7 Phase D R5 rounds shipped today (v2.51.0 → v2.57.0); the Replay engine port is complete. Phase D R5 R6 — payload resolver (v2.56.0, server#156; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): every event's result.reference JSON gets parsed into a typed PayloadSummary and appended to the relevant projection's payload-refs list. PayloadSummary + PayloadRefEntry types mirror Python's dict shapes; extract_payload_ref + payload_summary mirror Python's helpers with three-tier fallback (reference.<field>rows_ref.meta.<field>rows_ref.ipc.<field>). ReplayExecutionState.payload_refs + ReplayFrameState.output_ref/output_ref_summary + ReplayBusinessObjectState.payload_refs/last_payload_ref all populated. 15 new unit tests; lib 564/0/0. Kind-validated against live execution showing 3 populated payload_refs with real SHA-256 digests. Only R7 remains. Phase D R5 R5 — snapshot seed + base_state + upcaster digest (v2.55.0, server#155; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): the replay fold can now start from a prior fold's output and continue from there. ReplaySnapshotSeed mirrors Python's frozen dataclass; ReplaySnapshotInfo is the output subset; ReplayFoldOptions carries the optional inputs; new fold_replay_state_with_options entry point + 5-arg fold_replay_state as a back-compat shim. R5-introduced ReplayState fields (replay_snapshot, upcaster_registry_digest) both skip_serializing_if None — default folds produce R1–R4-identical JSON. Snapshot-storage backend deferred to a downstream sub-issue. 8 new unit tests; lib 549/0/0. Kind-validated wire-shape back-compat. Phase D R5 R4 — typed Checksum + projection_checksums (v2.54.0, server#154; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): every replay fold now produces a typed Checksum over the full state + a 6-entry projection_checksums map. ChecksumType enum (initial variant Sha256) + Checksum { type, value } struct + stable_json_bytes deterministic JSON encoder + compute_checksums per-projection + top-level digest run at the end of fold_replay_state. Design decision: digest input is the typed Rust state directly, not Python's flat-row normalize layer; Python byte-for-byte parity deferred to R7. 9 new unit tests; lib 541/0/0. Kind-validated: top-level checksum 41265876487f...ae426; six projection_checksums entries all populated. Phase D R5 R3 — loops + business_objects projections (v2.53.0, server#153; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): third slice of Phase D R5 — replay fold populates the last two per-projection maps. Two new typed state structs (ReplayLoopState, ReplayBusinessObjectState) replace R2's serde_json::Map placeholders; ReplayState.{loops,business_objects} flip to BTreeMap for deterministic key ordering. Two new ID extractors mirror Python's _loop_id / _business_object_identity resolution order; business_object_status helper mirrors Python's _business_object_status suffix-derived ACTIVE/DELETED transitions; two new populate functions with full event-shape coverage (loop counters bump on command.{completed,failed} + loop.shard.{done,failed} + loop.{done,fanin.completed}; business-object attributes REPLACE/PATCH from meta). payload_refs deferred to R6. 13 new unit tests; lib 532/0/0. Kind-validated: re-probe returns loops + business_objects empty (expected — fixture doesn't emit those events). R4 design captured (per user direction): typed ChecksumType enum + Checksum { type, value } struct; future checksum types slot in via enum. Phase D R5 R2 — stages + frames + commands projections (v2.52.0, server#152; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): second slice of Phase D R5 — replay fold populates stages + frames + commands projections. Mirrors Python's state["stages"] / state["frames"] / state["commands"] per-projection dicts. ReplayEventRow extended with stage_id / frame_id / command_id / worker_id / aggregate_type / aggregate_id / meta columns (all #[sqlx(default)]); three new typed state structs (ReplayStageState / ReplayFrameState / ReplayCommandState); ReplayState.{stages,frames,commands} flip from serde_json::Map to BTreeMap for deterministic ordering; three new ID extractors mirror Python's resolution order; three new populate functions with full status transitions (stage opened/closed; frame dispatched/started/committed/failed/abandoned; command full lifecycle). 10 new unit tests; lib 518/0/0. Kind-validated: re-probe of prior fanout_reduce execution returns commands map populated with 4 entries carrying worker_id + issued_event_id + last_event_id. Phase D R5 R1 — Replay endpoint scaffold + execution projection (v2.51.0, server#149; tracks server#148): opens Phase D Round 5 (Python's noetl/server/api/replay/service.py ~1236 LoC → Rust). Sub-issue server#148 documents the 7-round decomposition (R1 scaffold + execution / R2 stages+frames+commands ✅ / R3 loops+business_objects / R4 typed Checksum + projection_checksums / R5 snapshot seeds / R6 payload resolver / R7 cross-server parity harness). R1 ships new GET /api/replay/state route mirroring Python's endpoint.py byte-for-byte (query params + defaults + projection enum + mutually-exclusive cutoffs returning 400); new services::replay module with ReplayService + ReplayCutoff + ReplayProjection + ReplayState + ReplayExecutionState + pure deterministic fold_replay_state; minimal execution projection fold reuses the Phase D R4 terminal-event short-circuit. Phase D R4 follow-up: status endpoint short-circuits on terminal events (v2.50.1, server#147; closes server#146): ExecutionService::get_status now (1) looks up playbook.completed / playbook.failed FIRST and returns COMPLETED / FAILED directly, and (2) accepts 'success' lowercase in the completed_steps counter. Kind-validated: prior execution flipped RUNNINGCOMPLETED on the same DB data. Phase D R4 slice 2 — apply_event handles step.skipped (v2.50.0, server#145; closes server#144): new step.skipped arm in state::WorkflowState::apply_event records the step with StepState::Skipped — fan-in barrier no longer defers forever when an upstream's when guard evaluates false. Container Tool Callback umbrella #43 Round 2 — POST /api/internal/container-callback/{execution_id}/{step} (v2.48.0, server#141; closes server#140; tracks noetl/ai-meta#43): external K8s watcher (Round 1, ops#166) POSTs Job terminal-state events here; handler validates path params, checks staleness via a single indexed SELECT on noetl.event, and emits a call.done event with the structured terminal state on match (or bumps noetl_container_callback_stale_total + returns 202 if no events exist for the execution). Six TerminalState variants survive in meta.terminal_state so playbooks can branch on the specific failure reason. Two new counters; 7 new unit tests; lib 487/0. Round 2 lands first per the umbrella's recommended ordering — smallest blast radius; unblocks Round 1 (watcher Deployment) + Round 3 (tools#36Tool::Container). Secrets Wallet #61 cloud-specific dynamic-secret providers shipped — umbrella feature-complete (v2.45.0 server#137 + v2.46.0 server#139 + v2.47.0 server#138): 6d.1 AWS STS AssumeRoleWithWebIdentity (EKS IRSA path; no SigV4 — token IS the credential); 6d.3 Azure AAD client-credentials (off-cluster non-IMDS path; sovereign-cloud overrides via env); 6d.2 GCP iamcredentials.generateAccessToken (workload-identity impersonation of a target SA). All three return SecretValue.expires_at populated — Phase 6d's cache_decision clamps cache TTL accordingly; Phase 7c.3 background refresh re-resolves inside the window. 39 new unit tests across the three providers; lib 470/0. Secrets Wallet umbrella is feature-complete: envelope encryption + KMS + 5 static-secret providers (GCP-SM, K8s, Vault, AWS-SM, Azure-KV) + 3 dynamic-secret providers (AWS-STS, GCP-IAM, Azure-AAD) + residency policy + cross-region broker + KEK rotation + audit + auto-renewal with stampede collapse. Phase 7c.3 — background-refresh wire-up + stampede collapse (v2.44.0, server#136): cache-hit path now spawns a background tokio::spawn to re-resolve via the provider + update the cache via KeychainService::set when the row is inside the refresh window. Cached value returns IMMEDIATELY to the caller — worker fetches stay on the fast path. Stampede collapse via new src/services/keychain_refresh.rs RefreshInflight (Arc<tokio::sync::Mutex<HashSet<(i64, String)>>>); concurrent refreshes for the same (catalog_id, alias) collapse to one provider call. Refactor: extracted resolve_via_provider from try_resolve_keychain so cache-miss inline + background refresh share identical code. Phase 7c series wire-complete (7c + 7c.2 + 7c.3). Phase 7a.2 / 7b.2 / 7c.2 — operator-facing rotation + audit storage + cache-refresh primitive (v2.42.0 server#127 + v2.43.0 server#129 + server#131): 7a.2 wraps the Phase-7a rewrap_storage_string primitive with POST /api/internal/wallet/rotate-kek (batched cursor scan of noetl.credential + noetl.keychain, RotateSummary { processed, rewrapped, skipped, failed, last_id } for progress checkpointing) and GET /api/internal/wallet/key-status (per-version row counts). 7b.2 adds the noetl.secret_audit table (CREATE TABLE IF NOT EXISTS at startup; server-owned), DbAuditSink impl, and GET /api/internal/secret-audit?credential=&execution_id=&from=&to=&limit= (bounded; ORDER BY occurred_at DESC; hard cap 10_000). 7c.2 adds KeychainService::should_refresh(catalog_id, keychain_name, execution_id, scope_type, now) — cache-side companion of the Phase-7c decision primitive; reads the row's expires_at, asks secrets::dynamic::should_refresh_default (honours KEYCHAIN_CACHE_REFRESH_WINDOW_SECS), bumps noetl_secret_refresh_total{outcome="triggered"} on true. Backward compatible. Next: 7c.3 stampede-collapse mutex + background re-resolve; 6d.1/.2/.3 cloud-specific dynamic providers. Phase 7c — token auto-renewal primitives (closes Phase 7) (v2.41.0, server#125): final named round of the Secrets Wallet umbrella. secrets::dynamic::should_refresh(expires_at, refresh_window, now) decision primitive (true iff expires_at set + still valid + within refresh window); KEYCHAIN_CACHE_REFRESH_WINDOW_SECS env (default 60). noetl_secret_refresh_total{outcome} counter (triggered|succeeded|failed|stampede_collapsed; failed alert-worthy) + noetl_secret_refresh_duration_seconds histogram (50ms–5s buckets). 5 new unit tests; lib 427/0. Lib-only. All named phases (1–7) of the Secrets Wallet umbrella are complete. Remaining queue is discrete follow-up sub-issues only. Phase 7b primitives — secret-resolution audit service (v2.40.0, server#123): durable audit trail of every credential resolution. AuditEvent struct (NEVER contains the secret value); bounded Operation + Outcome enums; AuditSink trait + NoopAuditSink default + SecretAuditService wrapper with record_async (fire-and-forget, never blocks resolver) + record_strict (await, used when compliance mandates the row exist before the value releases) + record (dispatches by strict-mode). NOETL_SECRET_AUDIT_REQUIRED env (default false; 1/true/TRUE/yes/YES enable strict). noetl_secret_audit_writes_total{operation, outcome, status} counter (failed_strict alert-worthy). 8 new unit tests; lib 422/0. Lib-only. Phase 7a — KEK rotation primitives (v2.39.0, server#121): starts Phase 7. KeyManager::current_key_version() trait accessor; EnvelopeCipher::rewrap_storage_string primitive (parse → if same version Skipped; else unwrap with historical version → re-wrap with current → Rewrapped { old_key_version, new_key_version, new_storage_string }). Plaintext payload NEVER reconstructed — pure DEK re-wrap, AES-GCM ciphertext bytes stay byte-identical. noetl_wallet_rotate_total{table, status} counter (skipped|rewrapped|failed_unwrap|failed_wrap|parse_error). 4 new unit tests; lib 414/0. Lib-only. Phase 6e — cross-region broker (closes Phase 6) (v2.38.0, server#119): BrokerRegistry (region → broker_url from NOETL_SECRET_BROKER_REGISTRY env; empty default = pre-6e fail-closed); POST /api/internal/cross-region/resolve peer endpoint validates expected_entry_region == server_region() (defensive against stale peer registries), resolves locally, seals via Phase-5a primitives to the requesting worker's pubkey directly; get_sealed handler falls back to broker on AppError::ResidencyViolation; KeychainDef.no_broker_fallback per-credential opt-out; AppError::CrossRegionUnreachable → HTTP 502. Two new metrics: noetl_secret_broker_call_total{broker_region, outcome} + noetl_secret_broker_call_duration_seconds{broker_region} histogram (50ms–5s). 10 new unit tests; lib 410/0. Both residency shapes operational: hard isolation (strict + no broker = HTTP 403) + soft federation (strict + broker registered = transparent cross-region routing). Phase 6 closes. Phase 6d primitives — dynamic-secret support + cache honors issuer TTL (v2.37.0, server#117): SecretValue.expires_at: Option<DateTime<Utc>> field; src/secrets/dynamic.rs cache_decision() honors min(default_ttl, expires_at - now - safety_margin) and returns SkipCacheAlreadyExpired when the deadline is already past or inside the operator's safety margin; KEYCHAIN_CACHE_DYNAMIC_SAFETY_MARGIN_SECS env (default 60); resolve_keychain_entry_with_meta returns the bundle's earliest expires_at; CredentialService::resolve_via_provider consumes the helper. Two new metrics: noetl_secret_dynamic_ttl_seconds histogram (1m/5m/15m/1h/4h/12h buckets) + noetl_secret_cache_skip_total{reason} counter. 7 new unit tests; lib 398/0. Backward compatible (providers without expires_at keep the 600 s default). Phase 6c — residency-policy gate (v2.36.0, server#115): KeychainDef.residency enum (none|advisory|strict, default none) + allowed_regions allowlist; resolver runs the gate at the top of resolve_keychain_entry BEFORE any provider call so strict-mode mismatches short-circuit with AppError::ResidencyViolation (HTTP 403, clear "credential X is region-locked to Y; this server is in Z" message that NEVER includes the value). noetl_secret_residency_check_total{policy, decision} counter — strict + violation_blocked is alert-worthy, advisory + violation_allowed is the migration-window signal. 8 new unit tests; lib 391/0. Phase 6b — ProviderRegistry + per-(provider, region) metrics (v2.35.0, server#113): server-side cache of (provider_id, region) → Arc<dyn SecretProvider> so the resolver doesn't rebuild from env on every cache-miss; RwLock + double-checked locking on the build path so concurrent get_or_build for the same key only builds once. Optional TTL via NOETL_SECRET_PROVIDER_TTL_SECONDS. New noetl_secret_provider_build_total{provider,region,status="cache_hit|ok|error"} counter + noetl_secret_resolve_duration_seconds{provider,region} histogram (5 ms – 5 s buckets). 7 new unit tests; lib 383/0. Phase 6a — region tag on keychain entries + per-region routing (v2.34.0, server#111): starts Phase 6 (residency-aware distributed resolution). KeychainDef.region optional field (no schema migration — lives in existing JSON blob); SecretRef.region provider-agnostic; AWS provider consumes it as the regional endpoint with explicit precedence (<region>: ref prefix > field > legacy project overload > AWS_REGION env). New NOETL_SERVER_REGION env + server_region() / effective_region() fallback helpers. noetl_secret_resolve_total{provider,region,status} counter per observability.md Principle 1. 5 new unit tests; lib 376/0. Lib-only — backward compatible. Phase 5b — wire format + sealing endpoint (v2.33.0, server#107): new GET /api/credentials/{id}/sealed?worker_id=<name> returns a SealedEnvelope (X25519-sealed credential JSON) addressed to the named worker; workers opt in by including worker_public_key in their register payload's runtime JSON blob — no schema migration; 400 BadRequest when the worker_pool row exists but didn't register a key; noetl_credentials_sealed_total{status} counter + credential.seal span per observability.md. Kind-validated end-to-end (Python cryptography + HKDF + ChaCha20-Poly1305 opens the envelope → recovers the bearer token + scope round-trip). Phase 5a — sealed payload crypto primitives (v2.32.0, server#107): src/crypto/sealed.rs X25519 ECDH + HKDF-SHA256 + ChaCha20-Poly1305 sealed-box (nonce derived from the shared secret, AAD pins alg+v for clean alg-mismatch rejection); 12 unit tests (round-trip, tamper, alg/version-mismatch, JSON wire stability); lib 369/0. Defense-in-depth on top of Phase-4 mTLS — cleartext never enters the response body. Lib-only; 5b adds the runtime-registry worker pubkey + sealing endpoint, 5c the worker side. Providers 3.x — AWS Secrets Manager + Azure Key Vault (v2.31.0, server#105): two new backends behind the one SecretProvider trait completing the 5-provider matrix. AWS SM uses hand-rolled AWS Signature Version 4 signing (no aws-sdk dep tree; signing key verified by a unit test against AWS's published reference vector); ref shape [<region>:]<secret-id>[#<json-key>] with JSON-key extraction for multi-field secrets; creds from env (the IRSA-injected triple). Azure KV uses IMDS Managed Identity (AKS/VMs) with TTL-cached bearer; ref shape [<vault>/]<secret-name>[#<version>]; sovereign clouds via NOETL_AZURE_KEYVAULT_DNS_SUFFIX. 21 new unit tests; lib 357/0; cloud-only backends (kind-val at unit-test layer like GCP). Phase 4a — opt-in TLS/mTLS listener (v2.30.0, server#103): the worker↔server credential channel (GET /api/credentials/<alias>) was plain HTTP; opt-in TLS via NOETL_TLS_CERT+NOETL_TLS_KEY (+NOETL_TLS_CLIENT_CA ⇒ mTLS), ring rustls + axum-server bind_rustls, kind-validated (200 w/ client cert, rejected w/o, plain HTTP refused). Providers 3.x — HashiCorp Vault provider (v2.29.0, server#101): a provider: vault keychain alias resolves from a Vault KV v2 secret (X-Vault-Token; ref [<mount>/]<path>#<key>), kind-validated end-to-end against an in-cluster Vault — second backend validatable on kind after K8s. /api/executions list perf + status fix (v2.28.1, server#99, #62): candidate-first rewrite (start-event index, not a 3.2M-row seq scan) — 6.5 s → 0.015 s (~430×), identical list; bool_or status-drift fix (was all-RUNNING). Secrets Wallet #61 providers 3.x — Kubernetes Secrets provider (v2.28.0, server#97): a provider: k8s keychain alias resolves from an in-cluster Secret via the API server + ServiceAccount token + cluster CA — the first secret backend kind-validated end-to-end with a real value (GCP needs GKE). Orchestrator-strand fix (v2.27.2, server#95): a deterministic evaluate failure (an invalid template in a step code body, an unknown step in a next arc, malformed routing) now emits a terminal playbook.failed instead of stranding the run in RUNNING forever — surfaced by the #54 e2e sweep (closed server#94). Parser fix (v2.27.1, server#93): NextSpec untagged-variant order — the list form next: [{step: x}] was deserialized into a struct Router positionally, silently dropping its arcs (and defeating unknown-step validation); sequence-shaped variants now precede the struct. Secrets Wallet Phase 3c — keychain cache (v2.27.0, server#91): execution-scoped, envelope-encrypted, TTL'd cache so an auth: "{{ alias }}" lookup isn't re-fetched from the secret manager per step; + fixed the keychain storage layer (queries never matched the table — also repairs the /api/keychain endpoints). Phase 3 (resolution) complete — R3b (v2.26.0) resolves a provider: gcp keychain alias from GCP Secret Manager on a credential miss; built on R3a/R2/R1 (v2.23.0–v2.25.0). Phases 1–2: Cloud KMS for the KEK (v2.22.0); envelope encryption (v2.21.0)
noetl/worker Rust NATS pull worker v5.40.0 ✅ #115 Phase 5 — forward the atomic-item-context flag onto the off-server from_events drive input (worker#121, v5.40.0 2484d17): so the off-server drive narrows each worker-bound command context to its minimal declared slice (the wasm reuses orchestrate-core's build_command). Default false → full-context dispatch unchanged. Prior: ✅ #115 Phase 4 REMAINDER — stateless off-server drive (resolve trigger type off the WAL + no-op on incomplete chain) (worker#120, v5.39.0 8e1f651): absorbs the server's stateless edge. ExecutionChain::event_type_of + build_offserver_input(trigger_event_id) resolve trigger_event_type off the pool WAL index when the server omits it (defaults command.completed). resolve_offserver_orchestrate_input returns an `OffserverDispatch{Wasm
noetl/tools Shared tool registry crate v3.13.0 #103 — deferred (ack-after-processing) ack (v3.13.0, tools#71): AckMode::Defer in the subscription SourceClient surfaces a durable per-message ack handle (NATS $JS.ACK reply subject) instead of acking inline; SourceClient::ack(ack_ids, AckDisposition) = Ack/Nack/Term (NATS + Pub/Sub); tool operation: ack|nack|term. Opt-in — existing callers unchanged. The capability the worker materializer (v5.34.0) drives for ack-after-materialize. Prior: #90 Phase 4 — store-and-forward spool engine + per-downstream circuit breaker (v3.4.0, tools#54): noetl_tools::spool — circuit breaker (trip/half-open/close, NATS-KV-serializable, per-downstream OQ2), SpoolItem (SHA-256 + noetl://spool ref + recv_seq-ordered keys), nats_object/local_disk backends, ordered-replay engine (global/per_key/none + idempotency + dead-letter + retention/GC). 44 unit tests + real-NATS integration test. Prior: #90 Phase 2 — header-directive engine + public build_source factory (v3.3.0, tools#52): source/directives.rsDirectiveSpec/DispatchPlan turn allowlisted message headers into dispatch instructions (redirect dispatch.playbook, dispatch.execution_pool, priority→pool, idempotency_key, content_type/schema_hint, W3C trace), untrusted by default (allowlist + value-allowlist enforced at parse; multi-value last-wins; applied[] audit). Public build_source(cfg, ctx) so the worker continuous runtime constructs the same SourceClient. 12 new tests. Prior (v3.2.0): bounded-drain subscription tool + SourceClient (Phase 1). Prior: Multi-tool sibling references (v3.1.1, tools#48; closes noetl/ai-meta#87): in a tool: [list] step, TaskSequenceTool stored each sub-tool's result for the aggregated output but never injected it into the running context, so a later sub-tool's {{ <label>.<field> }} rendered empty — masked in quoted positions, a syntax error at or near "," in unquoted numeric SQL (save_edge_cases test_large_payload). Fix injects each sub-tool's result under its label (with a synthetic .data self-ref matching build_context) so later siblings + a later python sub-tool's stdin variables resolve it. 2 new unit tests; lib 300/0. Kind-validated: save_edge_cases test_large_payloadrecord_count = 100, save_delegation_test clean. Worker adopts via worker#69 (b97f642). Prior: e2e-sweep cleanup (v3.1.0, tools#47; tracks noetl/ai-meta#49): YAML boolean when: true in policy rules now checks as_bool() before the string-template fallthrough (Value::Bool as_str() returns None); `
noetl/cli Rust CLI + local-mode runner v4.11.0 noetl subscribe — local-mode subscription listener (RFC #90 Phase 6) (cli#60, ai-meta → 2fb3fb0): standalone listener + FileEventSink JSONL + local_disk spool; cli-only (noetl-tools v3.5.0 source+spool reused unchanged). Prior: --include-data flag doc fix (cli#58)
noetl/gateway Gatekeeper — auth + SSE + push-ingress v3.3.0 #90 Phase 3 — push-ingress (Mode C) + auth-gated directive trust (gateway#28): POST /ingress/{listener} verifies HMAC / bearer / Pub-Sub-OIDC → only-then directives → one POST /api/execute per delivery on the dedicated pool (verify-and-forward, no DB on the ingress path); verify_then_plan makes the auth gate a testable invariant; first /metrics surface. Live E2E green (HMAC 12/12 + bearer 12/12). Prior (v3.2.0): Phase F R3b-2 shard-info twin endpoint.
noetl/noetl Python control plane (legacy; retained for back-compat) v4.12.1 (ecd16a2) ✅ #115 Phase 2 DDL (noetl#667, ecd16a2): canonical schema_ddl.sql prev_event_id columns on noetl.event + noetl.command + idx_event_prev_event_id for fresh installs (the Rust server also ensures them idempotently at startup). Otherwise deprioritized per Rust-only direction; pytest debt at noetl/noetl#663 parked
noetl/ops Helm + manifests (untagged) k8s-watcher durable image + pod-level OOM classification (ops@cacc513, ops#168; closes noetl/ai-meta#80, tracks noetl/ai-meta#43): retired the dead bitnami/kubectl:1.30.3 (removed from Docker Hub; cluster was on the bitnamilegacy stopgap) for alpine/k8s:1.30.3 (kubectl + jq + curl baked in) — the prior runtime install never put curl on PATH so callback POSTs returned HTTP 000. classify_pod_failure() now reads the backing Pod's status (RBAC already grants pod reads) to emit failed_oom (OOMKilled) / failed_image_pull (ImagePullBackOff); build_body's completed_at fallback uses RFC3339 `now
noetl/docs Docusaurus site (untagged) ADR Implementation-status block
noetl/travel Reference SPA (domain-fork example) (untagged) Production

Architecture at a glance

Today — kind cluster is fully Rust

The 2026-06-04 session retired all Python deployments + their services + configmaps from the kind cluster (per the user directive "delete all legacy stuff"). Local validation runs against the Rust topology by default:

              ┌──────────────────┐
              │  Gateway         │  noetl/gateway  v3.2.0
              │  (Rust)          │  auth · SSE · subscriptions · shard routing
              └────────┬─────────┘
                       │ HTTPS
              ┌────────▼──────────────────────────────┐
              │  noetl-server-rust  (v2.19.7)         │
              │  catalog · execute · events ·         │
              │  /api/internal · DbPoolMap sharding   │
              │  orchestrator engine · SSE            │
              │  workbook resolution · pipeline parse │
              └────────┬──────────────────────────────┘
                       │
              ┌────────▼─────────┐
              │   NATS JetStream │  NOETL_COMMANDS stream
              │   + Postgres     │  noetl.event + noetl.command
              └────────┬─────────┘
                       │
        ┌──────────────┴──────────────┐
        ▼                             ▼
  ┌──────────────────┐         ┌──────────────────┐
  │ noetl-worker-    │         │ worker-system-   │
  │ rust v5.11.3     │         │ pool             │
  │ (shared pool)    │         │ (Rust, runs      │
  │                  │         │  system          │
  │ noetl-tools      │         │  playbooks)      │
  │ v2.18.1:         │         │                  │
  │ python · shell · │         │ Consumer:        │
  │ http · postgres  │         │  ..._pool_system │
  │ duckdb · rhai ·  │         │ Filter:          │
  │ task_sequence ·  │         │  noetl.commands. │
  │ playbook · noop  │         │  system.>        │
  └──────────────────┘         └──────────────────┘
   Consumer:                     [Outbox publisher +
    ..._pool_shared               projector migrated
   Filter:                        to system playbooks
    noetl.commands.               via Phase 2.a]
    shared.>

Retired this session (2026-06-04): noetl-server (Python deploy), noetl-worker (Python deploy), noetl-outbox-publisher (Python deploy), noetl-projector (Python statefulset), and the noetl/noetl-ext/noetl-projector/noetl-worker-metrics services + 4 legacy configmaps + noetl-worker SA. The kind cluster is the regression-test topology for Rust-only e2e.

[Status legend: ✅ = shipped + kind-validated]

v10 playbook compatibility on Rust (closed loop)

Six interlocking gaps fixed in one iteration brought control_flow_workbook end-to-end on the Rust-only stack — exercising the complete v10 control-flow surface:

playbook YAML  ──►  noetl-server-rust orchestrator                 noetl-worker-rust
                                                                   + noetl-tools v2.18
─────────────────  ──────────────────────────────────────────────  ──────────────────
workload: {...}   ► #56 workload + input alias decode               PythonTool wrapper
tool.kind: python ► (existing dispatch)                             globals().update(args)
tool.kind:                                                          ► #17 capture
  workbook        ► #59 parser substitutes inline action            `result = {...}`
                                                                    global → data
tool: [{...}]     ► #57 ToolDefinition::Pipeline accepts both       ► #18 TaskSequenceTool
  (pipeline)        flat (name-as-field) + nested (label-as-key)     runtime
                                                                    
{{ step.field }}  ► #60 build_context exposes step data at top
  in next.arcs      level (not just steps.<name>); apply_event
                    captures call.done before command.completed
                    overwrites
                    
command.failed    ► #58 trigger_orchestrator on command.failed
                    + dedicated short-circuit in process_in_progress
                    emits playbook.failed terminal
                    
worker → server   ► #55 EventEmitRequest accepts i64 wire shape
  event emission    (was rejecting integer execution_id)

Validated end-to-end:

noetl exec tests/fixtures/playbooks/control_flow_workbook
→ playbook_started
→ start (python)
→ eval_flag (workbook→python; is_hot=true captured via marker)
→ hot_path (next.arc when="{{ eval_flag.is_hot == true }}" matched)
→ parallel hot_task_a + hot_task_b
→ playbook.completed ✅

Long-term Python trajectory

Per the Rust-only direction, Python pieces stay only as:

  1. Container payloads — runtime stays Rust; user code that wants Python ships in a container dispatched by the container tool kind (#43, in design).
  2. Back-compat GKE deployments — existing Python pods on the production cluster aren't removed (yet) since GKE traffic still uses them; no new feature work goes there.

The kind cluster is the canary for the Rust-only topology. When the operator runs the validation rigs end-to-end against their live cluster, R5 cutover decision will move the production topology to match.

Sessions log

Chronological notes on what each session accomplished — see Sessions Log.

Most recent (top of log):

  • 2026-06-20✅ #116 program-scale step 2 SHIPPED + multi-replica gate-ON validated — execution-affinity single-owner WRITE ORDERING. server#252 v3.39.0 + e2e#71. Step 1 (KV coherence) was necessary-not-sufficient — the command.issued prev-read + head CAS-advance are two non-atomic steps, so concurrent cross-replica emits forked the chain. Affinity routes every trigger (POST /api/events, which fires the drive) to the single replica that ShardConfig::owns(execution_id) owns; a non-owner forwards a reverse-proxy POST (one-hop loop guard, degrade-to-local). On the owner the single-process drive lock + in-memory ChainHeads make the read→advance atomic, no distributed lock; KV is the genesis/handoff vehicle (owner resolves LOCAL → kv_remote_hit→0). server#252 (5e00d0a, v3.39.0): src/affinity.rs, flags NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off), metric noetl_execution_affinity_total{outcome}. e2e#71 (66b6e1b): 2-replica StatefulSet topology + rig HARD gate. Multi-replica gate-ON kind PASS: linear/loop/fanout COMPLETE, every chain roots=1/dangling=0/walk==total (NO fork), forwarded_ok +9, never-scan (scans+0) + sole-writer across replicas; single-replica unchanged; 595 server tests + clippy green; baseline restored. Follow-up #117: off-server from_events spine ordered by event_id wedges fan-in under a chain-order≠id-order inversion (affinity + high-concurrency fanout) — fix = order spine by prev_event_id walk; linear/loop already reliable. Prod multi-replica verdict: write-ordering COMPLETE (no fork) — prod can horizontally scale the off-server stack for linear/loop; high-concurrency fan-out needs #117 first. All affinity flags default off; PROD GKE untouched.
  • 2026-06-20✅ #115 program-scale step 1 SHIPPED — multi-replica coherence DATA LAYER (NATS-KV-backed ChainHeads + ExecDescriptor); execution-affinity STAGED. server#251 v3.38.0 + e2e#70. NOETL_REPLICA_COHERENCE=nats_kv (default local, prod unchanged) backs the off-server drive's watermark + descriptor with JetStream KV buckets so 2+ replicas resolve the same value — head advance = CAS (one chain), descriptor = CAS merge; in-process maps = write-through cache / degraded fallback (local → bit-identical). server#251 (8f39a79): src/coherence.rs (CoherenceKv + lazy buckets), ChainHeads/ExecDescriptors async, ExecDescriptor serde, metric noetl_replica_coherence_total{structure,op,outcome} (proof kv_remote_hit). e2e#70 (e222877): kind_validate_replica_coherence.sh + un-staled the #113/#114 offload asserts behind NOETL_RIG_EXPECT_OFFLOAD (default false; under refs_in_state=true the offload paths legitimately stay flat). Kind-validated: single-replica nats_kv is bit-for-bit parity with local (linear/loop/fan-out ×2 all COMPLETE, roots=1/dangling=0/walk==total, state_build_event_scans +0, hotpath scan +0, sole-writer intact); 2-replica proved cross-replica resolves work (kv_remote_hit advanced for head + descriptor, no kv_unavailable). Necessary but NOT sufficient: on 2+ replicas concurrent cross-replica emits still fork the chain (the issuing_event head-read vs head-advance is non-atomic across replicas → observed forked chains + a cross-execution prev), so executions don't reliably COMPLETE on 2+ replicas yet — the remaining piece is execution-affinity (one replica owns an execution's drive + chain write; substrate present in src/sharding.rs shard_for/owns), STAGED as program-scale step 2. 588 server tests + clippy green; baseline restored. ai-meta pointers → server 8f39a79 (v3.38.0) + e2e e222877. PROD GKE untouched; default local; no gate/mode/builder default changed. Off-server architecture is multi-replica-COHERENT (data) but not yet multi-replica-COMPLETE (write-ordering) — prod cutover stays single-replica until affinity lands.
  • 2026-06-20✅ #115 Phase 5 SHIPPED + gate-ON validated — atomic-working-item context (tenet 6): the drive hands a worker only its minimal declared slice; NOETL_ATOMIC_ITEM_CONTEXT (default off). #77 (Explicit Input Binding) resolved. server#250 v3.37.0 + worker#121 v5.40.0 + e2e#69.
  • 2026-06-20✅ #115 Phase 6 SHIPPED + gate-ON literal-zero validated — hot-path noetl.event read class RETIRED; the table is AUDIT-ONLY (server v3.36.0). NOETL_EVENT_READ_PATH=event_scan|audit_only (default event_scan, prod unchanged) retires the remaining lifecycle readers of noetl.event (the WHERE execution_id replay class outside the drive). server#249 (b71ca1d): under audit_only get_catalog_id (per-ingest) + inherit_parent_trace + subscription dedup-audit + container-callback catalog/existence serve from the in-memory execute-time ExecDescriptor; a cold descriptor (post-terminal straggler after eviction / restart) resolves catalog_id from noetl.command (synchronous queue) — never a noetl.event scan. Proof metric noetl_event_hotpath_reads_total{site,outcome}. ops#199 (e5b0737) pins event_scan on the prod server manifest (operator-gated flip). e2e#67+#68 (0ab3c0a) kind_validate_event_read_path_phase6.sh. Gate-ON kind-validated (PUBLISH_ONLY + offserver + materializer + audit_only): hot-path scan Δ0 (served_descriptor +96 + served_command +3), drive state_build_total Δ0 + event_scans Δ0 ⇒ ZERO noetl.event scans anywhere on the hot path, end-to-end; linear/loop/fan-out/output_select COMPLETE; sole-writer + lag-0; audit still works (direct SELECT + status COMPLETED + replay event_count=25); committed gate rig PASS with audit_only on (no regression); 585 server tests + clippy green; baseline restored. Completes the RFC never-scan end state (tenet 3) under the flag. ai-meta pointers → server b71ca1d (v3.36.0) + ops e5b0737 + e2e 0ab3c0a. PROD GKE untouched; default event_scan; no gate/mode/builder default changed. Remainder = Phase 5 (atomic-item, needs #77) + program-scale (per-shard WAL, multi-replica descriptor coherence).
  • 2026-06-19✅ #115 Phase 4 KERNEL + FLAG SHIPPED + shadow kind-validated — off-server state builder (worker v5.37.0 + server v3.33.0); drive cutover staged. The pool-side state_builder (worker#118, fef961c) reconstructs WorkflowState from the noetl_events WAL — a per-execution chain index walks prev_event_id head→root, caches the spine keyed by the immutable chain head, advances only the new tail. A live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off) + the server NOETL_STATE_BUILDER=offserver|server flag (server#246, 3e6006d, default server). Gate-ON kind-validated on live cluster: shadow replayed the WAL → chain-walked spines whose indexed==spine sizes match Phase-3 topologies (linear 13, loop 62, fan-out 25, output_select 31, storage_tiers 55) = parity by construction; WAL-read wal_events_total=993 with event_scans_total=0; cache cold_rebuild=28 (replay/restart) + incremental=21 (live tail-advance — fresh fan-out indexed Incremental(5), indexed==spine==25==DB event_rows 25); fresh fan-out COMPLETED gate-ON, event_rows==distinct, 0 __orchestrate__ event rows, materializer pending=0/project_errors=0; 8 worker unit tests + 2 server config tests + clippy green; baseline restored. ai-meta pointers → worker fef961c + server 3e6006d. PROD untouched; no default changed. The offserver drive cutover (drive consumes the builder's state) + Phase 5 (atomic-item context, #77) / Phase 6 (retire event read path) remain.
  • 2026-06-19✅ #115 Phase 3 MERGED — chain-walk state builder (server v3.32.0); Phase 4 (off-server state builder + WAL cache) started. server#245 self-merged (no classifier block) → server v3.32.0 (8338417); ai-meta repos/server pointer bumped. Behind NOETL_STATE_BUILD_MODE=chain_walk (default event_scan, prod unchanged) the drive reconstructs WorkflowState by walking prev_event_id head→root (in-memory ChainHeads head + (execution_id,event_id) PK lookups — never a WHERE execution_id scan) → same from_events (orchestrate-core unchanged; parity by construction); falls back to event-scan on cold-head / lag / non-genesis. Gate-ON kind-validated (prior session): parity 41/41 MATCH, event_scans_total=0 / 1064 PK hops / 0 fallbacks, all topologies COMPLETE, sole-writer + lag-0 + gate rig PASS, 577 tests + clippy green. Phase 4 now in progress — move the chain-walk state construction OFF the server onto the system worker pool reading the WAL/NATS stream, pool-side cache keyed by the immutable chain head + incremental tail-advance (server chain-walk + event-scan remain fallbacks). PROD GKE untouched; no gate default changed.
  • 2026-06-19✅ #115 Phase 2 implemented + kind-validated — one-level prev_event_id event chain (server#244 + noetl#667 merged). Each noetl.event carries prev_event_id (the immediately-previous event in causal order) + each noetl.command the issuing-event link, so per-execution events form a walkable singly-linked list followable pointer-by-pointer without scanning noetl.event (additive; no reader yet — Phase 3). The emit chokepoint emit_events stamps the event link from a per-execution chain-head watermark (ChainHeads) — one path covers drive events + command.issued + worker-lifecycle on both the gate-off INSERT and the gate-on publish, the materializer persisting it; the command link = the real step.enter/unblocking completion so cursor-fan-out bodies share their branch origin (§4.4). Server-only (no orchestrate-core change). Chain-correctness proven gate-ON across 6 executions (linear 13/13, loop 62/62, fan-out 25/25 with a real shared branch origin, sub-playbook 46/46, + Phase-1 output_select 31/31 & storage_tiers 55/55 bounded): each has 1 root, 0 dangling / 0 duplicate event prev, 1 head, pointer-walk == full sequence (no gaps), real-step command dangling=0; kind_validate_orchestrate_gate.sh PASS (sole-writer 25==25, 0 dup cycles, catalog0=0, lag 0); 573 lib tests + clippy green. PROD untouched; no gate default changed. Awaiting merge → ai-meta pointer bump; Phase 3 (chain-walk state builder) next.
  • 2026-06-19🐞 #114 oversized-command.issued offload shipped (server v3.29.5); refs_in_state consume side (#101) is the remaining off-server-drive cutover blocker. With #113's decode fix in place, 4 large-context fixtures wedged at a distinct second stall: under refs_in_state=false the off-server drive embeds the full resolved upstream context into the next command, so its command.issued event (~1.32MB) exceeded NATS max_payload → publish ack-timeout → wedge. Fix: a command context over NOETL_COMMAND_CONTEXT_MAX_BYTES (512KB) is offloaded to noetl.result_store with a {__context_ref__} marker; get_command/claim_command resolve it before the worker sees it (metrics context_offloaded/context_ref_resolved). Server #242v3.29.5 + rig phase 8 e2e#64. Kind gate-ON: off-server rig PASS (new test_oversize_command_context COMPLETED, max command.issued ctx 585B, offload+resolve fired, 0 __orchestrate__ event rows, materializer lag 0); every command.issued event <1MB across all fixtures; 6 of #113's 9 now COMPLETE. Chose ref-on-oversize over refs_in_state=true (candidate #1): a kind experiment proved refs_in_state=true fixes the state-bloat (kind_playbook_lease_expiry completes — drive-state 29KB vs ~1MB loop) but breaks the bulk-consuming fixtures (test_storage_tiers/test_output_select fail at the bulk step) because the worker render-time ref-resolution isn't implemented — so the default stays false. The remaining 3 fixtures + the cutover (#107/#111) now hinge on the refs_in_state consume side (#101)__orchestrate__ drive-state bloat (17.4MB for storage_tiers) + the _ref/bulk-resolve gap. ai-meta → server 385d21f (v3.29.5) + e2e 9919392. No prod default flipped; prod is pre-#108 in-server drive (unaffected).
  • 2026-06-19🐞 #113 off-server drive — recover offloaded drive result + stop drive on cancel (server v3.29.4); #114 opened. Fixed the worker-driven drive stall when an __orchestrate__ result exceeds the 100KB inline budget (worker offloads it with only a reference.ref → server now resolves+decodes it via result_store.resolve, metric ref_resolved, instead of dropping → non-convergent re-loop) + the cancel non-stop facet (match underscore playbook_cancelled + ExecutionState::is_terminal terminal guard, no restart). Server #241v3.29.4 + rig e2e#63. Kind gate-ON proven (785KB result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; sole-writer lag 0). 5/9 #113 large-context fixtures COMPLETE; the other 4 hit a distinct oversized-command.issued (full upstream context embedded → >1MB NATS payload) stall → #114 (#113 stays open until all 9 close). ai-meta → server 1e844c1 + e2e 12b27e9. No prod default flipped.
  • 2026-06-19🚀 #103 GKE pre-flip PREP — prod images pushed, GMP monitoring live, manifests staged; NO traffic flip / NO PUBLISH_ONLY. Verified prod already-Rust (the #49 cutover is done; pre-#103 live images), both flip secrets present, monitoring = Google Managed Prometheus (not VM). Pushed server v3.29.3 + worker v5.35.0 to the prod AR (amd64); applied + verified GMP PodMonitoring (worker+server /metrics — the noetl ns had none) + materializer-lag Rules (up{namespace="noetl"}=4 live); staged the roll-forward manifests (not applied — they roll live workloads); runbook gained a "Production (GKE)" section + GMP managedAlertmanager pager stub. Operator-gated: roll images → materializer shadow → pager → flip. ai-meta e5b6d6c → ops 9edd9c4 (ops PR #197). No prod default changed.
  • 2026-06-19🛡️ #103 materializer-lag GUARDRAIL shipped — the pre-flip observability gate. The server was FLIP-READY; the remaining gate was a materializer-lag metric + alert. Worker #116v5.35.0 extends the JetStream lag poller to track the noetl_events/noetl_materializer consumer on an independent task → noetl_worker_nats_consumer_pending{consumer="noetl_materializer"} climbs even when the materializer loop is dead. Ops #195+#196: VMRule (backlog warning>200/critical>2000/growing + stall-under-gate + project-errors + absent-under-gate, stall guarded on backlog>0), worker /metrics VMServiceScrape (was unscraped), VMAlert enabled, Grafana dashboard, flip runbook noetl-cqrs-publish-only-flip.md. Kind-proven full cycle on the VM stack: green baseline (backlog 0) → induced lag (materializer fault-injected, events publishing under the gate) → gauge 0→684, alerts fire (backlog warning+critical + stall) → recover → drains→0 idempotently (0 dup/loss), alerts clear. ai-meta → worker b910341 (v5.35.0) + ops 2fcfa59 + worker-wiki 0030f30. PUBLISH_ONLY stays default-off.
  • 2026-06-19🎯 #103 server cutover COMPLETE — FLIP-READY. The 2 ExecutionService cancel/finalize sites now route through the emit_event chokepoint (server #240 → v3.29.3 + e2e rig #62); kind-proven both modes; no remaining synchronous server event writers under the gate. Flipping PUBLISH_ONLY on is now a staged operator decision.
  • 2026-06-19🎯 #104 off-server-drive × gate reconciliation PROVEN — the last real blocker before the PUBLISH_ONLY flip is operator-safe. The combination #103 left unproven (gate-on was only ever validated with the in-process drive) is green on kind: gate-ON (PUBLISH_ONLY=true) with the off-server drive (PLUGIN_DRIVE=true) + materializer sole writer → fresh exec + cursor fan-out → COMPLETED; server wrote 0 noetl.event rows (all 25 PUBLISHED — event_ingest_published_total=25), materializer materialized all 25 exactly once (25 rows == 25 distinct ids, 0 catalog_id=0, 0 dup cycles), drive dispatched=applied, read-your-writes held (the relocated trigger fires post-materialize → server rebuilds state from committed log before bounding the off-server drive input). Server #238v3.29.2 (76d29bb): cold-cache apply now rebuilds WorkflowState from the durable log (the #104 WAL-rebuild principle) instead of dropping the in-flight result on a server restart mid-drive — kind crash-recovery proof: hard-kill mid-drive → cold_rebuild metric+log fires → that exec COMPLETES with full event integrity. Committed e2e rig kind_validate_orchestrate_gate.sh (e2e#61). Regression green: gate-off + in-process (prod default) and gate-off + off-server. ai-meta → server 76d29bb + e2e 61f7a5c. Remaining before a safe flip: only the 2 ExecutionService cancel/finalize sites. No prod default changed.
  • 2026-06-18🎯 #108 (c) — the worker-driven orchestrator drive is now the DEFAULT; #108 CLOSED. Flipped NOETL_ORCHESTRATE_PLUGIN_DRIVE to default true (server#233, v3.28.0 → server@80cc0e6). Gated on a scale soak on kind (images built from the released tips, server v3.27.0 / worker v5.33.0): a single 694-drive cursor+fan-out run (test_pft_flow_v2 3×40) COMPLETED with __orchestrate__ rows in noetl.event = 0 (event_suppressed +2082) and all 694 drives claimed on the system pool (shared pool got only the 671 real-step commands), 0 errors; 5× concurrent self-contained cursor = 5/5 COMPLETED. Then deployed the flipped image with no env var — the default-on path reproduced the identical shape (361 drives, system-isolated, 0 burst); 15/15 regression fixtures green; the revert (=false) verified to fall back to the in-process drive (system delta 0). In-process trigger_orchestrator_inner kept as the fallback. ai-meta → server 80cc0e6 + worker 437b0be + server-wiki 0210012.
  • 2026-06-18orchestrate drive isolated on the SYSTEM pool via pool affinity (#108 follow-up b, kind-validated). Server stamps execution_pool on the command notification (server#232 → server@846166b); worker declines (ACK+skip) notifications not for its pool segment (worker#114 → worker@e2162b7), so the drive runs on the dedicated system pool even under JetStream consumer-filter drift. No worker HTTP pending-poll — the NATS consumer is the only claim vector. Validated: __orchestrate__ claimed+executed on the system pool (3), zero on the default pool, simple_python COMPLETED; 553+196 tests green. Only (c) the deliberate default-flip remains.
  • 2026-06-18the orchestrate meta-command touches noetl.event ZERO times (#108 follow-up a, kind-validated). dispatch_orchestrate_command stops writing command.issued to noetl.event; the command lives only in noetl.command, and claim_command/get_command fall back to it on a miss (noetl.event stays authoritative for normal commands) (server#231 → server@9438f3b). So __orchestrate__ writes 0 of its former 5 rows per drive — the directive that system playbooks keep only their own state is met. Validated: cursor+fan-out COMPLETED via the noetl.command fallback, 0 event rows, 20 real steps normal, 0 errors. Remaining: NATS affinity (ops) + default-flip.
  • 2026-06-17system playbook events no longer burst Postgres + system-pool routing (#108 slice 4b, kind-validated). The __orchestrate__ meta-command is infrastructure, not a workflow step, so the server now skips persisting its lifecycle events to noetl.event (handle_event_inner + claim_command) (server#230 → server@6aef3a6). At scale they'd burst noetl.event/Postgres for no benefit. Validated: __orchestrate__ now writes only the lone command.issued (1 of 5 — 80% fewer rows); cursor+fan-out flow still COMPLETED. Drive routes to the system segment (true isolation pending a NATS-affinity ops fix; resilient via the pending-poll meanwhile). Follow-ups: eliminate the last command.issued (claim from noetl.command) + NATS affinity.
  • 2026-06-17🎯 the orchestrator drive runs OFF-SERVER on the worker pool (#108 slice 3, kind-validated). With NOETL_ORCHESTRATE_PLUGIN_DRIVE=on the server issues system/orchestrate (entry: run_state, args = the bounded WorkflowState) to the worker pool instead of evaluating in-process; the worker runs the drive, the server applies the result on the command's call.done (server#229 → server@465cdbb v3.23.0). Kind: test/simple_python drove start→end→COMPLETED through the round-trip (dispatched=2, applied=2, 0 decode_error, __orchestrate__ didn't leak as a step, playbook.completed). Default off, in-process fallback. Bug caught+fixed: output_b64 rides call.done not command.completed. Next: shadow→flip at scale, make drive the default, route to the system pool.
  • 2026-06-17worker-driven cutover slice 2: apply_orchestration_result extracted + slice 3 designed (#108). The post-evaluate emission (events → commands → terminal) is extracted verbatim from trigger_orchestrator_inner into a reusable fn (server#228 → server@586aeae) so the worker-driven drive applies a worker-computed result identically. Behavior-preserving (553 tests green, clippy clean). Slice 3 (dispatch) designed + grounded: apply_event would phantom-step a meta-command, so the design uses a reserved __orchestrate__ step ignored in state, a flag-gated scheduler, apply-on-callback, and loop-prevention. Lands behind NOETL_ORCHESTRATE_PLUGIN_DRIVE (default off), kind-validated before the flip.
  • 2026-06-17worker-driven cutover slice 1: configurable wasm guest entry (#108). The worker can now dispatch a named plug-in export (worker#113 → worker@04420d0): tool: {kind: wasm, plugin: {path, version, entry}} names the export (default run); the worker-driven orchestrator will use entry: run_state. invoke_bytes_with_entry + run_by_ref_entry/run_and_apply_by_ref_entry (originals delegate with run); test proves run→0xAA vs run_state→0xBB + missing-export error. Purely additive, no live change. Server scheduler+apply (the hot-path round-trip, default-off flag) next.
  • 2026-06-17orchestrate plug-in drives the real workload identically, live (#108 slice 4). The orchestrator runs the plug-in alongside the in-process drive on every evaluation + diffs commands (server#227 → server@bd652ab) — a process-global wasmtime host (feature orchestrate-shadow) loaded from noetl.plugin_module at boot, gated NOETL_ORCHESTRATE_PLUGIN_SHADOW; in-process result authoritative. Kind-validated over the live 10×1000 PFT: noetl_orchestrate_shadow_total{result="match"} 529, ZERO mismatch/error, workers stable. Plug-in gains a state-input path (run_state); both build configs green (default no wasmtime). Slices 1-4 prove orchestrator-as-plug-in end to end; next is the worker-driven cutover.
  • 2026-06-17system/orchestrate@1 registered + servable in a deployed server (#108 slice 3). The server bakes the orchestrate wasm into its image and seeds built-in system plug-ins into noetl.plugin_module on boot (server#226 → server@b21b589, kind-validated). New src/system_plugins.rs (pure dir-scan + sha256, unit-tested) + a wasmbuilder Docker stage + NOETL_SYSTEM_PLUGIN_DIR; in-process upsert (not the token-gated HTTP surface); digest-keyed hot-reload. Validated: GET /api/internal/plugins/system/orchestrate?version=1 → 200 application/wasm 1559093 bytes, ETag=digest, stale→409, baked sha256 == served digest. Next: kernel scheduler dispatches the now-registered plug-in.
  • 2026-06-17system/orchestrate plug-in runs identically to native in wasmtime (#108 slice 2). A wasmtime shadow-diff (server#225 → server@ccec104) loads the built .wasm through a harness mirroring the worker host's invoke_bytes ABI byte-for-byte and asserts the wasm output equals the native drive over auth0 multi-arc when: routing (minijinja in wasm) + cold-start. Finding: command-set identity (parsed Value eq), not raw bytes — the context map serializes in insertion order (serde_json preserve_order ← upstream HashMap iteration, differs wasm32 vs host arch); the scheduler deserializes to Vec<Command>, so the command set is the bar. 2 unit + shadow-diff green; plug-in excluded; test-only. Next: catalog register/serve → kernel scheduler.
  • 2026-06-17system/orchestrate WASM plug-in exists — drive core runs as a 0-import module (#108 slice 1). New standalone plugins/orchestrate/ crate (server#224 → server@10a629b) wraps the drive behind the worker plug-in ABI (input = JSON event-slice + playbook; output = JSON OrchestrationResult; data-plane = memory/alloc/run) and compiles to wasm32-unknown-unknown — the first non-trivial compiled system playbook. Feasibility risk retired: the .wasm has 0 imports (no WASI, no host render) — the whole drive incl. minijinja runs in-guest. Native parity test reproduces native evaluate byte-for-byte; 551 server tests green, crate excluded from the workspace. Next: worker-host shadow-diff → catalog register/serve → kernel scheduler (NOETL_ORCHESTRATE_PLUGIN, default off).
  • 2026-06-17Orchestrator drive core fully wasm-resident — Event-ABI round #109 CLOSED. Slice 3 (server#223) moved orchestrator/evaluate from src/engine/ into noetl-orchestrate-core. All 6 drive modules (renderer, playbook model, commands, evaluator, state, orchestrator switch) now compile native + wasm32-unknown-unknown — the system/orchestrate plug-in seed (#108). evaluate reads the pure core::event::Event; server converts db::Event at the trigger_orchestrator boundary (slice-1 From). 122 core + 565 server tests green, 0 WASI imports on wasm32, clippy clean; cargo-chef image (v3.20.0) kind-deployed, PFT 10×1000 — full command lifecycle, 0 errors, 0 restarts. ai-meta → server bfd3f77 (internal refactor, stays v3.20.0).
  • 2026-06-14Transfer tool: Snowflake↔Postgres both directions — #99 CLOSED. Both transfer arms implemented with full credential-alias resolution. tools v3.10.0 (tools#65) + worker v5.22.0 (worker#87) + e2e#58. SF→PG: $n::text::<udt> coercion + RFC3339 timestamp reformat; PG→SF: generated INSERTs. Full bidirectional data_transfer/snowflake_postgres fixture COMPLETED on kind against live sf_test account. tools → 4127b4b · worker → 6d97e7c · e2e → 94aa7f1.
  • 2026-06-14Snowflake key-pair JWT validated end-to-end — #98 last external-tool gap closed; transfer step → #99. noetl-tools v3.9.0 / v3.9.1 / v3.9.2 (tools#62/#63/#64) — key-pair JWT auth (bypasses MFA) + User-Agent fix (code 391903) + SQL-API context-in-body + multi-statement split (codes 391911 + 000008). Worker bumped to v3.9.2 (worker#83#86) + e2e#57 fixture cleanup. create_sf_database (CREATE DATABASE) + setup_sf_table (CREATE TABLE + INSERT) both COMPLETED via JWT on kind against the live sf_test account (NDCFGPC-MI21697). Transfer step fails (inline creds, no key-pair fields) — filed #99. ai-meta pointers: tools a216ab2 · worker 9d6b127 · e2e e191231.
  • 2026-06-12#90 Phase 7 shipped — scale hardening; #90 CLOSED (all 7 phases complete), live proof green. Final phase: server v3.5.0 (server#189) POST /api/execute/batch (N→N, partial-failure contained) + opt-in exactly-once dedup window (noetl.subscription_dedup, bounded-by-age, race-safe, default off); worker v5.19.0 (worker#79) batch dispatch + dedup opt-in + per-subscription rate limits (deterministic token-bucket RateGovernor, fetch-side backpressure → source keeps backlog, no loss, subscription.rate_limited event); ops (ops#176) + e2e (e2e#48); no tools change → no crate cascade. Live on kind: batch 12→12 COMPLETED on the subscription pool + per-message traceparent; dedup duplicate→1 execution + subscription.message.deduplicated; rate-limit engaged + 10/10 → executions (no loss); direct-curl within/outside-window + dedup-off + batch partial-failure all green. ai-meta → server 7b217d8 + worker 7531f4a + ops 6db69b9 + e2e 203593b. #90 closed; follow-ups tracked: #91#94 + tools#57.
  • 2026-06-12#90 Phase 6 shipped — CLI local noetl subscribe + FileEventSink + local_disk spool (live local proof green). Added noetl subscribe <spec.yaml> (cli v4.11.0, cli#60, closes cli#59): a kind: Subscription listener run standalone in local mode — no k8s, no NATS-dispatch server for the listening itself — reusing the same noetl_tools source clients + directive engine + spool engine the in-cluster worker uses, emitting the same ExecutorEvent envelope to a local FileEventSink (one event/line JSONL → replayable trail). Local dispatch (RFC §5.3): in-process via PlaybookRunner (pure-local default) or POST /api/execute. local_disk spool (§8.6): circuit-breaker + buffer + ordered replay + idempotency + dead-letter against a local dir, circuit state in a local file. New src/subscribe/{mod,spec,sink,dispatch,runtime,spool}.rs + examples/subscribe/. cli-only — no tools change / crate cascade (the source+spool surface ships in noetl-tools v3.5.0; bumps the lock 3.0.0 → 3.5.0 via the executor's "3"). Tests: 12 subscribe + full bin suite (53) green, incl. a deterministic outage→spool→ordered-replay→idempotency proof on the real engine. Live (in-cluster NATS on kind): 5 msgs → received=5 dispatched=5 failed=0 (19-event JSONL trail); local_disk spool outage → 6 message.spooled (0 dispatched, no loss) → recovery → 6 message.replayed in order → drained to 0. Finding: the NATS source ignores URL-embedded user:pass (async-nats ConnectOptions) — specs use explicit user/password. ai-meta → cli 2fb3fb0 (v4.11.0); wiki cli subscribe. #90 stays open for Phase 7 (scale hardening, volume-gated).
  • 2026-06-11#90 Pub/Sub + Kafka brought to live-E2E parity with NATS (validation gap closed). Stood up the two remaining subscription brokers in kind — Pub/Sub emulator (gcloud SDK image) + single-broker KRaft apache/kafka:3.9.1 — under noetl/ops (ops#170), and added bounded-drain fixtures + kind-validate runners under noetl/e2e (e2e#41). Both backends passed the same live bar as NATS: publish/produce 5 → bounded drain count=5 acked=true → execution COMPLETEDcall.done/command.completed/playbook.completed event trail. No adapter code change needed — the pure-Rust kafka crate talks to Kafka 3.9 KRaft and the Pub/Sub REST backend works against the emulator as-is. The one fix: the <step>.output.<field> accessor never resolved (both when: arcs skipped → drain stalled); corrected to <step>.<field> in the fixtures + the latent ops subscription_drain.yaml example. Validated on server v3.1.0 + worker v5.15.2 + tools v3.2.0; cluster left on that clean released stack. ai-meta → ops 568a4ac + e2e 8d21e7a. #90 stays open (Phases 2–7 design-only).
  • 2026-06-11#89 shipped — JSON null round-trips through {{ step }} (server fix, v3.0.6). #89nullundefined serialization — CLOSED. The #88 cursor fixture walked all 4 pages but its 4th check_pagination crashed: the terminal page's next_cursor: null, re-injected via the whole {{ fetch_page }} envelope, rendered as the JS token undefined (invalid JSON), so the consuming Python step received response as a str. Traced the corrupt command.issued args.response to the renderer that builds next-step inputs — the server orchestrator (src/template/jinja.rs::render_to_value), not the worker the issue blamed. json_value_to_minijinja maps JSON nullValue::UNDEFINED; minijinja's map repr emits undefined; render_to_value failed from_str and fell through to a raw string. The noetl-tools engine already had a | tojson retry for exactly this; the server's copy had diverged without it. Fix (server#177, v3.0.6) ports the retry. 5 new tests; 619 lib + 8 parity green; clippy clean. Kind-validated end to end on the live test-server (baseline 4th check_pagination error → fixed success; cursor collects 35, matching offset). ai-meta → server 8e17fbe. Standing direction honored — Claude wrote the Rust directly, no Codex.
  • 2026-06-10#88 shipped — pagination fixtures read response.body.*; #89 filed. #88 — offset/cursor pagination fixture path — CLOSED. The Rust http tool nests the parsed JSON payload under body ({{ fetch_page }}{body, headers, status_code}); the fixtures read response.get('data', {}), which resolved to {}, so has_more/next_cursor defaulted falsy and the loop exited after page 1 despite the correct post-#85 machinery. Confirmed the shape against a live http-tool result, then switched both check_pagination steps to response.get('body', {}) (e2e#40). Kind-validated: offset walks 0→10→20→30, users 10/10/10/5, validate_results success 35, playbook.completed COMPLETED; cursor path-fixed + walks all 4 pages (Mg==→Mw==→NA==→null, 35 events fetched) but the terminal page surfaced a distinct worker bug → #89 (worker serializes next_cursor: null as JS undefined when re-injecting {{ fetch_page }}, so the consuming Python step gets an unparseable str). Other pagination fixtures (retry/max_iterations/pipeline*/loop_with_pagination) share the same envelope-key assumption over /api/v1/assessments|flaky ({data, paging}) — flagged, left for follow-up. ai-meta → e2e 72a7525.
  • 2026-06-10#87 shipped, #85 deferred (e2e sweep follow-ups #85/#87). #87 — multi-tool sibling references — CLOSED. task_sequence (the tool: [list] pipeline runtime) stored each sub-tool's result for the aggregated output but never injected it into the running context, so a later sub-tool's {{ <label>.<field> }} rendered empty — masked in quoted positions, a syntax error at or near "," in unquoted numeric SQL (save_edge_cases test_large_payload). Fix (tools#48, v3.1.1) injects each sub-tool's result under its label (synthetic .data self-ref); worker adopts via worker#69. Kind-validated on a worker built from the fix: save_edge_cases test_large_payloadrecord_count = 100 (no syntax error), save_delegation_test clean. ai-meta → tools 76f942a + tools-wiki 4962f8b + worker b97f642. #85 — workflow-arc loop re-entry — DEFERRED (kept open). Implemented the dispatch-guard layer (draft server#176): a back-edge detector (cycle + recency) re-enters a completed loop head, so the loop no longer hangs (608 lib tests + 5 new pass). But kind validation surfaced a second blocker — set: ctx.X loop variables are recomputed per orchestrator pass and revert to the workload default when the producing step is re-dispatched (a minimal counter-loop thrashes 0,0,1,0,1,2,…). Full multi-page pagination needs durable event-sourced ctx propagation across iterations — larger than is safe to land well-tested in one session; held as a draft, not merged. Standing direction honored: Claude wrote all Rust directly (no Codex).
  • 2026-06-10#80 closed — container_callback chain green end to end. Fixing the watcher's missing curl (the literal #80 goal) surfaced two more layered bugs beneath it. Watcher image (ops#168): the manifest used the retired bitnami/kubectl:1.30.3 (removed from Docker Hub; the live cluster was patched to the bitnamilegacy archive) with a runtime apt/apk install step that never put curl on PATH → callback POST returned HTTP 000. Switched to alpine/k8s:1.30.3 (kubectl + jq + curl baked in), dropped the install hack. Server insert (server#173, v3.0.3): once curl worked the POST reached the server and 500'd — the container-callback handler inserted call.done via a stale query targeting an attempt column that doesn't exist on the deployed noetl.event; fixed to the working handlers::events column set. OOM path: the watcher only read Job-level conditions so failed_oom could never fire — added pod-level OOMKilledfailed_oom classification (ops#168); the completed_at fallback for failed Jobs used bare jq now (numeric epoch → HTTP 422), fixed to RFC3339 now | todate; and the e2e fixture's bytes(40MiB) was calloc-lazy (mapped to the zero page, never faulted in) so the container exited 0 — switched to a written-into bytearray that dirties pages and reliably OOM-kills (e2e#38). Verified the kind cluster actually enforces memory limits (120 MiB in a 32Mi pod → OOMKilled exit 137). Rebuilt the server image + reloaded into kind; kind_validate_container_callback.sh both probes GREEN — happy_path → succeeded (delta 1), oom → failed_oom (delta 1). This is the last blocker on the #43 container-callback chain. ai-meta → ops cacc513 + server 5d2cf58 (v3.0.3) + e2e 6aaf06e.
  • 2026-06-10#79 closed — e2e kind-val runners back on the current noetl CLI surface. Both scripts/kind_validate_*.sh runners aborted immediately on error: unrecognized subcommand 'playbook' — they targeted the retired noetl playbook register/execute + noetl execution status/events verbs. The validation logic and the event taxonomy (step.enter / command.completed / node_name / the fan-in barrier) were intact; only the invocation layer had drifted. Fix (e2e#37): noetl register playbook --file, noetl exec <catalog-path> --runtime distributed --json (exec by metadata.path, not the bare name), noetl status <id> --json, and the event log over noetl query (no events verb today — rows wrap under .result, order by event_id since noetl.event has no timestamp column). Added a fail-fast CLI-surface guard to each runner. Validated on kind (server-rust v3.0.1 + worker-rust, :8082): fanout_reduce PASS start-to-finish with no manual workaround; container_callback drives register→exec→COMPLETED cleanly and stops at the metric-delta assertion because the deployed noetl-k8s-watcher image lacks curl (watcher.sh: curl: not found → HTTP 000) — a cluster-side watcher gap tracked on #80. Version-skew note: PATH binary is noetl 2.17.0, repos/cli submodule is v4.10.0; the targeted surface is identical across both, so the runners work on either (the binary lags the submodule by a major line — worth refreshing for parity, not required here). Pointer: e2e → a3594b3; e2e wiki: new Kind-Val Runners page.
  • 2026-06-10#82 closed — GUI credential View/Edit recovered for pre-wallet records. The Secrets Wallet (#61) moved credential storage to forward-only envelope encryption; pre-wallet records now 500 on GET /api/credentials/{id}?include_data=true (Decryption failed: aead::Error), so the GUI View/Edit flow dead-ended on a generic toast (response shape unchanged). Fix (gui#36): View surfaces the real reason + points to Edit; Edit still opens with the list-row metadata (name/type/description/tags) + a warning banner and an empty-but-required data field, so re-entering the secret and saving re-seals the record under the current wallet — recovering it. Validated live against kind + the dev:kind UI on :3001. Also landed e2e#36 (duplicate workload probe-flag keys removed from tooling_non_blocking) and gui#35 (dev:kind convenience script). Pointers: gui → 8cacc9e (v1.11.1), e2e → 4a9ffbc.
  • 2026-06-10#81 closed — noetl-server v3.0.2 fixes the container-tool command type contradiction. ToolSpec.command was Option<String> (scalar) but the container tool kind writes a K8s-Job-style array — an array failed the server's ToolDefinition untagged-enum match (400), a scalar was rejected by the worker's ContainerConfig.command: Option<Vec<String>>. Typed command as Option<serde_json::Value> (same as args); ToolCall::from_spec forwards it verbatim. 2 regression tests; clippy clean (server#172, v3.0.2). Kind-val GREEN end-to-end: server accepts the array command, worker creates the K8s Job, Job reaches Complete 1/1. Server pointer bumped (ai-meta → server bd36672). Chain counter-bump validation stays gated on #79 (runner CLI) / #43.
  • 2026-06-09E2E sweep cleanup — noetl-tools v3.1.0 + noetl-server v3.0.1. Stripped the diagnostic tracing::debug! scaffolding added during the e2e triage, kept the production fixes: YAML when: true boolean + |tojson object-template fallback (tools#47), 64 MB result-store body limit + pipeline command/spec stash (server#171). Pointers bumped (ai-meta@316048c tools, @6590bd6 server); tracks #49. All 7 sweep playbooks PASS on Rust-only kind. Worker crates.io dep-revert deferred — v3.1.0 not yet on crates.io ([skip ci] release commit).
  • 2026-06-08noetl-tools v2.24.2 clippy cleanup + noetl/server#22 closed. Cleared the clippy -D warnings CI gate on noetl-tools (15 warnings across 7 files; all mechanical lint fixes). Closed stale noetl/server#22 (Phase D orchestrator engine port — complete). noetl/server PR #167 (same clippy shape) opened, awaiting merge.
  • 2026-06-05Rust-only regression rig — canonical v10 SQL + http config shapes. Swept ~30 self-contained e2e fixtures against the Rust-only kind stack and fixed three config-shape classes in noetl-tools: postgres command: alias + multi-statement SQL (tools#24, v2.18.3), a task_sequence→duckdb regression test (tools#25), and the duckdb command: alias + http params/headers/form non-string coercion (tools#26, v2.18.4). Worker adopted both (worker#50, worker#51). Newly GREEN: duckdb_test, json_serialization_save, duckdb_retry_query, pagination/{offset,cursor,max_iterations,pipeline}, retry_simple_config. Recovered the cluster first (server had latched into NATS not configured after a podman restart). Server-side follow-up noted: loop_with_pagination renders {{ execution_id }} empty in a multi-statement postgres command.
  • 2026-06-05postgres-tool observability — real SQLSTATE errors. noetl-tools 2.18.2 (tools#21) + worker dep bump (worker#49): the postgres tool surfaces the real SQLSTATE + message instead of the opaque db error. Validated end-to-end — a bad query reports ERROR: relation "..." does not exist (SQLSTATE 42P01) in the call.error event. Closes the last follow-up from the credential/iterator saga.
  • 2026-06-05iterator_save_test GREEN — full v10 + credential + iterator-pipeline surface validated. server#73 (v2.19.7) defers task_sequence _prev/_results refs at command-build so nested-pipeline templates render at runtime. iterator_save_test reaches playbook.completed and writes 3 rows to the real demo_noetl DB — the deepest v10 path (iterator → pipeline → _prev chaining → nested credential → postgres write). Closes the credential + iterator + pipeline chain (server#71, worker#46, worker#48, server#73).
  • 2026-06-05Nested-pipeline credentials + template-timing finding. worker#48 (v5.11.3) — the worker now pre-resolves keychain aliases on task_sequence SUB-tasks; iterator_save_test's nested save_item postgres step connects to demo_noetl. Closes the credential-path chain (store → alias-key → nested resolution, all validated). Last iterator_save_test blocker found + filed: server#72 — the server pre-renders task_sequence {{ _prev.* }} refs (runtime-only) to empty → malformed SQL (a symptom the v2.19.5 Chainable change surfaced).
  • 2026-06-05Keychain-credential path validated on Rust-only. Continuing R5 Tier 4, registered the pg_k8s postgres credential and probed the DB-backed fixtures. Surfaced + fixed a 3-bug chain in the keychain subsystem: credential store bound AES-GCM Vec<u8> to a TEXT column (server#71, v2.19.6); alias resolution read only the auth: key not v10's credential: (worker#46, v5.11.2). Proven: iterator_save_test's create_table connects + runs DDL against the real demo_noetl DB. Third bug — nested-pipeline credentials (task_sequence sub-tasks bypass worker resolution) — filed as worker#47 for a follow-up round. Session details
  • 2026-06-05v10 control-flow runs end-to-end on Rust-only. Phase F R5 Tier 4 re-probe found + fixed 7 more bugs across the Rust stack (server v2.19.5 server#69 6 commits, worker v5.11.1 worker#44, tools v2.18.1). Four v10 fixtures now reach playbook.completedstart_with_action, end_with_action, loop_test, control_flow_workbook; actions_test correct-fails on a missing TEST_SECRET env. Root-cause chain: catalog SQL type drift → ToolSpec null-serialization → worker array-config drop → orchestrator end-step skip + task_sequence label-wrap → minijinja Lenient-vs-Chainable undefined → end-step trigger gate. Also: rust-analyzer workspace setup + rule (ai-meta@38287b7). Session details
  • 2026-06-04 (late evening)Rust-only e2e complete + legacy cleanup. Six interlocking server gaps closed in one iteration (#55–#60)
    • two noetl-tools fixes (#15, #16); worker dep bump; kind cluster legacy Python deployments retired. control_flow_workbook runs fully end-to-end on the Rust-only stack. Standing direction pinned: Rust-only focus, ignore Python tasks. Session details
  • 2026-06-04 (afternoon)Pipeline + failure termination + workbook resolution. Three server PRs landed together as v2.19.3 (#61, #63, #65).
  • 2026-06-04 (morning)EE-5 lax decode + workload + input alias. v2.19.1 + v2.19.2 — unblocked Rust worker → Rust server emission + canonical v10 playbook compatibility.
  • 2026-06-04 (early morning)Phase F R4-5 + R4 complete. N=2 shard kind validation script + ExecutionService refactor.
  • 2026-06-03Phase F R4 series — DbPoolMap N+1 pool layer, AppState wiring, per-execution handler cutover, cluster-wide list fan-out.
  • 2026-06-02 (afternoon)Architecture pivot: rest of migration moves to system playbooks. Closed #30, #45; promoted #46.

Releases

See Releases for the per-repo release log with links to GitHub Releases pages.

Recent (2026-06):

  • 2026-06-20noetl/server v3.39.0 — ✅ #116 program-scale step 2: execution-affinity single-owner WRITE ORDERING (multi-replica gate-ON validated) — closes the chain-fork race step 1 left open. Affinity routes every trigger for an execution to the replica that ShardConfig::owns it (non-owner forwards a reverse-proxy POST); owner's single-process drive lock + in-memory ChainHeads make the read→advance atomic, no distributed lock; KV = genesis/handoff vehicle (kv_remote_hit→0). src/affinity.rs; NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off); metric noetl_execution_affinity_total{outcome} (server#252 5e00d0a). 2-replica gate-ON kind PASS: chains roots=1/dangling=0/walk==total (NO fork), forwarded_ok +9, never-scan + sole-writer across replicas. Follow-up #117 (off-server from_events spine event_id-order wedges fan-in under inversion).
  • 2026-06-20noetl/server v3.38.0 — ✅ #115 program-scale step 1: multi-replica coherence DATA LAYER — NOETL_REPLICA_COHERENCE=nats_kv (default local, prod unchanged) backs ChainHeads + ExecDescriptor with JetStream KV buckets (head CAS + descriptor CAS merge); src/coherence.rs; metric noetl_replica_coherence_total{structure,op,outcome} (server#251 8f39a79). Kind: single-replica parity with local; 2-replica cross-replica resolves proven. Necessary-not-sufficient → execution-affinity STAGED (2+ replicas still fork the chain).
  • 2026-06-20noetl/server v3.37.0 — ✅ #115 Phase 5 atomic-working-item context (tenet 6): input_binding + NOETL_ATOMIC_ITEM_CONTEXT (default off) (server#250 a96ade8)
  • 2026-06-20noetl/worker v5.40.0 — ✅ #115 Phase 5 forward the atomic-item-context flag onto the off-server from_events drive (worker#121 2484d17)
  • 2026-06-20noetl/server v3.36.0 — ✅ #115 Phase 6 retire the hot-path noetl.event read class; the table is AUDIT-ONLY (server#249, b71ca1d): NOETL_EVENT_READ_PATH=event_scan|audit_only (default event_scan, prod unchanged). Under audit_only the remaining lifecycle readers (get_catalog_id, inherit_parent_trace, dedup-audit + container-callback catalog/existence) serve from the in-memory ExecDescriptor; cold → noetl.command (synchronous queue) — never a noetl.event scan. Proof metric noetl_event_hotpath_reads_total{site,outcome}. Gate-ON kind-validated: hot-path scan Δ0 + drive state_build_total/event_scans Δ0 ⇒ ZERO noetl.event scans on the hot path, end-to-end; audit/replay still work; 585 tests + clippy green. RFC never-scan end state (tenet 3) reached under the flag.
  • 2026-06-19noetl/worker v5.37.0 — ✅ #115 Phase 4 off-server state-builder kernel + WAL shadow loop (worker#118, fef961c): pool-side per-execution chain index sourced from the noetl_events WAL; chain_walk() head→root spine in event_id order (parity by construction); cache keyed by the immutable chain head (CacheHit / Incremental tail-advance / ColdRebuild). Live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off) + metrics. Gate-ON kind-validated (993 WAL events, 0 noetl.event scans, 28 cold + 21 incremental); 8 unit tests + clippy green; default off. Drive cutover staged.
  • 2026-06-19noetl/server v3.33.0 — ✅ #115 Phase 4 NOETL_STATE_BUILDER=offserver|server flag scaffold (server#246, 3e6006d, default server): the server-side flag for the off-server state-builder drive cutover (staged). 2 config tests; no prod default changed.
  • 2026-06-19noetl/server v3.32.0 — ✅ #115 Phase 3 chain-walk state builder (flagged, default-off) (server#245, ai-meta pointer bumped): behind NOETL_STATE_BUILD_MODE=chain_walk the drive reconstructs WorkflowState by following the one-level prev_event_id chain head→root (in-memory ChainHeads head + (execution_id,event_id) PK lookups — never a WHERE execution_id scan) → same from_events (parity by construction); event-scan kept as the default + fallback (cold-head / lag / non-genesis). NOETL_STATE_BUILD_PARITY_CHECK shadow-builds both ways in one REPEATABLE READ snapshot. New metrics noetl_state_build_total{mode,outcome} / _event_scans_total (no-scan proof) / _chain_hops / _parity_total{result}. Gate-ON kind-validated: parity 41/41 MATCH, scans=0 / 1064 hops / 0 fallbacks, all topologies COMPLETE, sole-writer + lag-0, 577 tests + clippy green. No prod default changed.
  • 2026-06-19noetl/server v3.31.0 — ✅ #115 Phase 2 one-level prev_event_id event chain (server#244, ai-meta afdb365): each noetl.event/noetl.command carries the chain link, stamped at the emit chokepoint from a per-execution chain-head watermark (ChainHeads), covering both gate paths + the materializer. Additive; nothing reads it yet (Phase 3). Kind-proven walkable/1-root/no-gap/no-scan across 6 gate-ON topologies; 573 tests + clippy green. Companion DDL noetl/noetl ecd16a2 (noetl#667). No prod default changed.
  • 2026-06-19noetl/server v3.30.0 — ✅ #115 Phase 1 surface _ref/_store on kept refs + refs_in_state default true (server#243): consume-side accessors for {{ step._ref }} lazy-load + storage-tier predicates; references stay out of state/commands by default (worker selective-resolve landed in worker#117 v5.36.0). Closed #113 + #114; kind gate-ON all 9 stalls COMPLETE.
  • 2026-06-19noetl/server v3.29.4 — 🐞 #113 off-server drive: recover offloaded drive result + stop drive on cancel (server#241, rig e2e#63). apply_worker_orchestration resolves+decodes an offloaded __orchestrate__ result (over the 100KB inline budget → durable reference.ref) instead of dropping it → non-convergent re-loop (metric ref_resolved); cancel now matches underscore playbook_cancelled + a terminal guard evicts the orch-cache (no restart). Kind gate-ON proven; 5/9 #113 fixtures COMPLETE, the other 4 hit a distinct oversized-command.issued stall → #114. No prod default flipped.
  • 2026-06-19noetl/server v3.29.5 — 🐞 #114 offload oversized command context: a command.issued context over NOETL_COMMAND_CONTEXT_MAX_BYTES (512KB) is offloaded to noetl.result_store with a {__context_ref__} marker, resolved in get_command/claim_command (server#242, rig e2e#64) — published events stay under NATS max_payload, no publish-wall wedge. Kind gate-ON: rig PASS, all command.issued <1MB, 6 of #113's 9 fixtures COMPLETE; remaining 3 + cutover gated on the refs_in_state consume side (#101). No prod default changed.
  • 2026-06-19noetl/server v3.29.4 — 🐞 #113 recover offloaded drive result + stop drive on cancel: apply_worker_orchestration resolves+decodes an over-budget __orchestrate__ drive result via result_store.resolve (metric ref_resolved) instead of dropping it → non-convergent re-loop; cancel matches underscore playbook_cancelled + a terminal guard evicts the orch-cache (server#241, rig e2e#63). No prod default changed.
  • 2026-06-19noetl/server v3.29.3 — 🎯 #103 cutover COMPLETE, FLIP-READY: the 2 ExecutionService cancel/finalize writers route through the emit_event chokepoint (server#240, e2e rig e2e#62) — the last synchronous server noetl.event writer under the gate is closed. Kind-proven both modes (gate-off byte-identical INSERT; gate-on PUBLISHED + materializer sole writer + terminal state + 0 loss/dup). All three flip blockers closed → PUBLISH_ONLY flip is a staged operator decision. Default-off; no prod default changed.
  • 2026-06-19noetl/server v3.29.2 — off-server-drive × gate crash-recovery: cold-cache apply rebuilds WorkflowState from the durable log instead of dropping the in-flight drive result (server#238, refs #104/#103). Unblocks the PUBLISH_ONLY flip (off-server drive × gate now kind-proven). Confined to the cold branch; no prod default changed.
  • 2026-06-19noetl/tools v3.13.0 + noetl/worker v5.34.0 — #103 ack-after-materialize durability: deferred ack-after-processing capability (tools#71: AckMode::Defer + $JS.ACK durable handles + ack/nack/term) + in-process CQRS materializer consume-loop (worker#115: drain→project→ack-only-on-success, redeliver on failure) + system-pool wiring (ops#194). Kind fault-injection: gate-on sole-writer loss=0 across a mid-drain failure. Default-off.
  • 2026-06-18noetl/server v3.28.0 — worker-driven orchestrator drive now default ON (NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true) (server#233, closes #108; scale-soak-gated, revert = =false).
  • 2026-06-18noetl/worker v5.33.0 — pool-affinity decline (drive isolated on the system pool) (worker#114, refs #108 (b)).
  • 2026-06-14noetl/worker v5.22.0 — transfer endpoint credential-alias resolution, both Snowflake↔Postgres directions (worker#87, closes #99).
  • 2026-06-14noetl/tools v3.10.0 — Snowflake↔Postgres transfer arms + flatten credential config (tools#65, closes #99).
  • 2026-06-14noetl/tools v3.9.2 — Snowflake SQL-API context in request body + multi-statement split (tools#64, refs #98).
  • 2026-06-14noetl/tools v3.9.1 — set User-Agent on the Snowflake HTTP client (tools#63).
  • 2026-06-14noetl/tools v3.9.0 — Snowflake key-pair JWT authentication (tools#62; kind-validated on live sf_test account).
  • 2026-06-12noetl/server v3.5.0POST /api/execute/batch
    • opt-in exactly-once dedup window (server#189, RFC #90 Phase 7 — scale hardening).
  • 2026-06-12noetl/worker v5.19.0 — batch dispatch + dedup opt-in + per-subscription rate limits (worker#79, RFC #90 Phase 7 — scale hardening, closes #90).
  • 2026-06-12noetl/cli v4.11.0noetl subscribe, local-mode subscription listener (cli#60, closes cli#59, RFC #90 Phase 6). Standalone kind: Subscription listener + FileEventSink JSONL trail + local_disk store-and-forward spool; cli-only (reuses noetl-tools v3.5.0 source+spool). ai-meta → cli 2fb3fb0.
  • 2026-06-11noetl/server v3.0.6 — round-trip JSON null in whole-object {{ step }} references (server#177, closes noetl/ai-meta#89). A null field in a {{ step }} envelope rendered as the JS token undefined (invalid JSON), so the consuming step received an unparseable str; render_to_value now retries with | tojson (undefined/none → JSON null) — the server renderer had diverged from the noetl-tools engine that already did this. Kind-validated: cursor pagination collects all 35 events through the terminal next_cursor: null page. ai-meta pointer → server 8e17fbe.
  • 2026-06-10noetl/tools v3.1.1 — multi-tool sibling references (tools#48, closes noetl/ai-meta#87). TaskSequenceTool now injects each sub-tool's result under its label so a later sub-tool resolves {{ <label>.<field> }} (was rendering empty — a syntax error at or near "," in unquoted numeric SQL positions). Worker adopts via worker#69. Kind-validated (save_edge_cases test_large_payloadrecord_count = 100). ai-meta pointer → tools 76f942a + worker b97f642.
  • 2026-06-10noetl/server v3.0.3 — container-callback insert matches the deployed noetl.event schema (server#173, tracks noetl/ai-meta#43). The handler's call.done insert targeted a non-existent attempt column → HTTP 500 on every watcher callback; replaced with an inline INSERT matching the working ingestion path. Unblocked the container-callback chain (kind-val GREEN both probes). ai-meta pointer → 5d2cf58.
  • 2026-06-10noetl/gui v1.11.0 + v1.11.1 — credential View/Edit recovery for pre-wallet records (gui#36, closes noetl/ai-meta#82) + dev:kind convenience script (gui#35). ai-meta pointer → 8cacc9e.
  • 2026-06-10noetl/server v3.0.2 — container-tool command type contradiction fix (server#172, closes noetl/ai-meta#81). ToolSpec.command Option<String>Option<serde_json::Value>: the container tool's array command now decodes server-side + passes through to the worker's Vec<String>; scalars stay JSON strings for shell/db tools. Kind-val GREEN (K8s Job reaches Complete 1/1).
  • 2026-06-08noetl/tools v2.24.2 — clippy cleanup: 15 warnings resolved across 7 files (tools#44, closes tools#42). Mechanical lint fixes, zero behavioral changes.
  • 2026-06-05noetl/tools v2.18.4 — duckdb command: alias (parity with postgres) + http params/headers/form non-string coercion (tools#26); worker adopts it (worker#51). Unblocks the pagination + http + duckdb-command fixtures.
  • 2026-06-05noetl/tools v2.18.3 — postgres command: alias
    • multi-statement SQL on postgres + duckdb (tools#24, closes tools#23); worker adopts it (worker#50). duckdb_test + json_serialization_save GREEN.
  • 2026-06-05noetl/tools v2.18.2 — postgres tool surfaces the real SQLSTATE + message instead of db error (tools#21); worker bumped to it (worker#49).
  • 2026-06-05noetl/server v2.22.0Secrets Wallet Phase 2: GCP Cloud KMS KeyManager (Cloud KMS :encrypt/:decrypt + Workload Identity); runtime NOETL_KMS_PROVIDER (local/gcp-kms); KEK can leave the process (server#81, tracks #61). Kind-validated on local.
  • 2026-06-05noetl/server v2.21.0Secrets Wallet Phase 1c/1d: credentials + keychain store envelope-encrypted (per-record DEK wrapped by the KEK); self-describing {"v":1,…} blob, forward-only (server#79, tracks #61). Kind-validated end-to-end.
  • 2026-06-05noetl/server v2.20.0Secrets Wallet Phase 1b: envelope-encryption core — KeyManager/LocalDevKms/EnvelopeCipher (server#77).
  • 2026-06-05noetl/server v2.19.8Secrets Wallet Phase 1a: remove the all-zeros default encryption key, fail closed (server#75, tracks #61). Kind-validated.
  • 2026-06-05noetl/tools v2.18.5 — dollar-quote-aware statement splitter; the 2.18.3 splitter shredded plpgsql $$ … $$ blocks (tools#27).
  • 2026-06-05noetl/server v2.19.7 — defer task_sequence _prev/_results refs at command-build (server#73); nested-pipeline templates render at runtime → iterator_save_test GREEN.
  • 2026-06-05noetl/worker v5.11.3 — resolve keychain aliases on task_sequence sub-tasks (worker#48); nested postgres-in-pipeline steps connect.
  • 2026-06-05noetl/server v2.19.6 — credential store base64-armors the AES-GCM blob for the TEXT data_encrypted column (server#71); keychain creds register + round-trip.
  • 2026-06-05noetl/worker v5.11.2 — resolves keychain alias under the v10 credential: key (worker#46).
  • 2026-06-05noetl/server v2.19.5 — v10 control-flow end-to-end (server#69, 6 commits): catalog INT4 + catalog_id alias, ToolSpec skip-null, orchestrator end-step-with-action + task_sequence flatten + intra-pass dedup, template Chainable- undefined, end-step trigger gate.
  • 2026-06-05noetl/worker v5.11.1 — preserve array tool_config for task_sequence (worker#44).
  • 2026-06-05noetl/tools v2.18.1 — task_sequence parse_tasks accepts worker-envelope shape.
  • 2026-06-04noetl/server v2.19.4 — orchestrator template context: step data at top level + call.done capture.
  • 2026-06-04noetl/tools v2.18.0 — TaskSequenceTool.
  • 2026-06-04noetl/tools v2.17.1 — PythonTool result- global capture.
  • 2026-06-04noetl/server v2.19.3 — three fixes shipped together: pipeline flat shape decode (#61), failure termination (#63), workbook resolution (#65).
  • 2026-06-04noetl/server v2.19.2 — v10 workload + input alias (#59).
  • 2026-06-04noetl/server v2.19.1 — EE-5 lax decode for integer execution_id (#57).
  • 2026-06-04noetl/server v2.19.0 — Phase F R4-4b (ExecutionService refactor + cluster-wide list fan-out).
  • 2026-06-04noetl/server v2.13.0 → v2.19.0 — Phase F R4 series (DbPoolMap N+1 pool layer through R4-5 kind validation).
  • 2026-06-04noetl/gateway v3.2.0 — Phase F R3b-2 shard-info twin endpoint.

Conventions

How agents (Claude / Codex / Cursor) operate across this ecosystem — pointers into the rule files in agents/rules/:

How to use this dashboard

  • Just landed in this codebase? Read Repo Map, then Execution Model, then the umbrella for whatever you're working on.
  • Picking up an in-flight task? Find the matching umbrella page above; it has the full state of the work + the next concrete step.
  • Need to file new work? Follow the issue tracking convention — open the ai-task issue on noetl/ai-meta, then add the umbrella to the table above and create the corresponding wiki page.
  • Maintenance pass? Refresh this Home + the Sessions Log
    • the Releases page + the matching Umbrella-*.md page when you bump a submodule pointer. All four pages drift together — see Rule 0a's checklist.

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally