-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Last refreshed: 2026-06-20 (Claude session — ✅ PROD CQRS rollout RECORDED (ops#200 → ops@d6633f6, ai-meta 08c73e5) + LIVE-PROD e2e validation of the gate-ON cutover. Ran the Rust regression + specialized playbooks against LIVE PROD (gke …noetl-cluster, server v3.39.1 / worker v5.40.2, PUBLISH_ONLY=true + STATE_BUILDER=offserver, materializer sole writer): 28/30 executions PASS (24/26 distinct fixtures + 4 composition-spawned children) — python/args/vars/loops/control-flow/output-select/large-result/actions/fan-out-reduce/duckdb/http/save-to-postgres/sub-playbook composition; every gate-ON execution COMPLETED with sole-writer (rows==distinct, 0 catalog_id=0, 0 __orchestrate__ event rows), clean chain (roots=1/terminals=1/dangling=0/walk==total), never-scan (worker state_builder_event_scans Δ0 / cumulative 0), materializer lag 0. The 2 FAILs (postgres_test pg_noetl_k8s / postgres_jsonb pg_local) are a prod-env credential-unreachable difference, NOT a cutover bug — both still produced a clean playbook.failed terminal + single-root chain (failure path is gate-ON-correct). No platform bug → no issue filed. Prod left healthy (health ok, lag 0, 0 pod restarts ~45 min). Footprint (uncleanable): tenant prefix prod-e2e-20260620-1946 — 26 catalog / 30 execs / 947 noetl.event rows. Refs #103/#107/#111. — prior: 🚀 PROD CQRS CUTOVER EXECUTED + gate-ON validated — server v3.39.1 / worker v5.40.2 rolled to prod GKE (gke_noetl-demo-19700101, ns noetl) and NOETL_EVENT_INGEST_PUBLISH_ONLY + NOETL_STATE_BUILDER=offserver flipped LIVE. Materializer is the sole noetl.event writer (event_ingest_published_total == materializer_acked_total); the off-server drive builds state from the noetl_events WAL with zero event-scans; 5 tenant validation execs all COMPLETED, chains roots=1/terminals=1/dangling=0; backlog 0, 0 restarts. A one-time owner-applied prev_event_id migration unblocked the write path — the runtime noetl role isn't the table owner, so the columns were provisioned as postgres via the live pgbouncer-embedded owner credential (no secret rotated/printed). GSM pg_noetl_k8s is stale → operator should rotate/realign it. The event-chain DDL skipped WARN persists by design (ownership checked before the IF-NOT-EXISTS skip). ops #200 (digest bumps + executed-rollout record); ai-meta pointer bump follows merge; one-command revert on standby. Closes the last #103 operator gap; puts #107/#115/#111 off-server into prod. — Earlier same day: ✅ #118 + #119 SHIPPED + gate-ON kind-validated + CLOSED — single- AND multi-replica off-server are now blemish-free (single-root incl. finalize, zero fallback) AND restart-robust (the WAL index rehydrates). noetl-server v3.39.1 (server#253, c5f8cb2) + noetl-worker v5.40.2 (worker#123, 48b0bde) + e2e (e2e#73, fe97d92). #118 — root cause corrected: the terminal event does go through ChainHeads.link_batch — the defect is a duplicate finalize. Under offserver+PUBLISH_ONLY single-replica, the first drive emits the terminal (chain-linked) + evicts the chain head; a straggler drive then rebuilds from the materializer-lagged WAL (state not terminal yet), drives again, and emits a SECOND playbook.completed that links to the now-None head → NULL prev_event_id orphan (2 roots) + a benign state-build event-scan. Fix = a bounded process-local FinalizedGuard (exactly-one-terminal-per-execution) suppressing the duplicate at emit_events before the chain linker (a suppressed duplicate never advances/consumes the head); gate-off byte-identical (a duplicate never occurs on the synchronous in-process drive); metric noetl_terminal_dedup_total{suppressed}. Rig gains a HARD terminals==1 per-exec assertion. #119 (the blocker that hid the #118 symptom — off-server WAL-drain index stall on worker restart): the authoritative drain used a durable noetl_state_builder consumer whose cursor persists across restarts while the in-memory WalEventIndex rebuilds empty → the cursor outran the fresh index → build_spine_to(expected_head) permanently Incomplete → off-server execs looped offserver_retry and never completed (so the #118 fork could not even be reached). Fix (worker-only, inside NOETL_STATE_BUILDER=offserver; PROD runs the in-server drive so untouched): the drain now defaults to an ephemeral DeliverPolicy::All consumer that rebuilds the full index from the retained noetl_events WAL on every boot (no persisted cursor to outrun; also correct for >1 worker pod — each holds the complete event set, not a load-balanced subset); instant revert NOETL_STATE_BUILDER_DURABLE=1; proof = one-shot index rehydrated… log + new noetl_worker_state_builder_indexed_executions gauge; never reintroduces a noetl.event scan. Gate-ON kind-validated (server 118-finalize v3.39.1 + worker 119-rehydrate v5.40.2, offserver+publish_only+audit_only+plugin_drive): restart-rehydration proven — a forced mid-flight delete --force of the system-pool pod → the new pod logged index rehydrated … indexed_executions=17 wal_events=200 (pre-fix this was 0 → the stall); single-replica 6/6 stress iterations / ~126 execs every chain roots=1(incl. terminal)/dangling=0/walk==rows/terminals=1/orch_events=0, zero build-scan + zero hot-path-scan; multi-replica (2-replica affinity StatefulSet) 21 execs COMPLETE, roots=1/terminals=1 all, forwarded_ok +202, kv_remote_hit +12, zero scans; 224 worker + 597 server lib tests + clippy green; baseline restored; PROD/defaults untouched. — prior: ✅ #117 SHIPPED — off-server from_events spine ordered by prev_event_id chain + walked from the real tip (expected_head), the high-concurrency fan-out reduce wedge. Under concurrent fan-out two branch completions arrive at the owner id-inverted, so emit_events stamps a higher-id event as the predecessor of a lower-id one; the worker tracked the head as max(event_id) but ChainHeads.link_batch advances the watermark to the last-arrived event (the real tip), so a max-id walk MISSED the inverted tip and the fan-in reduce never fired. worker-only, inside NOETL_STATE_BUILDER=offserver (PROD untouched); byte-identical to the old sort for monotonic chains; NOETL_OFFSERVER_SPINE_ORDER=event_id reverts. noetl-worker v5.40.1 (worker#122, baeae78) + e2e (#72, cdf1768). 2-replica affinity gate-ON stress: 6/6 iterations, 108/108 execs COMPLETE, 15 execs with a real prev_event_id > event_id inversion all fired reduce_customer + completed; never-scan + sole-writer + roots=1 hold. Single-replica 7/8 (the 1 fail a separate pre-existing terminal-finalize race, non-wedging). — prior: ✅ #116 PROGRAM-SCALE STEP 2 SHIPPED + multi-replica gate-ON validated → ai-meta pointers bumped: execution-affinity single-owner WRITE ORDERING; the off-server stack is now multi-replica chain-COHERENT. Step 1 (KV data coherence, v3.38.0) was necessary-not-sufficient — the command.issued prev-read (handlers::execute) and head CAS-advance (emit_events) are two non-atomic steps, so concurrent cross-replica emits forked the chain. Affinity closes it by routing every trigger (POST /api/events, which also fires the drive) to the single replica that sharding::ShardConfig::owns(execution_id) owns (stable XxHash64); a non-owner forwards (reverse-proxy POST, one-hop loop guard, degrade-to-local). On the owner the single-process drive lock + in-memory ChainHeads make the read→advance atomic, no distributed lock; KV is the genesis/handoff vehicle (owner resolves LOCAL → kv_remote_hit→0 by design). Chose forwarding over a per-drive lease — solves the fork AND the double-drive with one mechanism, reusing src/sharding.rs. noetl-server v3.39.0 (server#252, 5e00d0a): src/affinity.rs (ExecutionAffinity + shard_index_from_hostname); flags NOETL_EXECUTION_AFFINITY / NOETL_PEER_URL_TEMPLATE / NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off, prod unchanged); handle_event forwards + reconcile poller skips non-owned; metric noetl_execution_affinity_total{outcome} (forwarded_ok = the proof). e2e (e2e#71, 66b6e1b): 2-replica StatefulSet topology (manifests/replica-affinity/, distinct shard per pod via hostname ordinal + headless DNS) + deploy_replica_affinity_topology.sh + rig flipped HARD (forwarded_ok proof; kv_remote_hit informational under affinity). Multi-replica gate-ON kind-validated (2-replica StatefulSet; server offserver+audit_only+nats_kv+affinity+publish_only; worker offserver): NOETL_COHERENCE_DRIVE_AFFINITY=shipped PASS — linear/loop/fanout COMPLETE; every chain roots=1/dangling=0/walk==total (NO fork — exactly what forked without affinity); forwarded_ok +9; state_build_event_scans +0 + hotpath scan +0 (never-scan across replicas); sole-writer rows==distinct, __orchestrate__ event=0; degraded +0; single-replica unchanged. 595 server tests + clippy green; baseline restored. Known follow-up #117 (separate, pre-existing): the off-server from_events spine orders by event_id; under a chain-order≠id-order inversion (affinity's forwarding makes it likelier under high-concurrency fan-out) the fan-in reduce wedged on 1/9 execs in a 9-way concurrent run (chain stayed clean — NOT a fork; fix = order the spine by prev_event_id walk). Prod multi-replica verdict: write-ordering is COMPLETE (no fork) — prod can horizontally scale the off-server stack for linear/loop reliably; high-concurrency FAN-OUT needs #117 first. All affinity flags default off; PROD GKE untouched. — prior: ✅ #115 PROGRAM-SCALE STEP 1 SHIPPED — multi-replica coherence DATA LAYER (NATS-KV ChainHeads+ExecDescriptor, NOETL_REPLICA_COHERENCE=nats_kv, default local); server v3.38.0 8f39a79 + e2e e222877; single-replica bit-for-bit parity, 2-replica cross-replica resolves proven — but necessary-not-sufficient (chain forked without affinity → step 2 above). — prior: ✅ #115 PHASE 5 SHIPPED + gate-ON validated: atomic-working-item context (tenet 6) — the drive hands a worker only its minimal declared slice. NOETL_ATOMIC_ITEM_CONTEXT (default false, prod unchanged). #77 dependency resolved: Explicit Input Binding is CLOSED (BREAKING v3.0.0) — it shipped the declaration surface (input:/args: + set:→input:); Phase 5 adds the missing extractor + drive narrowing (the server previously attached the full accumulated context to every command regardless of input:). noetl-server v3.37.0 (server#250, a96ade8): orchestrate-core::input_binding — analyze(step) statically extracts the base-context keys a step references (minijinja undeclared_variables; ctx.X→X, bare step-name→key, injected roots→none), conservative (unbounded ref → full context); CommandBuilder/WorkflowOrchestrator::with_atomic_item_context narrow the persisted worker-bound context for plain non-loop steps (server-side render still runs against the full context); plugin OrchestrateInput/OrchestrateStateInput carry a #[serde(default)] flag; metric noetl_atomic_item_context_total{outcome}. noetl-worker v5.40.0 (worker#121, 2484d17): forwards the flag onto the off-server from_events drive input. e2e (e2e#69, 79505fa) atomic_item_context.yaml + kind_validate_atomic_item_context.sh. Gate-ON kind-validated (server p5-atomic, flag on, STATE_BUILDER=server + PLUGIN_DRIVE=true + PUBLISH_ONLY=true): flag-ON consumer render_context = [producer_a] ONLY (producer_b + start + steps + workload + execution_id + catalog_id + path all dropped), execution COMPLETED, narrowed metric +1; flag-OFF full 8-key context, COMPLETED (back-compat); offserver regression COMPLETED / __orchestrate__ event=0 / dispatched+applied advance / system-pool isolation / lag-0; 7 input_binding + 132 orchestrate-core + 584 server + 10 worker tests + clippy green; baseline restored. Realizes tenet 6; remainder = program-scale (per-shard WAL, multi-replica descriptor coherence). — prior: ✅ #115 PHASE 6 SHIPPED + gate-ON LITERAL-ZERO validated → ai-meta pointers bumped: the hot-path noetl.event read class is RETIRED; the table is AUDIT-ONLY. NOETL_EVENT_READ_PATH=event_scan\|audit_only (default event_scan, prod unchanged). Phase 4 cleared the drive scan under offserver; Phase 6 retires the remaining lifecycle readers (the WHERE execution_id replay class outside the drive). noetl-server v3.36.0 (server#249, b71ca1d): under audit_only, get_catalog_id (per-ingest) + inherit_parent_trace + the subscription dedup-audit + container-callback catalog/existence reads serve from the in-memory execute-time ExecDescriptor; a cold descriptor (post-terminal straggler after eviction / restart) resolves catalog_id from noetl.command (synchronous queue) — never a noetl.event scan. Proof metric noetl_event_hotpath_reads_total{site,outcome}. ops (ops#199, e5b0737) pins event_scan on the prod server manifest (operator-gated flip). e2e (e2e#67+#68, 0ab3c0a) kind_validate_event_read_path_phase6.sh. Gate-ON kind-validated: hot-path scan Δ0 (served_descriptor +96 + served_command +3), drive state_build_total Δ0 + event_scans Δ0 ⇒ ZERO noetl.event scans anywhere on the hot path, end-to-end; linear/loop/fan-out/output_select COMPLETE; sole-writer + lag-0; audit still works (direct SELECT + status COMPLETED + replay event_count=25); gate rig PASS (no regression); 585 server tests + clippy green; baseline restored; default event_scan, prod unchanged. The RFC's never-scan end state (tenet 3) is reached under the flag; remainder = Phase 5 (atomic-item, needs #77) + program-scale (per-shard WAL, multi-replica descriptor coherence). — prior: ✅ #115 PHASE 4 REMAINDER SHIPPED — the off-server drive edge is now STATELESS. Under NOETL_STATE_BUILDER=offserver the server performs ZERO state rebuild + ZERO noetl.event reads on the drive path. noetl-server v3.35.0 (server#248, 6e30fc3) — a per-execution ExecDescriptor (catalog_id + routing seeded at playbook_started; terminal stamped at the emit_events chokepoint) lets trigger_orchestrator_inner/apply_worker_orchestration route + apply WITHOUT building state; expected_head from the in-memory ChainHeads; cold descriptor (restart) falls through to the server-built path (re-seeds) so chain_walk + event_scan stay fallbacks. noetl-worker v5.39.0 (worker#120, 8e1f651) — resolves trigger_event_type off the WAL from trigger_event_id; a stateless command with an incomplete WAL after the bounded retry is a benign __offserver_retry__ no-op the reconcile poller re-drives (never a partial state). e2e (e2e#66, f4bb342) — the rig now asserts noetl_state_build_total Δ0 + dispatched_offserver_stateless/applied_stateless advance. Gate-ON kind-validated: state_build_total Δ0 + event_scans Δ0 across linear(13)/loop(62)/fan-out(25)/output_select(31), all COMPLETE, offserver==server parity, sole-writer 25==25, lag-0, materializer dup 0; 583 server + 218 worker tests + clippy green; baseline restored; default server, prod unchanged. Completes #107 step 2 server-side (state rebuild + event reads removed from the drive path under the flag); Phase 5 (atomic-item context, needs #77) + Phase 6 (retire the event read path) remain. — prior: ✅ #115 PHASE 4 DRIVE CUTOVER SHIPPED + gate-ON parity-validated → ai-meta pointers bumped. The orchestrator drive now constructs its WorkflowState off the server on the system worker pool, from the noetl_events WAL spine (wasm run/from_events), under NOETL_STATE_BUILDER=offserver (default server, prod unchanged): noetl-worker v5.38.0 (worker#119, bef13e5) — shared WAL index + authoritative durable noetl_state_builder consumer + dispatch_wasm builds from the spine with a staleness guard (expected_head); noetl-server v3.34.0 (server#247, f0922bd) — marks the offserver command + carries expected_head; ops (ops#198, b1da9f1) the system-pool knob; e2e (e2e#65, b38b6dd) the committed gate-ON parity rig. Validated gate-ON: offserver==server fingerprint (fan-in fires once), worker served +3 / event_scans 0 / wal_events +25, server state_build_event_scans 0, cache cold+1/incr+2, sole-writer 25==25 / lag-0; linear + loop legs COMPLETED off-server; baseline restored. Phase-4 remainder (remove residual server chain-walk bookkeeping → fully zero server reads) + Phases 5–6 remain. — prior: ✅ #115 PHASE 4 (off-server state builder) KERNEL + FLAG SHIPPED + shadow kind-validated. The pool-side off-server builder reconstructs orchestrator WorkflowState from the noetl_events WAL (not the materialized noetl.event table): a per-execution chain index walks the prev_event_id spine head→root and caches the built spine keyed by the immutable chain head, advancing only the new tail on the next trigger. Shipped: noetl-worker v5.37.0 (worker#118, fef961c) — the state_builder kernel + a live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off) + metrics; noetl-server v3.33.0 (server#246, 3e6006d) — the NOETL_STATE_BUILDER=offserver|server flag scaffold (default server, prod unchanged). Validated gate-ON on live kind (PUBLISH_ONLY + off-server drive + materializer sole-writer): the shadow replayed the WAL and chain-walked spines whose indexed==spine sizes exactly match the Phase-3 topologies (linear 13, loop 62, fan-out 25, output_select 31, storage_tiers 55) — parity by construction; WAL-read proven (wal_events_total=993) with ZERO noetl.event scans (event_scans_total=0); cache proven (cold_rebuild=28 on replay/restart + incremental=21 live tail-advance, e.g. a fresh fan-out indexed live as Incremental(5), indexed==spine==25 == DB event_rows 25); a fresh fan-out COMPLETED gate-ON with event_rows==distinct / 0 __orchestrate__ event rows / materializer pending=0 / project_errors=0; 8 worker unit tests + 2 server config tests + clippy green. Baseline restored. PROD GKE untouched; no gate/mode/builder default changed. Phase 5 (atomic-item context, needs #77) / Phase 6 (retire event read path) + the off-server drive cutover (offserver wiring) remain. — prior: ✅ #115 PHASE 3 MERGED → server v3.32.0. server#245 self-merged → server v3.32.0 (8338417) — the chain-walk state builder. Behind NOETL_STATE_BUILD_MODE=chain_walk|event_scan (default event_scan, prod unchanged), the drive rebuilds WorkflowState by walking prev_event_id head→root (head from the in-memory ChainHeads watermark, then (execution_id,event_id) PK lookups — never a WHERE execution_id scan) and feeding the same from_events (orchestrate-core unchanged; parity by construction). Conservative fallback to event-scan on cold-head / lag / non-genesis. Validated gate-ON across 5 topologies (linear/loop/fan-out/output_select/storage_tiers): parity 41/41 MATCH 0-mismatch (tx-isolated, normalized), NO-SCAN proven (event_scans_total=0 across 40 drive builds, 1064 PK hops, 0 fallbacks), all fixtures COMPLETE in chain_walk mode, sole-writer + lag-0 + gate rig PASS, 577 lib tests + clippy green. Phase 4 (off-server state builder + WAL cache) builds on this — now in progress (move the chain-walk state construction OFF the server onto the system worker pool, reading the WAL/NATS stream, pool-side cache keyed by the immutable chain head + incremental tail-advance). — prior: ✅ #115 PHASE 2 MERGED — one-level prev_event_id event chain (server#244 → server v3.31.0 f5bd4a8 + noetl#667 → noetl ecd16a2; ai-meta pointer bump afdb365). — prior: ✅ #115 PHASE 2 IMPLEMENTED + kind-validated — one-level prev_event_id event chain. Each noetl.event carries prev_event_id (the immediately-previous event in causal order) + each noetl.command the issuing-event link, so per-execution events form a walkable singly-linked list followable pointer-by-pointer without scanning noetl.event (additive; no reader consumes it yet — that's Phase 3). The emit chokepoint emit_events stamps the event link from a per-execution chain-head watermark (ChainHeads) — one path covers drive events + command.issued + worker-lifecycle (via handle_event) on both the gate-off INSERT and the gate-on publish (to_stream_json), the materializer persisting it; the command link = the real step.enter/unblocking completion so cursor-fan-out bodies share their branch origin (§4.4). Server-only (no orchestrate-core change — the chokepoint subsumes the "core carries prev" design). Chain-correctness proven gate-ON (PUBLISH_ONLY + off-server drive + materializer sole-writer) across 6 executions — linear 13/13, loop 62/62, fan-out 25/25 (real shared branch origin), sub-playbook 46/46, + Phase-1 output_select 31/31 & storage_tiers 55/55 bounded: each has 1 root, 0 dangling / 0 duplicate event prev, 1 head, pointer-walk == full sequence (no gaps), real-step command dangling=0; kind_validate_orchestrate_gate.sh PASS (sole-writer 25==25, 0 dup cycles, catalog0=0, lag 0); 573 server lib tests + clippy green. PRs server#244 + noetl#667 open, awaiting merge → pointer bump; PROD untouched, no gate default changed. — prior: ✅ #115 PHASE 1 SHIPPED — references-in-state consume side; closed #113 + #114 (all 9 stalled fixtures green gate-ON). Worker resolve_context_references made selective (worker#117 → v5.36.0): resolve a noetl:// ref only when this command's tool input binds the step's bulk (a path the bounded extracted summary can't satisfy — whole-object bind, .data over a summarised rowset, array element past [0], _truncated node); predicate / scalar / _ref access reads off the summary with no store round-trip, and an upstream result a step doesn't consume stays a reference (foreign bulk never inflates the render). Server hydrate_result_references surfaces _ref/_store/_uri on the kept summary (so {{ step._ref }} lazy-load + {{ step._ref is defined }}/{{ step._store }} predicates resolve without bulk) and refs_in_state default flipped to true (server#243 → v3.30.0) — references stay out of state + commands by default (the #114 kind experiment proved the flip needs this consume side first; it landed in the same change set). Validated kind gate-ON (PUBLISH_ONLY + off-server drive + materializer sole-writer): all 9 of #113's stalls reach playbook.completed — output_select / storage_tiers / lease_expiry / pipeline_heavy_payload / save_edge_cases / large_result_extraction / http_to_postgres_{direct,simple,bulk_python}; max next-command context across the 9 = 412KB (was 1.32MB), storage_tiers' 17.4MB drive-state gone (36KB), lease_expiry's 201 spinning orch cmds → 16; 0 __orchestrate__ event rows every run (sole-writer intact), materializer lag = 0; offserver rig PASS (decode_error +0) + 6-fixture fast-path regression sample (fan-out/loops/conditionals) green. #101's consume side done; off-server-drive prod cutover (#107/#111) unblocked on the payload/state-size axis. PROD GKE untouched (pre-#108 in-server drive); the refs_in_state flip is a code default — prod runtime gates unchanged. ai-meta → worker 0a66b41 + server 3014f6f. — prior: 📐 RFC FILED (#115) — Decouple result data from context: reference-only schema + one-level event chain + worker-side state builder. A design deliverable (no feature code) directed by the platform owner, reframing the off-server-drive / event-sourcing model around six tenets: the root cause is unbounded context growth (every step passes the entire accumulated context forward — build_context folds all prior step results, orchestrator.rs clones it per arc, build_command hands the whole map to each command; evidence: 17.4MB drive-state for storage_tiers, 1,324,800B next-command context for output_select), fixed by (2) reference-only noetl.* schema (data lives in the result/object store by URN; rows carry {ref, extracted} only), (3) a never-scan-noetl.event invariant on the state/drive read path (beyond #103's sole-writer — removes event-table reads), (4) a one-level event chain (command→prev-event→prev-event, walked pointer-by-pointer, not a full replay), (5) an off-server system/state_builder WASM playbook that walks the chain + caches state on the system pool, and (6) an atomic-working-item context contract (a tool worker receives only its minimal item slice, not the whole playbook). Full RFC on the wiki (Umbrella: Decoupled Context + Event Chain); 6-phase plan with Phase 1 = #101's consume side (worker selective render-time ref-resolution — the immediate unblock for the 3 stuck #114 fixtures + the off-server-drive prod cutover). Reframes #101 (subsumed; its incremental-state cache + resolve-into-state is inverted to never-data-in-state + never-scan), subsumes the storage tier of #104, is the state-construction design for #107 steps 2–4, unblocks #111, builds on #103's write path (the sole-writer gate is unaffected — we did NOT change the kind gate state), and leaves #114 as a safety cap. Filed: issue #115 + board 3 (Todo) + this wiki page; #101/#107/#111 cross-linked. — prior: 🐞 #114 oversized-command.issued offload shipped (server v3.29.5). Under the publish-only gate a command.issued event carrying the full upstream context (~1.32MB) exceeded NATS max_payload (1MB) → publish never acked → wedge. Fix: when a command context exceeds NOETL_COMMAND_CONTEXT_MAX_BYTES (default 512KB) persist_engine_command(s) offload it to noetl.result_store with a tiny {__context_ref__} marker; get_command/claim_command resolve it before the worker sees it (metrics context_offloaded/context_ref_resolved). server#242→v3.29.5 + rig phase 8 e2e#64. Kind gate-ON: rig PASS (new test_oversize_command_context COMPLETED, max command.issued ctx 585B, offload+resolve fired, 0 __orchestrate__ event rows, lag 0); every command.issued event <1MB across all fixtures; 6 of #113's 9 large-context fixtures now COMPLETE. Chose ref-on-oversize over refs_in_state=true (candidate #1): a kind experiment proved refs_in_state=true fixes the state-bloat (lease_expiry completes) but breaks bulk-consuming fixtures (storage_tiers/output_select fail at the bulk step) because the worker render-time ref-resolution isn't implemented — so the default stays false. The remaining 3 fixtures + the off-server-drive cutover (#107/#111) are now blocked on the refs_in_state consume side (#101) — __orchestrate__ drive-state bloat (17.4MB for storage_tiers) + the _ref/bulk-resolve gap. No prod default flipped; prod is pre-#108 in-server drive (unaffected). — prior: 🐞 #113 off-server-drive payload-size + cancel fix shipped (server v3.29.4). Fixed the worker-driven drive stall when an __orchestrate__ result exceeds the 100KB inline budget: the worker offloads it to the durable result store with only a reference.ref (no inline output_b64), and apply_worker_orchestration now resolves+decodes that ref (server#241, metric ref_resolved) instead of dropping the drive decision → non-convergent re-loop; plus cancel now stops the drive (match underscore playbook_cancelled + ExecutionState::is_terminal terminal guard evicts the orch-cache, no restart). Companion convergence rig e2e#63. Kind gate-ON proven (785KB drive result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; materializer sole-writer lag 0). 5/9 #113 large-context fixtures COMPLETE; the other 4 hit a DISTINCT oversized-command.issued (full upstream context embedded → >1MB NATS payload) stall → #114 (#113 stays open until all 9 close). No prod default flipped; prod is pre-#108 in-server drive (unaffected). — prior: 🚀 #103 GKE pre-flip PREP landed (staged, NO traffic flip, NO PUBLISH_ONLY). Verified live that prod already runs the full Rust stack (the #49 Python→Rust cutover is done; the noetl Service selector is already app=noetl-server-rust; live images are pre-#103 batch-dispatch-v1/cursor-100), both flip secrets already exist (NOETL_ENCRYPTION_KEY + noetl-internal-api-token), and prod monitoring is Google Managed Prometheus, not VictoriaMetrics. Pushed the post-#103 images to the prod Artifact Registry (server v3.29.3 + worker v5.35.0, amd64); applied + verified GMP monitoring (PodMonitoring for worker+server /metrics — the noetl namespace had none, so app metrics weren't scraped at all — + materializer-lag Rules, translated from the kind VMRule; up{namespace="noetl"}=4 live series); staged the roll-forward manifests (server→v3.29.3 gate-off, system-pool→v5.35.0 materializer-off) in a PR not applied (they roll live workloads). Operator-gated remainder (surfaced, not done): roll the live images, enable the materializer shadow, wire the GMP managedAlertmanager pager, then flip. No prod default changed. — prior: 🛡️ #103 materializer-lag GUARDRAIL shipped — the pre-flip observability gate is now in place. The server was already FLIP-READY; the remaining operator gate was a materializer-lag metric + alert so the staged PUBLISH_ONLY flip is safe and one revert away. Shipped (default-off): worker #116 → v5.35.0 extends the JetStream lag poller to track the noetl_events/noetl_materializer consumer on an independent task (so a stalled/dead materializer loop — which can't report its own lag — still surfaces as a climbing noetl_worker_nats_consumer_pending{consumer="noetl_materializer"} gauge); ops #195+#196 add the VMRule (backlog warning/critical/growing + stall-under-gate + project-errors + absent-under-gate), a worker /metrics VMServiceScrape (was unscraped), VMAlert enabled, a Grafana dashboard, and the flip runbook noetl-cqrs-publish-only-flip.md (pre-flip green-baseline check + one-command revert). Kind-proven the full cycle: green baseline (backlog 0, published==projected==acked) → induced lag (materializer fault-injected while events publish under the gate) → backlog gauge climbs 0→684 via the independent poller, alerts fire (backlog warning+critical + stall) → recover (fault removed) → materializer drains backlog→0 idempotently (0 dup, 0 loss), alerts clear. Flip-readiness now includes the monitoring gate; PUBLISH_ONLY stays default-off, no prod default changed. — prior: 🎯 #103 server cutover COMPLETE — the server is FLIP-READY. The last of the three flip blockers is closed: the two ExecutionService terminal writers (POST /cancel → playbook_cancelled, POST /finalize → playbook_completed/playbook_failed) now route through the emit_event chokepoint, so they honour NOETL_EVENT_INGEST_PUBLISH_ONLY like the other 13 producers instead of writing noetl.event synchronously under the gate. Shipped: server #240 → v3.29.3 (b6e5d31, default-off) + dual-mode e2e rig kind_validate_cancel_finalize_gate.sh (e2e#62). Kind-proven both modes: gate-OFF cancel/finalize INSERT synchronously, byte-identical columns (error preserved), published delta +0, natural completion still COMPLETED; gate-ON noetl_event_ingest_published_total{playbook_cancelled}=1/{playbook_failed}=1 (PUBLISHED, not inserted), materializer (system pool) is the sole writer, both executions reach the correct terminal state, rows==distinct, 0 catalog_id=0, no loss/dup. No remaining synchronous server noetl.event writers on the producer path under the gate — the server is a complete non-writer of the event log when PUBLISH_ONLY is on. Flipping PUBLISH_ONLY on is now a staged operator decision (behind a materializer-lag alert, one revert away); no prod default changed. — prior: #104 off-server-drive × gate reconciliation PROVEN (server v3.29.2, cold_rebuild); #103 ack-after-materialize durability resolved (tools 3.13.0 + worker 5.34.0 + ops#194).)
Refresh cadence: every session that lands meaningful cross-repo work (per agents/rules/wiki-maintenance.md Rule 0a)
Standing direction (2026-06-04). Per memory entry, Python tiers are deprioritized. Forward Rust-only e2e work is tracked under #54 (Phase F R5). Python pieces stay deployable for backwards-compat on GKE but are NOT a target for new feature work.
Single pane of glass for the NoETL platform. Every active umbrella, every submodule, every release lands here so a single page shows what's in flight, what shipped, what's next.
Convention. This wiki is the cross-repo dashboard. Per-repo wikis (e.g. noetl/server wiki, noetl/ops wiki) document that repo's surface; this wiki documents the system of repos. See Wiki convention for the split.
Umbrellas open in the ai-task queue: the Rust server
parity-port umbrella (#49), the orchestrator-scaling work
(#101), the new
event-WAL + derivable-storage model
(#104), the Rust regression baseline migration
(#98), the Python-era deploy legacy cleanup
(#97), and the postgres timestamptz
NaiveDateTime bug (#95).
#100 (cursor/claim loop mode) closed
2026-06-15 — server v3.8.0 + tools v3.10.1; test_pft_flow_v2 all_passed:true on kind (see Recently closed).
#99 (transfer-tool credential aliases) closed
2026-06-14 via tools#65 + worker#87 + e2e#58 (see Recently closed).
The subscription / listener tool RFC (#90)
closed 2026-06-12 with all 7 phases shipped + live-proven (see Recently
closed); refinement follow-ups #91–#94
- tools#57 are tracked separately.
| # | Opened | Last update | Umbrella | Status | Wiki page |
|---|---|---|---|---|---|
| #107 | 2026-06-17 | 2026-06-17 | Program: Distributed Multitenant OS — Server Dissolution → Global Grid | The strategic roof over #101–#105. Blueprint names NoETL a distributed multitenant OS (server→stateless edge; NATS WAL + object store the only durable state; processing = event-driven system playbooks on a sharded grid; foundation for quantum-cloud-hybrid). 5-step path: CQRS cutover (step 1, shadow green) → orchestrator-as-plug-in (#108, done 2026-06-18 — drive core fully wasm-resident (#109 closed); the system/orchestrate plug-in compiles to a 0-import .wasm (server#224), runs identically to native in wasmtime (server#225), is seeded into the registry on boot + servable (server#226), and drives the real workload identically live — shadow over the 10×1000 PFT, 529 evals 0 mismatch (server#227); worker-driven cutover — the drive runs OFF-SERVER on the pool, kind-validated — entry/run_state dispatch (worker#113) + apply_orchestration_result (server#228) + the flag-gated scheduler/apply/state-guard (server#229, NOETL_ORCHESTRATE_PLUGIN_DRIVE); simple_python drove start→end→COMPLETED through the worker round-trip. #108 CLOSED 2026-06-18 — (c) the default-flip shipped (server#233, v3.28.0, drive default ON) after a clean scale soak; orchestrator-as-plug-in is done; the in-server shadow + wasmtime server dep retired in #110 / server#234) → per-shard WAL → drop Postgres → cross-shard federation. |
docs blueprint |
| #115 | 2026-06-19 | 2026-06-20 | RFC: Decouple result data from context — reference-only schema + one-level event chain + worker-side state builder |
PROGRAM-SCALE STEP 2 SHIPPED + multi-replica validated 2026-06-20 — execution-affinity write ordering (#116). Step 1 (KV coherence, server v3.38.0) was necessary-not-sufficient; affinity routes every trigger for an execution to the replica that ShardConfig::owns it (non-owner forwards) so the read→advance is atomic per execution → 2+ replicas produce one unforked chain. server#252 → server v3.39.0 5e00d0a (src/affinity.rs; NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME, all default off) + e2e#71 66b6e1b (2-replica StatefulSet topology + rig HARD gate). Multi-replica gate-ON kind PASS: linear/loop/fanout COMPLETE, every chain roots=1/dangling=0/walk==total, forwarded_ok proof, never-scan + sole-writer across replicas; single-replica unchanged; 595 server tests + clippy green; baseline restored. Remaining #117: off-server from_events spine ordered by event_id wedges fan-in under a chain-order≠id-order inversion (affinity + high-concurrency fanout) — fix = order spine by prev_event_id walk; linear/loop already reliable. PROD GKE untouched; all affinity flags default off. Phase 1 SHIPPED 2026-06-19 — references-in-state consume side. Worker resolve_context_references made selective (resolve a noetl:// ref only when this command's tool input binds the step's bulk — a path the bounded extracted summary can't satisfy; predicate / scalar / _ref access reads off the summary; worker#117 v5.36.0). Server hydrate_result_references surfaces _ref/_store/_uri on the kept summary + refs_in_state default flipped to true (server#243 v3.30.0). Kind gate-ON: all 9 #113 stalls COMPLETE (max command ctx 412KB, 0 __orchestrate__ event rows, materializer lag 0); the 3 targets bounded (output_select 1.32MB→10KB, storage_tiers 17.4MB→36KB, lease_expiry 201 spinning orch cmds→16). Closed #113 + #114. Off-server-drive prod cutover (#107/#111) unblocked on the size axis; PROD GKE untouched. Phase 2 MERGED 2026-06-19 — one-level prev_event_id event chain (server#244 → server v3.31.0 f5bd4a8 + noetl#667 → noetl ecd16a2; ai-meta pointer afdb365): each noetl.event/noetl.command carries the chain link, stamped at the emit chokepoint from a per-execution chain-head watermark, covering both gate paths + the materializer; chain-correctness proven walkable / 1-root / no-gap / no-scan across 6 gate-ON executions (incl. a real fan-out branch origin), sole-writer + Phase-1 bounded sizes intact, 573 tests + clippy green. Post-merge verified on live kind (both prev_event_id columns present, server image reflects merged code, gate-ON baseline live). Phase 3 MERGED 2026-06-19 — chain-walk state builder (server-side, flagged) (server#245 → server v3.32.0 8338417; ai-meta pointer bumped): the drive rebuilds WorkflowState by walking prev_event_id head→root (in-memory ChainHeads head + (execution_id,event_id) PK lookups, no WHERE execution_id scan) → same from_events (orchestrate-core unchanged; parity by construction), behind `NOETL_STATE_BUILD_MODE=chain_walk |
event_scan(defaultevent_scan, prod unchanged), event-scan kept as the fallback (cold-head / lag / non-genesis). Gate-ON validated: parity 41/41 MATCH 0-mismatch (tx-isolated), NO-SCAN proven (event_scans_total=0, 40 builds / 1064 PK hops / 0 fallbacks), all topologies COMPLETE, sole-writer + lag-0 + gate rig PASS, 577 tests + clippy green. Self-merged (no classifier block). **Phase 4 KERNEL + FLAG SHIPPED + shadow kind-validated** ([worker#118](https://github.com/noetl/worker/pull/118) → worker **v5.37.0** fef961c+ [server#246](https://github.com/noetl/server/pull/246) → server **v3.33.0**3e6006d): the pool-side state_builderreconstructsWorkflowState from the **noetl_eventsWAL** — a per-execution chain index walksprev_event_id head→root, caching the spine keyed by the immutable chain head + incremental tail-advance + cold-rebuild on miss/restart. A live WAL **shadow** loop (NOETL_STATE_BUILDER_SHADOW, default off) + the NOETL_STATE_BUILDER=offserver |
| #49 | 2026-06-02 | 2026-06-14 | Rust server FastAPI parity port — primary | server v3.6.0 (system worker pool + cleanup/purge endpoint, server#193); prod is Rust-only (server-rust + worker-rust + system pool). Remaining entangled refactor tracked in #97. | Umbrella: Rust Server Port |
| #101 | 2026-06-15 | 2026-06-16 | Orchestrator scaling: incremental state + results-by-reference resolution | In progress — PRs open (server#197 incremental OrchStateCache + hydrate_result_references; worker#89 NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES). References-in-state flag-on stall fixed + re-validated 2026-06-16 (worker feat/refs-in-state-extracted — build_extracted now a bounded navigable structural summary so output.data.rows[0].<field> resolves off the reference; PFT advanced 13→253 events, 31 refs active). Awaiting merge + pointer bumps. REFRAMED 2026-06-19 by #115 — the RFC inverts #101's "resolve refs into a rebuilt WorkflowState" to "never put data in state, never scan events to build state". Consume side DONE 2026-06-19 (#115 Phase 1 — worker#117 v5.36.0 + server#243 v3.30.0): worker selective render-time ref-resolution + _ref/_store surfacing + refs_in_state default true; references now stay out of state/commands by default. The incremental OrchStateCache is superseded by the immutable-chain cache in later #115 phases. |
Umbrella: Orchestrator Scaling |
| #103 | 2026-06-15 | 2026-06-19 | Step 2 — CQRS event log: events to JetStream + batch projector (write-path scaler) |
GKE pre-flip PREP landed (2026-06-19) — prod images pushed, GMP monitoring live, manifests staged; NO traffic flip / NO PUBLISH_ONLY. Prod verified already-Rust (pre-#103 images), secrets present, monitoring = GMP (not VM); operator-gated remainder = roll images → materializer shadow → pager → flip. Server cutover COMPLETE — FLIP-READY (default-off). All three flip blockers closed: (1) ack-after-materialize durability (deferred ack + worker materializer loop), (2) off-server-drive × gate reconciliation (#104, server v3.29.2 cold_rebuild), (3) the 2 ExecutionService cancel/finalize sites now route through the emit_event chokepoint (server#240 → v3.29.3, e2e rig e2e#62). No remaining synchronous server noetl.event writers under the gate — kind-proven both modes (gate-off byte-identical INSERT; gate-on cancel/finalize PUBLISHED, materializer sole writer, terminal state reached, 0 loss/dup). Flipping PUBLISH_ONLY on is a staged operator decision (materializer-lag alert, one revert away); no prod default changed. Materializer-lag guardrail SHIPPED 2026-06-19 (worker #116 v5.35.0 lag gauge on an independent poller + ops #195/#196 VMRule + worker scrape + VMAlert + dashboard + flip runbook) — kind-proven induce→fire→recover→clear; flip-readiness now includes the observability gate. |
Umbrella: Orchestrator Scaling |
| #102 | 2026-06-15 | 2026-06-16 | Step 1 orchestrator throughput: batch event-log writes + frame-batch cursor body | In progress — Part A landed (server#198 → v3.10.0), validated on GKE (1×50 PFT COMPLETED, batch event-log path). Part B (frame-batch cursor body) ahead. | Umbrella: Orchestrator Scaling |
| #104 | 2026-06-16 | 2026-06-19 | Event WAL + derivable result storage: NATS-as-WAL, logical-URI naming, Feather tier | In progress — off-server-drive × gate reconciliation PROVEN (2026-06-19): gate-ON (PUBLISH_ONLY=true) with the off-server drive (PLUGIN_DRIVE=true) + materializer sole writer is now green on kind (the combo #103 left unproven) — fresh exec + cursor fan-out → COMPLETED, server wrote 0 noetl.event rows (all PUBLISHED), materializer materialized all exactly once (rows==distinct ids, 0 dup), read-your-writes held via the relocated trigger. Server #238 → v3.29.2 rebuilds WorkflowState from the durable log on cold-cache apply (crash-recovery, kind-proven cold_rebuild); committed e2e rig kind_validate_orchestrate_gate.sh (e2e#61). This unblocked the #103 flip (the 2 cancel/finalize sites are now also done — server v3.29.3 #240; #103 is FLIP-READY). Prior: naming foundation noetl_tools::locator (tools#68/#70, v3.12.0) + worker URI stamp (worker#99); blueprint (docs#180). |
Umbrella: Event WAL Storage |
| #98 | 2026-06-14 | 2026-06-14 | Grow the Rust regression baseline: migrate Python-era e2e fixtures |
Snowflake key-pair JWT validated — the last external-tool gap is closed. tools v3.9.2 (tools#62/#63/#64); create_sf_database + setup_sf_table COMPLETED via JWT on kind. Transfer step deferred to #99. Core green: 64 fixtures. |
— |
| #97 | 2026-06-14 | 2026-06-14 | Retire remaining Python-era deploy legacy (manifests, kind automation, helm chart) | Open — Todo. Python manifests, kind redeploy automation that hardcodes Python deployment names, stale helm release rev 185. | — |
| #111 | 2026-06-18 | 2026-06-19 | E2E: worker-driven orchestrate topology coverage + server-API-only gap tracking | In progress — three committed kind rigs now: kind_validate_orchestrate_offserver.sh (e2e#59) asserts the off-server topology (gate-off), kind_validate_orchestrate_gate.sh (e2e#61) asserts it composes with the PUBLISH_ONLY gate, and kind_validate_cancel_finalize_gate.sh (e2e#62, 2026-06-19) dual-mode-asserts the ExecutionService cancel/finalize writers honour the gate (gate-off byte-identical INSERT; gate-on PUBLISHED, materializer sole writer, terminal state, 0 loss/dup) — the rig that closed the last #103 flip blocker. Off-server rig live-green (COMPLETED, __orchestrate__ in noetl.event = 0, dispatched=applied, shadow metric absent). Durable home for the server-API-only gap (server still sole-writer + rebuilds state — moves under #103/#104) + two operator decisions: (A) retire in-process drive fallback (gated on prod adopting a post-#108 image; prod still pre-#108), (B) reap accumulating __orchestrate__ PENDING delivery rows in noetl.command. |
— |
| #95 | 2026-06-14 | 2026-06-14 | noetl-tools postgres pg_value_to_json returns null for timestamptz / NaiveDateTime columns | Open — Todo. Bug: timestamptz columns serialize to JSON null instead of the ISO-8601 string. |
— |
| # | Closed | Title |
|---|---|---|
| #119 | 2026-06-20 |
Off-server WAL state-builder drain stranded executions after a worker restart — FIXED: rebuild the in-memory index from the retained noetl_events WAL on every boot. The authoritative drain used a durable noetl_state_builder consumer whose cursor persists across restarts while the in-memory WalEventIndex rebuilds empty → the cursor outran the fresh index → build_spine_to(expected_head) permanently Incomplete → off-server execs looped offserver_retry and never completed (this is what hid the #118 symptom). Fix (worker-only, inside NOETL_STATE_BUILDER=offserver; PROD runs the in-server drive so untouched): the drain defaults to an ephemeral DeliverPolicy::All consumer rebuilding the full index from the retained WAL on every boot (no persisted cursor to outrun; also correct for >1 worker pod); instant revert NOETL_STATE_BUILDER_DURABLE=1; proof = index rehydrated… log + new noetl_worker_state_builder_indexed_executions gauge; never reintroduces a noetl.event scan. noetl-worker v5.40.2 (worker#123, 48b0bde). Gate-ON kind: forced mid-flight pod delete --force → new pod index rehydrated … indexed_executions=17 wal_events=200; single-replica 6/6 stress + multi-replica 21 execs COMPLETE, zero scans. |
| #118 | 2026-06-20 |
Single-replica off-server terminal-finalize chain fork (a duplicate playbook.completed orphaned the chain as a NULL-prev_event_id 2nd root) — FIXED: a bounded FinalizedGuard (exactly-one-terminal-per-execution) suppresses the duplicate at emit_events before the chain linker. A straggler drive on a materializer-lagged single replica re-drove off the lagged WAL (state not terminal yet) and emitted a 2nd terminal that linked to the now-evicted head → 2 roots + a benign event-scan. The guard makes the first terminal win; gate-off byte-identical (a duplicate never occurs on the synchronous in-process drive); metric noetl_terminal_dedup_total{suppressed}; rig gains a HARD terminals==1 assertion. Absent under multi-replica execution-affinity (#116 serializes finalize to the owner). noetl-server v3.39.1 (server#253, c5f8cb2) + e2e (e2e#73, fe97d92). Gate-ON kind (unblocked by #119): single-replica 6/6 stress iterations / ~126 execs every chain roots=1(incl. terminal)/terminals=1/zero-scan; multi-replica 21 execs clean. |
| #117 | 2026-06-20 | Off-server from_events spine ordered by event_id broke fan-in under a chain-order≠id-order inversion (high-concurrency fan-out reduce wedge) — FIXED: order the spine by the prev_event_id chain walk + walk from the real tip (expected_head). worker v5.40.1 (worker#122, baeae78) + e2e (#72, cdf1768). 2-replica affinity gate-ON stress 6/6 iterations / 108 execs COMPLETE; 15 real id-inversions all fired the fan-in reduce. The residual single-replica terminal-finalize chain-linking race is now FIXED as #118. |
| #113 | 2026-06-19 |
Worker-driven orchestrate drive stalled when the drive result / accumulated context crossed the inline budget — fixed at the source by #115 Phase 1 (references-in-state consume side). All 9 of 9 stalled core fixtures now reach playbook.completed gate-ON (PUBLISH_ONLY + off-server drive + materializer sole-writer), server v3.30.0 + worker v5.36.0: output_select / storage_tiers / lease_expiry / pipeline_heavy_payload / save_edge_cases / large_result_extraction / http_to_postgres_{direct,simple,bulk_python}. Max next-command context across the 9 = 412KB (save_edge_cases); 0 __orchestrate__ event rows on every run; materializer lag = 0. The #113 decode fix (server#241 v3.29.4) + #114 offload cap (server#242 v3.29.5) stay as safety nets; the growth that pushed past budget is gone (worker selective resolve keeps foreign bulk out of the render; refs_in_state-true keeps the drive state + command.issued bounded). Landed via worker#117 + server#243. |
| #114 | 2026-06-19 |
Off-server drive oversized command.issued (full upstream context embedded) — resolved by #115 Phase 1 (refs_in_state default true). With the worker selective consume side in place, refs_in_state defaults true so the next command's render_context carries {reference, extracted} (small) instead of the full upstream payload — command.issued no longer approaches NATS max_payload. The #114 offload safety cap (maybe_offload_command_context, 512KB) stays as belt-and-suspenders. Verified gate-ON: all 4 of this issue's fixtures COMPLETE; max next-command context 412KB across the full 9. Landed via server#243 + worker#117. |
| #112 | 2026-06-18 |
Worker /dev/shm SIGBUS — k8s default 64 MiB tmpfs vs the 256 MiB Arrow IPC cache budget. Every worker process (Rust noetl-worker + legacy Python worker) allocates an Arrow IPC shared-memory cache at init (NOETL_IPC_CACHE_BUDGET_BYTES, default 256 MB) backed by POSIX shm on /dev/shm; the k8s container-runtime default /dev/shm is a 64 MiB tmpfs, so under shm-heavy load the cache writes past 64 MiB, the store page-faults against the full tmpfs, and the worker dies with SIGBUS (exit 135) and crash-loops. Surfaced during #103 CQRS kind validation on the system pool; a transient live fix was reverted, leaving the committed manifests latent. Fix (ops#193) gives every worker deployment a memory-backed /dev/shm (emptyDir medium: Memory, sizeLimit: 320Mi > budget), pins NOETL_IPC_CACHE_BUDGET_BYTES=268435456 next to the sizeLimit so the two can't drift, and raises the memory limit to 768Mi (tmpfs is charged to the pod cgroup). Applied to all 7 worker manifests (system / shared-rust / subscription / subscription-runtime / Python-cpu + 2 prod variants). Kind-validated on the system pool: reproduced SIGBUS (exit 135) on the 64 MiB tmpfs → after fix /dev/shm is 320 MiB and a full 256 MiB write completes (exit 0, peak 256M/320M) with the pod healthy (restarts=0, no OOM); cluster restored to baseline. ai-meta → ops f4df4c1 + wiki worker deployment-specification. |
| #110 | 2026-06-18 |
Retired the in-server orchestrate shadow + the wasmtime server dependency. The separable server-slimming follow-up to #108 (closed): the in-server shadow was the slice-4 cutover-confidence harness (ran the system/orchestrate plug-in inside the server via an embedded wasmtime host, diffed 529/0 against the in-process drive). With the worker-driven drive default-on + proven, the live drive uses the worker's wasmtime host — never the server's — so the shadow was dead weight. server#234 removed orchestrate_shadow.rs, the orchestrate-shadow cargo feature + the optional wasmtime dep (the cranelift/wasmtime tree — ~1000 Cargo.lock lines — fell out; cargo tree -i wasmtime now matches nothing), the trigger_orchestrator_inner shadow hook, the main.rs boot loader, the NOETL_ORCHESTRATE_PLUGIN_SHADOW config field, and the noetl_orchestrate_shadow_total metric. Kept run_state + NOETL_ORCHESTRATE_PLUGIN_DRIVE (default true). refactor: = no version bump (stays v3.28.0). Kind smoke on a 4-page cursor loop, both drive modes: COMPLETED with __orchestrate__ rows in noetl.event = 0 (worker-driven: 10 drive cmds on the system pool, dispatched=applied=10, event_suppressed=30); shadow metric gone from /metrics. ai-meta → server f3043c9 + wiki deployment-specification. See Umbrella-System-Pool-Design. |
| #108 | 2026-06-18 |
🎯 The orchestrator drive runs OFF-SERVER on the system pool, and it's now the DEFAULT. Step 2 of the dissolution (#107) — the server's brain moved onto the worker pool as the system/orchestrate WASM plug-in. Arc: slices 1–3 (off-server drive, server#229) → 4a (cursor+fan-out validated) → 4b + follow-up a (the __orchestrate__ meta-command touches noetl.event zero times, server#230/#231) → (b) system-pool isolation (server#232 + worker#114 + ops#191) → (c) the deliberate default-flip (server#233, v3.28.0): NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true. Gated on a scale soak (kind): a 694-drive cursor+fan-out run COMPLETED with __orchestrate__ rows in noetl.event = 0 and all 694 drives claimed on the system pool (shared pool got only the real steps); the default-on path (no env var) reproduced the identical shape (361 drives, isolated, 0 burst); 15/15 regression fixtures green; revert (=false → in-process drive) verified. ai-meta → server 80cc0e6 + worker 437b0be. See Umbrella-System-Pool-Design. |
| #109 | 2026-06-17 |
Event-ABI round — orchestrator + evaluate moved into the wasm core (#108 final slices). Slice 3 (server#223) relocated orchestrator/evaluate to noetl-orchestrate-core; the whole drive (renderer, playbook, commands, evaluator, state, orchestrator switch) now compiles native + wasm32-unknown-unknown. evaluate reads the pure core::event::Event; db::Event converts at the trigger_orchestrator boundary. 122 core + 565 server tests green, 0 WASI imports, kind PFT 10×1000 clean (0 errors, 0 restarts). Plug-in round (data-plane ABI + command_emit + scheduler + shadow-diff) tracked under #108. |
| #106 | 2026-06-17 |
EventEnvelope rejects timezone-less timestamps — blocked the CQRS materializer. The tailer publishes to_jsonb(noetl.event row) and created_at is timestamp WITHOUT time zone, so the timestamp had no offset; EventEnvelope.timestamp: Option<DateTime<Utc>> rejected it with the misleading premature end of input. Fixed (server#217) with a flexible deserialize_with (RFC3339 + tz-less→UTC). Validated live: materializer projects {projected:0, duplicates:20} = byte-identical. The #103 shadow gate is green. |
| #105 | 2026-06-17 |
Plug-in compilation & hot-reload: WASM-compiled system playbooks + managed library. The WASM plug-in runtime is complete + fully live-proven on kind: dispatch (tool_kind: "wasm" → wasmtime host → digest from registry → run), capability flush (object_put → noetl.object_store), real data flow (input→args, worker#110), and hot-replace (worker#112) — republishing the same path@version swaps the running pool's behavior with 0 restarts. Landed across worker#93/#95/#97/#99/#101/#103/#105/#107/#108/#110/#112 + server#210/#212/#214 + tools#68 + docs#181. Remaining (executor: author sugar; compiled materialiser port) deferred to #104/#103. See Umbrella-WASM-Plugin-Compilation. |
| #100 | 2026-06-15 |
Cursor/claim loop mode: loop.spec.mode: cursor in the Rust orchestrator. noetl-server v3.8.0 (server#196) cursor loop engine + output namespace; noetl-tools v3.10.1 (tools#66) postgres -- comment splitter fix; worker#88 dep bump. test_pft_flow_v2 all_passed:true, 5/5 per data type on kind against the throttling/error-injecting paginated-api. See Umbrella-Cursor-Loop-Mode. |
| #99 | 2026-06-14 |
Transfer tool: Snowflake↔Postgres both directions, with credential-alias resolution. Both transfer arms implemented in noetl-tools v3.10.0 (tools#65); worker v5.22.0 (worker#87) pre-resolves source.auth/target.auth aliases; SF→PG coerces string cells via $n::text::<udt> + reformats Snowflake-epoch timestamps to RFC3339; PG→SF generates SQL-escaped INSERTs. e2e fixture migrated (e2e#58). Kind-validated bidirectionally against the live sf_test account: every step COMPLETED, real types preserved. |
| #90 | 2026-06-12 |
Subscription / listener tool (RFC) — all 7 phases shipped + live-proven. Bounded-drain tool (Mode A) → kind: Subscription continuous runtime (Mode B) + header-directive engine → gateway push-ingress (Mode C) + auth-gated directive trust → store-and-forward spool + circuit breaker → out-of-cluster Cloud Run + gcs spool → CLI local noetl subscribe → Phase 7 scale hardening. Final phase: server v3.5.0 (server#189) POST /api/execute/batch (N→N, partial-failure contained) + opt-in exactly-once dedup window (noetl.subscription_dedup, bounded-by-age, race-safe, default off); worker v5.19.0 (worker#79) batch dispatch + dedup opt-in + per-subscription rate limits (token-bucket fetch-side backpressure → source keeps backlog, no loss); ops (ops#176) + e2e (e2e#48); no tools change. Live on kind: batch 12→12 COMPLETED on the subscription pool + per-message traceparent; dedup duplicate→1 execution + subscription.message.deduplicated; rate-limit engaged + 10/10 → executions (no loss). Refinement follow-ups tracked: #91–#94 + tools#57. ai-meta → server 7b217d8 + worker 7531f4a + ops 6db69b9 + e2e 203593b. |
| #89 | 2026-06-11 |
JSON null re-injected via {{ step }} serialized as JS undefined — fixed in the server template renderer — the cursor pagination fixture's terminal page returns next_cursor: null; re-injecting the whole {{ fetch_page }} envelope into the next step's input rendered that field as the bare token undefined, producing invalid JSON. render_to_value then failed serde_json::from_str and returned the entire envelope as a raw string, so the consuming Python step got response as a str and crashed ('str' object has no attribute 'get'). Root cause was the server, not the worker (the issue's hypothesis): src/template/jinja.rs::json_value_to_minijinja maps JSON null→Value::UNDEFINED and minijinja's map repr emits undefined; the server's renderer was a divergent copy missing the | tojson retry the noetl-tools engine already had. Fix (server#177, v3.0.6) adds that retry — a lone {{ expr }} rendering container-shaped-but-invalid JSON re-renders with | tojson; minijinja_to_json maps undefined/none → JSON null. 5 new regression tests; 619 lib + 8 parity green. Kind-validated: cursor walks all 4 pages, terminal next_cursor: null handled, validate_results collects 35 (first_id=1, last_id=35, success) — matching offset. ai-meta → server 8e17fbe. |
| #88 | 2026-06-10 |
e2e offset/cursor pagination fixtures read response.body.* not response.data.* — the Rust http tool nests the parsed JSON payload under body ({{ fetch_page }} → {body, headers, status_code}), so the fixtures' response.get('data', {}) resolved to {}, has_more/next_cursor defaulted falsy, and the loop exited after page 1 even with the post-#85 loop machinery correct. Fixed both check_pagination steps to response.get('body', {}) via e2e#40. Kind-validated against the live paginated-api test-server (Rust server/worker :dev): offset walks 0→10→20→30, has_more T/T/T/F, users 10/10/10/5, validate_results success 35 (first_id=1, last_id=35), playbook.completed COMPLETED. cursor path-fix correct + walks all 4 pages (Mg==→Mw==→NA==→null, 35 events fetched) but final collection blocked by a distinct worker bug → filed #89 (terminal next_cursor: null serialized as JS undefined). Other pagination fixtures (retry/max_iterations/pipeline*/loop_with_pagination) share the same envelope-key assumption over /api/v1/assessments|flaky ({data, paging}) — flagged, left for follow-up. ai-meta → e2e 72a7525. |
| #85 | 2026-06-10 |
Workflow-arc loops now advance across iterations + terminate cleanly — built on the dispatch-guard re-entry layer, two coupled orchestrator fixes via server#176 (v3.0.5). (1) Durable event-sourced loop-ctx propagation: step-level set: ctx.* loop variables were recomputed per pass and reverted to the workload default (loop thrashed 0,0,1,0,1,2,…); root cause was start's initializer set re-firing every pass in random HashMap order against check_pagination's advancing set. Fix persists each completion's rendered set: as a ctx.updated event (latest-wins fold + build_context overlay), emitted once per completion keyed by the stable completion event_id (not Utc::now()-fallback completed_at). (2) Loop-exit hang: the exit branch was marked step.skipped on a loop-body-completion pass (recency-based branch-point detector missed it), turning it terminal so the exit dispatch was suppressed; fixed with a structural loop-branch-point test. 614 lib tests (6 new; 2 verified to fail without their guard); clippy-clean. Kind-validated: counter loop advances 0→1→2→3 + terminates; real-http offset pagination advances 4 pages collecting 35. (Separate finding, filed follow-up: the e2e offset/cursor fixtures read response.data.users but the Rust http tool nests it at response.body.users.) ai-meta → server e519fdc. |
| #87 | 2026-06-10 |
Multi-tool step: a later sub-tool can now reference an earlier sibling's output — in a tool: [list] step, each sub-tool's result was stored for the aggregated output but never injected into the running context, so {{ <label>.<field> }} rendered empty (masked in quoted positions; a syntax error at or near "," in unquoted numeric SQL positions, e.g. save_edge_cases test_large_payload). Fixed via tools#48 (v3.1.1): inject each sub-tool's result under its label (with a synthetic .data self-ref) so later siblings resolve it. Worker adopts 3.1.1 via worker#69. Kind-validated: save_edge_cases test_large_payload → record_count = 100 (no syntax error), save_delegation_test clean. ai-meta → tools 76f942a + worker b97f642. |
| #83 | 2026-06-10 |
Orchestrator fan-in barrier deadlocked workflow loops — build_incoming_arcs counted a loop back-edge (check_pagination → fetch_page) as an upstream so the barrier deferred the loop head forever. Fixed via server#175 (v3.0.4): exclude back-edges via a new forward_reachable helper (genuine fan-in unaffected). Kind-validated; fanout_reduce green. ai-meta → server 480ba72. |
| #84 | 2026-06-10 |
Orchestrator never populated event.name → loop.done arc gates always skipped — when: {{ event.name == "loop.done" }} (10+ fixtures) never matched, so in-step loop: steps hung after completion. Fixed via server#175 (v3.0.4): inject event.name = "loop.done" into a completed loop step's next-arc context. Kind-validated (test_pagination_basic completes). ai-meta → server 480ba72. |
| #86 | 2026-06-10 |
e2e fixtures: duckdb tool field is command/query, not commands — commands: (plural) failed pre-dispatch (missing field 'query' / malformed tool config). Renamed across 4 storage/gcs fixtures via e2e#39; save_all_storage_types green. ai-meta → e2e b0a5c85. |
| #78 | 2026-06-10 |
noetl-worker: pre-dispatch errors now emit terminal call.error instead of hanging — credential-alias resolution + tool-config deserialization failures used to ?-propagate out of execute_with_server_url; the dispatch loop only logged them, so the execution sat at command.started forever. Fixed via worker#68 (v5.15.1): typed CredentialResolutionError (terminal AliasNotFound/Invalid vs retryable Transient) + CredentialHttpError carrying the HTTP status so classify_fetch_error decides retryability by code (terminal: 404/400/401/403/500; retryable: 408/429/502/503/504 + transport), and handle_predispatch_failure emits call.error + command.failed. Diagnosis correction: the live pg_noetl_k8s repro is an HTTP 500 "Decryption failed: aead::Error", not a 404 (the worker has no /api/keychain/ call). Kind-val GREEN: test/postgres now → call.error/command.failed/playbook.failed (no hang); hello_world still completes. ai-meta → worker 99e2c66. |
| #80 | 2026-06-10 |
container_callback chain green end to end — fixing the watcher's missing curl (retired bitnami/kubectl → alpine/k8s:1.30.3, ops#168) surfaced two more layered bugs. Server: container-callback call.done insert targeted a non-existent attempt column → HTTP 500; fixed to the deployed noetl.event schema (server#173, v3.0.3). OOM path: the watcher only read Job-level conditions so failed_oom could never fire — added pod-level OOMKilled→failed_oom / ImagePullBackOff→failed_image_pull classification + RFC3339 completed_at (bare now was returning HTTP 422) (ops#168); and the e2e fixture's calloc-lazy bytes(40MiB) never dirtied pages so it never OOM'd (e2e#38). kind_validate_container_callback.sh both probes GREEN — happy_path → succeeded, oom → failed_oom. ai-meta → ops cacc513 + server 5d2cf58 + e2e 6aaf06e. |
| #79 | 2026-06-10 |
e2e kind-val runner scripts updated to current noetl CLI surface — both scripts/kind_validate_*.sh aborted on unrecognized subcommand 'playbook'; the validation logic + event taxonomy were intact, only the invocation layer drifted. Fixed via e2e#37: register playbook / exec <catalog-path> --runtime distributed --json / status --json / event log via noetl query (.result, order by event_id); fail-fast CLI-surface guard. fanout_reduce PASS start-to-finish on kind (server-rust v3.0.1); container_callback drives cleanly and stops at the watcher curl gap (#80). ai-meta → e2e a3594b3; wiki Kind-Val Runners. |
| #82 | 2026-06-10 |
GUI: credential View/Edit recovered for pre-wallet (legacy-encrypted) records — Secrets Wallet (#61) made credential storage forward-only, so pre-wallet records 500 on decrypt and the GUI View/Edit flow dead-ended on a generic toast. Fixed via gui#36: View explains the cause + points to Edit; Edit reopens with the list-row metadata + a warning banner so re-saving re-seals the record under the current wallet. ai-meta → gui 8cacc9e (v1.11.1). |
| #81 | 2026-06-10 |
Container tool unusable: ToolSpec.command (String) vs ContainerConfig.command (Vec) type contradiction — landed via server#172 (v3.0.2). ToolSpec.command Option<String> → Option<serde_json::Value> so the container tool's array command decodes server-side + passes through to the worker's Vec<String> (scalar stays a JSON string for shell/db tools); ToolCall::from_spec forwards verbatim. 2 regression tests; clippy clean. Kind-val GREEN: server accepts command: ["/bin/sh","-c"], worker creates the K8s Job, Job reaches Complete 1/1. Chain counter-bump validation stays gated on #79/#43. |
| #77 | 2026-06-09 |
Explicit input:/set: forward-only data binding — BREAKING v3.0.0 across noetl-tools + noetl-server. All 5 PRs merged: tools#45 (v3.0.0) + server#169 (v3.0.0) + e2e#35 (13 fixtures migrated) + cli#57 (v4.10.0, executor 0.5.0) + worker#66 (dep bump). Kind-val GREEN. |
| #76 | 2026-06-08 | Sequential-mode iterator dispatch — LoopMode enum (Sequential default / Parallel), StepInfo.iterations_dispatched guard. Landed via noetl/server#166 (v2.62.0). First Claude-direct Rust PR under agents/rules/handoff-routing.md. Kind-val GREEN: test/loop COMPLETED 5/5 + iterator_save_test COMPLETED 4 steps. |
| #70 | 2026-06-08 | noetl-server missing PUT /api/result/<execution_id> endpoint — landed via noetl/server#160 (v2.58.0). Durable result-store endpoints (PUT + GET /api/result/<eid>/<step>/<ref>) ported to the Rust server. Kind-val GREEN: output_select_test reached playbook.completed with test_result: "PASSED". |
| #69 | 2026-06-08 | noetl-worker: over-budget call.done returned reference-only envelope, missing inline _ref for downstream {{ step._ref }} templates. Landed via noetl/worker#63 (v5.15.0): build_call_done_result's durable-success branch now embeds context: { data: { _ref: <noetl://...> } } alongside the existing reference block. Kind re-val pending noetl/ai-meta#70 (server-side PUT /api/result/<eid> endpoint missing — falls back to degraded shm-only path where there's no noetl:// URI to embed). |
| #68 | 2026-06-08 | noetl-tools: ArtifactTool config required input but server pipeline emitted args (the post-#56 normalized name) — landed via noetl/tools#40 (v2.23.1) + worker dep bump via noetl/worker#62. One-line #[serde(alias = "args")] on ArtifactConfig.input accepts both shapes. Re-val surfaced a downstream _ref/output_select gap → filed as noetl/ai-meta#69. |
| #67 | 2026-06-08 | Rust orchestrator hangs on mode: exclusive routing — untaken sibling never emits step.skipped, R4 fan-in barrier deadlocks. Landed via noetl/server#159 (v2.57.2). Three-part fix: (1) evaluator::evaluate_next_transitions surfaces unmatched siblings as not_matched_with_target; (2) orchestrator::process_in_progress two-pass emit-skipped-then-dispatch (HashMap-order-independent); (3) +2 unit tests (lib 568/0/0). Kind-val GREEN — comprehensive_test.yaml reached playbook.completed in ~4s (was hanging forever pre-fix). |
| #66 | 2026-06-07 | Rust orchestrator: cross-step {{ step.data }} template resolves to None — landed via noetl/server#158 (v2.57.1). WorkflowState::build_context injects a self-referencing .data key on extracted user_data, guarded so the task_sequence flatten path's existing .data stays intact. 2 new unit tests + 566/0/0 lib. Surfaced by #65 kind-val; concrete repro kind execution 322087210360770560. |
| #65 | 2026-06-07 | noetl-tools: external python script loaders (file/gcs/http source types) + legacy main() function convention — landed via tools#38 (v2.22.0) + tools#39 (v2.23.0); kind-val GREEN on the live worker (execution 322087210360770560 reached playbook.completed in ~6s; loaded main(name, count) returned the expected payload). noetl/worker#61 OPEN+mergeable to bump the worker pin. Surfaced finding: noetl/ai-meta#66. |
| #43 | 2026-06-07 | R-3 Phase C-2: container tool kind — design callback pattern for K8s Job dispatch. All four Rust rounds shipped (1 k8s-watcher ops@8892043 / 2 callback endpoint server v2.48.0 / 3 Tool::Container tools v2.21.0 / 5 kind-val rig e2e@17de21d). Round 4 (Python parity) parked per Rust-only direction. Worker-side pending_callback adoption is a coordinated follow-up. |
| #64 | 2026-06-07 | noetl-tools: artifact tool kind missing from Rust registry — landed via tools#35 v2.20.0 (thin ArtifactTool adapter translating the Python-era YAML shape into a ResultFetchTool call) |
| #61 | 2026-06-07 | Secrets Wallet (Rust): envelope encryption + KMS/secret-provider plugins + sealed worker delivery + distributed multi-region resolution — all named phases + 6d.X cloud dynamic providers shipped; umbrella feature-complete |
| #54 | 2026-06-06 | Phase F R5 — regression + e2e validation under sharded topology — closed at the umbrella level (Tier 1 + Tier 3 + Tier 4 e2e all GREEN on the Rust-only stack; subsequent regression findings filed as their own ai-task issues — #62/#63/#64/#65/#66 all now also closed) |
| #62 | 2026-06-05 | noetl-server: /api/executions list query candidate-first rewrite + status-drift fix (server#99 v2.28.1) — 6.5 s → 0.015 s (~430×) |
| #63 | 2026-06-05 | noetl-tools: python tool accepts nested script.source.code (inline) — tools#33 v2.19.3 + worker#54 adoption, test_script_loading kind-validated; external loaders split to #65 |
| #60 | 2026-06-04 | Rust orchestrator template context doesn't expose step data for next.arcs / step.when |
| #59 | 2026-06-04 | Rust orchestrator doesn't resolve tool.kind:workbook references to inline actions |
| #58 | 2026-06-04 | Rust orchestrator doesn't emit playbook.failed on command.failed — executions stall |
| #57 | 2026-06-04 | Rust server rejects flat (name-as-field) pipeline shape in v10 playbook YAML |
| #56 | 2026-06-04 | Canonical v10 playbook workload + input alias unrecognized on Rust stack |
| #55 | 2026-06-04 | Rust server EventEmitRequest.execution_id wire-type drift blocks worker traffic |
| #53 | 2026-06-04 | Rust worker → Rust server e2e compatibility (Phase D R3b/R3c terminal completion gap) |
| #52 | 2026-06-03 | noetl-tools: add js_consume operation to nats tool kind |
| #51 | 2026-06-03 | Fix system/outbox_publisher.yaml auth block to use AuthResolver pattern |
| #50 | 2026-06-03 | Phase 2.a — system/outbox_publisher playbook + routing + auth wiring |
See the Repo Map page for the full submodule inventory. Quick view of the production repos and their current versions (2026-06-20):
| Repo | Role | Version | Recent |
|---|---|---|---|
noetl/server |
Rust control plane | v3.39.0 |
✅ #116 program-scale step 2 — execution-affinity single-owner WRITE ORDERING (multi-replica gate-ON validated) (server#252, v3.39.0 5e00d0a): closes the chain-fork race step 1 left open (the command.issued prev-read + head CAS-advance are two non-atomic steps → concurrent cross-replica emits forked the chain). Affinity routes every trigger for an execution (POST /api/events, which fires the drive) to the single replica that sharding::ShardConfig::owns(execution_id) owns (stable XxHash64); non-owner forwards a reverse-proxy POST (one-hop loop guard, degrade-to-local). On the owner the single-process drive lock + in-memory ChainHeads make the read→advance atomic, no distributed lock; KV is the genesis/handoff vehicle (owner resolves LOCAL → kv_remote_hit→0). New src/affinity.rs; NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off, prod unchanged); metric noetl_execution_affinity_total{outcome} (forwarded_ok = proof). Multi-replica gate-ON kind (2-replica StatefulSet, offserver+audit_only+nats_kv+affinity+publish_only): rig PASS — chains roots=1/dangling=0/walk==total (NO fork), forwarded_ok +9, never-scan + sole-writer across replicas, executions COMPLETE; single-replica unchanged; 595 tests + clippy green. Follow-up #117: off-server from_events spine ordered by event_id wedges fan-in under a chain-order≠id-order inversion (high-concurrency fanout). Prior: ✅ #115 program-scale step 1 — multi-replica coherence DATA LAYER (NATS-KV-backed ChainHeads + ExecDescriptor); execution-affinity STAGED (server#251, v3.38.0 8f39a79): NOETL_REPLICA_COHERENCE=nats_kv (default local, prod unchanged) backs the off-server drive's watermark + descriptor with JetStream KV buckets — head advance = CAS (one chain under concurrent emits), descriptor = CAS merge; in-process maps = write-through cache / degraded fallback. src/coherence.rs; ChainHeads/ExecDescriptors async; metric noetl_replica_coherence_total{structure,op,outcome} (proof kv_remote_hit). Kind: single-replica nats_kv bit-for-bit parity with local (all topologies COMPLETE, clean chains, scans +0); 2-replica proved cross-replica resolves work. Necessary-not-sufficient: 2+ replicas still fork the chain (non-atomic issuing_event-read vs head-advance) → execution-affinity STAGED as step 2 (substrate in src/sharding.rs). Prior: ✅ #115 Phase 5 — atomic-working-item context (tenet 6): the drive hands a worker only its minimal declared slice (server#250, v3.37.0 a96ade8): new orchestrate-core::input_binding (analyze/project_context over minijinja undeclared_variables, conservative bail-to-full) + CommandBuilder::with_atomic_item_context; NOETL_ATOMIC_ITEM_CONTEXT (default false, prod unchanged); metric noetl_atomic_item_context_total{outcome}. Builds on #77 (Explicit Input Binding, CLOSED). Gate-ON kind-validated: flag-ON consumer ctx narrowed to the one declared key, COMPLETED; flag-OFF full ctx (back-compat). Prior: ✅ #115 Phase 6 — retire the hot-path noetl.event read class; the table is AUDIT-ONLY (server#249, v3.36.0 b71ca1d): NOETL_EVENT_READ_PATH=event_scan|audit_only (default event_scan, prod unchanged). Under audit_only, the remaining lifecycle readers of noetl.event (the WHERE execution_id replay class outside the drive — get_catalog_id per-ingest, inherit_parent_trace, subscription dedup-audit + container-callback catalog/existence) serve from the in-memory execute-time ExecDescriptor; a cold descriptor (post-terminal straggler after eviction / restart) resolves catalog_id from noetl.command (the synchronous queue) — never a noetl.event scan. New proof metric noetl_event_hotpath_reads_total{site,outcome}. Gate-ON kind-validated: hot-path scan Δ0 (served_descriptor +96 + served_command +3), drive state_build_total Δ0 + event_scans Δ0 ⇒ ZERO noetl.event scans anywhere on the hot path, end-to-end; linear/loop/fan-out/output_select COMPLETE; sole-writer + lag-0; audit still works (direct SELECT + status + replay event_count=25); 585 tests + clippy green; default event_scan, prod unchanged. The RFC never-scan end state (tenet 3) is reached under the flag. Prior: ✅ #115 Phase 4 REMAINDER — stateless off-server drive edge (zero state rebuild + zero noetl.event reads) (server#248, v3.35.0 6e30fc3): removes the residual server-side chain-walk bookkeeping on the drive path. Under NOETL_STATE_BUILDER=offserver a warm execute-time ExecDescriptor (catalog_id + routing seeded at playbook_started; terminal stamped at the emit_events chokepoint) lets trigger_orchestrator_inner route the system/orchestrate command WITHOUT building WorkflowState — expected_head from the in-memory ChainHeads, trigger_event_id passed so the worker resolves the trigger type off its WAL, NO server-built state (__stateless__). apply_worker_orchestration sources catalog_id+routing from the descriptor (skips cold-rebuild) + evicts on terminal worker-built state. Cold descriptor (restart) falls through to the server-built path (re-seeds) → chain_walk + event_scan stay fallbacks. Proof: noetl_orchestrate_drive_total{dispatched_offserver_stateless,applied_stateless} advance while noetl_state_build_total stays flat. Gate-ON kind-validated: state_build_total Δ0, event_scans Δ0, dispatched_offserver_stateless +3 / applied_stateless +3, linear(13)/loop(62)/fan-out(25)/output_select(31) COMPLETE, offserver==server parity, sole-writer 25==25, lag-0; 583 tests + clippy green; default server, prod unchanged. Completes #107 step 2 server-side. Prior: ✅ #115 Phase 4 DRIVE CUTOVER — mark the off-server drive command + carry expected_head (server#247, v3.34.0 f0922bd): under NOETL_STATE_BUILDER=offserver (default server, prod unchanged) trigger_orchestrator_inner marks the system/orchestrate command __offserver_build__ + carries execution_id + expected_head (the staleness watermark) so the worker self-sources the drive state from the WAL; the server-built state rides along as the worker's incomplete-chain fallback. Gate-ON parity rig PASS (offserver==server, fan-in exactly-once, server scans 0). 580 tests + clippy green; no prod default changed. Prior: ✅ #115 Phase 4 — NOETL_STATE_BUILDER=offserver|server flag scaffold (default server) (server#246, v3.33.0 3e6006d): the server-side flag for the off-server state-builder drive cutover. server (default, prod unchanged) builds WorkflowState in-process; offserver routes the drive to obtain state from the pool-side off-server builder (worker state_builder, reading the noetl_events WAL). Flag only — the offserver drive-cutover wiring is staged (pool-side builder landed in worker v5.37.0). 2 config tests; no prod default changed. Prior: ✅ #115 Phase 3 MERGED — chain-walk state builder (flagged, default-off) (server#245, v3.32.0 8338417): behind NOETL_STATE_BUILD_MODE=chain_walk the drive reconstructs WorkflowState by following the one-level prev_event_id chain head→root (head from the in-memory ChainHeads watermark, then (execution_id,event_id) PK point-lookups — never a WHERE execution_id scan of noetl.event) and feeds the same from_events (orchestrate-core unchanged → parity by construction). Falls back to event_scan (default) on cold-head / materializer-lag / non-genesis so correctness is never sacrificed; NOETL_STATE_BUILD_PARITY_CHECK shadow-builds both ways in one REPEATABLE READ snapshot and asserts equal. New metrics noetl_state_build_total{mode,outcome} + noetl_state_build_event_scans_total (no-scan proof, 0 under chain_walk) + noetl_state_build_chain_hops + noetl_state_build_parity_total{result}. Gate-ON kind-validated: parity 41/41 MATCH, event_scans_total=0 / 1064 PK hops / 0 fallbacks, all topologies COMPLETE, sole-writer + lag-0, 577 lib tests + clippy green. Self-merged; prod default unchanged. Phase 4 (off-server builder + WAL cache) builds on this. Prior: ✅ #115 Phase 2 MERGED — one-level prev_event_id event chain (server#244, v3.31.0): each noetl.event carries prev_event_id (the immediately-previous event in causal order) + each noetl.command the issuing-event link, stamped at the emit chokepoint emit_events from a per-execution chain-head watermark (ChainHeads) — covering drive events + command.issued + worker-lifecycle on both gate paths, the materializer persisting it. Additive (nothing reads the link yet — that is Phase 3, in progress). Chain-correctness kind-proven walkable/1-root/no-gap/no-scan across 6 gate-ON topologies; 573 lib tests + clippy green. ai-meta pointer afdb365. Prior: ✅ #115 Phase 1 — surface _ref/_store on kept refs + refs_in_state default true (server#243, v3.30.0): hydrate_result_references keep_refs branch merges ref/store/uri onto the bounded extracted summary it surfaces as context.data (so {{ step._ref }} lazy-load + {{ step._ref is defined }}/{{ step._store }} predicates resolve off the summary without bulk), and refs_in_state now defaults true — references stay out of state + commands (drive state + command.issued bounded; the worker selective consume side landed in worker#117). Closed #113 + #114; kind gate-ON all 9 stalls COMPLETE. Prior: 🐞 #114 — offload oversized command context (server#242, v3.29.5): a command.issued context over NOETL_COMMAND_CONTEXT_MAX_BYTES (512KB) is offloaded to noetl.result_store with a {__context_ref__} marker; get_command/claim_command resolve it before the worker sees the command (metrics context_offloaded/context_ref_resolved) — so the published event stays under NATS max_payload and never wedges the publish-only gate. Kind gate-ON: rig PASS, every command.issued event <1MB, 6 of #113's 9 fixtures COMPLETE; chose ref-on-oversize over refs_in_state=true (the latter breaks bulk-consuming fixtures — consume side not impl). Remaining 3 fixtures + the cutover gated on the refs_in_state consume side (#101). Prior: 🐞 #113 — off-server drive: recover offloaded drive result + stop drive on cancel (server#241, v3.29.4): apply_worker_orchestration resolves+decodes an OFFLOADED __orchestrate__ drive result (over the 100KB inline budget → durable reference.ref, no inline output_b64) instead of dropping it → non-convergent re-loop (new noetl_orchestrate_drive_total{stage=ref_resolved}); cancel now stops the drive (match underscore playbook_cancelled + ExecutionState::is_terminal terminal guard evicts the orch-cache, no restart). Kind gate-ON proven (785KB result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; sole-writer lag 0); rig e2e#63. 5/9 #113 fixtures COMPLETE; 4 hit a distinct oversized-command.issued stall → #114. No prod default flipped. Prior: 🎯 #103 — CQRS write-path cutover COMPLETE, FLIP-READY (server#240, v3.29.3): the 2 ExecutionService cancel/finalize writers route through the emit_event chokepoint — the last synchronous server noetl.event writer under the gate is closed, so with NOETL_EVENT_INGEST_PUBLISH_ONLY=on the server writes zero event rows (materializer is the sole writer). Kind-proven both modes (gate-off byte-identical INSERT; gate-on PUBLISHED + materializer sole writer + terminal state + 0 loss/dup). All three flip blockers closed → flipping the gate on is a staged operator decision. Default-off; no prod default changed. Prior: #104 — off-server-drive × gate crash-recovery (v3.29.2, server#238): cold-cache apply rebuilds WorkflowState from the durable log. Prior: #108 (c) — worker-driven orchestrator drive now DEFAULT ON (server#233, v3.28.0): NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true after the scale soak proved zero noetl.event burst + full system-pool isolation; in-process drive kept as the =false revert. Prior: #101 — bounded-memory orchestrator + stall-proof reconcile (v3.9.0, server#197): projection-snapshot bounded rebuild (flat memory — 167KB snapshot at 200k events, was OOM at ~19k); throttled O(events) consistency COUNT off the hot path; background reconcile poller (force-advances every active execution every 8s → no permanent deadlock under DB backpressure); results-by-reference resolution; GET /api/executions/{id} memory-bomb fix. Validated kind 10×1000 (flat memory) + GKE db-g1-small/PgBouncer 10×200 (poller broke a stall, 0 fails/restarts, Cloud SQL ~15 backends). Prior: #90 Phase 7 — POST /api/execute/batch + opt-in exactly-once dedup window (v3.5.0, server#189, closes server#188): batch endpoint creates N executions in one round-trip with partial-failure containment, reusing the single-execute path so per-message routing/trace/dedup are intact; the opt-in dedup window (noetl.subscription_dedup, bounded-by-age, race-safe INSERT … ON CONFLICT, scoped by subscription, subscription.message.deduplicated audit, default off — RFC §10 OQ1); validation of the new dispatch.batch_dispatch/batch_max/dedup/limits spec blocks; noetl_execute_outcomes_total + noetl_execute_batch_size. Live on kind: batch 12→12 COMPLETED; dedup duplicate→1 execution; direct-curl within/outside-window + dedup-off all green. ai-meta → server 7b217d8. Prior (v3.4.2): #90 Phase 5 gcs/s3 spool credential optional (ADC). Prior: #90 Phase 4 — spool config validation + subscription lifecycle-status fix (v3.4.1, server#184+server#185): validate the spec.spool block at registration; lifecycle-status reconstruction now matches only the six lifecycle event types so spool/circuit events (which share the subscription's execution_id) can't 500 subscription_get/activate. Prior: #90 Phase 3 — push-ingress config endpoint + push catalog validation (v3.3.0, server#182): mode: push requires an ingress.verify block (hmac_sha256 | bearer | pubsub_oidc; none rejected) + new GET /api/internal/ingress/{listener} (service-account-gated) resolving the verify-secret alias through the Wallet + idempotent subscription registration — the gateway's DB-free config source. Live-validated on kind. Prior: #90 Phase 2 — kind: Subscription type + lifecycle + pool routing + W3C trace (v3.2.0, server#180): first-class kind: Subscription catalog type (source/mode/dispatch validation, no step-DAG) + event-sourced lifecycle endpoints /api/subscriptions (register→activate→pause/resume→drain→deactivate, idempotent register, GET list/get) + execution_pool override on /api/execute routing the whole run to noetl.commands.<pool>.<eid> (persisted in playbook_started meta, orchestrator reads back) + W3C trace into meta.trace + command notification + child inheritance. Startup-seeds the subscription resource kind; decodes noetl.event.created_at as TIMESTAMP. Live E2E green. Prior (v3.1.0): subscription ToolKind. Prior: Round-trip JSON null in whole-object {{ step }} references (v3.0.6, server#177; closes noetl/ai-meta#89): a single-expression {{ step }} reference to a prior result envelope carrying a null field rendered that field as the JS token undefined via minijinja's map repr (json_value_to_minijinja maps JSON null→Value::UNDEFINED); render_to_value then failed serde_json::from_str and returned the whole envelope as a raw string, so the consuming python/rhai step received an unparseable str and crashed. Fix adds a | tojson retry to render_to_value (mirrors the noetl-tools TemplateEngine::render_value the server copy had diverged from): a lone {{ expr }} whose plain render is container-shaped-but-invalid JSON re-renders with | tojson, and minijinja_to_json maps undefined/none → JSON null so the field round-trips as null. 5 new regression tests (null in nested + top-level objects, null array element, explicit | tojson no-double-pipe, scalars unchanged); 619 lib + 8 parity green; clippy clean. Kind-validated against the live paginated-api test-server: cursor fixture walks all 4 pages, terminal next_cursor: null handled, validate_results collects 35 (first_id=1, last_id=35, success) — matching offset; pre-fix the 4th check_pagination was command.completed error. ai-meta → server 8e17fbe. Prior: Workflow-arc loops advance across iterations + terminate (v3.0.5, server#176; closes noetl/ai-meta#85): two coupled fixes atop the dispatch-guard re-entry layer. (1) Durable event-sourced loop-ctx propagation — step-level set: ctx.* loop variables were recomputed per pass and reverted to the workload default (loop thrashed 0,0,1,0,1,2,…); root cause was start's initializer set re-firing every pass in random HashMap order against the loop's advancing set. Fix persists each completion's rendered set: as a ctx.updated event (latest-wins fold in WorkflowState.ctx + build_context overlay), emitted once per completion keyed by the stable completion event_id (the Utc::now()-fallback completed_at is unstable across reconstructions). (2) Loop-exit hang — the exit branch was marked step.skipped on a loop-body-completion pass (recency-based branch-point detector missed it), turning it terminal so the is_step_done guard suppressed the exit dispatch; fixed with a structural loop-branch-point test (any step with a back-edge arc). 614 lib tests (+6; 2 verified to fail without their guard); clippy-clean. Kind-validated: counter loop 0→1→2→3 + terminates; real-http offset pagination 4 pages collecting 35. ai-meta → server e519fdc. (Separate finding: the e2e offset/cursor fixtures read response.data.users but the Rust http tool nests it at response.body.users — filed as a follow-up.) Prior: Unblock workflow loops + loop.done-gated transitions (v3.0.4, server#175; closes noetl/ai-meta#83 + #84): the fan-in/reduce barrier counted a loop back-edge (check_pagination → fetch_page) as an upstream and deferred the loop head forever — fix excludes back-edges via a new forward_reachable helper (genuine fan-in unaffected); and event.name was never populated for arc evaluation so when: {{ event.name == "loop.done" }} never matched (10+ fixtures hung after an in-step loop:) — fix injects event.name = "loop.done" into a completed loop step's next-arc context. Found in a full e2e regression re-sweep (19→27/36 on kind); landed with e2e fixture fix e2e#39 (duckdb commands→command, closes #86). Follow-ups #85/#87 filed open. 26 orchestrator tests +2 new; kind-validated. Prior: container-callback insert matches the deployed event schema (v3.0.3, server#173; tracks noetl/ai-meta#43): the container-callback handler emitted its resume call.done via a stale query targeting attempt + id columns that don't exist on the deployed noetl.event (PK (execution_id, event_id)) — every watcher POST 500'd with column "attempt" of relation "event" does not exist, blocking the #43 chain. Replaced with an inline INSERT matching the working handlers::events column set; terminal outcome rides in a chk_event_result_shape-conforming result envelope. Kind-val GREEN: kind_validate_container_callback.sh both probes pass (happy_path → succeeded, oom → failed_oom). Prior: container-tool command type contradiction fixed (v3.0.2, server#172; closes noetl/ai-meta#81): ToolSpec.command Option<String> → Option<serde_json::Value> so the container tool kind's K8s-Job-style array command decodes server-side (was a 400 data did not match any variant of untagged enum ToolDefinition) and passes through unchanged to the worker's ContainerConfig.command: Option<Vec<String>>; a scalar stays a JSON string for the shell/db consumers; ToolCall::from_spec forwards the value verbatim instead of wrapping in Value::String. 2 new regression tests (playbook::types 18/18); clippy clean. Kind-val GREEN end-to-end — server accepts command: ["/bin/sh","-c"], worker dispatches the container tool, K8s Job reaches Complete 1/1 (pre-fix kubectl get jobs empty). Prior: e2e-sweep cleanup (v3.0.1, server#171; tracks noetl/ai-meta#49): 64 MB result-store PUT body limit (DefaultBodyLimit — was rejecting 15 MB+ payloads with HTTP 413); render_pipeline_config stashes set/args/spec/command blocks before Tera rendering; iter namespace map in build_iteration_command; cmd_render_ctx uses command.context override; stripped diagnostic tracing::debug! blocks. All 7 e2e sweep playbooks PASS on Rust-only kind stack. Prior: Sequential-mode iterator dispatch (v2.62.0, server#166; closes noetl/ai-meta#76): LoopMode enum (Sequential default / Parallel); LoopSpec.mode parsed from loop.spec.mode YAML; StepInfo.iterations_dispatched tracks command.issued count for the sequential dispatch guard; sequential pattern dispatches iteration 0 at fan-out then on each command.completed dispatches next if iterations_dispatched == iterations_completed(). Default is Sequential — existing playbooks without explicit spec.mode get sequential behavior. 3 new tests; lib pass; clippy clean. Kind-val GREEN: test/loop COMPLETED 5/5 + iterator_save_test COMPLETED 4 steps. First Claude-direct Rust PR under agents/rules/handoff-routing.md. Prior: Durable result-store endpoints (v2.58.0, server#160; closes noetl/ai-meta#70): PUT + GET /api/result/<eid>/<step>/<ref>. Kind-val GREEN: output_select_test reached playbook.completed. Prior: Rust orchestrator exclusive-routing fix — step.skipped for untaken siblings (v2.57.2, server#159; closes noetl/ai-meta#67): under mode: exclusive only one arc fires; pre-fix, the static planner declared the untaken sibling's target as an upstream of any downstream merge step, then the R4 fan-in barrier waited for it forever. Three-part fix: (1) evaluator::evaluate_next_transitions stops break-ing on exclusive-mode match — surfaces unmatched siblings as not_matched_with_target results; (2) orchestrator::process_in_progress two-pass refactor — emit step.skipped for ALL unmatched arc targets first, then dispatch matched (HashMap-order-independent); (3) +2 unit tests + new defensive Jinja regression guard (server lib 568/0/0). Kind-val GREEN: e2e/fixtures/playbooks/comprehensive_test.yaml reaches playbook.completed in ~4s (was hanging forever). Single-commit patch on top of v2.57.1. Prior: Rust orchestrator step.data template accessor fix (v2.57.1, server#158; closes noetl/ai-meta#66): WorkflowState::build_context now injects a self-referencing .data key on the extracted user_data so {{ step.field }} (existing flat path) AND {{ step.data.field }} (new wrapped path) both resolve. Guarded by !map.contains_key("data") to preserve task_sequence flatten back-compat (a labeled sub-task's data field stays addressable as both <step>.<label>.data.x AND <step>.data.x). 2 new unit tests; cargo test lib 566/0/0; release build clean. Single-commit patch on top of v2.57.0. Prior: Phase D R5 R7 — cross-server parity harness; Replay engine port complete (v2.57.0, server#157; closes server#148; tracks noetl/ai-meta#49 Phase D R5). Final slice — hermetic parity rig: events.json (13 synthetic events) + expected.json (Python's pre-recorded fold output) + regenerate_expected.py (standalone Python script — verbatim extract of service.py fold + helpers, no noetl-package imports). tests/parity_harness.rs 8-test integration suite asserts structural parity field-by-field across all six projections + payload refs. Parity is structural not byte-for-byte hex (Python and Rust hash different digest inputs per R4's design). All 8 tests pass; lib 564/0/0; release build clean. No kind-val needed — test-only PR. All 7 Phase D R5 rounds shipped today (v2.51.0 → v2.57.0); the Replay engine port is complete. Phase D R5 R6 — payload resolver (v2.56.0, server#156; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): every event's result.reference JSON gets parsed into a typed PayloadSummary and appended to the relevant projection's payload-refs list. PayloadSummary + PayloadRefEntry types mirror Python's dict shapes; extract_payload_ref + payload_summary mirror Python's helpers with three-tier fallback (reference.<field> → rows_ref.meta.<field> → rows_ref.ipc.<field>). ReplayExecutionState.payload_refs + ReplayFrameState.output_ref/output_ref_summary + ReplayBusinessObjectState.payload_refs/last_payload_ref all populated. 15 new unit tests; lib 564/0/0. Kind-validated against live execution showing 3 populated payload_refs with real SHA-256 digests. Only R7 remains. Phase D R5 R5 — snapshot seed + base_state + upcaster digest (v2.55.0, server#155; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): the replay fold can now start from a prior fold's output and continue from there. ReplaySnapshotSeed mirrors Python's frozen dataclass; ReplaySnapshotInfo is the output subset; ReplayFoldOptions carries the optional inputs; new fold_replay_state_with_options entry point + 5-arg fold_replay_state as a back-compat shim. R5-introduced ReplayState fields (replay_snapshot, upcaster_registry_digest) both skip_serializing_if None — default folds produce R1–R4-identical JSON. Snapshot-storage backend deferred to a downstream sub-issue. 8 new unit tests; lib 549/0/0. Kind-validated wire-shape back-compat. Phase D R5 R4 — typed Checksum + projection_checksums (v2.54.0, server#154; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): every replay fold now produces a typed Checksum over the full state + a 6-entry projection_checksums map. ChecksumType enum (initial variant Sha256) + Checksum { type, value } struct + stable_json_bytes deterministic JSON encoder + compute_checksums per-projection + top-level digest run at the end of fold_replay_state. Design decision: digest input is the typed Rust state directly, not Python's flat-row normalize layer; Python byte-for-byte parity deferred to R7. 9 new unit tests; lib 541/0/0. Kind-validated: top-level checksum 41265876487f...ae426; six projection_checksums entries all populated. Phase D R5 R3 — loops + business_objects projections (v2.53.0, server#153; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): third slice of Phase D R5 — replay fold populates the last two per-projection maps. Two new typed state structs (ReplayLoopState, ReplayBusinessObjectState) replace R2's serde_json::Map placeholders; ReplayState.{loops,business_objects} flip to BTreeMap for deterministic key ordering. Two new ID extractors mirror Python's _loop_id / _business_object_identity resolution order; business_object_status helper mirrors Python's _business_object_status suffix-derived ACTIVE/DELETED transitions; two new populate functions with full event-shape coverage (loop counters bump on command.{completed,failed} + loop.shard.{done,failed} + loop.{done,fanin.completed}; business-object attributes REPLACE/PATCH from meta). payload_refs deferred to R6. 13 new unit tests; lib 532/0/0. Kind-validated: re-probe returns loops + business_objects empty (expected — fixture doesn't emit those events). R4 design captured (per user direction): typed ChecksumType enum + Checksum { type, value } struct; future checksum types slot in via enum. Phase D R5 R2 — stages + frames + commands projections (v2.52.0, server#152; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): second slice of Phase D R5 — replay fold populates stages + frames + commands projections. Mirrors Python's state["stages"] / state["frames"] / state["commands"] per-projection dicts. ReplayEventRow extended with stage_id / frame_id / command_id / worker_id / aggregate_type / aggregate_id / meta columns (all #[sqlx(default)]); three new typed state structs (ReplayStageState / ReplayFrameState / ReplayCommandState); ReplayState.{stages,frames,commands} flip from serde_json::Map to BTreeMap for deterministic ordering; three new ID extractors mirror Python's resolution order; three new populate functions with full status transitions (stage opened/closed; frame dispatched/started/committed/failed/abandoned; command full lifecycle). 10 new unit tests; lib 518/0/0. Kind-validated: re-probe of prior fanout_reduce execution returns commands map populated with 4 entries carrying worker_id + issued_event_id + last_event_id. Phase D R5 R1 — Replay endpoint scaffold + execution projection (v2.51.0, server#149; tracks server#148): opens Phase D Round 5 (Python's noetl/server/api/replay/service.py ~1236 LoC → Rust). Sub-issue server#148 documents the 7-round decomposition (R1 scaffold + execution / R2 stages+frames+commands ✅ / R3 loops+business_objects / R4 typed Checksum + projection_checksums / R5 snapshot seeds / R6 payload resolver / R7 cross-server parity harness). R1 ships new GET /api/replay/state route mirroring Python's endpoint.py byte-for-byte (query params + defaults + projection enum + mutually-exclusive cutoffs returning 400); new services::replay module with ReplayService + ReplayCutoff + ReplayProjection + ReplayState + ReplayExecutionState + pure deterministic fold_replay_state; minimal execution projection fold reuses the Phase D R4 terminal-event short-circuit. Phase D R4 follow-up: status endpoint short-circuits on terminal events (v2.50.1, server#147; closes server#146): ExecutionService::get_status now (1) looks up playbook.completed / playbook.failed FIRST and returns COMPLETED / FAILED directly, and (2) accepts 'success' lowercase in the completed_steps counter. Kind-validated: prior execution flipped RUNNING→COMPLETED on the same DB data. Phase D R4 slice 2 — apply_event handles step.skipped (v2.50.0, server#145; closes server#144): new step.skipped arm in state::WorkflowState::apply_event records the step with StepState::Skipped — fan-in barrier no longer defers forever when an upstream's when guard evaluates false. Container Tool Callback umbrella #43 Round 2 — POST /api/internal/container-callback/{execution_id}/{step} (v2.48.0, server#141; closes server#140; tracks noetl/ai-meta#43): external K8s watcher (Round 1, ops#166) POSTs Job terminal-state events here; handler validates path params, checks staleness via a single indexed SELECT on noetl.event, and emits a call.done event with the structured terminal state on match (or bumps noetl_container_callback_stale_total + returns 202 if no events exist for the execution). Six TerminalState variants survive in meta.terminal_state so playbooks can branch on the specific failure reason. Two new counters; 7 new unit tests; lib 487/0. Round 2 lands first per the umbrella's recommended ordering — smallest blast radius; unblocks Round 1 (watcher Deployment) + Round 3 (tools#36 — Tool::Container). Secrets Wallet #61 cloud-specific dynamic-secret providers shipped — umbrella feature-complete (v2.45.0 server#137 + v2.46.0 server#139 + v2.47.0 server#138): 6d.1 AWS STS AssumeRoleWithWebIdentity (EKS IRSA path; no SigV4 — token IS the credential); 6d.3 Azure AAD client-credentials (off-cluster non-IMDS path; sovereign-cloud overrides via env); 6d.2 GCP iamcredentials.generateAccessToken (workload-identity impersonation of a target SA). All three return SecretValue.expires_at populated — Phase 6d's cache_decision clamps cache TTL accordingly; Phase 7c.3 background refresh re-resolves inside the window. 39 new unit tests across the three providers; lib 470/0. Secrets Wallet umbrella is feature-complete: envelope encryption + KMS + 5 static-secret providers (GCP-SM, K8s, Vault, AWS-SM, Azure-KV) + 3 dynamic-secret providers (AWS-STS, GCP-IAM, Azure-AAD) + residency policy + cross-region broker + KEK rotation + audit + auto-renewal with stampede collapse. Phase 7c.3 — background-refresh wire-up + stampede collapse (v2.44.0, server#136): cache-hit path now spawns a background tokio::spawn to re-resolve via the provider + update the cache via KeychainService::set when the row is inside the refresh window. Cached value returns IMMEDIATELY to the caller — worker fetches stay on the fast path. Stampede collapse via new src/services/keychain_refresh.rs RefreshInflight (Arc<tokio::sync::Mutex<HashSet<(i64, String)>>>); concurrent refreshes for the same (catalog_id, alias) collapse to one provider call. Refactor: extracted resolve_via_provider from try_resolve_keychain so cache-miss inline + background refresh share identical code. Phase 7c series wire-complete (7c + 7c.2 + 7c.3). Phase 7a.2 / 7b.2 / 7c.2 — operator-facing rotation + audit storage + cache-refresh primitive (v2.42.0 server#127 + v2.43.0 server#129 + server#131): 7a.2 wraps the Phase-7a rewrap_storage_string primitive with POST /api/internal/wallet/rotate-kek (batched cursor scan of noetl.credential + noetl.keychain, RotateSummary { processed, rewrapped, skipped, failed, last_id } for progress checkpointing) and GET /api/internal/wallet/key-status (per-version row counts). 7b.2 adds the noetl.secret_audit table (CREATE TABLE IF NOT EXISTS at startup; server-owned), DbAuditSink impl, and GET /api/internal/secret-audit?credential=&execution_id=&from=&to=&limit= (bounded; ORDER BY occurred_at DESC; hard cap 10_000). 7c.2 adds KeychainService::should_refresh(catalog_id, keychain_name, execution_id, scope_type, now) — cache-side companion of the Phase-7c decision primitive; reads the row's expires_at, asks secrets::dynamic::should_refresh_default (honours KEYCHAIN_CACHE_REFRESH_WINDOW_SECS), bumps noetl_secret_refresh_total{outcome="triggered"} on true. Backward compatible. Next: 7c.3 stampede-collapse mutex + background re-resolve; 6d.1/.2/.3 cloud-specific dynamic providers. Phase 7c — token auto-renewal primitives (closes Phase 7) (v2.41.0, server#125): final named round of the Secrets Wallet umbrella. secrets::dynamic::should_refresh(expires_at, refresh_window, now) decision primitive (true iff expires_at set + still valid + within refresh window); KEYCHAIN_CACHE_REFRESH_WINDOW_SECS env (default 60). noetl_secret_refresh_total{outcome} counter (triggered|succeeded|failed|stampede_collapsed; failed alert-worthy) + noetl_secret_refresh_duration_seconds histogram (50ms–5s buckets). 5 new unit tests; lib 427/0. Lib-only. All named phases (1–7) of the Secrets Wallet umbrella are complete. Remaining queue is discrete follow-up sub-issues only. Phase 7b primitives — secret-resolution audit service (v2.40.0, server#123): durable audit trail of every credential resolution. AuditEvent struct (NEVER contains the secret value); bounded Operation + Outcome enums; AuditSink trait + NoopAuditSink default + SecretAuditService wrapper with record_async (fire-and-forget, never blocks resolver) + record_strict (await, used when compliance mandates the row exist before the value releases) + record (dispatches by strict-mode). NOETL_SECRET_AUDIT_REQUIRED env (default false; 1/true/TRUE/yes/YES enable strict). noetl_secret_audit_writes_total{operation, outcome, status} counter (failed_strict alert-worthy). 8 new unit tests; lib 422/0. Lib-only. Phase 7a — KEK rotation primitives (v2.39.0, server#121): starts Phase 7. KeyManager::current_key_version() trait accessor; EnvelopeCipher::rewrap_storage_string primitive (parse → if same version Skipped; else unwrap with historical version → re-wrap with current → Rewrapped { old_key_version, new_key_version, new_storage_string }). Plaintext payload NEVER reconstructed — pure DEK re-wrap, AES-GCM ciphertext bytes stay byte-identical. noetl_wallet_rotate_total{table, status} counter (skipped|rewrapped|failed_unwrap|failed_wrap|parse_error). 4 new unit tests; lib 414/0. Lib-only. Phase 6e — cross-region broker (closes Phase 6) (v2.38.0, server#119): BrokerRegistry (region → broker_url from NOETL_SECRET_BROKER_REGISTRY env; empty default = pre-6e fail-closed); POST /api/internal/cross-region/resolve peer endpoint validates expected_entry_region == server_region() (defensive against stale peer registries), resolves locally, seals via Phase-5a primitives to the requesting worker's pubkey directly; get_sealed handler falls back to broker on AppError::ResidencyViolation; KeychainDef.no_broker_fallback per-credential opt-out; AppError::CrossRegionUnreachable → HTTP 502. Two new metrics: noetl_secret_broker_call_total{broker_region, outcome} + noetl_secret_broker_call_duration_seconds{broker_region} histogram (50ms–5s). 10 new unit tests; lib 410/0. Both residency shapes operational: hard isolation (strict + no broker = HTTP 403) + soft federation (strict + broker registered = transparent cross-region routing). Phase 6 closes. Phase 6d primitives — dynamic-secret support + cache honors issuer TTL (v2.37.0, server#117): SecretValue.expires_at: Option<DateTime<Utc>> field; src/secrets/dynamic.rs cache_decision() honors min(default_ttl, expires_at - now - safety_margin) and returns SkipCacheAlreadyExpired when the deadline is already past or inside the operator's safety margin; KEYCHAIN_CACHE_DYNAMIC_SAFETY_MARGIN_SECS env (default 60); resolve_keychain_entry_with_meta returns the bundle's earliest expires_at; CredentialService::resolve_via_provider consumes the helper. Two new metrics: noetl_secret_dynamic_ttl_seconds histogram (1m/5m/15m/1h/4h/12h buckets) + noetl_secret_cache_skip_total{reason} counter. 7 new unit tests; lib 398/0. Backward compatible (providers without expires_at keep the 600 s default). Phase 6c — residency-policy gate (v2.36.0, server#115): KeychainDef.residency enum (none|advisory|strict, default none) + allowed_regions allowlist; resolver runs the gate at the top of resolve_keychain_entry BEFORE any provider call so strict-mode mismatches short-circuit with AppError::ResidencyViolation (HTTP 403, clear "credential X is region-locked to Y; this server is in Z" message that NEVER includes the value). noetl_secret_residency_check_total{policy, decision} counter — strict + violation_blocked is alert-worthy, advisory + violation_allowed is the migration-window signal. 8 new unit tests; lib 391/0. Phase 6b — ProviderRegistry + per-(provider, region) metrics (v2.35.0, server#113): server-side cache of (provider_id, region) → Arc<dyn SecretProvider> so the resolver doesn't rebuild from env on every cache-miss; RwLock + double-checked locking on the build path so concurrent get_or_build for the same key only builds once. Optional TTL via NOETL_SECRET_PROVIDER_TTL_SECONDS. New noetl_secret_provider_build_total{provider,region,status="cache_hit|ok|error"} counter + noetl_secret_resolve_duration_seconds{provider,region} histogram (5 ms – 5 s buckets). 7 new unit tests; lib 383/0. Phase 6a — region tag on keychain entries + per-region routing (v2.34.0, server#111): starts Phase 6 (residency-aware distributed resolution). KeychainDef.region optional field (no schema migration — lives in existing JSON blob); SecretRef.region provider-agnostic; AWS provider consumes it as the regional endpoint with explicit precedence (<region>: ref prefix > field > legacy project overload > AWS_REGION env). New NOETL_SERVER_REGION env + server_region() / effective_region() fallback helpers. noetl_secret_resolve_total{provider,region,status} counter per observability.md Principle 1. 5 new unit tests; lib 376/0. Lib-only — backward compatible. Phase 5b — wire format + sealing endpoint (v2.33.0, server#107): new GET /api/credentials/{id}/sealed?worker_id=<name> returns a SealedEnvelope (X25519-sealed credential JSON) addressed to the named worker; workers opt in by including worker_public_key in their register payload's runtime JSON blob — no schema migration; 400 BadRequest when the worker_pool row exists but didn't register a key; noetl_credentials_sealed_total{status} counter + credential.seal span per observability.md. Kind-validated end-to-end (Python cryptography + HKDF + ChaCha20-Poly1305 opens the envelope → recovers the bearer token + scope round-trip). Phase 5a — sealed payload crypto primitives (v2.32.0, server#107): src/crypto/sealed.rs X25519 ECDH + HKDF-SHA256 + ChaCha20-Poly1305 sealed-box (nonce derived from the shared secret, AAD pins alg+v for clean alg-mismatch rejection); 12 unit tests (round-trip, tamper, alg/version-mismatch, JSON wire stability); lib 369/0. Defense-in-depth on top of Phase-4 mTLS — cleartext never enters the response body. Lib-only; 5b adds the runtime-registry worker pubkey + sealing endpoint, 5c the worker side. Providers 3.x — AWS Secrets Manager + Azure Key Vault (v2.31.0, server#105): two new backends behind the one SecretProvider trait completing the 5-provider matrix. AWS SM uses hand-rolled AWS Signature Version 4 signing (no aws-sdk dep tree; signing key verified by a unit test against AWS's published reference vector); ref shape [<region>:]<secret-id>[#<json-key>] with JSON-key extraction for multi-field secrets; creds from env (the IRSA-injected triple). Azure KV uses IMDS Managed Identity (AKS/VMs) with TTL-cached bearer; ref shape [<vault>/]<secret-name>[#<version>]; sovereign clouds via NOETL_AZURE_KEYVAULT_DNS_SUFFIX. 21 new unit tests; lib 357/0; cloud-only backends (kind-val at unit-test layer like GCP). Phase 4a — opt-in TLS/mTLS listener (v2.30.0, server#103): the worker↔server credential channel (GET /api/credentials/<alias>) was plain HTTP; opt-in TLS via NOETL_TLS_CERT+NOETL_TLS_KEY (+NOETL_TLS_CLIENT_CA ⇒ mTLS), ring rustls + axum-server bind_rustls, kind-validated (200 w/ client cert, rejected w/o, plain HTTP refused). Providers 3.x — HashiCorp Vault provider (v2.29.0, server#101): a provider: vault keychain alias resolves from a Vault KV v2 secret (X-Vault-Token; ref [<mount>/]<path>#<key>), kind-validated end-to-end against an in-cluster Vault — second backend validatable on kind after K8s. /api/executions list perf + status fix (v2.28.1, server#99, #62): candidate-first rewrite (start-event index, not a 3.2M-row seq scan) — 6.5 s → 0.015 s (~430×), identical list; bool_or status-drift fix (was all-RUNNING). Secrets Wallet #61 providers 3.x — Kubernetes Secrets provider (v2.28.0, server#97): a provider: k8s keychain alias resolves from an in-cluster Secret via the API server + ServiceAccount token + cluster CA — the first secret backend kind-validated end-to-end with a real value (GCP needs GKE). Orchestrator-strand fix (v2.27.2, server#95): a deterministic evaluate failure (an invalid template in a step code body, an unknown step in a next arc, malformed routing) now emits a terminal playbook.failed instead of stranding the run in RUNNING forever — surfaced by the #54 e2e sweep (closed server#94). Parser fix (v2.27.1, server#93): NextSpec untagged-variant order — the list form next: [{step: x}] was deserialized into a struct Router positionally, silently dropping its arcs (and defeating unknown-step validation); sequence-shaped variants now precede the struct. Secrets Wallet Phase 3c — keychain cache (v2.27.0, server#91): execution-scoped, envelope-encrypted, TTL'd cache so an auth: "{{ alias }}" lookup isn't re-fetched from the secret manager per step; + fixed the keychain storage layer (queries never matched the table — also repairs the /api/keychain endpoints). Phase 3 (resolution) complete — R3b (v2.26.0) resolves a provider: gcp keychain alias from GCP Secret Manager on a credential miss; built on R3a/R2/R1 (v2.23.0–v2.25.0). Phases 1–2: Cloud KMS for the KEK (v2.22.0); envelope encryption (v2.21.0) |
noetl/worker |
Rust NATS pull worker | v5.40.0 |
✅ #115 Phase 5 — forward the atomic-item-context flag onto the off-server from_events drive input (worker#121, v5.40.0 2484d17): so the off-server drive narrows each worker-bound command context to its minimal declared slice (the wasm reuses orchestrate-core's build_command). Default false → full-context dispatch unchanged. Prior: ✅ #115 Phase 4 REMAINDER — stateless off-server drive (resolve trigger type off the WAL + no-op on incomplete chain) (worker#120, v5.39.0 8e1f651): absorbs the server's stateless edge. ExecutionChain::event_type_of + build_offserver_input(trigger_event_id) resolve trigger_event_type off the pool WAL index when the server omits it (defaults command.completed). resolve_offserver_orchestrate_input returns an `OffserverDispatch{Wasm |
noetl/tools |
Shared tool registry crate | v3.13.0 |
#103 — deferred (ack-after-processing) ack (v3.13.0, tools#71): AckMode::Defer in the subscription SourceClient surfaces a durable per-message ack handle (NATS $JS.ACK reply subject) instead of acking inline; SourceClient::ack(ack_ids, AckDisposition) = Ack/Nack/Term (NATS + Pub/Sub); tool operation: ack|nack|term. Opt-in — existing callers unchanged. The capability the worker materializer (v5.34.0) drives for ack-after-materialize. Prior: #90 Phase 4 — store-and-forward spool engine + per-downstream circuit breaker (v3.4.0, tools#54): noetl_tools::spool — circuit breaker (trip/half-open/close, NATS-KV-serializable, per-downstream OQ2), SpoolItem (SHA-256 + noetl://spool ref + recv_seq-ordered keys), nats_object/local_disk backends, ordered-replay engine (global/per_key/none + idempotency + dead-letter + retention/GC). 44 unit tests + real-NATS integration test. Prior: #90 Phase 2 — header-directive engine + public build_source factory (v3.3.0, tools#52): source/directives.rs — DirectiveSpec/DispatchPlan turn allowlisted message headers into dispatch instructions (redirect dispatch.playbook, dispatch.execution_pool, priority→pool, idempotency_key, content_type/schema_hint, W3C trace), untrusted by default (allowlist + value-allowlist enforced at parse; multi-value last-wins; applied[] audit). Public build_source(cfg, ctx) so the worker continuous runtime constructs the same SourceClient. 12 new tests. Prior (v3.2.0): bounded-drain subscription tool + SourceClient (Phase 1). Prior: Multi-tool sibling references (v3.1.1, tools#48; closes noetl/ai-meta#87): in a tool: [list] step, TaskSequenceTool stored each sub-tool's result for the aggregated output but never injected it into the running context, so a later sub-tool's {{ <label>.<field> }} rendered empty — masked in quoted positions, a syntax error at or near "," in unquoted numeric SQL (save_edge_cases test_large_payload). Fix injects each sub-tool's result under its label (with a synthetic .data self-ref matching build_context) so later siblings + a later python sub-tool's stdin variables resolve it. 2 new unit tests; lib 300/0. Kind-validated: save_edge_cases test_large_payload → record_count = 100, save_delegation_test clean. Worker adopts via worker#69 (b97f642). Prior: e2e-sweep cleanup (v3.1.0, tools#47; tracks noetl/ai-meta#49): YAML boolean when: true in policy rules now checks as_bool() before the string-template fallthrough (Value::Bool as_str() returns None); ` |
noetl/cli |
Rust CLI + local-mode runner | v4.11.0 |
noetl subscribe — local-mode subscription listener (RFC #90 Phase 6) (cli#60, ai-meta → 2fb3fb0): standalone listener + FileEventSink JSONL + local_disk spool; cli-only (noetl-tools v3.5.0 source+spool reused unchanged). Prior: --include-data flag doc fix (cli#58) |
noetl/gateway |
Gatekeeper — auth + SSE + push-ingress | v3.3.0 |
#90 Phase 3 — push-ingress (Mode C) + auth-gated directive trust (gateway#28): POST /ingress/{listener} verifies HMAC / bearer / Pub-Sub-OIDC → only-then directives → one POST /api/execute per delivery on the dedicated pool (verify-and-forward, no DB on the ingress path); verify_then_plan makes the auth gate a testable invariant; first /metrics surface. Live E2E green (HMAC 12/12 + bearer 12/12). Prior (v3.2.0): Phase F R3b-2 shard-info twin endpoint. |
noetl/noetl |
Python control plane (legacy; retained for back-compat) | v4.12.1 (ecd16a2) |
✅ #115 Phase 2 DDL (noetl#667, ecd16a2): canonical schema_ddl.sql prev_event_id columns on noetl.event + noetl.command + idx_event_prev_event_id for fresh installs (the Rust server also ensures them idempotently at startup). Otherwise deprioritized per Rust-only direction; pytest debt at noetl/noetl#663 parked |
noetl/ops |
Helm + manifests | (untagged) |
k8s-watcher durable image + pod-level OOM classification (ops@cacc513, ops#168; closes noetl/ai-meta#80, tracks noetl/ai-meta#43): retired the dead bitnami/kubectl:1.30.3 (removed from Docker Hub; cluster was on the bitnamilegacy stopgap) for alpine/k8s:1.30.3 (kubectl + jq + curl baked in) — the prior runtime install never put curl on PATH so callback POSTs returned HTTP 000. classify_pod_failure() now reads the backing Pod's status (RBAC already grants pod reads) to emit failed_oom (OOMKilled) / failed_image_pull (ImagePullBackOff); build_body's completed_at fallback uses RFC3339 `now |
noetl/docs |
Docusaurus site | (untagged) | ADR Implementation-status block |
noetl/travel |
Reference SPA (domain-fork example) | (untagged) | Production |
The 2026-06-04 session retired all Python deployments + their services + configmaps from the kind cluster (per the user directive "delete all legacy stuff"). Local validation runs against the Rust topology by default:
┌──────────────────┐
│ Gateway │ noetl/gateway v3.2.0
│ (Rust) │ auth · SSE · subscriptions · shard routing
└────────┬─────────┘
│ HTTPS
┌────────▼──────────────────────────────┐
│ noetl-server-rust (v2.19.7) │
│ catalog · execute · events · │
│ /api/internal · DbPoolMap sharding │
│ orchestrator engine · SSE │
│ workbook resolution · pipeline parse │
└────────┬──────────────────────────────┘
│
┌────────▼─────────┐
│ NATS JetStream │ NOETL_COMMANDS stream
│ + Postgres │ noetl.event + noetl.command
└────────┬─────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ noetl-worker- │ │ worker-system- │
│ rust v5.11.3 │ │ pool │
│ (shared pool) │ │ (Rust, runs │
│ │ │ system │
│ noetl-tools │ │ playbooks) │
│ v2.18.1: │ │ │
│ python · shell · │ │ Consumer: │
│ http · postgres │ │ ..._pool_system │
│ duckdb · rhai · │ │ Filter: │
│ task_sequence · │ │ noetl.commands. │
│ playbook · noop │ │ system.> │
└──────────────────┘ └──────────────────┘
Consumer: [Outbox publisher +
..._pool_shared projector migrated
Filter: to system playbooks
noetl.commands. via Phase 2.a]
shared.>
Retired this session (2026-06-04): noetl-server (Python
deploy), noetl-worker (Python deploy), noetl-outbox-publisher
(Python deploy), noetl-projector (Python statefulset), and
the noetl/noetl-ext/noetl-projector/noetl-worker-metrics
services + 4 legacy configmaps + noetl-worker SA. The kind
cluster is the regression-test topology for Rust-only e2e.
[Status legend: ✅ = shipped + kind-validated]
Six interlocking gaps fixed in one iteration brought
control_flow_workbook end-to-end on the Rust-only stack —
exercising the complete v10 control-flow surface:
playbook YAML ──► noetl-server-rust orchestrator noetl-worker-rust
+ noetl-tools v2.18
───────────────── ────────────────────────────────────────────── ──────────────────
workload: {...} ► #56 workload + input alias decode PythonTool wrapper
tool.kind: python ► (existing dispatch) globals().update(args)
tool.kind: ► #17 capture
workbook ► #59 parser substitutes inline action `result = {...}`
global → data
tool: [{...}] ► #57 ToolDefinition::Pipeline accepts both ► #18 TaskSequenceTool
(pipeline) flat (name-as-field) + nested (label-as-key) runtime
{{ step.field }} ► #60 build_context exposes step data at top
in next.arcs level (not just steps.<name>); apply_event
captures call.done before command.completed
overwrites
command.failed ► #58 trigger_orchestrator on command.failed
+ dedicated short-circuit in process_in_progress
emits playbook.failed terminal
worker → server ► #55 EventEmitRequest accepts i64 wire shape
event emission (was rejecting integer execution_id)
Validated end-to-end:
noetl exec tests/fixtures/playbooks/control_flow_workbook
→ playbook_started
→ start (python)
→ eval_flag (workbook→python; is_hot=true captured via marker)
→ hot_path (next.arc when="{{ eval_flag.is_hot == true }}" matched)
→ parallel hot_task_a + hot_task_b
→ playbook.completed ✅
Per the Rust-only direction, Python pieces stay only as:
- Container payloads — runtime stays Rust; user code that wants Python ships in a container dispatched by the container tool kind (#43, in design).
- Back-compat GKE deployments — existing Python pods on the production cluster aren't removed (yet) since GKE traffic still uses them; no new feature work goes there.
The kind cluster is the canary for the Rust-only topology. When the operator runs the validation rigs end-to-end against their live cluster, R5 cutover decision will move the production topology to match.
Chronological notes on what each session accomplished — see Sessions Log.
Most recent (top of log):
-
2026-06-20 — ✅ #116 program-scale step 2 SHIPPED + multi-replica gate-ON validated — execution-affinity single-owner WRITE ORDERING. server#252 v3.39.0 + e2e#71. Step 1 (KV coherence) was necessary-not-sufficient — the
command.issuedprev-read + head CAS-advance are two non-atomic steps, so concurrent cross-replica emits forked the chain. Affinity routes every trigger (POST /api/events, which fires the drive) to the single replica thatShardConfig::owns(execution_id)owns; a non-owner forwards a reverse-proxy POST (one-hop loop guard, degrade-to-local). On the owner the single-process drive lock + in-memoryChainHeadsmake the read→advance atomic, no distributed lock; KV is the genesis/handoff vehicle (owner resolves LOCAL →kv_remote_hit→0). server#252 (5e00d0a, v3.39.0):src/affinity.rs, flagsNOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME(all default off), metricnoetl_execution_affinity_total{outcome}. e2e#71 (66b6e1b): 2-replica StatefulSet topology + rig HARD gate. Multi-replica gate-ON kind PASS: linear/loop/fanout COMPLETE, every chain roots=1/dangling=0/walk==total (NO fork),forwarded_ok +9, never-scan (scans+0) + sole-writer across replicas; single-replica unchanged; 595 server tests + clippy green; baseline restored. Follow-up #117: off-serverfrom_eventsspine ordered byevent_idwedges fan-in under a chain-order≠id-order inversion (affinity + high-concurrency fanout) — fix = order spine byprev_event_idwalk; linear/loop already reliable. Prod multi-replica verdict: write-ordering COMPLETE (no fork) — prod can horizontally scale the off-server stack for linear/loop; high-concurrency fan-out needs #117 first. All affinity flags default off; PROD GKE untouched. -
2026-06-20 — ✅ #115 program-scale step 1 SHIPPED — multi-replica coherence DATA LAYER (NATS-KV-backed
ChainHeads+ExecDescriptor); execution-affinity STAGED. server#251 v3.38.0 + e2e#70.NOETL_REPLICA_COHERENCE=nats_kv(defaultlocal, prod unchanged) backs the off-server drive's watermark + descriptor with JetStream KV buckets so 2+ replicas resolve the same value — head advance = CAS (one chain), descriptor = CAS merge; in-process maps = write-through cache / degraded fallback (local→ bit-identical). server#251 (8f39a79):src/coherence.rs(CoherenceKv+ lazy buckets),ChainHeads/ExecDescriptorsasync,ExecDescriptorserde, metricnoetl_replica_coherence_total{structure,op,outcome}(proofkv_remote_hit). e2e#70 (e222877):kind_validate_replica_coherence.sh+ un-staled the #113/#114 offload asserts behindNOETL_RIG_EXPECT_OFFLOAD(default false; underrefs_in_state=truethe offload paths legitimately stay flat). Kind-validated: single-replicanats_kvis bit-for-bit parity withlocal(linear/loop/fan-out ×2 all COMPLETE, roots=1/dangling=0/walk==total,state_build_event_scans+0, hotpath scan +0, sole-writer intact); 2-replica proved cross-replica resolves work (kv_remote_hitadvanced for head + descriptor, nokv_unavailable). Necessary but NOT sufficient: on 2+ replicas concurrent cross-replica emits still fork the chain (theissuing_eventhead-read vs head-advance is non-atomic across replicas → observed forked chains + a cross-executionprev), so executions don't reliably COMPLETE on 2+ replicas yet — the remaining piece is execution-affinity (one replica owns an execution's drive + chain write; substrate present insrc/sharding.rsshard_for/owns), STAGED as program-scale step 2. 588 server tests + clippy green; baseline restored. ai-meta pointers → server8f39a79(v3.38.0) + e2ee222877. PROD GKE untouched; defaultlocal; no gate/mode/builder default changed. Off-server architecture is multi-replica-COHERENT (data) but not yet multi-replica-COMPLETE (write-ordering) — prod cutover stays single-replica until affinity lands. -
2026-06-20 — ✅ #115 Phase 5 SHIPPED + gate-ON validated — atomic-working-item context (tenet 6): the drive hands a worker only its minimal declared slice;
NOETL_ATOMIC_ITEM_CONTEXT(default off). #77 (Explicit Input Binding) resolved. server#250 v3.37.0 + worker#121 v5.40.0 + e2e#69. -
2026-06-20 — ✅ #115 Phase 6 SHIPPED + gate-ON literal-zero validated — hot-path
noetl.eventread class RETIRED; the table is AUDIT-ONLY (server v3.36.0).NOETL_EVENT_READ_PATH=event_scan|audit_only(defaultevent_scan, prod unchanged) retires the remaining lifecycle readers ofnoetl.event(theWHERE execution_idreplay class outside the drive). server#249 (b71ca1d): underaudit_onlyget_catalog_id(per-ingest) +inherit_parent_trace+ subscription dedup-audit + container-callback catalog/existence serve from the in-memory execute-timeExecDescriptor; a cold descriptor (post-terminal straggler after eviction / restart) resolves catalog_id fromnoetl.command(synchronous queue) — never anoetl.eventscan. Proof metricnoetl_event_hotpath_reads_total{site,outcome}. ops#199 (e5b0737) pinsevent_scanon the prod server manifest (operator-gated flip). e2e#67+#68 (0ab3c0a)kind_validate_event_read_path_phase6.sh. Gate-ON kind-validated (PUBLISH_ONLY + offserver + materializer + audit_only): hot-pathscanΔ0 (served_descriptor +96 + served_command +3), drivestate_build_totalΔ0 +event_scansΔ0 ⇒ ZEROnoetl.eventscans anywhere on the hot path, end-to-end; linear/loop/fan-out/output_select COMPLETE; sole-writer + lag-0; audit still works (direct SELECT + status COMPLETED + replayevent_count=25); committed gate rig PASS with audit_only on (no regression); 585 server tests + clippy green; baseline restored. Completes the RFC never-scan end state (tenet 3) under the flag. ai-meta pointers → serverb71ca1d(v3.36.0) + opse5b0737+ e2e0ab3c0a. PROD GKE untouched; defaultevent_scan; no gate/mode/builder default changed. Remainder = Phase 5 (atomic-item, needs #77) + program-scale (per-shard WAL, multi-replica descriptor coherence). -
2026-06-19 — ✅ #115 Phase 4 KERNEL + FLAG SHIPPED + shadow kind-validated — off-server state builder (worker v5.37.0 + server v3.33.0); drive cutover staged. The pool-side
state_builder(worker#118,fef961c) reconstructsWorkflowStatefrom thenoetl_eventsWAL — a per-execution chain index walksprev_event_idhead→root, caches the spine keyed by the immutable chain head, advances only the new tail. A live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off) + the serverNOETL_STATE_BUILDER=offserver|serverflag (server#246,3e6006d, defaultserver). Gate-ON kind-validated on live cluster: shadow replayed the WAL → chain-walked spines whoseindexed==spinesizes match Phase-3 topologies (linear 13, loop 62, fan-out 25, output_select 31, storage_tiers 55) = parity by construction; WAL-readwal_events_total=993 withevent_scans_total=0; cachecold_rebuild=28 (replay/restart) +incremental=21 (live tail-advance — fresh fan-out indexedIncremental(5),indexed==spine==25==DB event_rows 25); fresh fan-out COMPLETED gate-ON,event_rows==distinct, 0__orchestrate__event rows, materializerpending=0/project_errors=0; 8 worker unit tests + 2 server config tests + clippy green; baseline restored. ai-meta pointers → workerfef961c+ server3e6006d. PROD untouched; no default changed. Theoffserverdrive cutover (drive consumes the builder's state) + Phase 5 (atomic-item context, #77) / Phase 6 (retire event read path) remain. -
2026-06-19 — ✅ #115 Phase 3 MERGED — chain-walk state builder (server v3.32.0); Phase 4 (off-server state builder + WAL cache) started. server#245 self-merged (no classifier block) → server v3.32.0 (
8338417); ai-metarepos/serverpointer bumped. BehindNOETL_STATE_BUILD_MODE=chain_walk(defaultevent_scan, prod unchanged) the drive reconstructsWorkflowStateby walkingprev_event_idhead→root (in-memoryChainHeadshead +(execution_id,event_id)PK lookups — never aWHERE execution_idscan) → samefrom_events(orchestrate-core unchanged; parity by construction); falls back to event-scan on cold-head / lag / non-genesis. Gate-ON kind-validated (prior session): parity 41/41 MATCH,event_scans_total=0 / 1064 PK hops / 0 fallbacks, all topologies COMPLETE, sole-writer + lag-0 + gate rig PASS, 577 tests + clippy green. Phase 4 now in progress — move the chain-walk state construction OFF the server onto the system worker pool reading the WAL/NATS stream, pool-side cache keyed by the immutable chain head + incremental tail-advance (server chain-walk + event-scan remain fallbacks). PROD GKE untouched; no gate default changed. -
2026-06-19 — ✅ #115 Phase 2 implemented + kind-validated — one-level
prev_event_idevent chain (server#244 + noetl#667 merged). Eachnoetl.eventcarriesprev_event_id(the immediately-previous event in causal order) + eachnoetl.commandthe issuing-event link, so per-execution events form a walkable singly-linked list followable pointer-by-pointer without scanningnoetl.event(additive; no reader yet — Phase 3). The emit chokepointemit_eventsstamps the event link from a per-execution chain-head watermark (ChainHeads) — one path covers drive events +command.issued+ worker-lifecycle on both the gate-off INSERT and the gate-on publish, the materializer persisting it; the command link = the realstep.enter/unblocking completion so cursor-fan-out bodies share their branch origin (§4.4). Server-only (noorchestrate-corechange). Chain-correctness proven gate-ON across 6 executions (linear 13/13, loop 62/62, fan-out 25/25 with a real shared branch origin, sub-playbook 46/46, + Phase-1output_select31/31 &storage_tiers55/55 bounded): each has 1 root, 0 dangling / 0 duplicate event prev, 1 head, pointer-walk == full sequence (no gaps), real-step command dangling=0;kind_validate_orchestrate_gate.shPASS (sole-writer 25==25, 0 dup cycles, catalog0=0, lag 0); 573 lib tests + clippy green. PROD untouched; no gate default changed. Awaiting merge → ai-meta pointer bump; Phase 3 (chain-walk state builder) next. -
2026-06-19 — 🐞 #114 oversized-
command.issuedoffload shipped (server v3.29.5); refs_in_state consume side (#101) is the remaining off-server-drive cutover blocker. With #113's decode fix in place, 4 large-context fixtures wedged at a distinct second stall: underrefs_in_state=falsethe off-server drive embeds the full resolved upstream context into the next command, so itscommand.issuedevent (~1.32MB) exceeded NATSmax_payload→ publish ack-timeout → wedge. Fix: a command context overNOETL_COMMAND_CONTEXT_MAX_BYTES(512KB) is offloaded tonoetl.result_storewith a{__context_ref__}marker;get_command/claim_commandresolve it before the worker sees it (metricscontext_offloaded/context_ref_resolved). Server #242 → v3.29.5 + rig phase 8 e2e#64. Kind gate-ON: off-server rig PASS (newtest_oversize_command_contextCOMPLETED, max command.issued ctx 585B, offload+resolve fired, 0__orchestrate__event rows, materializer lag 0); every command.issued event <1MB across all fixtures; 6 of #113's 9 now COMPLETE. Chose ref-on-oversize overrefs_in_state=true(candidate #1): a kind experiment provedrefs_in_state=truefixes the state-bloat (kind_playbook_lease_expirycompletes — drive-state 29KB vs ~1MB loop) but breaks the bulk-consuming fixtures (test_storage_tiers/test_output_selectfail at the bulk step) because the worker render-time ref-resolution isn't implemented — so the default stays false. The remaining 3 fixtures + the cutover (#107/#111) now hinge on the refs_in_state consume side (#101) —__orchestrate__drive-state bloat (17.4MB forstorage_tiers) + the_ref/bulk-resolve gap. ai-meta → server385d21f(v3.29.5) + e2e9919392. No prod default flipped; prod is pre-#108 in-server drive (unaffected). -
2026-06-19 — 🐞 #113 off-server drive — recover offloaded drive result + stop drive on cancel (server v3.29.4); #114 opened. Fixed the worker-driven drive stall when an
__orchestrate__result exceeds the 100KB inline budget (worker offloads it with only areference.ref→ server now resolves+decodes it viaresult_store.resolve, metricref_resolved, instead of dropping → non-convergent re-loop) + the cancel non-stop facet (match underscoreplaybook_cancelled+ExecutionState::is_terminalterminal guard, no restart). Server #241 → v3.29.4 + rig e2e#63. Kind gate-ON proven (785KB result →ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; sole-writer lag 0). 5/9 #113 large-context fixtures COMPLETE; the other 4 hit a distinct oversized-command.issued(full upstream context embedded → >1MB NATS payload) stall → #114 (#113 stays open until all 9 close). ai-meta → server1e844c1+ e2e12b27e9. No prod default flipped. -
2026-06-19 — 🚀 #103 GKE pre-flip PREP — prod images pushed, GMP monitoring live, manifests staged; NO traffic flip / NO
PUBLISH_ONLY. Verified prod already-Rust (the #49 cutover is done; pre-#103 live images), both flip secrets present, monitoring = Google Managed Prometheus (not VM). Pushed server v3.29.3 + worker v5.35.0 to the prod AR (amd64); applied + verified GMPPodMonitoring(worker+server/metrics— the noetl ns had none) + materializer-lagRules(up{namespace="noetl"}=4 live); staged the roll-forward manifests (not applied — they roll live workloads); runbook gained a "Production (GKE)" section + GMP managedAlertmanager pager stub. Operator-gated: roll images → materializer shadow → pager → flip. ai-metae5b6d6c→ ops9edd9c4(ops PR #197). No prod default changed. -
2026-06-19 — 🛡️ #103 materializer-lag GUARDRAIL shipped — the pre-flip observability gate. The server was FLIP-READY; the remaining gate was a materializer-lag metric + alert. Worker #116 → v5.35.0 extends the JetStream lag poller to track the
noetl_events/noetl_materializerconsumer on an independent task →noetl_worker_nats_consumer_pending{consumer="noetl_materializer"}climbs even when the materializer loop is dead. Ops #195+#196:VMRule(backlog warning>200/critical>2000/growing + stall-under-gate + project-errors + absent-under-gate, stall guarded on backlog>0), worker/metricsVMServiceScrape(was unscraped), VMAlert enabled, Grafana dashboard, flip runbooknoetl-cqrs-publish-only-flip.md. Kind-proven full cycle on the VM stack: green baseline (backlog 0) → induced lag (materializer fault-injected, events publishing under the gate) → gauge 0→684, alerts fire (backlog warning+critical + stall) → recover → drains→0 idempotently (0 dup/loss), alerts clear. ai-meta → workerb910341(v5.35.0) + ops2fcfa59+ worker-wiki0030f30.PUBLISH_ONLYstays default-off. -
2026-06-19 — 🎯 #103 server cutover COMPLETE — FLIP-READY. The 2 ExecutionService cancel/finalize sites now route through the
emit_eventchokepoint (server #240 → v3.29.3 + e2e rig #62); kind-proven both modes; no remaining synchronous server event writers under the gate. FlippingPUBLISH_ONLYon is now a staged operator decision. -
2026-06-19 — 🎯 #104 off-server-drive × gate reconciliation PROVEN — the last real blocker before the
PUBLISH_ONLYflip is operator-safe. The combination #103 left unproven (gate-on was only ever validated with the in-process drive) is green on kind: gate-ON (PUBLISH_ONLY=true) with the off-server drive (PLUGIN_DRIVE=true) + materializer sole writer → fresh exec + cursor fan-out → COMPLETED; server wrote 0noetl.eventrows (all 25 PUBLISHED —event_ingest_published_total=25), materializer materialized all 25 exactly once (25 rows == 25 distinct ids, 0catalog_id=0, 0 dup cycles), drivedispatched=applied, read-your-writes held (the relocated trigger fires post-materialize → server rebuilds state from committed log before bounding the off-server drive input). Server #238 → v3.29.2 (76d29bb): cold-cache apply now rebuildsWorkflowStatefrom the durable log (the #104 WAL-rebuild principle) instead of dropping the in-flight result on a server restart mid-drive — kind crash-recovery proof: hard-kill mid-drive →cold_rebuildmetric+log fires → that exec COMPLETES with full event integrity. Committed e2e rigkind_validate_orchestrate_gate.sh(e2e#61). Regression green: gate-off + in-process (prod default) and gate-off + off-server. ai-meta → server76d29bb+ e2e61f7a5c. Remaining before a safe flip: only the 2 ExecutionService cancel/finalize sites. No prod default changed. -
2026-06-18 — 🎯 #108 (c) — the worker-driven orchestrator drive is now the DEFAULT; #108 CLOSED.
Flipped
NOETL_ORCHESTRATE_PLUGIN_DRIVEto default true (server#233, v3.28.0 → server@80cc0e6). Gated on a scale soak on kind (images built from the released tips, server v3.27.0 / worker v5.33.0): a single 694-drive cursor+fan-out run (test_pft_flow_v23×40) COMPLETED with__orchestrate__rows innoetl.event= 0 (event_suppressed +2082) and all 694 drives claimed on the system pool (shared pool got only the 671 real-step commands), 0 errors; 5× concurrent self-contained cursor = 5/5 COMPLETED. Then deployed the flipped image with no env var — the default-on path reproduced the identical shape (361 drives, system-isolated, 0 burst); 15/15 regression fixtures green; the revert (=false) verified to fall back to the in-process drive (system delta 0). In-processtrigger_orchestrator_innerkept as the fallback. ai-meta → server80cc0e6+ worker437b0be+ server-wiki0210012. -
2026-06-18 — orchestrate drive isolated on the SYSTEM pool via pool affinity (#108 follow-up b, kind-validated).
Server stamps
execution_poolon the command notification (server#232 → server@846166b); worker declines (ACK+skip) notifications not for its pool segment (worker#114 → worker@e2162b7), so the drive runs on the dedicated system pool even under JetStream consumer-filter drift. No worker HTTP pending-poll — the NATS consumer is the only claim vector. Validated:__orchestrate__claimed+executed on the system pool (3), zero on the default pool, simple_python COMPLETED; 553+196 tests green. Only (c) the deliberate default-flip remains. -
2026-06-18 — the orchestrate meta-command touches
noetl.eventZERO times (#108 follow-up a, kind-validated).dispatch_orchestrate_commandstops writingcommand.issuedtonoetl.event; the command lives only innoetl.command, andclaim_command/get_commandfall back to it on a miss (noetl.eventstays authoritative for normal commands) (server#231 → server@9438f3b). So__orchestrate__writes 0 of its former 5 rows per drive — the directive that system playbooks keep only their own state is met. Validated: cursor+fan-out COMPLETED via thenoetl.commandfallback, 0 event rows, 20 real steps normal, 0 errors. Remaining: NATS affinity (ops) + default-flip. -
2026-06-17 — system playbook events no longer burst Postgres + system-pool routing (#108 slice 4b, kind-validated).
The
__orchestrate__meta-command is infrastructure, not a workflow step, so the server now skips persisting its lifecycle events tonoetl.event(handle_event_inner+claim_command) (server#230 → server@6aef3a6). At scale they'd burstnoetl.event/Postgres for no benefit. Validated:__orchestrate__now writes only the lonecommand.issued(1 of 5 — 80% fewer rows); cursor+fan-out flow still COMPLETED. Drive routes to thesystemsegment (true isolation pending a NATS-affinity ops fix; resilient via the pending-poll meanwhile). Follow-ups: eliminate the lastcommand.issued(claim fromnoetl.command) + NATS affinity. -
2026-06-17 — 🎯 the orchestrator drive runs OFF-SERVER on the worker pool (#108 slice 3, kind-validated).
With
NOETL_ORCHESTRATE_PLUGIN_DRIVE=onthe server issuessystem/orchestrate(entry: run_state, args = the boundedWorkflowState) to the worker pool instead of evaluating in-process; the worker runs the drive, the server applies the result on the command'scall.done(server#229 → server@465cdbb v3.23.0). Kind:test/simple_pythondrove start→end→COMPLETED through the round-trip (dispatched=2, applied=2, 0 decode_error,__orchestrate__didn't leak as a step,playbook.completed). Default off, in-process fallback. Bug caught+fixed:output_b64ridescall.donenotcommand.completed. Next: shadow→flip at scale, make drive the default, route to the system pool. -
2026-06-17 — worker-driven cutover slice 2:
apply_orchestration_resultextracted + slice 3 designed (#108). The post-evaluate emission (events → commands → terminal) is extracted verbatim fromtrigger_orchestrator_innerinto a reusable fn (server#228 → server@586aeae) so the worker-driven drive applies a worker-computed result identically. Behavior-preserving (553 tests green, clippy clean). Slice 3 (dispatch) designed + grounded:apply_eventwould phantom-step a meta-command, so the design uses a reserved__orchestrate__step ignored in state, a flag-gated scheduler, apply-on-callback, and loop-prevention. Lands behindNOETL_ORCHESTRATE_PLUGIN_DRIVE(default off), kind-validated before the flip. -
2026-06-17 — worker-driven cutover slice 1: configurable wasm guest entry (#108).
The worker can now dispatch a named plug-in export (worker#113 → worker@04420d0):
tool: {kind: wasm, plugin: {path, version, entry}}names the export (defaultrun); the worker-driven orchestrator will useentry: run_state.invoke_bytes_with_entry+run_by_ref_entry/run_and_apply_by_ref_entry(originals delegate withrun); test proves run→0xAA vs run_state→0xBB + missing-export error. Purely additive, no live change. Server scheduler+apply (the hot-path round-trip, default-off flag) next. -
2026-06-17 — orchestrate plug-in drives the real workload identically, live (#108 slice 4).
The orchestrator runs the plug-in alongside the in-process drive on every evaluation + diffs commands (server#227 → server@bd652ab) — a process-global wasmtime host (feature
orchestrate-shadow) loaded fromnoetl.plugin_moduleat boot, gatedNOETL_ORCHESTRATE_PLUGIN_SHADOW; in-process result authoritative. Kind-validated over the live 10×1000 PFT:noetl_orchestrate_shadow_total{result="match"} 529, ZERO mismatch/error, workers stable. Plug-in gains a state-input path (run_state); both build configs green (default nowasmtime). Slices 1-4 prove orchestrator-as-plug-in end to end; next is the worker-driven cutover. -
2026-06-17 —
system/orchestrate@1registered + servable in a deployed server (#108 slice 3). The server bakes the orchestrate wasm into its image and seeds built-in system plug-ins intonoetl.plugin_moduleon boot (server#226 → server@b21b589, kind-validated). Newsrc/system_plugins.rs(pure dir-scan + sha256, unit-tested) + awasmbuilderDocker stage +NOETL_SYSTEM_PLUGIN_DIR; in-process upsert (not the token-gated HTTP surface); digest-keyed hot-reload. Validated:GET /api/internal/plugins/system/orchestrate?version=1→ 200application/wasm1559093 bytes, ETag=digest, stale→409, baked sha256 == served digest. Next: kernel scheduler dispatches the now-registered plug-in. -
2026-06-17 —
system/orchestrateplug-in runs identically to native in wasmtime (#108 slice 2). A wasmtime shadow-diff (server#225 → server@ccec104) loads the built.wasmthrough a harness mirroring the worker host'sinvoke_bytesABI byte-for-byte and asserts the wasm output equals the native drive over auth0 multi-arcwhen:routing (minijinja in wasm) + cold-start. Finding: command-set identity (parsedValueeq), not raw bytes — thecontextmap serializes in insertion order (serde_jsonpreserve_order← upstream HashMap iteration, differs wasm32 vs host arch); the scheduler deserializes toVec<Command>, so the command set is the bar. 2 unit + shadow-diff green; plug-inexcluded; test-only. Next: catalog register/serve → kernel scheduler. -
2026-06-17 —
system/orchestrateWASM plug-in exists — drive core runs as a 0-import module (#108 slice 1). New standaloneplugins/orchestrate/crate (server#224 → server@10a629b) wraps the drive behind the worker plug-in ABI (input = JSON event-slice + playbook; output = JSONOrchestrationResult; data-plane =memory/alloc/run) and compiles towasm32-unknown-unknown— the first non-trivial compiled system playbook. Feasibility risk retired: the.wasmhas 0 imports (no WASI, no hostrender) — the whole drive incl. minijinja runs in-guest. Native parity test reproduces nativeevaluatebyte-for-byte; 551 server tests green, crateexcluded from the workspace. Next: worker-host shadow-diff → catalog register/serve → kernel scheduler (NOETL_ORCHESTRATE_PLUGIN, default off). -
2026-06-17 — Orchestrator drive core fully wasm-resident — Event-ABI round #109 CLOSED.
Slice 3 (server#223) moved
orchestrator/evaluatefromsrc/engine/intonoetl-orchestrate-core. All 6 drive modules (renderer, playbook model, commands, evaluator, state, orchestrator switch) now compile native +wasm32-unknown-unknown— thesystem/orchestrateplug-in seed (#108).evaluatereads the purecore::event::Event; server convertsdb::Eventat thetrigger_orchestratorboundary (slice-1From). 122 core + 565 server tests green, 0 WASI imports on wasm32, clippy clean; cargo-chef image (v3.20.0) kind-deployed, PFT 10×1000 — full command lifecycle, 0 errors, 0 restarts. ai-meta → serverbfd3f77(internal refactor, stays v3.20.0). -
2026-06-14 — Transfer tool: Snowflake↔Postgres both directions — #99 CLOSED.
Both transfer arms implemented with full credential-alias resolution. tools v3.10.0 (tools#65) + worker v5.22.0 (worker#87) + e2e#58. SF→PG:
$n::text::<udt>coercion + RFC3339 timestamp reformat; PG→SF: generated INSERTs. Full bidirectionaldata_transfer/snowflake_postgresfixture COMPLETED on kind against livesf_testaccount. tools → 4127b4b · worker → 6d97e7c · e2e → 94aa7f1. -
2026-06-14 — Snowflake key-pair JWT validated end-to-end — #98 last external-tool gap closed; transfer step → #99.
noetl-tools v3.9.0 / v3.9.1 / v3.9.2 (tools#62/#63/#64) — key-pair JWT auth (bypasses MFA) + User-Agent fix (code
391903) + SQL-API context-in-body + multi-statement split (codes391911+000008). Worker bumped to v3.9.2 (worker#83–#86) + e2e#57 fixture cleanup.create_sf_database(CREATE DATABASE) +setup_sf_table(CREATE TABLE + INSERT) both COMPLETED via JWT on kind against the livesf_testaccount (NDCFGPC-MI21697). Transfer step fails (inline creds, no key-pair fields) — filed #99. ai-meta pointers: tools a216ab2 · worker 9d6b127 · e2e e191231. -
2026-06-12 — #90 Phase 7 shipped — scale hardening; #90 CLOSED (all 7 phases complete), live proof green.
Final phase: server v3.5.0 (server#189)
POST /api/execute/batch(N→N, partial-failure contained) + opt-in exactly-once dedup window (noetl.subscription_dedup, bounded-by-age, race-safe, default off); worker v5.19.0 (worker#79) batch dispatch + dedup opt-in + per-subscription rate limits (deterministic token-bucketRateGovernor, fetch-side backpressure → source keeps backlog, no loss,subscription.rate_limitedevent); ops (ops#176) + e2e (e2e#48); no tools change → no crate cascade. Live on kind: batch 12→12 COMPLETED on the subscription pool + per-message traceparent; dedup duplicate→1 execution +subscription.message.deduplicated; rate-limit engaged + 10/10 → executions (no loss); direct-curl within/outside-window + dedup-off + batch partial-failure all green. ai-meta → server7b217d8+ worker7531f4a+ ops6db69b9+ e2e203593b. #90 closed; follow-ups tracked: #91–#94 + tools#57. -
2026-06-12 — #90 Phase 6 shipped — CLI local
noetl subscribe+FileEventSink+local_diskspool (live local proof green). Addednoetl subscribe <spec.yaml>(cli v4.11.0, cli#60, closes cli#59): akind: Subscriptionlistener run standalone in local mode — no k8s, no NATS-dispatch server for the listening itself — reusing the samenoetl_toolssource clients + directive engine + spool engine the in-cluster worker uses, emitting the sameExecutorEventenvelope to a localFileEventSink(one event/line JSONL → replayable trail). Local dispatch (RFC §5.3): in-process viaPlaybookRunner(pure-local default) orPOST /api/execute.local_diskspool (§8.6): circuit-breaker + buffer + ordered replay + idempotency + dead-letter against a local dir, circuit state in a local file. Newsrc/subscribe/{mod,spec,sink,dispatch,runtime,spool}.rs+examples/subscribe/.cli-only — no tools change / crate cascade (the source+spool surface ships innoetl-toolsv3.5.0; bumps the lock 3.0.0 → 3.5.0 via the executor's"3"). Tests: 12 subscribe + full bin suite (53) green, incl. a deterministic outage→spool→ordered-replay→idempotency proof on the real engine. Live (in-cluster NATS on kind): 5 msgs →received=5 dispatched=5 failed=0(19-event JSONL trail); local_disk spool outage → 6message.spooled(0 dispatched, no loss) → recovery → 6message.replayedin order → drained to 0. Finding: the NATS source ignores URL-embeddeduser:pass(async-natsConnectOptions) — specs use explicituser/password. ai-meta → cli2fb3fb0(v4.11.0); wiki clisubscribe. #90 stays open for Phase 7 (scale hardening, volume-gated). -
2026-06-11 — #90 Pub/Sub + Kafka brought to live-E2E parity with NATS (validation gap closed).
Stood up the two remaining subscription brokers in kind — Pub/Sub emulator (gcloud SDK image) + single-broker KRaft
apache/kafka:3.9.1— undernoetl/ops(ops#170), and added bounded-drain fixtures + kind-validate runners undernoetl/e2e(e2e#41). Both backends passed the same live bar as NATS: publish/produce 5 → bounded drain count=5 acked=true → execution COMPLETED →call.done/command.completed/playbook.completedevent trail. No adapter code change needed — the pure-Rustkafkacrate talks to Kafka 3.9 KRaft and the Pub/Sub REST backend works against the emulator as-is. The one fix: the<step>.output.<field>accessor never resolved (bothwhen:arcs skipped → drain stalled); corrected to<step>.<field>in the fixtures + the latent opssubscription_drain.yamlexample. Validated on server v3.1.0 + worker v5.15.2 + tools v3.2.0; cluster left on that clean released stack. ai-meta → ops568a4ac+ e2e8d21e7a. #90 stays open (Phases 2–7 design-only). -
2026-06-11 — #89 shipped — JSON
nullround-trips through{{ step }}(server fix, v3.0.6). #89 —null→undefinedserialization — CLOSED. The #88 cursor fixture walked all 4 pages but its 4thcheck_paginationcrashed: the terminal page'snext_cursor: null, re-injected via the whole{{ fetch_page }}envelope, rendered as the JS tokenundefined(invalid JSON), so the consuming Python step receivedresponseas astr. Traced the corruptcommand.issuedargs.responseto the renderer that builds next-step inputs — the server orchestrator (src/template/jinja.rs::render_to_value), not the worker the issue blamed.json_value_to_minijinjamaps JSONnull→Value::UNDEFINED; minijinja's map repr emitsundefined;render_to_valuefailedfrom_strand fell through to a raw string. The noetl-tools engine already had a| tojsonretry for exactly this; the server's copy had diverged without it. Fix (server#177, v3.0.6) ports the retry. 5 new tests; 619 lib + 8 parity green; clippy clean. Kind-validated end to end on the live test-server (baseline 4thcheck_paginationerror→ fixedsuccess; cursor collects 35, matching offset). ai-meta → server8e17fbe. Standing direction honored — Claude wrote the Rust directly, no Codex. -
2026-06-10 — #88 shipped — pagination fixtures read
response.body.*; #89 filed. #88 — offset/cursor pagination fixture path — CLOSED. The Rusthttptool nests the parsed JSON payload underbody({{ fetch_page }}→{body, headers, status_code}); the fixtures readresponse.get('data', {}), which resolved to{}, sohas_more/next_cursordefaulted falsy and the loop exited after page 1 despite the correct post-#85 machinery. Confirmed the shape against a live http-tool result, then switched bothcheck_paginationsteps toresponse.get('body', {})(e2e#40). Kind-validated: offset walks0→10→20→30, users 10/10/10/5,validate_resultssuccess 35,playbook.completed COMPLETED; cursor path-fixed + walks all 4 pages (Mg==→Mw==→NA==→null, 35 events fetched) but the terminal page surfaced a distinct worker bug → #89 (worker serializesnext_cursor: nullas JSundefinedwhen re-injecting{{ fetch_page }}, so the consuming Python step gets an unparseable str). Other pagination fixtures (retry/max_iterations/pipeline*/loop_with_pagination) share the same envelope-key assumption over/api/v1/assessments|flaky({data, paging}) — flagged, left for follow-up. ai-meta → e2e72a7525. -
2026-06-10 — #87 shipped, #85 deferred (e2e sweep follow-ups #85/#87).
#87 — multi-tool sibling references — CLOSED.
task_sequence(thetool: [list]pipeline runtime) stored each sub-tool's result for the aggregated output but never injected it into the running context, so a later sub-tool's{{ <label>.<field> }}rendered empty — masked in quoted positions, asyntax error at or near ","in unquoted numeric SQL (save_edge_casestest_large_payload). Fix (tools#48, v3.1.1) injects each sub-tool's result under its label (synthetic.dataself-ref); worker adopts via worker#69. Kind-validated on a worker built from the fix:save_edge_casestest_large_payload→record_count = 100(no syntax error),save_delegation_testclean. ai-meta → tools76f942a+ tools-wiki4962f8b+ workerb97f642. #85 — workflow-arc loop re-entry — DEFERRED (kept open). Implemented the dispatch-guard layer (draft server#176): a back-edge detector (cycle + recency) re-enters a completed loop head, so the loop no longer hangs (608 lib tests + 5 new pass). But kind validation surfaced a second blocker —set: ctx.Xloop variables are recomputed per orchestrator pass and revert to the workload default when the producing step is re-dispatched (a minimal counter-loop thrashes0,0,1,0,1,2,…). Full multi-page pagination needs durable event-sourced ctx propagation across iterations — larger than is safe to land well-tested in one session; held as a draft, not merged. Standing direction honored: Claude wrote all Rust directly (no Codex). -
2026-06-10 — #80 closed — container_callback chain green end to end.
Fixing the watcher's missing
curl(the literal #80 goal) surfaced two more layered bugs beneath it. Watcher image (ops#168): the manifest used the retiredbitnami/kubectl:1.30.3(removed from Docker Hub; the live cluster was patched to thebitnamilegacyarchive) with a runtimeapt/apk installstep that never putcurlon PATH → callback POST returned HTTP 000. Switched toalpine/k8s:1.30.3(kubectl + jq + curl baked in), dropped the install hack. Server insert (server#173, v3.0.3): once curl worked the POST reached the server and 500'd — the container-callback handler insertedcall.donevia a stale query targeting anattemptcolumn that doesn't exist on the deployednoetl.event; fixed to the workinghandlers::eventscolumn set. OOM path: the watcher only read Job-level conditions sofailed_oomcould never fire — added pod-levelOOMKilled→failed_oomclassification (ops#168); thecompleted_atfallback for failed Jobs used bare jqnow(numeric epoch → HTTP 422), fixed to RFC3339now | todate; and the e2e fixture'sbytes(40MiB)was calloc-lazy (mapped to the zero page, never faulted in) so the container exited 0 — switched to a written-intobytearraythat dirties pages and reliably OOM-kills (e2e#38). Verified the kind cluster actually enforces memory limits (120 MiB in a 32Mi pod →OOMKilledexit 137). Rebuilt the server image + reloaded into kind;kind_validate_container_callback.shboth probes GREEN — happy_path → succeeded (delta 1), oom → failed_oom (delta 1). This is the last blocker on the #43 container-callback chain. ai-meta → opscacc513+ server5d2cf58(v3.0.3) + e2e6aaf06e. -
2026-06-10 — #79 closed — e2e kind-val runners back on the current
noetlCLI surface. Bothscripts/kind_validate_*.shrunners aborted immediately onerror: unrecognized subcommand 'playbook'— they targeted the retirednoetl playbook register/execute+noetl execution status/eventsverbs. The validation logic and the event taxonomy (step.enter/command.completed/node_name/ the fan-in barrier) were intact; only the invocation layer had drifted. Fix (e2e#37):noetl register playbook --file,noetl exec <catalog-path> --runtime distributed --json(exec bymetadata.path, not the bare name),noetl status <id> --json, and the event log overnoetl query(noeventsverb today — rows wrap under.result, order byevent_idsincenoetl.eventhas no timestamp column). Added a fail-fast CLI-surface guard to each runner. Validated on kind (server-rust v3.0.1 + worker-rust, :8082): fanout_reduce PASS start-to-finish with no manual workaround; container_callback drives register→exec→COMPLETED cleanly and stops at the metric-delta assertion because the deployednoetl-k8s-watcherimage lackscurl(watcher.sh: curl: not found→ HTTP 000) — a cluster-side watcher gap tracked on #80. Version-skew note: PATH binary isnoetl 2.17.0,repos/clisubmodule is v4.10.0; the targeted surface is identical across both, so the runners work on either (the binary lags the submodule by a major line — worth refreshing for parity, not required here). Pointer: e2e →a3594b3; e2e wiki: new Kind-Val Runners page. -
2026-06-10 — #82 closed — GUI credential View/Edit recovered for pre-wallet records.
The Secrets Wallet (#61) moved credential storage to
forward-only envelope encryption; pre-wallet records now 500 on
GET /api/credentials/{id}?include_data=true(Decryption failed: aead::Error), so the GUI View/Edit flow dead-ended on a generic toast (response shape unchanged). Fix (gui#36): View surfaces the real reason + points to Edit; Edit still opens with the list-row metadata (name/type/description/tags) + a warning banner and an empty-but-required data field, so re-entering the secret and saving re-seals the record under the current wallet — recovering it. Validated live against kind + thedev:kindUI on :3001. Also landed e2e#36 (duplicateworkloadprobe-flag keys removed fromtooling_non_blocking) and gui#35 (dev:kindconvenience script). Pointers: gui →8cacc9e(v1.11.1), e2e →4a9ffbc. -
2026-06-10 — #81 closed — noetl-server v3.0.2 fixes the container-tool
commandtype contradiction.ToolSpec.commandwasOption<String>(scalar) but thecontainertool kind writes a K8s-Job-style array — an array failed the server'sToolDefinitionuntagged-enum match (400), a scalar was rejected by the worker'sContainerConfig.command: Option<Vec<String>>. TypedcommandasOption<serde_json::Value>(same asargs);ToolCall::from_specforwards it verbatim. 2 regression tests; clippy clean (server#172, v3.0.2). Kind-val GREEN end-to-end: server accepts the arraycommand, worker creates the K8s Job, Job reachesComplete 1/1. Server pointer bumped (ai-meta → serverbd36672). Chain counter-bump validation stays gated on #79 (runner CLI) / #43. -
2026-06-09 — E2E sweep cleanup — noetl-tools v3.1.0 + noetl-server v3.0.1.
Stripped the diagnostic
tracing::debug!scaffolding added during the e2e triage, kept the production fixes: YAMLwhen: trueboolean +|tojsonobject-template fallback (tools#47), 64 MB result-store body limit + pipeline command/spec stash (server#171). Pointers bumped (ai-meta@316048c tools, @6590bd6 server); tracks #49. All 7 sweep playbooks PASS on Rust-only kind. Worker crates.io dep-revert deferred — v3.1.0 not yet on crates.io ([skip ci]release commit). -
2026-06-08 — noetl-tools v2.24.2 clippy cleanup + noetl/server#22 closed.
Cleared the clippy
-D warningsCI gate on noetl-tools (15 warnings across 7 files; all mechanical lint fixes). Closed stale noetl/server#22 (Phase D orchestrator engine port — complete). noetl/server PR #167 (same clippy shape) opened, awaiting merge. -
2026-06-05 — Rust-only regression rig — canonical v10 SQL + http config shapes.
Swept ~30 self-contained e2e fixtures against the Rust-only kind stack
and fixed three config-shape classes in noetl-tools: postgres
command:alias + multi-statement SQL (tools#24, v2.18.3), a task_sequence→duckdb regression test (tools#25), and the duckdbcommand:alias + httpparams/headers/formnon-string coercion (tools#26, v2.18.4). Worker adopted both (worker#50, worker#51). Newly GREEN:duckdb_test,json_serialization_save,duckdb_retry_query,pagination/{offset,cursor,max_iterations,pipeline},retry_simple_config. Recovered the cluster first (server had latched intoNATS not configuredafter a podman restart). Server-side follow-up noted:loop_with_paginationrenders{{ execution_id }}empty in a multi-statement postgrescommand. -
2026-06-05 — postgres-tool observability — real SQLSTATE errors.
noetl-tools 2.18.2 (tools#21) + worker dep bump (worker#49): the
postgres tool surfaces the real SQLSTATE + message instead of the
opaque
db error. Validated end-to-end — a bad query reportsERROR: relation "..." does not exist (SQLSTATE 42P01)in thecall.errorevent. Closes the last follow-up from the credential/iterator saga. -
2026-06-05 —
iterator_save_testGREEN — full v10 + credential + iterator-pipeline surface validated. server#73 (v2.19.7) defers task_sequence_prev/_resultsrefs at command-build so nested-pipeline templates render at runtime.iterator_save_testreachesplaybook.completedand writes 3 rows to the realdemo_noetlDB — the deepest v10 path (iterator → pipeline →_prevchaining → nested credential → postgres write). Closes the credential + iterator + pipeline chain (server#71, worker#46, worker#48, server#73). -
2026-06-05 — Nested-pipeline credentials + template-timing finding.
worker#48 (v5.11.3) — the worker now pre-resolves keychain aliases
on task_sequence SUB-tasks;
iterator_save_test's nestedsave_itempostgres step connects todemo_noetl. Closes the credential-path chain (store → alias-key → nested resolution, all validated). Lastiterator_save_testblocker found + filed: server#72 — the server pre-renders task_sequence{{ _prev.* }}refs (runtime-only) to empty → malformed SQL (a symptom the v2.19.5 Chainable change surfaced). -
2026-06-05 — Keychain-credential path validated on Rust-only.
Continuing R5 Tier 4, registered the
pg_k8spostgres credential and probed the DB-backed fixtures. Surfaced + fixed a 3-bug chain in the keychain subsystem: credential store bound AES-GCMVec<u8>to a TEXT column (server#71, v2.19.6); alias resolution read only theauth:key not v10'scredential:(worker#46, v5.11.2). Proven:iterator_save_test'screate_tableconnects + runs DDL against the realdemo_noetlDB. Third bug — nested-pipeline credentials (task_sequence sub-tasks bypass worker resolution) — filed as worker#47 for a follow-up round. Session details -
2026-06-05 — v10 control-flow runs end-to-end on Rust-only.
Phase F R5 Tier 4 re-probe found + fixed 7 more bugs across the
Rust stack (server v2.19.5 server#69 6 commits, worker v5.11.1
worker#44, tools v2.18.1). Four v10 fixtures now reach
playbook.completed—start_with_action,end_with_action,loop_test,control_flow_workbook;actions_testcorrect-fails on a missingTEST_SECRETenv. Root-cause chain: catalog SQL type drift → ToolSpec null-serialization → worker array-config drop → orchestrator end-step skip + task_sequence label-wrap → minijinjaLenient-vs-Chainableundefined → end-step trigger gate. Also: rust-analyzer workspace setup + rule (ai-meta@38287b7). Session details -
2026-06-04 (late evening) — Rust-only e2e complete + legacy cleanup.
Six interlocking server gaps closed in one iteration (#55–#60)
- two noetl-tools fixes (#15, #16); worker dep bump; kind cluster legacy Python deployments retired. control_flow_workbook runs fully end-to-end on the Rust-only stack. Standing direction pinned: Rust-only focus, ignore Python tasks. Session details
- 2026-06-04 (afternoon) — Pipeline + failure termination + workbook resolution. Three server PRs landed together as v2.19.3 (#61, #63, #65).
- 2026-06-04 (morning) — EE-5 lax decode + workload + input alias. v2.19.1 + v2.19.2 — unblocked Rust worker → Rust server emission + canonical v10 playbook compatibility.
- 2026-06-04 (early morning) — Phase F R4-5 + R4 complete. N=2 shard kind validation script + ExecutionService refactor.
- 2026-06-03 — Phase F R4 series — DbPoolMap N+1 pool layer, AppState wiring, per-execution handler cutover, cluster-wide list fan-out.
- 2026-06-02 (afternoon) — Architecture pivot: rest of migration moves to system playbooks. Closed #30, #45; promoted #46.
See Releases for the per-repo release log with links to GitHub Releases pages.
Recent (2026-06):
-
2026-06-20 —
noetl/serverv3.39.0 — ✅ #116 program-scale step 2: execution-affinity single-owner WRITE ORDERING (multi-replica gate-ON validated) — closes the chain-fork race step 1 left open. Affinity routes every trigger for an execution to the replica thatShardConfig::ownsit (non-owner forwards a reverse-proxy POST); owner's single-process drive lock + in-memoryChainHeadsmake the read→advance atomic, no distributed lock; KV = genesis/handoff vehicle (kv_remote_hit→0).src/affinity.rs;NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME(all default off); metricnoetl_execution_affinity_total{outcome}(server#2525e00d0a). 2-replica gate-ON kind PASS: chains roots=1/dangling=0/walk==total (NO fork),forwarded_ok +9, never-scan + sole-writer across replicas. Follow-up #117 (off-serverfrom_eventsspineevent_id-order wedges fan-in under inversion). -
2026-06-20 —
noetl/serverv3.38.0 — ✅ #115 program-scale step 1: multi-replica coherence DATA LAYER —NOETL_REPLICA_COHERENCE=nats_kv(defaultlocal, prod unchanged) backsChainHeads+ExecDescriptorwith JetStream KV buckets (head CAS + descriptor CAS merge);src/coherence.rs; metricnoetl_replica_coherence_total{structure,op,outcome}(server#2518f39a79). Kind: single-replica parity withlocal; 2-replica cross-replica resolves proven. Necessary-not-sufficient → execution-affinity STAGED (2+ replicas still fork the chain). -
2026-06-20 —
noetl/serverv3.37.0 — ✅ #115 Phase 5 atomic-working-item context (tenet 6):input_binding+NOETL_ATOMIC_ITEM_CONTEXT(default off) (server#250a96ade8) -
2026-06-20 —
noetl/workerv5.40.0 — ✅ #115 Phase 5 forward the atomic-item-context flag onto the off-serverfrom_eventsdrive (worker#1212484d17) -
2026-06-20 —
noetl/serverv3.36.0 — ✅ #115 Phase 6 retire the hot-pathnoetl.eventread class; the table is AUDIT-ONLY (server#249,b71ca1d):NOETL_EVENT_READ_PATH=event_scan|audit_only(defaultevent_scan, prod unchanged). Underaudit_onlythe remaining lifecycle readers (get_catalog_id,inherit_parent_trace, dedup-audit + container-callback catalog/existence) serve from the in-memoryExecDescriptor; cold →noetl.command(synchronous queue) — never anoetl.eventscan. Proof metricnoetl_event_hotpath_reads_total{site,outcome}. Gate-ON kind-validated: hot-path scan Δ0 + drivestate_build_total/event_scansΔ0 ⇒ ZEROnoetl.eventscans on the hot path, end-to-end; audit/replay still work; 585 tests + clippy green. RFC never-scan end state (tenet 3) reached under the flag. -
2026-06-19 —
noetl/workerv5.37.0 — ✅ #115 Phase 4 off-server state-builder kernel + WAL shadow loop (worker#118,fef961c): pool-side per-execution chain index sourced from thenoetl_eventsWAL;chain_walk()head→root spine inevent_idorder (parity by construction); cache keyed by the immutable chain head (CacheHit / Incremental tail-advance / ColdRebuild). Live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off) + metrics. Gate-ON kind-validated (993 WAL events, 0 noetl.event scans, 28 cold + 21 incremental); 8 unit tests + clippy green; default off. Drive cutover staged. -
2026-06-19 —
noetl/serverv3.33.0 — ✅ #115 Phase 4NOETL_STATE_BUILDER=offserver|serverflag scaffold (server#246,3e6006d, defaultserver): the server-side flag for the off-server state-builder drive cutover (staged). 2 config tests; no prod default changed. -
2026-06-19 —
noetl/serverv3.32.0 — ✅ #115 Phase 3 chain-walk state builder (flagged, default-off) (server#245, ai-meta pointer bumped): behindNOETL_STATE_BUILD_MODE=chain_walkthe drive reconstructsWorkflowStateby following the one-levelprev_event_idchain head→root (in-memoryChainHeadshead +(execution_id,event_id)PK lookups — never aWHERE execution_idscan) → samefrom_events(parity by construction); event-scan kept as the default + fallback (cold-head / lag / non-genesis).NOETL_STATE_BUILD_PARITY_CHECKshadow-builds both ways in oneREPEATABLE READsnapshot. New metricsnoetl_state_build_total{mode,outcome}/_event_scans_total(no-scan proof) /_chain_hops/_parity_total{result}. Gate-ON kind-validated: parity 41/41 MATCH, scans=0 / 1064 hops / 0 fallbacks, all topologies COMPLETE, sole-writer + lag-0, 577 tests + clippy green. No prod default changed. -
2026-06-19 —
noetl/serverv3.31.0 — ✅ #115 Phase 2 one-levelprev_event_idevent chain (server#244, ai-metaafdb365): eachnoetl.event/noetl.commandcarries the chain link, stamped at the emit chokepoint from a per-execution chain-head watermark (ChainHeads), covering both gate paths + the materializer. Additive; nothing reads it yet (Phase 3). Kind-proven walkable/1-root/no-gap/no-scan across 6 gate-ON topologies; 573 tests + clippy green. Companion DDLnoetl/noetlecd16a2(noetl#667). No prod default changed. -
2026-06-19 —
noetl/serverv3.30.0 — ✅ #115 Phase 1 surface_ref/_storeon kept refs +refs_in_statedefault true (server#243): consume-side accessors for{{ step._ref }}lazy-load + storage-tier predicates; references stay out of state/commands by default (worker selective-resolve landed in worker#117 v5.36.0). Closed #113 + #114; kind gate-ON all 9 stalls COMPLETE. -
2026-06-19 —
noetl/serverv3.29.4 — 🐞 #113 off-server drive: recover offloaded drive result + stop drive on cancel (server#241, rig e2e#63).apply_worker_orchestrationresolves+decodes an offloaded__orchestrate__result (over the 100KB inline budget → durablereference.ref) instead of dropping it → non-convergent re-loop (metricref_resolved); cancel now matches underscoreplaybook_cancelled+ a terminal guard evicts the orch-cache (no restart). Kind gate-ON proven; 5/9 #113 fixtures COMPLETE, the other 4 hit a distinct oversized-command.issuedstall → #114. No prod default flipped. -
2026-06-19 —
noetl/serverv3.29.5 — 🐞 #114 offload oversized command context: acommand.issuedcontext overNOETL_COMMAND_CONTEXT_MAX_BYTES(512KB) is offloaded tonoetl.result_storewith a{__context_ref__}marker, resolved inget_command/claim_command(server#242, rig e2e#64) — published events stay under NATSmax_payload, no publish-wall wedge. Kind gate-ON: rig PASS, all command.issued <1MB, 6 of #113's 9 fixtures COMPLETE; remaining 3 + cutover gated on the refs_in_state consume side (#101). No prod default changed. -
2026-06-19 —
noetl/serverv3.29.4 — 🐞 #113 recover offloaded drive result + stop drive on cancel:apply_worker_orchestrationresolves+decodes an over-budget__orchestrate__drive result viaresult_store.resolve(metricref_resolved) instead of dropping it → non-convergent re-loop; cancel matches underscoreplaybook_cancelled+ a terminal guard evicts the orch-cache (server#241, rig e2e#63). No prod default changed. -
2026-06-19 —
noetl/serverv3.29.3 — 🎯 #103 cutover COMPLETE, FLIP-READY: the 2ExecutionServicecancel/finalize writers route through theemit_eventchokepoint (server#240, e2e rig e2e#62) — the last synchronous servernoetl.eventwriter under the gate is closed. Kind-proven both modes (gate-off byte-identical INSERT; gate-on PUBLISHED + materializer sole writer + terminal state + 0 loss/dup). All three flip blockers closed →PUBLISH_ONLYflip is a staged operator decision. Default-off; no prod default changed. -
2026-06-19 —
noetl/serverv3.29.2 — off-server-drive × gate crash-recovery: cold-cache apply rebuildsWorkflowStatefrom the durable log instead of dropping the in-flight drive result (server#238, refs #104/#103). Unblocks thePUBLISH_ONLYflip (off-server drive × gate now kind-proven). Confined to the cold branch; no prod default changed. -
2026-06-19 —
noetl/toolsv3.13.0 +noetl/workerv5.34.0 — #103 ack-after-materialize durability: deferred ack-after-processing capability (tools#71:AckMode::Defer+$JS.ACKdurable handles +ack/nack/term) + in-process CQRS materializer consume-loop (worker#115: drain→project→ack-only-on-success, redeliver on failure) + system-pool wiring (ops#194). Kind fault-injection: gate-on sole-writer loss=0 across a mid-drain failure. Default-off. -
2026-06-18 —
noetl/serverv3.28.0 — worker-driven orchestrator drive now default ON (NOETL_ORCHESTRATE_PLUGIN_DRIVEdefaults true) (server#233, closes #108; scale-soak-gated, revert ==false). -
2026-06-18 —
noetl/workerv5.33.0 — pool-affinity decline (drive isolated on the system pool) (worker#114, refs #108 (b)). -
2026-06-14 —
noetl/workerv5.22.0 — transfer endpoint credential-alias resolution, both Snowflake↔Postgres directions (worker#87, closes #99). -
2026-06-14 —
noetl/toolsv3.10.0 — Snowflake↔Postgres transfer arms + flatten credential config (tools#65, closes #99). -
2026-06-14 —
noetl/toolsv3.9.2 — Snowflake SQL-API context in request body + multi-statement split (tools#64, refs #98). -
2026-06-14 —
noetl/toolsv3.9.1 — set User-Agent on the Snowflake HTTP client (tools#63). -
2026-06-14 —
noetl/toolsv3.9.0 — Snowflake key-pair JWT authentication (tools#62; kind-validated on livesf_testaccount). -
2026-06-12 —
noetl/serverv3.5.0 —POST /api/execute/batch- opt-in exactly-once dedup window (server#189, RFC #90 Phase 7 — scale hardening).
-
2026-06-12 —
noetl/workerv5.19.0 — batch dispatch + dedup opt-in + per-subscription rate limits (worker#79, RFC #90 Phase 7 — scale hardening, closes #90). -
2026-06-12 —
noetl/cliv4.11.0 —noetl subscribe, local-mode subscription listener (cli#60, closes cli#59, RFC #90 Phase 6). Standalonekind: Subscriptionlistener +FileEventSinkJSONL trail +local_diskstore-and-forward spool;cli-only (reusesnoetl-toolsv3.5.0 source+spool). ai-meta → cli2fb3fb0. -
2026-06-11 —
noetl/serverv3.0.6 — round-trip JSONnullin whole-object{{ step }}references (server#177, closes noetl/ai-meta#89). Anullfield in a{{ step }}envelope rendered as the JS tokenundefined(invalid JSON), so the consuming step received an unparseablestr;render_to_valuenow retries with| tojson(undefined/none → JSONnull) — the server renderer had diverged from the noetl-tools engine that already did this. Kind-validated: cursor pagination collects all 35 events through the terminalnext_cursor: nullpage. ai-meta pointer → server8e17fbe. -
2026-06-10 —
noetl/toolsv3.1.1 — multi-tool sibling references (tools#48, closes noetl/ai-meta#87).TaskSequenceToolnow injects each sub-tool's result under its label so a later sub-tool resolves{{ <label>.<field> }}(was rendering empty — asyntax error at or near ","in unquoted numeric SQL positions). Worker adopts via worker#69. Kind-validated (save_edge_casestest_large_payload→record_count = 100). ai-meta pointer → tools76f942a+ workerb97f642. -
2026-06-10 —
noetl/serverv3.0.3 — container-callback insert matches the deployednoetl.eventschema (server#173, tracks noetl/ai-meta#43). The handler'scall.doneinsert targeted a non-existentattemptcolumn → HTTP 500 on every watcher callback; replaced with an inline INSERT matching the working ingestion path. Unblocked the container-callback chain (kind-val GREEN both probes). ai-meta pointer →5d2cf58. -
2026-06-10 —
noetl/guiv1.11.0 + v1.11.1 — credential View/Edit recovery for pre-wallet records (gui#36, closes noetl/ai-meta#82) +dev:kindconvenience script (gui#35). ai-meta pointer →8cacc9e. -
2026-06-10 —
noetl/serverv3.0.2 — container-toolcommandtype contradiction fix (server#172, closes noetl/ai-meta#81).ToolSpec.commandOption<String>→Option<serde_json::Value>: the container tool's arraycommandnow decodes server-side + passes through to the worker'sVec<String>; scalars stay JSON strings for shell/db tools. Kind-val GREEN (K8s Job reaches Complete 1/1). -
2026-06-08 —
noetl/toolsv2.24.2 — clippy cleanup: 15 warnings resolved across 7 files (tools#44, closes tools#42). Mechanical lint fixes, zero behavioral changes. -
2026-06-05 —
noetl/toolsv2.18.4 — duckdbcommand:alias (parity with postgres) + httpparams/headers/formnon-string coercion (tools#26); worker adopts it (worker#51). Unblocks the pagination + http + duckdb-commandfixtures. -
2026-06-05 —
noetl/toolsv2.18.3 — postgrescommand:alias- multi-statement SQL on postgres + duckdb (tools#24, closes tools#23);
worker adopts it (worker#50).
duckdb_test+json_serialization_saveGREEN.
- multi-statement SQL on postgres + duckdb (tools#24, closes tools#23);
worker adopts it (worker#50).
-
2026-06-05 —
noetl/toolsv2.18.2 — postgres tool surfaces the real SQLSTATE + message instead ofdb error(tools#21); worker bumped to it (worker#49). -
2026-06-05 —
noetl/serverv2.22.0 — Secrets Wallet Phase 2: GCP Cloud KMSKeyManager(Cloud KMS:encrypt/:decrypt+ Workload Identity); runtimeNOETL_KMS_PROVIDER(local/gcp-kms); KEK can leave the process (server#81, tracks #61). Kind-validated on local. -
2026-06-05 —
noetl/serverv2.21.0 — Secrets Wallet Phase 1c/1d: credentials + keychain store envelope-encrypted (per-record DEK wrapped by the KEK); self-describing{"v":1,…}blob, forward-only (server#79, tracks #61). Kind-validated end-to-end. -
2026-06-05 —
noetl/serverv2.20.0 — Secrets Wallet Phase 1b: envelope-encryption core —KeyManager/LocalDevKms/EnvelopeCipher(server#77). -
2026-06-05 —
noetl/serverv2.19.8 — Secrets Wallet Phase 1a: remove the all-zeros default encryption key, fail closed (server#75, tracks #61). Kind-validated. -
2026-06-05 —
noetl/toolsv2.18.5 — dollar-quote-aware statement splitter; the 2.18.3 splitter shredded plpgsql$$ … $$blocks (tools#27). -
2026-06-05 —
noetl/serverv2.19.7 — defer task_sequence_prev/_resultsrefs at command-build (server#73); nested-pipeline templates render at runtime →iterator_save_testGREEN. -
2026-06-05 —
noetl/workerv5.11.3 — resolve keychain aliases on task_sequence sub-tasks (worker#48); nested postgres-in-pipeline steps connect. -
2026-06-05 —
noetl/serverv2.19.6 — credential store base64-armors the AES-GCM blob for the TEXTdata_encryptedcolumn (server#71); keychain creds register + round-trip. -
2026-06-05 —
noetl/workerv5.11.2 — resolves keychain alias under the v10credential:key (worker#46). -
2026-06-05 —
noetl/serverv2.19.5 — v10 control-flow end-to-end (server#69, 6 commits): catalog INT4 + catalog_id alias, ToolSpec skip-null, orchestrator end-step-with-action + task_sequence flatten + intra-pass dedup, template Chainable- undefined, end-step trigger gate. -
2026-06-05 —
noetl/workerv5.11.1 — preserve arraytool_configfor task_sequence (worker#44). -
2026-06-05 —
noetl/toolsv2.18.1 — task_sequenceparse_tasksaccepts worker-envelope shape. -
2026-06-04 —
noetl/serverv2.19.4 — orchestrator template context: step data at top level + call.done capture. -
2026-06-04 —
noetl/toolsv2.18.0 — TaskSequenceTool. -
2026-06-04 —
noetl/toolsv2.17.1 — PythonTool result- global capture. -
2026-06-04 —
noetl/serverv2.19.3 — three fixes shipped together: pipeline flat shape decode (#61), failure termination (#63), workbook resolution (#65). -
2026-06-04 —
noetl/serverv2.19.2 — v10 workload + input alias (#59). -
2026-06-04 —
noetl/serverv2.19.1 — EE-5 lax decode for integer execution_id (#57). -
2026-06-04 —
noetl/serverv2.19.0 — Phase F R4-4b (ExecutionService refactor + cluster-wide list fan-out). -
2026-06-04 —
noetl/serverv2.13.0 → v2.19.0 — Phase F R4 series (DbPoolMap N+1 pool layer through R4-5 kind validation). -
2026-06-04 —
noetl/gatewayv3.2.0 — Phase F R3b-2 shard-info twin endpoint.
How agents (Claude / Codex / Cursor) operate across this
ecosystem — pointers into the rule files in
agents/rules/:
- Issue tracking — when + how to open an ai-task issue (rule).
- Wiki convention — ai-meta wiki vs per-repo wikis (rule). Rule 0a enumerates the four wiki pages that drift together; spot-check Home against the open-issue list at session start.
- Handoffs — file-based cross-agent dispatch (rule).
- Deployment validation — kind-first before GKE (rule).
- Execution model — the foundational shape (rule).
-
Data access boundary — NoETL data via
server API only; workers never touch
noetl.*direct (rule). - Observability — metrics + traceability + snowflake IDs (rule).
- Just landed in this codebase? Read Repo Map, then Execution Model, then the umbrella for whatever you're working on.
- Picking up an in-flight task? Find the matching umbrella page above; it has the full state of the work + the next concrete step.
-
Need to file new work? Follow the
issue tracking convention — open the ai-task
issue on
noetl/ai-meta, then add the umbrella to the table above and create the corresponding wiki page. -
Maintenance pass? Refresh this Home + the Sessions Log
- the Releases page + the matching
Umbrella-*.mdpage when you bump a submodule pointer. All four pages drift together — see Rule 0a's checklist.
- the Releases page + the matching
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture