Skip to content
Kadyapam edited this page Jun 19, 2026 · 277 revisions

NoETL Ecosystem Dashboard

Last refreshed: 2026-06-19 (Claude session — 🐞 #113 off-server-drive payload-size + cancel fix shipped (server v3.29.4). Fixed the worker-driven drive stall when an __orchestrate__ result exceeds the 100KB inline budget: the worker offloads it to the durable result store with only a reference.ref (no inline output_b64), and apply_worker_orchestration now resolves+decodes that ref (server#241, metric ref_resolved) instead of dropping the drive decision → non-convergent re-loop; plus cancel now stops the drive (match underscore playbook_cancelled + ExecutionState::is_terminal terminal guard evicts the orch-cache, no restart). Companion convergence rig e2e#63. Kind gate-ON proven (785KB drive result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; materializer sole-writer lag 0). 5/9 #113 large-context fixtures COMPLETE; the other 4 hit a DISTINCT oversized-command.issued (full upstream context embedded → >1MB NATS payload) stall → #114 (#113 stays open until all 9 close). No prod default flipped; prod is pre-#108 in-server drive (unaffected). — prior: 🚀 #103 GKE pre-flip PREP landed (staged, NO traffic flip, NO PUBLISH_ONLY). Verified live that prod already runs the full Rust stack (the #49 Python→Rust cutover is done; the noetl Service selector is already app=noetl-server-rust; live images are pre-#103 batch-dispatch-v1/cursor-100), both flip secrets already exist (NOETL_ENCRYPTION_KEY + noetl-internal-api-token), and prod monitoring is Google Managed Prometheus, not VictoriaMetrics. Pushed the post-#103 images to the prod Artifact Registry (server v3.29.3 + worker v5.35.0, amd64); applied + verified GMP monitoring (PodMonitoring for worker+server /metrics — the noetl namespace had none, so app metrics weren't scraped at all — + materializer-lag Rules, translated from the kind VMRule; up{namespace="noetl"}=4 live series); staged the roll-forward manifests (server→v3.29.3 gate-off, system-pool→v5.35.0 materializer-off) in a PR not applied (they roll live workloads). Operator-gated remainder (surfaced, not done): roll the live images, enable the materializer shadow, wire the GMP managedAlertmanager pager, then flip. No prod default changed. — prior: 🛡️ #103 materializer-lag GUARDRAIL shipped — the pre-flip observability gate is now in place. The server was already FLIP-READY; the remaining operator gate was a materializer-lag metric + alert so the staged PUBLISH_ONLY flip is safe and one revert away. Shipped (default-off): worker #116v5.35.0 extends the JetStream lag poller to track the noetl_events/noetl_materializer consumer on an independent task (so a stalled/dead materializer loop — which can't report its own lag — still surfaces as a climbing noetl_worker_nats_consumer_pending{consumer="noetl_materializer"} gauge); ops #195+#196 add the VMRule (backlog warning/critical/growing + stall-under-gate + project-errors + absent-under-gate), a worker /metrics VMServiceScrape (was unscraped), VMAlert enabled, a Grafana dashboard, and the flip runbook noetl-cqrs-publish-only-flip.md (pre-flip green-baseline check + one-command revert). Kind-proven the full cycle: green baseline (backlog 0, published==projected==acked) → induced lag (materializer fault-injected while events publish under the gate) → backlog gauge climbs 0→684 via the independent poller, alerts fire (backlog warning+critical + stall) → recover (fault removed) → materializer drains backlog→0 idempotently (0 dup, 0 loss), alerts clear. Flip-readiness now includes the monitoring gate; PUBLISH_ONLY stays default-off, no prod default changed. — prior: 🎯 #103 server cutover COMPLETE — the server is FLIP-READY. The last of the three flip blockers is closed: the two ExecutionService terminal writers (POST /cancelplaybook_cancelled, POST /finalizeplaybook_completed/playbook_failed) now route through the emit_event chokepoint, so they honour NOETL_EVENT_INGEST_PUBLISH_ONLY like the other 13 producers instead of writing noetl.event synchronously under the gate. Shipped: server #240v3.29.3 (b6e5d31, default-off) + dual-mode e2e rig kind_validate_cancel_finalize_gate.sh (e2e#62). Kind-proven both modes: gate-OFF cancel/finalize INSERT synchronously, byte-identical columns (error preserved), published delta +0, natural completion still COMPLETED; gate-ON noetl_event_ingest_published_total{playbook_cancelled}=1/{playbook_failed}=1 (PUBLISHED, not inserted), materializer (system pool) is the sole writer, both executions reach the correct terminal state, rows==distinct, 0 catalog_id=0, no loss/dup. No remaining synchronous server noetl.event writers on the producer path under the gate — the server is a complete non-writer of the event log when PUBLISH_ONLY is on. Flipping PUBLISH_ONLY on is now a staged operator decision (behind a materializer-lag alert, one revert away); no prod default changed. — prior: #104 off-server-drive × gate reconciliation PROVEN (server v3.29.2, cold_rebuild); #103 ack-after-materialize durability resolved (tools 3.13.0 + worker 5.34.0 + ops#194).) Refresh cadence: every session that lands meaningful cross-repo work (per agents/rules/wiki-maintenance.md Rule 0a)

Standing direction (2026-06-04). Per memory entry, Python tiers are deprioritized. Forward Rust-only e2e work is tracked under #54 (Phase F R5). Python pieces stay deployable for backwards-compat on GKE but are NOT a target for new feature work.

Single pane of glass for the NoETL platform. Every active umbrella, every submodule, every release lands here so a single page shows what's in flight, what shipped, what's next.

Convention. This wiki is the cross-repo dashboard. Per-repo wikis (e.g. noetl/server wiki, noetl/ops wiki) document that repo's surface; this wiki documents the system of repos. See Wiki convention for the split.

Active umbrellas

Umbrellas open in the ai-task queue: the Rust server parity-port umbrella (#49), the orchestrator-scaling work (#101), the new event-WAL + derivable-storage model (#104), the Rust regression baseline migration (#98), the Python-era deploy legacy cleanup (#97), and the postgres timestamptz NaiveDateTime bug (#95). #100 (cursor/claim loop mode) closed 2026-06-15 — server v3.8.0 + tools v3.10.1; test_pft_flow_v2 all_passed:true on kind (see Recently closed). #99 (transfer-tool credential aliases) closed 2026-06-14 via tools#65 + worker#87 + e2e#58 (see Recently closed). The subscription / listener tool RFC (#90) closed 2026-06-12 with all 7 phases shipped + live-proven (see Recently closed); refinement follow-ups #91#94

# Opened Last update Umbrella Status Wiki page
#107 2026-06-17 2026-06-17 Program: Distributed Multitenant OS — Server Dissolution → Global Grid The strategic roof over #101–#105. Blueprint names NoETL a distributed multitenant OS (server→stateless edge; NATS WAL + object store the only durable state; processing = event-driven system playbooks on a sharded grid; foundation for quantum-cloud-hybrid). 5-step path: CQRS cutover (step 1, shadow green) → orchestrator-as-plug-in (#108, done 2026-06-18 — drive core fully wasm-resident (#109 closed); the system/orchestrate plug-in compiles to a 0-import .wasm (server#224), runs identically to native in wasmtime (server#225), is seeded into the registry on boot + servable (server#226), and drives the real workload identically live — shadow over the 10×1000 PFT, 529 evals 0 mismatch (server#227); worker-driven cutover — the drive runs OFF-SERVER on the pool, kind-validatedentry/run_state dispatch (worker#113) + apply_orchestration_result (server#228) + the flag-gated scheduler/apply/state-guard (server#229, NOETL_ORCHESTRATE_PLUGIN_DRIVE); simple_python drove start→end→COMPLETED through the worker round-trip. #108 CLOSED 2026-06-18 — (c) the default-flip shipped (server#233, v3.28.0, drive default ON) after a clean scale soak; orchestrator-as-plug-in is done; the in-server shadow + wasmtime server dep retired in #110 / server#234) → per-shard WAL → drop Postgres → cross-shard federation. docs blueprint
#49 2026-06-02 2026-06-14 Rust server FastAPI parity port — primary server v3.6.0 (system worker pool + cleanup/purge endpoint, server#193); prod is Rust-only (server-rust + worker-rust + system pool). Remaining entangled refactor tracked in #97. Umbrella: Rust Server Port
#101 2026-06-15 2026-06-16 Orchestrator scaling: incremental state + results-by-reference resolution In progress — PRs open (server#197 incremental OrchStateCache + hydrate_result_references; worker#89 NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES). References-in-state flag-on stall fixed + re-validated 2026-06-16 (worker feat/refs-in-state-extractedbuild_extracted now a bounded navigable structural summary so output.data.rows[0].<field> resolves off the reference; PFT advanced 13→253 events, 31 refs active). Awaiting merge + pointer bumps. Umbrella: Orchestrator Scaling
#103 2026-06-15 2026-06-19 Step 2 — CQRS event log: events to JetStream + batch projector (write-path scaler) GKE pre-flip PREP landed (2026-06-19) — prod images pushed, GMP monitoring live, manifests staged; NO traffic flip / NO PUBLISH_ONLY. Prod verified already-Rust (pre-#103 images), secrets present, monitoring = GMP (not VM); operator-gated remainder = roll images → materializer shadow → pager → flip. Server cutover COMPLETE — FLIP-READY (default-off). All three flip blockers closed: (1) ack-after-materialize durability (deferred ack + worker materializer loop), (2) off-server-drive × gate reconciliation (#104, server v3.29.2 cold_rebuild), (3) the 2 ExecutionService cancel/finalize sites now route through the emit_event chokepoint (server#240v3.29.3, e2e rig e2e#62). No remaining synchronous server noetl.event writers under the gate — kind-proven both modes (gate-off byte-identical INSERT; gate-on cancel/finalize PUBLISHED, materializer sole writer, terminal state reached, 0 loss/dup). Flipping PUBLISH_ONLY on is a staged operator decision (materializer-lag alert, one revert away); no prod default changed. Materializer-lag guardrail SHIPPED 2026-06-19 (worker #116 v5.35.0 lag gauge on an independent poller + ops #195/#196 VMRule + worker scrape + VMAlert + dashboard + flip runbook) — kind-proven induce→fire→recover→clear; flip-readiness now includes the observability gate. Umbrella: Orchestrator Scaling
#102 2026-06-15 2026-06-16 Step 1 orchestrator throughput: batch event-log writes + frame-batch cursor body In progress — Part A landed (server#198 → v3.10.0), validated on GKE (1×50 PFT COMPLETED, batch event-log path). Part B (frame-batch cursor body) ahead. Umbrella: Orchestrator Scaling
#104 2026-06-16 2026-06-19 Event WAL + derivable result storage: NATS-as-WAL, logical-URI naming, Feather tier In progress — off-server-drive × gate reconciliation PROVEN (2026-06-19): gate-ON (PUBLISH_ONLY=true) with the off-server drive (PLUGIN_DRIVE=true) + materializer sole writer is now green on kind (the combo #103 left unproven) — fresh exec + cursor fan-out → COMPLETED, server wrote 0 noetl.event rows (all PUBLISHED), materializer materialized all exactly once (rows==distinct ids, 0 dup), read-your-writes held via the relocated trigger. Server #238v3.29.2 rebuilds WorkflowState from the durable log on cold-cache apply (crash-recovery, kind-proven cold_rebuild); committed e2e rig kind_validate_orchestrate_gate.sh (e2e#61). This unblocked the #103 flip (the 2 cancel/finalize sites are now also done — server v3.29.3 #240; #103 is FLIP-READY). Prior: naming foundation noetl_tools::locator (tools#68/#70, v3.12.0) + worker URI stamp (worker#99); blueprint (docs#180). Umbrella: Event WAL Storage
#98 2026-06-14 2026-06-14 Grow the Rust regression baseline: migrate Python-era e2e fixtures Snowflake key-pair JWT validated — the last external-tool gap is closed. tools v3.9.2 (tools#62/#63/#64); create_sf_database + setup_sf_table COMPLETED via JWT on kind. Transfer step deferred to #99. Core green: 64 fixtures.
#97 2026-06-14 2026-06-14 Retire remaining Python-era deploy legacy (manifests, kind automation, helm chart) Open — Todo. Python manifests, kind redeploy automation that hardcodes Python deployment names, stale helm release rev 185.
#111 2026-06-18 2026-06-19 E2E: worker-driven orchestrate topology coverage + server-API-only gap tracking In progress — three committed kind rigs now: kind_validate_orchestrate_offserver.sh (e2e#59) asserts the off-server topology (gate-off), kind_validate_orchestrate_gate.sh (e2e#61) asserts it composes with the PUBLISH_ONLY gate, and kind_validate_cancel_finalize_gate.sh (e2e#62, 2026-06-19) dual-mode-asserts the ExecutionService cancel/finalize writers honour the gate (gate-off byte-identical INSERT; gate-on PUBLISHED, materializer sole writer, terminal state, 0 loss/dup) — the rig that closed the last #103 flip blocker. Off-server rig live-green (COMPLETED, __orchestrate__ in noetl.event = 0, dispatched=applied, shadow metric absent). Durable home for the server-API-only gap (server still sole-writer + rebuilds state — moves under #103/#104) + two operator decisions: (A) retire in-process drive fallback (gated on prod adopting a post-#108 image; prod still pre-#108), (B) reap accumulating __orchestrate__ PENDING delivery rows in noetl.command.
#113 2026-06-19 2026-06-19 Worker-driven orchestrate drive stalls when OrchestrationResult exceeds the 100KB inline budget In progress — decode-drop loop + cancel-non-stop FIXED (server#241v3.29.4; rig e2e#63). When the off-server __orchestrate__ drive result exceeds the 100KB inline budget the worker offloads it (durable reference.ref, no inline output_b64); apply_worker_orchestration now resolves+decodes the ref (metric ref_resolved) instead of dropping it → non-convergent re-loop. Cancel now matches underscore playbook_cancelled + a terminal guard evicts the orch-cache (no restart). Kind gate-ON proven (785KB result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a loop instantly; sole-writer lag 0). 5 of 9 large-context fixtures COMPLETE; the other 4 hit a DISTINCT oversized-command.issued stall → #114 — issue stays open until all 9 close. Gates the off-server-drive prod cutover (#107/#111).
#114 2026-06-19 2026-06-19 Off-server drive: oversized command.issued event (full upstream context embedded) exceeds NATS 1MB max_payload → wedge Open — Todo. Surfaced fixing #113: with refs_in_state=false the off-server drive embeds the full resolved upstream context into the next step's command, so its command.issued event (~1.32MB, ctx[start]=1.06MB for test_output_select) exceeds NATS max_payload=1MB → publish ack-timeout → step.enter persisted, command never issued, wedge. Blocks the remaining 4 of #113's 9 fixtures (test_output_select, test_storage_tiers, pagination/pipeline/test_pipeline_heavy_payload, kind_playbook_lease_expiry). Candidate fixes: refs-in-state for the drive / don't embed full context in the event / offload over-budget command context. Gates the off-server-drive prod cutover (#107/#111) with #113.
#95 2026-06-14 2026-06-14 noetl-tools postgres pg_value_to_json returns null for timestamptz / NaiveDateTime columns Open — Todo. Bug: timestamptz columns serialize to JSON null instead of the ISO-8601 string.

Recently closed (last 7 days)

# Closed Title
#112 2026-06-18 Worker /dev/shm SIGBUS — k8s default 64 MiB tmpfs vs the 256 MiB Arrow IPC cache budget. Every worker process (Rust noetl-worker + legacy Python worker) allocates an Arrow IPC shared-memory cache at init (NOETL_IPC_CACHE_BUDGET_BYTES, default 256 MB) backed by POSIX shm on /dev/shm; the k8s container-runtime default /dev/shm is a 64 MiB tmpfs, so under shm-heavy load the cache writes past 64 MiB, the store page-faults against the full tmpfs, and the worker dies with SIGBUS (exit 135) and crash-loops. Surfaced during #103 CQRS kind validation on the system pool; a transient live fix was reverted, leaving the committed manifests latent. Fix (ops#193) gives every worker deployment a memory-backed /dev/shm (emptyDir medium: Memory, sizeLimit: 320Mi > budget), pins NOETL_IPC_CACHE_BUDGET_BYTES=268435456 next to the sizeLimit so the two can't drift, and raises the memory limit to 768Mi (tmpfs is charged to the pod cgroup). Applied to all 7 worker manifests (system / shared-rust / subscription / subscription-runtime / Python-cpu + 2 prod variants). Kind-validated on the system pool: reproduced SIGBUS (exit 135) on the 64 MiB tmpfs → after fix /dev/shm is 320 MiB and a full 256 MiB write completes (exit 0, peak 256M/320M) with the pod healthy (restarts=0, no OOM); cluster restored to baseline. ai-meta → ops f4df4c1 + wiki worker deployment-specification.
#110 2026-06-18 Retired the in-server orchestrate shadow + the wasmtime server dependency. The separable server-slimming follow-up to #108 (closed): the in-server shadow was the slice-4 cutover-confidence harness (ran the system/orchestrate plug-in inside the server via an embedded wasmtime host, diffed 529/0 against the in-process drive). With the worker-driven drive default-on + proven, the live drive uses the worker's wasmtime host — never the server's — so the shadow was dead weight. server#234 removed orchestrate_shadow.rs, the orchestrate-shadow cargo feature + the optional wasmtime dep (the cranelift/wasmtime tree — ~1000 Cargo.lock lines — fell out; cargo tree -i wasmtime now matches nothing), the trigger_orchestrator_inner shadow hook, the main.rs boot loader, the NOETL_ORCHESTRATE_PLUGIN_SHADOW config field, and the noetl_orchestrate_shadow_total metric. Kept run_state + NOETL_ORCHESTRATE_PLUGIN_DRIVE (default true). refactor: = no version bump (stays v3.28.0). Kind smoke on a 4-page cursor loop, both drive modes: COMPLETED with __orchestrate__ rows in noetl.event = 0 (worker-driven: 10 drive cmds on the system pool, dispatched=applied=10, event_suppressed=30); shadow metric gone from /metrics. ai-meta → server f3043c9 + wiki deployment-specification. See Umbrella-System-Pool-Design.
#108 2026-06-18 🎯 The orchestrator drive runs OFF-SERVER on the system pool, and it's now the DEFAULT. Step 2 of the dissolution (#107) — the server's brain moved onto the worker pool as the system/orchestrate WASM plug-in. Arc: slices 1–3 (off-server drive, server#229) → 4a (cursor+fan-out validated) → 4b + follow-up a (the __orchestrate__ meta-command touches noetl.event zero times, server#230/#231) → (b) system-pool isolation (server#232 + worker#114 + ops#191) → (c) the deliberate default-flip (server#233, v3.28.0): NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true. Gated on a scale soak (kind): a 694-drive cursor+fan-out run COMPLETED with __orchestrate__ rows in noetl.event = 0 and all 694 drives claimed on the system pool (shared pool got only the real steps); the default-on path (no env var) reproduced the identical shape (361 drives, isolated, 0 burst); 15/15 regression fixtures green; revert (=false → in-process drive) verified. ai-meta → server 80cc0e6 + worker 437b0be. See Umbrella-System-Pool-Design.
#109 2026-06-17 Event-ABI round — orchestrator + evaluate moved into the wasm core (#108 final slices). Slice 3 (server#223) relocated orchestrator/evaluate to noetl-orchestrate-core; the whole drive (renderer, playbook, commands, evaluator, state, orchestrator switch) now compiles native + wasm32-unknown-unknown. evaluate reads the pure core::event::Event; db::Event converts at the trigger_orchestrator boundary. 122 core + 565 server tests green, 0 WASI imports, kind PFT 10×1000 clean (0 errors, 0 restarts). Plug-in round (data-plane ABI + command_emit + scheduler + shadow-diff) tracked under #108.
#106 2026-06-17 EventEnvelope rejects timezone-less timestamps — blocked the CQRS materializer. The tailer publishes to_jsonb(noetl.event row) and created_at is timestamp WITHOUT time zone, so the timestamp had no offset; EventEnvelope.timestamp: Option<DateTime<Utc>> rejected it with the misleading premature end of input. Fixed (server#217) with a flexible deserialize_with (RFC3339 + tz-less→UTC). Validated live: materializer projects {projected:0, duplicates:20} = byte-identical. The #103 shadow gate is green.
#105 2026-06-17 Plug-in compilation & hot-reload: WASM-compiled system playbooks + managed library. The WASM plug-in runtime is complete + fully live-proven on kind: dispatch (tool_kind: "wasm" → wasmtime host → digest from registry → run), capability flush (object_putnoetl.object_store), real data flow (inputargs, worker#110), and hot-replace (worker#112) — republishing the same path@version swaps the running pool's behavior with 0 restarts. Landed across worker#93/#95/#97/#99/#101/#103/#105/#107/#108/#110/#112 + server#210/#212/#214 + tools#68 + docs#181. Remaining (executor: author sugar; compiled materialiser port) deferred to #104/#103. See Umbrella-WASM-Plugin-Compilation.
#100 2026-06-15 Cursor/claim loop mode: loop.spec.mode: cursor in the Rust orchestrator. noetl-server v3.8.0 (server#196) cursor loop engine + output namespace; noetl-tools v3.10.1 (tools#66) postgres -- comment splitter fix; worker#88 dep bump. test_pft_flow_v2 all_passed:true, 5/5 per data type on kind against the throttling/error-injecting paginated-api. See Umbrella-Cursor-Loop-Mode.
#99 2026-06-14 Transfer tool: Snowflake↔Postgres both directions, with credential-alias resolution. Both transfer arms implemented in noetl-tools v3.10.0 (tools#65); worker v5.22.0 (worker#87) pre-resolves source.auth/target.auth aliases; SF→PG coerces string cells via $n::text::<udt> + reformats Snowflake-epoch timestamps to RFC3339; PG→SF generates SQL-escaped INSERTs. e2e fixture migrated (e2e#58). Kind-validated bidirectionally against the live sf_test account: every step COMPLETED, real types preserved.
#90 2026-06-12 Subscription / listener tool (RFC) — all 7 phases shipped + live-proven. Bounded-drain tool (Mode A) → kind: Subscription continuous runtime (Mode B) + header-directive engine → gateway push-ingress (Mode C) + auth-gated directive trust → store-and-forward spool + circuit breaker → out-of-cluster Cloud Run + gcs spool → CLI local noetl subscribePhase 7 scale hardening. Final phase: server v3.5.0 (server#189) POST /api/execute/batch (N→N, partial-failure contained) + opt-in exactly-once dedup window (noetl.subscription_dedup, bounded-by-age, race-safe, default off); worker v5.19.0 (worker#79) batch dispatch + dedup opt-in + per-subscription rate limits (token-bucket fetch-side backpressure → source keeps backlog, no loss); ops (ops#176) + e2e (e2e#48); no tools change. Live on kind: batch 12→12 COMPLETED on the subscription pool + per-message traceparent; dedup duplicate→1 execution + subscription.message.deduplicated; rate-limit engaged + 10/10 → executions (no loss). Refinement follow-ups tracked: #91#94 + tools#57. ai-meta → server 7b217d8 + worker 7531f4a + ops 6db69b9 + e2e 203593b.
#89 2026-06-11 JSON null re-injected via {{ step }} serialized as JS undefined — fixed in the server template renderer — the cursor pagination fixture's terminal page returns next_cursor: null; re-injecting the whole {{ fetch_page }} envelope into the next step's input rendered that field as the bare token undefined, producing invalid JSON. render_to_value then failed serde_json::from_str and returned the entire envelope as a raw string, so the consuming Python step got response as a str and crashed ('str' object has no attribute 'get'). Root cause was the server, not the worker (the issue's hypothesis): src/template/jinja.rs::json_value_to_minijinja maps JSON nullValue::UNDEFINED and minijinja's map repr emits undefined; the server's renderer was a divergent copy missing the | tojson retry the noetl-tools engine already had. Fix (server#177, v3.0.6) adds that retry — a lone {{ expr }} rendering container-shaped-but-invalid JSON re-renders with | tojson; minijinja_to_json maps undefined/none → JSON null. 5 new regression tests; 619 lib + 8 parity green. Kind-validated: cursor walks all 4 pages, terminal next_cursor: null handled, validate_results collects 35 (first_id=1, last_id=35, success) — matching offset. ai-meta → server 8e17fbe.
#88 2026-06-10 e2e offset/cursor pagination fixtures read response.body.* not response.data.* — the Rust http tool nests the parsed JSON payload under body ({{ fetch_page }}{body, headers, status_code}), so the fixtures' response.get('data', {}) resolved to {}, has_more/next_cursor defaulted falsy, and the loop exited after page 1 even with the post-#85 loop machinery correct. Fixed both check_pagination steps to response.get('body', {}) via e2e#40. Kind-validated against the live paginated-api test-server (Rust server/worker :dev): offset walks 0→10→20→30, has_more T/T/T/F, users 10/10/10/5, validate_results success 35 (first_id=1, last_id=35), playbook.completed COMPLETED. cursor path-fix correct + walks all 4 pages (Mg==Mw==NA==null, 35 events fetched) but final collection blocked by a distinct worker bug → filed #89 (terminal next_cursor: null serialized as JS undefined). Other pagination fixtures (retry/max_iterations/pipeline*/loop_with_pagination) share the same envelope-key assumption over /api/v1/assessments|flaky ({data, paging}) — flagged, left for follow-up. ai-meta → e2e 72a7525.
#85 2026-06-10 Workflow-arc loops now advance across iterations + terminate cleanly — built on the dispatch-guard re-entry layer, two coupled orchestrator fixes via server#176 (v3.0.5). (1) Durable event-sourced loop-ctx propagation: step-level set: ctx.* loop variables were recomputed per pass and reverted to the workload default (loop thrashed 0,0,1,0,1,2,…); root cause was start's initializer set re-firing every pass in random HashMap order against check_pagination's advancing set. Fix persists each completion's rendered set: as a ctx.updated event (latest-wins fold + build_context overlay), emitted once per completion keyed by the stable completion event_id (not Utc::now()-fallback completed_at). (2) Loop-exit hang: the exit branch was marked step.skipped on a loop-body-completion pass (recency-based branch-point detector missed it), turning it terminal so the exit dispatch was suppressed; fixed with a structural loop-branch-point test. 614 lib tests (6 new; 2 verified to fail without their guard); clippy-clean. Kind-validated: counter loop advances 0→1→2→3 + terminates; real-http offset pagination advances 4 pages collecting 35. (Separate finding, filed follow-up: the e2e offset/cursor fixtures read response.data.users but the Rust http tool nests it at response.body.users.) ai-meta → server e519fdc.
#87 2026-06-10 Multi-tool step: a later sub-tool can now reference an earlier sibling's output — in a tool: [list] step, each sub-tool's result was stored for the aggregated output but never injected into the running context, so {{ <label>.<field> }} rendered empty (masked in quoted positions; a syntax error at or near "," in unquoted numeric SQL positions, e.g. save_edge_cases test_large_payload). Fixed via tools#48 (v3.1.1): inject each sub-tool's result under its label (with a synthetic .data self-ref) so later siblings resolve it. Worker adopts 3.1.1 via worker#69. Kind-validated: save_edge_cases test_large_payloadrecord_count = 100 (no syntax error), save_delegation_test clean. ai-meta → tools 76f942a + worker b97f642.
#83 2026-06-10 Orchestrator fan-in barrier deadlocked workflow loopsbuild_incoming_arcs counted a loop back-edge (check_pagination → fetch_page) as an upstream so the barrier deferred the loop head forever. Fixed via server#175 (v3.0.4): exclude back-edges via a new forward_reachable helper (genuine fan-in unaffected). Kind-validated; fanout_reduce green. ai-meta → server 480ba72.
#84 2026-06-10 Orchestrator never populated event.nameloop.done arc gates always skippedwhen: {{ event.name == "loop.done" }} (10+ fixtures) never matched, so in-step loop: steps hung after completion. Fixed via server#175 (v3.0.4): inject event.name = "loop.done" into a completed loop step's next-arc context. Kind-validated (test_pagination_basic completes). ai-meta → server 480ba72.
#86 2026-06-10 e2e fixtures: duckdb tool field is command/query, not commandscommands: (plural) failed pre-dispatch (missing field 'query' / malformed tool config). Renamed across 4 storage/gcs fixtures via e2e#39; save_all_storage_types green. ai-meta → e2e b0a5c85.
#78 2026-06-10 noetl-worker: pre-dispatch errors now emit terminal call.error instead of hanging — credential-alias resolution + tool-config deserialization failures used to ?-propagate out of execute_with_server_url; the dispatch loop only logged them, so the execution sat at command.started forever. Fixed via worker#68 (v5.15.1): typed CredentialResolutionError (terminal AliasNotFound/Invalid vs retryable Transient) + CredentialHttpError carrying the HTTP status so classify_fetch_error decides retryability by code (terminal: 404/400/401/403/500; retryable: 408/429/502/503/504 + transport), and handle_predispatch_failure emits call.error + command.failed. Diagnosis correction: the live pg_noetl_k8s repro is an HTTP 500 "Decryption failed: aead::Error", not a 404 (the worker has no /api/keychain/ call). Kind-val GREEN: test/postgres now → call.error/command.failed/playbook.failed (no hang); hello_world still completes. ai-meta → worker 99e2c66.
#80 2026-06-10 container_callback chain green end to end — fixing the watcher's missing curl (retired bitnami/kubectlalpine/k8s:1.30.3, ops#168) surfaced two more layered bugs. Server: container-callback call.done insert targeted a non-existent attempt column → HTTP 500; fixed to the deployed noetl.event schema (server#173, v3.0.3). OOM path: the watcher only read Job-level conditions so failed_oom could never fire — added pod-level OOMKilledfailed_oom / ImagePullBackOfffailed_image_pull classification + RFC3339 completed_at (bare now was returning HTTP 422) (ops#168); and the e2e fixture's calloc-lazy bytes(40MiB) never dirtied pages so it never OOM'd (e2e#38). kind_validate_container_callback.sh both probes GREEN — happy_path → succeeded, oom → failed_oom. ai-meta → ops cacc513 + server 5d2cf58 + e2e 6aaf06e.
#79 2026-06-10 e2e kind-val runner scripts updated to current noetl CLI surface — both scripts/kind_validate_*.sh aborted on unrecognized subcommand 'playbook'; the validation logic + event taxonomy were intact, only the invocation layer drifted. Fixed via e2e#37: register playbook / exec <catalog-path> --runtime distributed --json / status --json / event log via noetl query (.result, order by event_id); fail-fast CLI-surface guard. fanout_reduce PASS start-to-finish on kind (server-rust v3.0.1); container_callback drives cleanly and stops at the watcher curl gap (#80). ai-meta → e2e a3594b3; wiki Kind-Val Runners.
#82 2026-06-10 GUI: credential View/Edit recovered for pre-wallet (legacy-encrypted) records — Secrets Wallet (#61) made credential storage forward-only, so pre-wallet records 500 on decrypt and the GUI View/Edit flow dead-ended on a generic toast. Fixed via gui#36: View explains the cause + points to Edit; Edit reopens with the list-row metadata + a warning banner so re-saving re-seals the record under the current wallet. ai-meta → gui 8cacc9e (v1.11.1).
#81 2026-06-10 Container tool unusable: ToolSpec.command (String) vs ContainerConfig.command (Vec) type contradiction — landed via server#172 (v3.0.2). ToolSpec.command Option<String>Option<serde_json::Value> so the container tool's array command decodes server-side + passes through to the worker's Vec<String> (scalar stays a JSON string for shell/db tools); ToolCall::from_spec forwards verbatim. 2 regression tests; clippy clean. Kind-val GREEN: server accepts command: ["/bin/sh","-c"], worker creates the K8s Job, Job reaches Complete 1/1. Chain counter-bump validation stays gated on #79/#43.
#77 2026-06-09 Explicit input:/set: forward-only data binding — BREAKING v3.0.0 across noetl-tools + noetl-server. All 5 PRs merged: tools#45 (v3.0.0) + server#169 (v3.0.0) + e2e#35 (13 fixtures migrated) + cli#57 (v4.10.0, executor 0.5.0) + worker#66 (dep bump). Kind-val GREEN.
#76 2026-06-08 Sequential-mode iterator dispatch — LoopMode enum (Sequential default / Parallel), StepInfo.iterations_dispatched guard. Landed via noetl/server#166 (v2.62.0). First Claude-direct Rust PR under agents/rules/handoff-routing.md. Kind-val GREEN: test/loop COMPLETED 5/5 + iterator_save_test COMPLETED 4 steps.
#70 2026-06-08 noetl-server missing PUT /api/result/<execution_id> endpoint — landed via noetl/server#160 (v2.58.0). Durable result-store endpoints (PUT + GET /api/result/<eid>/<step>/<ref>) ported to the Rust server. Kind-val GREEN: output_select_test reached playbook.completed with test_result: "PASSED".
#69 2026-06-08 noetl-worker: over-budget call.done returned reference-only envelope, missing inline _ref for downstream {{ step._ref }} templates. Landed via noetl/worker#63 (v5.15.0): build_call_done_result's durable-success branch now embeds context: { data: { _ref: <noetl://...> } } alongside the existing reference block. Kind re-val pending noetl/ai-meta#70 (server-side PUT /api/result/<eid> endpoint missing — falls back to degraded shm-only path where there's no noetl:// URI to embed).
#68 2026-06-08 noetl-tools: ArtifactTool config required input but server pipeline emitted args (the post-#56 normalized name) — landed via noetl/tools#40 (v2.23.1) + worker dep bump via noetl/worker#62. One-line #[serde(alias = "args")] on ArtifactConfig.input accepts both shapes. Re-val surfaced a downstream _ref/output_select gap → filed as noetl/ai-meta#69.
#67 2026-06-08 Rust orchestrator hangs on mode: exclusive routing — untaken sibling never emits step.skipped, R4 fan-in barrier deadlocks. Landed via noetl/server#159 (v2.57.2). Three-part fix: (1) evaluator::evaluate_next_transitions surfaces unmatched siblings as not_matched_with_target; (2) orchestrator::process_in_progress two-pass emit-skipped-then-dispatch (HashMap-order-independent); (3) +2 unit tests (lib 568/0/0). Kind-val GREEN — comprehensive_test.yaml reached playbook.completed in ~4s (was hanging forever pre-fix).
#66 2026-06-07 Rust orchestrator: cross-step {{ step.data }} template resolves to None — landed via noetl/server#158 (v2.57.1). WorkflowState::build_context injects a self-referencing .data key on extracted user_data, guarded so the task_sequence flatten path's existing .data stays intact. 2 new unit tests + 566/0/0 lib. Surfaced by #65 kind-val; concrete repro kind execution 322087210360770560.
#65 2026-06-07 noetl-tools: external python script loaders (file/gcs/http source types) + legacy main() function convention — landed via tools#38 (v2.22.0) + tools#39 (v2.23.0); kind-val GREEN on the live worker (execution 322087210360770560 reached playbook.completed in ~6s; loaded main(name, count) returned the expected payload). noetl/worker#61 OPEN+mergeable to bump the worker pin. Surfaced finding: noetl/ai-meta#66.
#43 2026-06-07 R-3 Phase C-2: container tool kind — design callback pattern for K8s Job dispatch. All four Rust rounds shipped (1 k8s-watcher ops@8892043 / 2 callback endpoint server v2.48.0 / 3 Tool::Container tools v2.21.0 / 5 kind-val rig e2e@17de21d). Round 4 (Python parity) parked per Rust-only direction. Worker-side pending_callback adoption is a coordinated follow-up.
#64 2026-06-07 noetl-tools: artifact tool kind missing from Rust registry — landed via tools#35 v2.20.0 (thin ArtifactTool adapter translating the Python-era YAML shape into a ResultFetchTool call)
#61 2026-06-07 Secrets Wallet (Rust): envelope encryption + KMS/secret-provider plugins + sealed worker delivery + distributed multi-region resolution — all named phases + 6d.X cloud dynamic providers shipped; umbrella feature-complete
#54 2026-06-06 Phase F R5 — regression + e2e validation under sharded topology — closed at the umbrella level (Tier 1 + Tier 3 + Tier 4 e2e all GREEN on the Rust-only stack; subsequent regression findings filed as their own ai-task issues — #62/#63/#64/#65/#66 all now also closed)
#62 2026-06-05 noetl-server: /api/executions list query candidate-first rewrite + status-drift fix (server#99 v2.28.1) — 6.5 s → 0.015 s (~430×)
#63 2026-06-05 noetl-tools: python tool accepts nested script.source.code (inline) — tools#33 v2.19.3 + worker#54 adoption, test_script_loading kind-validated; external loaders split to #65
#60 2026-06-04 Rust orchestrator template context doesn't expose step data for next.arcs / step.when
#59 2026-06-04 Rust orchestrator doesn't resolve tool.kind:workbook references to inline actions
#58 2026-06-04 Rust orchestrator doesn't emit playbook.failed on command.failed — executions stall
#57 2026-06-04 Rust server rejects flat (name-as-field) pipeline shape in v10 playbook YAML
#56 2026-06-04 Canonical v10 playbook workload + input alias unrecognized on Rust stack
#55 2026-06-04 Rust server EventEmitRequest.execution_id wire-type drift blocks worker traffic
#53 2026-06-04 Rust worker → Rust server e2e compatibility (Phase D R3b/R3c terminal completion gap)
#52 2026-06-03 noetl-tools: add js_consume operation to nats tool kind
#51 2026-06-03 Fix system/outbox_publisher.yaml auth block to use AuthResolver pattern
#50 2026-06-03 Phase 2.a — system/outbox_publisher playbook + routing + auth wiring

Ecosystem map

See the Repo Map page for the full submodule inventory. Quick view of the production repos and their current versions (2026-06-19):

Repo Role Version Recent
noetl/server Rust control plane v3.29.4 🐞 #113 — off-server drive: recover offloaded drive result + stop drive on cancel (server#241, v3.29.4): apply_worker_orchestration resolves+decodes an OFFLOADED __orchestrate__ drive result (over the 100KB inline budget → durable reference.ref, no inline output_b64) instead of dropping it → non-convergent re-loop (new noetl_orchestrate_drive_total{stage=ref_resolved}); cancel now stops the drive (match underscore playbook_cancelled + ExecutionState::is_terminal terminal guard evicts the orch-cache, no restart). Kind gate-ON proven (785KB result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; sole-writer lag 0); rig e2e#63. 5/9 #113 fixtures COMPLETE; 4 hit a distinct oversized-command.issued stall → #114. No prod default flipped. Prior: 🎯 #103 — CQRS write-path cutover COMPLETE, FLIP-READY (server#240, v3.29.3): the 2 ExecutionService cancel/finalize writers route through the emit_event chokepoint — the last synchronous server noetl.event writer under the gate is closed, so with NOETL_EVENT_INGEST_PUBLISH_ONLY=on the server writes zero event rows (materializer is the sole writer). Kind-proven both modes (gate-off byte-identical INSERT; gate-on PUBLISHED + materializer sole writer + terminal state + 0 loss/dup). All three flip blockers closed → flipping the gate on is a staged operator decision. Default-off; no prod default changed. Prior: #104 — off-server-drive × gate crash-recovery (v3.29.2, server#238): cold-cache apply rebuilds WorkflowState from the durable log. Prior: #108 (c) — worker-driven orchestrator drive now DEFAULT ON (server#233, v3.28.0): NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true after the scale soak proved zero noetl.event burst + full system-pool isolation; in-process drive kept as the =false revert. Prior: #101 — bounded-memory orchestrator + stall-proof reconcile (v3.9.0, server#197): projection-snapshot bounded rebuild (flat memory — 167KB snapshot at 200k events, was OOM at ~19k); throttled O(events) consistency COUNT off the hot path; background reconcile poller (force-advances every active execution every 8s → no permanent deadlock under DB backpressure); results-by-reference resolution; GET /api/executions/{id} memory-bomb fix. Validated kind 10×1000 (flat memory) + GKE db-g1-small/PgBouncer 10×200 (poller broke a stall, 0 fails/restarts, Cloud SQL ~15 backends). Prior: #90 Phase 7 — POST /api/execute/batch + opt-in exactly-once dedup window (v3.5.0, server#189, closes server#188): batch endpoint creates N executions in one round-trip with partial-failure containment, reusing the single-execute path so per-message routing/trace/dedup are intact; the opt-in dedup window (noetl.subscription_dedup, bounded-by-age, race-safe INSERT … ON CONFLICT, scoped by subscription, subscription.message.deduplicated audit, default off — RFC §10 OQ1); validation of the new dispatch.batch_dispatch/batch_max/dedup/limits spec blocks; noetl_execute_outcomes_total + noetl_execute_batch_size. Live on kind: batch 12→12 COMPLETED; dedup duplicate→1 execution; direct-curl within/outside-window + dedup-off all green. ai-meta → server 7b217d8. Prior (v3.4.2): #90 Phase 5 gcs/s3 spool credential optional (ADC). Prior: #90 Phase 4 — spool config validation + subscription lifecycle-status fix (v3.4.1, server#184+server#185): validate the spec.spool block at registration; lifecycle-status reconstruction now matches only the six lifecycle event types so spool/circuit events (which share the subscription's execution_id) can't 500 subscription_get/activate. Prior: #90 Phase 3 — push-ingress config endpoint + push catalog validation (v3.3.0, server#182): mode: push requires an ingress.verify block (hmac_sha256 | bearer | pubsub_oidc; none rejected) + new GET /api/internal/ingress/{listener} (service-account-gated) resolving the verify-secret alias through the Wallet + idempotent subscription registration — the gateway's DB-free config source. Live-validated on kind. Prior: #90 Phase 2 — kind: Subscription type + lifecycle + pool routing + W3C trace (v3.2.0, server#180): first-class kind: Subscription catalog type (source/mode/dispatch validation, no step-DAG) + event-sourced lifecycle endpoints /api/subscriptions (register→activate→pause/resume→drain→deactivate, idempotent register, GET list/get) + execution_pool override on /api/execute routing the whole run to noetl.commands.<pool>.<eid> (persisted in playbook_started meta, orchestrator reads back) + W3C trace into meta.trace + command notification + child inheritance. Startup-seeds the subscription resource kind; decodes noetl.event.created_at as TIMESTAMP. Live E2E green. Prior (v3.1.0): subscription ToolKind. Prior: Round-trip JSON null in whole-object {{ step }} references (v3.0.6, server#177; closes noetl/ai-meta#89): a single-expression {{ step }} reference to a prior result envelope carrying a null field rendered that field as the JS token undefined via minijinja's map repr (json_value_to_minijinja maps JSON nullValue::UNDEFINED); render_to_value then failed serde_json::from_str and returned the whole envelope as a raw string, so the consuming python/rhai step received an unparseable str and crashed. Fix adds a | tojson retry to render_to_value (mirrors the noetl-tools TemplateEngine::render_value the server copy had diverged from): a lone {{ expr }} whose plain render is container-shaped-but-invalid JSON re-renders with | tojson, and minijinja_to_json maps undefined/none → JSON null so the field round-trips as null. 5 new regression tests (null in nested + top-level objects, null array element, explicit | tojson no-double-pipe, scalars unchanged); 619 lib + 8 parity green; clippy clean. Kind-validated against the live paginated-api test-server: cursor fixture walks all 4 pages, terminal next_cursor: null handled, validate_results collects 35 (first_id=1, last_id=35, success) — matching offset; pre-fix the 4th check_pagination was command.completed error. ai-meta → server 8e17fbe. Prior: Workflow-arc loops advance across iterations + terminate (v3.0.5, server#176; closes noetl/ai-meta#85): two coupled fixes atop the dispatch-guard re-entry layer. (1) Durable event-sourced loop-ctx propagation — step-level set: ctx.* loop variables were recomputed per pass and reverted to the workload default (loop thrashed 0,0,1,0,1,2,…); root cause was start's initializer set re-firing every pass in random HashMap order against the loop's advancing set. Fix persists each completion's rendered set: as a ctx.updated event (latest-wins fold in WorkflowState.ctx + build_context overlay), emitted once per completion keyed by the stable completion event_id (the Utc::now()-fallback completed_at is unstable across reconstructions). (2) Loop-exit hang — the exit branch was marked step.skipped on a loop-body-completion pass (recency-based branch-point detector missed it), turning it terminal so the is_step_done guard suppressed the exit dispatch; fixed with a structural loop-branch-point test (any step with a back-edge arc). 614 lib tests (+6; 2 verified to fail without their guard); clippy-clean. Kind-validated: counter loop 0→1→2→3 + terminates; real-http offset pagination 4 pages collecting 35. ai-meta → server e519fdc. (Separate finding: the e2e offset/cursor fixtures read response.data.users but the Rust http tool nests it at response.body.users — filed as a follow-up.) Prior: Unblock workflow loops + loop.done-gated transitions (v3.0.4, server#175; closes noetl/ai-meta#83 + #84): the fan-in/reduce barrier counted a loop back-edge (check_pagination → fetch_page) as an upstream and deferred the loop head forever — fix excludes back-edges via a new forward_reachable helper (genuine fan-in unaffected); and event.name was never populated for arc evaluation so when: {{ event.name == "loop.done" }} never matched (10+ fixtures hung after an in-step loop:) — fix injects event.name = "loop.done" into a completed loop step's next-arc context. Found in a full e2e regression re-sweep (19→27/36 on kind); landed with e2e fixture fix e2e#39 (duckdb commandscommand, closes #86). Follow-ups #85/#87 filed open. 26 orchestrator tests +2 new; kind-validated. Prior: container-callback insert matches the deployed event schema (v3.0.3, server#173; tracks noetl/ai-meta#43): the container-callback handler emitted its resume call.done via a stale query targeting attempt + id columns that don't exist on the deployed noetl.event (PK (execution_id, event_id)) — every watcher POST 500'd with column "attempt" of relation "event" does not exist, blocking the #43 chain. Replaced with an inline INSERT matching the working handlers::events column set; terminal outcome rides in a chk_event_result_shape-conforming result envelope. Kind-val GREEN: kind_validate_container_callback.sh both probes pass (happy_path → succeeded, oom → failed_oom). Prior: container-tool command type contradiction fixed (v3.0.2, server#172; closes noetl/ai-meta#81): ToolSpec.command Option<String>Option<serde_json::Value> so the container tool kind's K8s-Job-style array command decodes server-side (was a 400 data did not match any variant of untagged enum ToolDefinition) and passes through unchanged to the worker's ContainerConfig.command: Option<Vec<String>>; a scalar stays a JSON string for the shell/db consumers; ToolCall::from_spec forwards the value verbatim instead of wrapping in Value::String. 2 new regression tests (playbook::types 18/18); clippy clean. Kind-val GREEN end-to-end — server accepts command: ["/bin/sh","-c"], worker dispatches the container tool, K8s Job reaches Complete 1/1 (pre-fix kubectl get jobs empty). Prior: e2e-sweep cleanup (v3.0.1, server#171; tracks noetl/ai-meta#49): 64 MB result-store PUT body limit (DefaultBodyLimit — was rejecting 15 MB+ payloads with HTTP 413); render_pipeline_config stashes set/args/spec/command blocks before Tera rendering; iter namespace map in build_iteration_command; cmd_render_ctx uses command.context override; stripped diagnostic tracing::debug! blocks. All 7 e2e sweep playbooks PASS on Rust-only kind stack. Prior: Sequential-mode iterator dispatch (v2.62.0, server#166; closes noetl/ai-meta#76): LoopMode enum (Sequential default / Parallel); LoopSpec.mode parsed from loop.spec.mode YAML; StepInfo.iterations_dispatched tracks command.issued count for the sequential dispatch guard; sequential pattern dispatches iteration 0 at fan-out then on each command.completed dispatches next if iterations_dispatched == iterations_completed(). Default is Sequential — existing playbooks without explicit spec.mode get sequential behavior. 3 new tests; lib pass; clippy clean. Kind-val GREEN: test/loop COMPLETED 5/5 + iterator_save_test COMPLETED 4 steps. First Claude-direct Rust PR under agents/rules/handoff-routing.md. Prior: Durable result-store endpoints (v2.58.0, server#160; closes noetl/ai-meta#70): PUT + GET /api/result/<eid>/<step>/<ref>. Kind-val GREEN: output_select_test reached playbook.completed. Prior: Rust orchestrator exclusive-routing fix — step.skipped for untaken siblings (v2.57.2, server#159; closes noetl/ai-meta#67): under mode: exclusive only one arc fires; pre-fix, the static planner declared the untaken sibling's target as an upstream of any downstream merge step, then the R4 fan-in barrier waited for it forever. Three-part fix: (1) evaluator::evaluate_next_transitions stops break-ing on exclusive-mode match — surfaces unmatched siblings as not_matched_with_target results; (2) orchestrator::process_in_progress two-pass refactor — emit step.skipped for ALL unmatched arc targets first, then dispatch matched (HashMap-order-independent); (3) +2 unit tests + new defensive Jinja regression guard (server lib 568/0/0). Kind-val GREEN: e2e/fixtures/playbooks/comprehensive_test.yaml reaches playbook.completed in ~4s (was hanging forever). Single-commit patch on top of v2.57.1. Prior: Rust orchestrator step.data template accessor fix (v2.57.1, server#158; closes noetl/ai-meta#66): WorkflowState::build_context now injects a self-referencing .data key on the extracted user_data so {{ step.field }} (existing flat path) AND {{ step.data.field }} (new wrapped path) both resolve. Guarded by !map.contains_key("data") to preserve task_sequence flatten back-compat (a labeled sub-task's data field stays addressable as both <step>.<label>.data.x AND <step>.data.x). 2 new unit tests; cargo test lib 566/0/0; release build clean. Single-commit patch on top of v2.57.0. Prior: Phase D R5 R7 — cross-server parity harness; Replay engine port complete (v2.57.0, server#157; closes server#148; tracks noetl/ai-meta#49 Phase D R5). Final slice — hermetic parity rig: events.json (13 synthetic events) + expected.json (Python's pre-recorded fold output) + regenerate_expected.py (standalone Python script — verbatim extract of service.py fold + helpers, no noetl-package imports). tests/parity_harness.rs 8-test integration suite asserts structural parity field-by-field across all six projections + payload refs. Parity is structural not byte-for-byte hex (Python and Rust hash different digest inputs per R4's design). All 8 tests pass; lib 564/0/0; release build clean. No kind-val needed — test-only PR. All 7 Phase D R5 rounds shipped today (v2.51.0 → v2.57.0); the Replay engine port is complete. Phase D R5 R6 — payload resolver (v2.56.0, server#156; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): every event's result.reference JSON gets parsed into a typed PayloadSummary and appended to the relevant projection's payload-refs list. PayloadSummary + PayloadRefEntry types mirror Python's dict shapes; extract_payload_ref + payload_summary mirror Python's helpers with three-tier fallback (reference.<field>rows_ref.meta.<field>rows_ref.ipc.<field>). ReplayExecutionState.payload_refs + ReplayFrameState.output_ref/output_ref_summary + ReplayBusinessObjectState.payload_refs/last_payload_ref all populated. 15 new unit tests; lib 564/0/0. Kind-validated against live execution showing 3 populated payload_refs with real SHA-256 digests. Only R7 remains. Phase D R5 R5 — snapshot seed + base_state + upcaster digest (v2.55.0, server#155; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): the replay fold can now start from a prior fold's output and continue from there. ReplaySnapshotSeed mirrors Python's frozen dataclass; ReplaySnapshotInfo is the output subset; ReplayFoldOptions carries the optional inputs; new fold_replay_state_with_options entry point + 5-arg fold_replay_state as a back-compat shim. R5-introduced ReplayState fields (replay_snapshot, upcaster_registry_digest) both skip_serializing_if None — default folds produce R1–R4-identical JSON. Snapshot-storage backend deferred to a downstream sub-issue. 8 new unit tests; lib 549/0/0. Kind-validated wire-shape back-compat. Phase D R5 R4 — typed Checksum + projection_checksums (v2.54.0, server#154; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): every replay fold now produces a typed Checksum over the full state + a 6-entry projection_checksums map. ChecksumType enum (initial variant Sha256) + Checksum { type, value } struct + stable_json_bytes deterministic JSON encoder + compute_checksums per-projection + top-level digest run at the end of fold_replay_state. Design decision: digest input is the typed Rust state directly, not Python's flat-row normalize layer; Python byte-for-byte parity deferred to R7. 9 new unit tests; lib 541/0/0. Kind-validated: top-level checksum 41265876487f...ae426; six projection_checksums entries all populated. Phase D R5 R3 — loops + business_objects projections (v2.53.0, server#153; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): third slice of Phase D R5 — replay fold populates the last two per-projection maps. Two new typed state structs (ReplayLoopState, ReplayBusinessObjectState) replace R2's serde_json::Map placeholders; ReplayState.{loops,business_objects} flip to BTreeMap for deterministic key ordering. Two new ID extractors mirror Python's _loop_id / _business_object_identity resolution order; business_object_status helper mirrors Python's _business_object_status suffix-derived ACTIVE/DELETED transitions; two new populate functions with full event-shape coverage (loop counters bump on command.{completed,failed} + loop.shard.{done,failed} + loop.{done,fanin.completed}; business-object attributes REPLACE/PATCH from meta). payload_refs deferred to R6. 13 new unit tests; lib 532/0/0. Kind-validated: re-probe returns loops + business_objects empty (expected — fixture doesn't emit those events). R4 design captured (per user direction): typed ChecksumType enum + Checksum { type, value } struct; future checksum types slot in via enum. Phase D R5 R2 — stages + frames + commands projections (v2.52.0, server#152; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): second slice of Phase D R5 — replay fold populates stages + frames + commands projections. Mirrors Python's state["stages"] / state["frames"] / state["commands"] per-projection dicts. ReplayEventRow extended with stage_id / frame_id / command_id / worker_id / aggregate_type / aggregate_id / meta columns (all #[sqlx(default)]); three new typed state structs (ReplayStageState / ReplayFrameState / ReplayCommandState); ReplayState.{stages,frames,commands} flip from serde_json::Map to BTreeMap for deterministic ordering; three new ID extractors mirror Python's resolution order; three new populate functions with full status transitions (stage opened/closed; frame dispatched/started/committed/failed/abandoned; command full lifecycle). 10 new unit tests; lib 518/0/0. Kind-validated: re-probe of prior fanout_reduce execution returns commands map populated with 4 entries carrying worker_id + issued_event_id + last_event_id. Phase D R5 R1 — Replay endpoint scaffold + execution projection (v2.51.0, server#149; tracks server#148): opens Phase D Round 5 (Python's noetl/server/api/replay/service.py ~1236 LoC → Rust). Sub-issue server#148 documents the 7-round decomposition (R1 scaffold + execution / R2 stages+frames+commands ✅ / R3 loops+business_objects / R4 typed Checksum + projection_checksums / R5 snapshot seeds / R6 payload resolver / R7 cross-server parity harness). R1 ships new GET /api/replay/state route mirroring Python's endpoint.py byte-for-byte (query params + defaults + projection enum + mutually-exclusive cutoffs returning 400); new services::replay module with ReplayService + ReplayCutoff + ReplayProjection + ReplayState + ReplayExecutionState + pure deterministic fold_replay_state; minimal execution projection fold reuses the Phase D R4 terminal-event short-circuit. Phase D R4 follow-up: status endpoint short-circuits on terminal events (v2.50.1, server#147; closes server#146): ExecutionService::get_status now (1) looks up playbook.completed / playbook.failed FIRST and returns COMPLETED / FAILED directly, and (2) accepts 'success' lowercase in the completed_steps counter. Kind-validated: prior execution flipped RUNNINGCOMPLETED on the same DB data. Phase D R4 slice 2 — apply_event handles step.skipped (v2.50.0, server#145; closes server#144): new step.skipped arm in state::WorkflowState::apply_event records the step with StepState::Skipped — fan-in barrier no longer defers forever when an upstream's when guard evaluates false. Container Tool Callback umbrella #43 Round 2 — POST /api/internal/container-callback/{execution_id}/{step} (v2.48.0, server#141; closes server#140; tracks noetl/ai-meta#43): external K8s watcher (Round 1, ops#166) POSTs Job terminal-state events here; handler validates path params, checks staleness via a single indexed SELECT on noetl.event, and emits a call.done event with the structured terminal state on match (or bumps noetl_container_callback_stale_total + returns 202 if no events exist for the execution). Six TerminalState variants survive in meta.terminal_state so playbooks can branch on the specific failure reason. Two new counters; 7 new unit tests; lib 487/0. Round 2 lands first per the umbrella's recommended ordering — smallest blast radius; unblocks Round 1 (watcher Deployment) + Round 3 (tools#36Tool::Container). Secrets Wallet #61 cloud-specific dynamic-secret providers shipped — umbrella feature-complete (v2.45.0 server#137 + v2.46.0 server#139 + v2.47.0 server#138): 6d.1 AWS STS AssumeRoleWithWebIdentity (EKS IRSA path; no SigV4 — token IS the credential); 6d.3 Azure AAD client-credentials (off-cluster non-IMDS path; sovereign-cloud overrides via env); 6d.2 GCP iamcredentials.generateAccessToken (workload-identity impersonation of a target SA). All three return SecretValue.expires_at populated — Phase 6d's cache_decision clamps cache TTL accordingly; Phase 7c.3 background refresh re-resolves inside the window. 39 new unit tests across the three providers; lib 470/0. Secrets Wallet umbrella is feature-complete: envelope encryption + KMS + 5 static-secret providers (GCP-SM, K8s, Vault, AWS-SM, Azure-KV) + 3 dynamic-secret providers (AWS-STS, GCP-IAM, Azure-AAD) + residency policy + cross-region broker + KEK rotation + audit + auto-renewal with stampede collapse. Phase 7c.3 — background-refresh wire-up + stampede collapse (v2.44.0, server#136): cache-hit path now spawns a background tokio::spawn to re-resolve via the provider + update the cache via KeychainService::set when the row is inside the refresh window. Cached value returns IMMEDIATELY to the caller — worker fetches stay on the fast path. Stampede collapse via new src/services/keychain_refresh.rs RefreshInflight (Arc<tokio::sync::Mutex<HashSet<(i64, String)>>>); concurrent refreshes for the same (catalog_id, alias) collapse to one provider call. Refactor: extracted resolve_via_provider from try_resolve_keychain so cache-miss inline + background refresh share identical code. Phase 7c series wire-complete (7c + 7c.2 + 7c.3). Phase 7a.2 / 7b.2 / 7c.2 — operator-facing rotation + audit storage + cache-refresh primitive (v2.42.0 server#127 + v2.43.0 server#129 + server#131): 7a.2 wraps the Phase-7a rewrap_storage_string primitive with POST /api/internal/wallet/rotate-kek (batched cursor scan of noetl.credential + noetl.keychain, RotateSummary { processed, rewrapped, skipped, failed, last_id } for progress checkpointing) and GET /api/internal/wallet/key-status (per-version row counts). 7b.2 adds the noetl.secret_audit table (CREATE TABLE IF NOT EXISTS at startup; server-owned), DbAuditSink impl, and GET /api/internal/secret-audit?credential=&execution_id=&from=&to=&limit= (bounded; ORDER BY occurred_at DESC; hard cap 10_000). 7c.2 adds KeychainService::should_refresh(catalog_id, keychain_name, execution_id, scope_type, now) — cache-side companion of the Phase-7c decision primitive; reads the row's expires_at, asks secrets::dynamic::should_refresh_default (honours KEYCHAIN_CACHE_REFRESH_WINDOW_SECS), bumps noetl_secret_refresh_total{outcome="triggered"} on true. Backward compatible. Next: 7c.3 stampede-collapse mutex + background re-resolve; 6d.1/.2/.3 cloud-specific dynamic providers. Phase 7c — token auto-renewal primitives (closes Phase 7) (v2.41.0, server#125): final named round of the Secrets Wallet umbrella. secrets::dynamic::should_refresh(expires_at, refresh_window, now) decision primitive (true iff expires_at set + still valid + within refresh window); KEYCHAIN_CACHE_REFRESH_WINDOW_SECS env (default 60). noetl_secret_refresh_total{outcome} counter (triggered|succeeded|failed|stampede_collapsed; failed alert-worthy) + noetl_secret_refresh_duration_seconds histogram (50ms–5s buckets). 5 new unit tests; lib 427/0. Lib-only. All named phases (1–7) of the Secrets Wallet umbrella are complete. Remaining queue is discrete follow-up sub-issues only. Phase 7b primitives — secret-resolution audit service (v2.40.0, server#123): durable audit trail of every credential resolution. AuditEvent struct (NEVER contains the secret value); bounded Operation + Outcome enums; AuditSink trait + NoopAuditSink default + SecretAuditService wrapper with record_async (fire-and-forget, never blocks resolver) + record_strict (await, used when compliance mandates the row exist before the value releases) + record (dispatches by strict-mode). NOETL_SECRET_AUDIT_REQUIRED env (default false; 1/true/TRUE/yes/YES enable strict). noetl_secret_audit_writes_total{operation, outcome, status} counter (failed_strict alert-worthy). 8 new unit tests; lib 422/0. Lib-only. Phase 7a — KEK rotation primitives (v2.39.0, server#121): starts Phase 7. KeyManager::current_key_version() trait accessor; EnvelopeCipher::rewrap_storage_string primitive (parse → if same version Skipped; else unwrap with historical version → re-wrap with current → Rewrapped { old_key_version, new_key_version, new_storage_string }). Plaintext payload NEVER reconstructed — pure DEK re-wrap, AES-GCM ciphertext bytes stay byte-identical. noetl_wallet_rotate_total{table, status} counter (skipped|rewrapped|failed_unwrap|failed_wrap|parse_error). 4 new unit tests; lib 414/0. Lib-only. Phase 6e — cross-region broker (closes Phase 6) (v2.38.0, server#119): BrokerRegistry (region → broker_url from NOETL_SECRET_BROKER_REGISTRY env; empty default = pre-6e fail-closed); POST /api/internal/cross-region/resolve peer endpoint validates expected_entry_region == server_region() (defensive against stale peer registries), resolves locally, seals via Phase-5a primitives to the requesting worker's pubkey directly; get_sealed handler falls back to broker on AppError::ResidencyViolation; KeychainDef.no_broker_fallback per-credential opt-out; AppError::CrossRegionUnreachable → HTTP 502. Two new metrics: noetl_secret_broker_call_total{broker_region, outcome} + noetl_secret_broker_call_duration_seconds{broker_region} histogram (50ms–5s). 10 new unit tests; lib 410/0. Both residency shapes operational: hard isolation (strict + no broker = HTTP 403) + soft federation (strict + broker registered = transparent cross-region routing). Phase 6 closes. Phase 6d primitives — dynamic-secret support + cache honors issuer TTL (v2.37.0, server#117): SecretValue.expires_at: Option<DateTime<Utc>> field; src/secrets/dynamic.rs cache_decision() honors min(default_ttl, expires_at - now - safety_margin) and returns SkipCacheAlreadyExpired when the deadline is already past or inside the operator's safety margin; KEYCHAIN_CACHE_DYNAMIC_SAFETY_MARGIN_SECS env (default 60); resolve_keychain_entry_with_meta returns the bundle's earliest expires_at; CredentialService::resolve_via_provider consumes the helper. Two new metrics: noetl_secret_dynamic_ttl_seconds histogram (1m/5m/15m/1h/4h/12h buckets) + noetl_secret_cache_skip_total{reason} counter. 7 new unit tests; lib 398/0. Backward compatible (providers without expires_at keep the 600 s default). Phase 6c — residency-policy gate (v2.36.0, server#115): KeychainDef.residency enum (none|advisory|strict, default none) + allowed_regions allowlist; resolver runs the gate at the top of resolve_keychain_entry BEFORE any provider call so strict-mode mismatches short-circuit with AppError::ResidencyViolation (HTTP 403, clear "credential X is region-locked to Y; this server is in Z" message that NEVER includes the value). noetl_secret_residency_check_total{policy, decision} counter — strict + violation_blocked is alert-worthy, advisory + violation_allowed is the migration-window signal. 8 new unit tests; lib 391/0. Phase 6b — ProviderRegistry + per-(provider, region) metrics (v2.35.0, server#113): server-side cache of (provider_id, region) → Arc<dyn SecretProvider> so the resolver doesn't rebuild from env on every cache-miss; RwLock + double-checked locking on the build path so concurrent get_or_build for the same key only builds once. Optional TTL via NOETL_SECRET_PROVIDER_TTL_SECONDS. New noetl_secret_provider_build_total{provider,region,status="cache_hit|ok|error"} counter + noetl_secret_resolve_duration_seconds{provider,region} histogram (5 ms – 5 s buckets). 7 new unit tests; lib 383/0. Phase 6a — region tag on keychain entries + per-region routing (v2.34.0, server#111): starts Phase 6 (residency-aware distributed resolution). KeychainDef.region optional field (no schema migration — lives in existing JSON blob); SecretRef.region provider-agnostic; AWS provider consumes it as the regional endpoint with explicit precedence (<region>: ref prefix > field > legacy project overload > AWS_REGION env). New NOETL_SERVER_REGION env + server_region() / effective_region() fallback helpers. noetl_secret_resolve_total{provider,region,status} counter per observability.md Principle 1. 5 new unit tests; lib 376/0. Lib-only — backward compatible. Phase 5b — wire format + sealing endpoint (v2.33.0, server#107): new GET /api/credentials/{id}/sealed?worker_id=<name> returns a SealedEnvelope (X25519-sealed credential JSON) addressed to the named worker; workers opt in by including worker_public_key in their register payload's runtime JSON blob — no schema migration; 400 BadRequest when the worker_pool row exists but didn't register a key; noetl_credentials_sealed_total{status} counter + credential.seal span per observability.md. Kind-validated end-to-end (Python cryptography + HKDF + ChaCha20-Poly1305 opens the envelope → recovers the bearer token + scope round-trip). Phase 5a — sealed payload crypto primitives (v2.32.0, server#107): src/crypto/sealed.rs X25519 ECDH + HKDF-SHA256 + ChaCha20-Poly1305 sealed-box (nonce derived from the shared secret, AAD pins alg+v for clean alg-mismatch rejection); 12 unit tests (round-trip, tamper, alg/version-mismatch, JSON wire stability); lib 369/0. Defense-in-depth on top of Phase-4 mTLS — cleartext never enters the response body. Lib-only; 5b adds the runtime-registry worker pubkey + sealing endpoint, 5c the worker side. Providers 3.x — AWS Secrets Manager + Azure Key Vault (v2.31.0, server#105): two new backends behind the one SecretProvider trait completing the 5-provider matrix. AWS SM uses hand-rolled AWS Signature Version 4 signing (no aws-sdk dep tree; signing key verified by a unit test against AWS's published reference vector); ref shape [<region>:]<secret-id>[#<json-key>] with JSON-key extraction for multi-field secrets; creds from env (the IRSA-injected triple). Azure KV uses IMDS Managed Identity (AKS/VMs) with TTL-cached bearer; ref shape [<vault>/]<secret-name>[#<version>]; sovereign clouds via NOETL_AZURE_KEYVAULT_DNS_SUFFIX. 21 new unit tests; lib 357/0; cloud-only backends (kind-val at unit-test layer like GCP). Phase 4a — opt-in TLS/mTLS listener (v2.30.0, server#103): the worker↔server credential channel (GET /api/credentials/<alias>) was plain HTTP; opt-in TLS via NOETL_TLS_CERT+NOETL_TLS_KEY (+NOETL_TLS_CLIENT_CA ⇒ mTLS), ring rustls + axum-server bind_rustls, kind-validated (200 w/ client cert, rejected w/o, plain HTTP refused). Providers 3.x — HashiCorp Vault provider (v2.29.0, server#101): a provider: vault keychain alias resolves from a Vault KV v2 secret (X-Vault-Token; ref [<mount>/]<path>#<key>), kind-validated end-to-end against an in-cluster Vault — second backend validatable on kind after K8s. /api/executions list perf + status fix (v2.28.1, server#99, #62): candidate-first rewrite (start-event index, not a 3.2M-row seq scan) — 6.5 s → 0.015 s (~430×), identical list; bool_or status-drift fix (was all-RUNNING). Secrets Wallet #61 providers 3.x — Kubernetes Secrets provider (v2.28.0, server#97): a provider: k8s keychain alias resolves from an in-cluster Secret via the API server + ServiceAccount token + cluster CA — the first secret backend kind-validated end-to-end with a real value (GCP needs GKE). Orchestrator-strand fix (v2.27.2, server#95): a deterministic evaluate failure (an invalid template in a step code body, an unknown step in a next arc, malformed routing) now emits a terminal playbook.failed instead of stranding the run in RUNNING forever — surfaced by the #54 e2e sweep (closed server#94). Parser fix (v2.27.1, server#93): NextSpec untagged-variant order — the list form next: [{step: x}] was deserialized into a struct Router positionally, silently dropping its arcs (and defeating unknown-step validation); sequence-shaped variants now precede the struct. Secrets Wallet Phase 3c — keychain cache (v2.27.0, server#91): execution-scoped, envelope-encrypted, TTL'd cache so an auth: "{{ alias }}" lookup isn't re-fetched from the secret manager per step; + fixed the keychain storage layer (queries never matched the table — also repairs the /api/keychain endpoints). Phase 3 (resolution) complete — R3b (v2.26.0) resolves a provider: gcp keychain alias from GCP Secret Manager on a credential miss; built on R3a/R2/R1 (v2.23.0–v2.25.0). Phases 1–2: Cloud KMS for the KEK (v2.22.0); envelope encryption (v2.21.0)
noetl/worker Rust NATS pull worker v5.34.0 #103 — in-process CQRS event materializer (ack-after-materialize) (v5.34.0, worker#115): opt-in (NOETL_MATERIALIZER_ENABLED, default off, system pool) consume-loop drains noetl_events with deferred ack (noetl-tools 3.13), POSTs events/project, acks each batch only on 2xx — failure → un-acked → redeliver, no loss. Closes the ack-after-materialize durability gap before the PUBLISH_ONLY flip; kind fault-injection: gate-on sole-writer loss=0 across a mid-drain failure. Prior: #108 (b) — pool-affinity decline (v5.33.0, worker#114): a worker declines (ACK+skip) command notifications whose execution_pool differs from its own segment, so the orchestrate drive lands on the dedicated system pool even if a JetStream consumer's filter drifted broad. Prior: #90 Phase 7 — batch dispatch + dedup opt-in + per-subscription rate limits (v5.19.0, worker#79, closes worker#78): when dispatch.batch_dispatch the runtime drains a backlog → execute_batch in chunks of batch_max (each item its own playbook/pool/trace/dedup); opt-in dedup stamps the block (idempotency_key→message_id); per-subscription rate limits via a new deterministic token-bucket RateGovernor (src/ratelimit.rs) enforced on the fetch side — over the cap the runtime stops fetching (source keeps the backlog, no loss) + a subscription.rate_limited event; new batch/rate-limit counters. Live on kind: batch 12→12 + per-message traceparent; rate-limit engaged + 10/10 → executions (no loss). ai-meta → worker 7531f4a. Prior (v5.18.0): #90 Phase 5 gcs spool wiring + bearer + $PORT bind. Prior: #90 Phase 4 — store-and-forward spool wired into the subscription run-loop (v5.17.0, worker#75): probe→circuit→spool-or-dispatch→ack, NATS-KV circuit persistence (survives restart mid-outage), drain-on-recovery, 6 spool/circuit events + noetl_subscription_spool_bytes gauge. buffer_and_ack/hybrid loss-safe. Prior: #90 Phase 2 — continuous subscription runtime (Mode B) (v5.16.0, worker#73): new WORKER_MODE=subscription run-mode that loads a kind: Subscription spec, builds the Phase-1 SourceClient via the tools build_source factory, register→activate→(pause/resume)→drain→deactivate (SIGTERM-driven on K8s), and loops poll()→one POST /api/execute per message on the dedicated subscription pool segment, applying header directives (redirect/pool/idempotency/content + W3C trace) + emitting directives_applied audit events. Observability triad (spans + noetl_subscription_* counters). Adopts noetl-tools 3.3.0. Prior: Adopts noetl-tools 3.1.1 (multi-tool sibling references, worker#69, b97f642; tracks noetl/ai-meta#87): Cargo.lock bump to deliver the task_sequence sibling-context fix — ^3 already covered it, no worker code change. chore(deps) rides the next worker release (production image lag normal); the dev kind cluster was validated against a worker built from the 3.1.1 code. Prior: call.done embeds inline context.data._ref on over-budget durable-success branch (v5.15.0, worker#63; closes noetl/ai-meta#69): when a step's tool result exceeds INLINE_CONTEXT_MAX_BYTES (64 KB) and the durable result_store PUT succeeds, build_call_done_result now emits {status, context: { data: { _ref: <noetl://...> } }, reference: {...}}. Downstream {{ step._ref }} resolves to the URI string; consuming artifact / result_fetch tools dispatch the URI-based fetch for the full data. 4 existing tests updated to assert the new inline shape; lib 126/0/0. Single-commit MINOR bump. Note: kind re-val of the consuming fixtures (test_output_select.yaml, test_storage_tiers.yaml) is blocked on noetl/ai-meta#70 — Rust server is missing PUT /api/result/<eid> so the worker lands on the degraded shm-only branch (no noetl:// URI to embed). Prior: noetl-tools 2.23.0 → 2.23.1 lockfile bump (worker#62; tracks noetl/ai-meta#68): adopts the artifact-tool args: alias fix shipped in noetl-tools v2.23.1. Lockfile-only — Cargo.toml already had noetl-tools = "2.23" which accepts any 2.23.x. 126/0 lib tests. Local image localhost/noetl-worker-rust:v5.15.1-tools231 built + loaded into kind for the #68 kind re-val. Prior: noetl-tools 2.21 → 2.23 dependency bump (worker#61, branch chore/bump-noetl-tools-2.23; tracks noetl/ai-meta#65 Round 2): adopts the python external-script loaders (file/gcs/http) + legacy main() function convention shipped via noetl-tools v2.22.0 + v2.23.0. Worker lib 126/0 against the new tools. Image localhost/noetl-worker-rust:v5.15.0-tools223 built locally + loaded into kind for the #65 kind-val (execution 322087210360770560 reached playbook.completed). No release tag — release-please skips chore(deps): commits. Container Tool Callback umbrella #43 Round 4 — worker-side pending_callback adoption (v5.14.0, worker#60): executor::command checks tool_result.pending_callback after success, when Some(true) logs INFO + bumps noetl_worker_call_done_skipped_pending_callback_total{tool_kind} + skips its own call.done emit so the server's container-callback endpoint owns the terminal event via the watcher path; when None (every existing tool, default) the emit path is preserved bit-for-bit. 126/0 lib tests against published noetl-executor 0.4.1. Secrets Wallet #61 Phase 5c — sealed credential delivery (v5.13.0, worker#58): long-lived X25519 keypair generated once at startup, pubkey registered in the runtime JSON blob, ControlPlaneClient::get_sealed_credential calls /api/credentials/{id}/sealed, unseals via the same primitives (drift-guard test pins server constants), zeroizes the cleartext after the auth-alias resolver consumes it. Env-gated (`NOETL_SEALED_CREDENTIALS=true
noetl/tools Shared tool registry crate v3.13.0 #103 — deferred (ack-after-processing) ack (v3.13.0, tools#71): AckMode::Defer in the subscription SourceClient surfaces a durable per-message ack handle (NATS $JS.ACK reply subject) instead of acking inline; SourceClient::ack(ack_ids, AckDisposition) = Ack/Nack/Term (NATS + Pub/Sub); tool operation: ack|nack|term. Opt-in — existing callers unchanged. The capability the worker materializer (v5.34.0) drives for ack-after-materialize. Prior: #90 Phase 4 — store-and-forward spool engine + per-downstream circuit breaker (v3.4.0, tools#54): noetl_tools::spool — circuit breaker (trip/half-open/close, NATS-KV-serializable, per-downstream OQ2), SpoolItem (SHA-256 + noetl://spool ref + recv_seq-ordered keys), nats_object/local_disk backends, ordered-replay engine (global/per_key/none + idempotency + dead-letter + retention/GC). 44 unit tests + real-NATS integration test. Prior: #90 Phase 2 — header-directive engine + public build_source factory (v3.3.0, tools#52): source/directives.rsDirectiveSpec/DispatchPlan turn allowlisted message headers into dispatch instructions (redirect dispatch.playbook, dispatch.execution_pool, priority→pool, idempotency_key, content_type/schema_hint, W3C trace), untrusted by default (allowlist + value-allowlist enforced at parse; multi-value last-wins; applied[] audit). Public build_source(cfg, ctx) so the worker continuous runtime constructs the same SourceClient. 12 new tests. Prior (v3.2.0): bounded-drain subscription tool + SourceClient (Phase 1). Prior: Multi-tool sibling references (v3.1.1, tools#48; closes noetl/ai-meta#87): in a tool: [list] step, TaskSequenceTool stored each sub-tool's result for the aggregated output but never injected it into the running context, so a later sub-tool's {{ <label>.<field> }} rendered empty — masked in quoted positions, a syntax error at or near "," in unquoted numeric SQL (save_edge_cases test_large_payload). Fix injects each sub-tool's result under its label (with a synthetic .data self-ref matching build_context) so later siblings + a later python sub-tool's stdin variables resolve it. 2 new unit tests; lib 300/0. Kind-validated: save_edge_cases test_large_payloadrecord_count = 100, save_delegation_test clean. Worker adopts via worker#69 (b97f642). Prior: e2e-sweep cleanup (v3.1.0, tools#47; tracks noetl/ai-meta#49): YAML boolean when: true in policy rules now checks as_bool() before the string-template fallthrough (Value::Bool as_str() returns None); `
noetl/cli Rust CLI + local-mode runner v4.11.0 noetl subscribe — local-mode subscription listener (RFC #90 Phase 6) (cli#60, ai-meta → 2fb3fb0): standalone listener + FileEventSink JSONL + local_disk spool; cli-only (noetl-tools v3.5.0 source+spool reused unchanged). Prior: --include-data flag doc fix (cli#58)
noetl/gateway Gatekeeper — auth + SSE + push-ingress v3.3.0 #90 Phase 3 — push-ingress (Mode C) + auth-gated directive trust (gateway#28): POST /ingress/{listener} verifies HMAC / bearer / Pub-Sub-OIDC → only-then directives → one POST /api/execute per delivery on the dedicated pool (verify-and-forward, no DB on the ingress path); verify_then_plan makes the auth gate a testable invariant; first /metrics surface. Live E2E green (HMAC 12/12 + bearer 12/12). Prior (v3.2.0): Phase F R3b-2 shard-info twin endpoint.
noetl/noetl Python control plane (legacy; retained for back-compat) v4.12.1 Deprioritized per Rust-only direction; pytest debt at noetl/noetl#663 parked
noetl/ops Helm + manifests (untagged) k8s-watcher durable image + pod-level OOM classification (ops@cacc513, ops#168; closes noetl/ai-meta#80, tracks noetl/ai-meta#43): retired the dead bitnami/kubectl:1.30.3 (removed from Docker Hub; cluster was on the bitnamilegacy stopgap) for alpine/k8s:1.30.3 (kubectl + jq + curl baked in) — the prior runtime install never put curl on PATH so callback POSTs returned HTTP 000. classify_pod_failure() now reads the backing Pod's status (RBAC already grants pod reads) to emit failed_oom (OOMKilled) / failed_image_pull (ImagePullBackOff); build_body's completed_at fallback uses RFC3339 `now
noetl/docs Docusaurus site (untagged) ADR Implementation-status block
noetl/travel Reference SPA (domain-fork example) (untagged) Production

Architecture at a glance

Today — kind cluster is fully Rust

The 2026-06-04 session retired all Python deployments + their services + configmaps from the kind cluster (per the user directive "delete all legacy stuff"). Local validation runs against the Rust topology by default:

              ┌──────────────────┐
              │  Gateway         │  noetl/gateway  v3.2.0
              │  (Rust)          │  auth · SSE · subscriptions · shard routing
              └────────┬─────────┘
                       │ HTTPS
              ┌────────▼──────────────────────────────┐
              │  noetl-server-rust  (v2.19.7)         │
              │  catalog · execute · events ·         │
              │  /api/internal · DbPoolMap sharding   │
              │  orchestrator engine · SSE            │
              │  workbook resolution · pipeline parse │
              └────────┬──────────────────────────────┘
                       │
              ┌────────▼─────────┐
              │   NATS JetStream │  NOETL_COMMANDS stream
              │   + Postgres     │  noetl.event + noetl.command
              └────────┬─────────┘
                       │
        ┌──────────────┴──────────────┐
        ▼                             ▼
  ┌──────────────────┐         ┌──────────────────┐
  │ noetl-worker-    │         │ worker-system-   │
  │ rust v5.11.3     │         │ pool             │
  │ (shared pool)    │         │ (Rust, runs      │
  │                  │         │  system          │
  │ noetl-tools      │         │  playbooks)      │
  │ v2.18.1:         │         │                  │
  │ python · shell · │         │ Consumer:        │
  │ http · postgres  │         │  ..._pool_system │
  │ duckdb · rhai ·  │         │ Filter:          │
  │ task_sequence ·  │         │  noetl.commands. │
  │ playbook · noop  │         │  system.>        │
  └──────────────────┘         └──────────────────┘
   Consumer:                     [Outbox publisher +
    ..._pool_shared               projector migrated
   Filter:                        to system playbooks
    noetl.commands.               via Phase 2.a]
    shared.>

Retired this session (2026-06-04): noetl-server (Python deploy), noetl-worker (Python deploy), noetl-outbox-publisher (Python deploy), noetl-projector (Python statefulset), and the noetl/noetl-ext/noetl-projector/noetl-worker-metrics services + 4 legacy configmaps + noetl-worker SA. The kind cluster is the regression-test topology for Rust-only e2e.

[Status legend: ✅ = shipped + kind-validated]

v10 playbook compatibility on Rust (closed loop)

Six interlocking gaps fixed in one iteration brought control_flow_workbook end-to-end on the Rust-only stack — exercising the complete v10 control-flow surface:

playbook YAML  ──►  noetl-server-rust orchestrator                 noetl-worker-rust
                                                                   + noetl-tools v2.18
─────────────────  ──────────────────────────────────────────────  ──────────────────
workload: {...}   ► #56 workload + input alias decode               PythonTool wrapper
tool.kind: python ► (existing dispatch)                             globals().update(args)
tool.kind:                                                          ► #17 capture
  workbook        ► #59 parser substitutes inline action            `result = {...}`
                                                                    global → data
tool: [{...}]     ► #57 ToolDefinition::Pipeline accepts both       ► #18 TaskSequenceTool
  (pipeline)        flat (name-as-field) + nested (label-as-key)     runtime
                                                                    
{{ step.field }}  ► #60 build_context exposes step data at top
  in next.arcs      level (not just steps.<name>); apply_event
                    captures call.done before command.completed
                    overwrites
                    
command.failed    ► #58 trigger_orchestrator on command.failed
                    + dedicated short-circuit in process_in_progress
                    emits playbook.failed terminal
                    
worker → server   ► #55 EventEmitRequest accepts i64 wire shape
  event emission    (was rejecting integer execution_id)

Validated end-to-end:

noetl exec tests/fixtures/playbooks/control_flow_workbook
→ playbook_started
→ start (python)
→ eval_flag (workbook→python; is_hot=true captured via marker)
→ hot_path (next.arc when="{{ eval_flag.is_hot == true }}" matched)
→ parallel hot_task_a + hot_task_b
→ playbook.completed ✅

Long-term Python trajectory

Per the Rust-only direction, Python pieces stay only as:

  1. Container payloads — runtime stays Rust; user code that wants Python ships in a container dispatched by the container tool kind (#43, in design).
  2. Back-compat GKE deployments — existing Python pods on the production cluster aren't removed (yet) since GKE traffic still uses them; no new feature work goes there.

The kind cluster is the canary for the Rust-only topology. When the operator runs the validation rigs end-to-end against their live cluster, R5 cutover decision will move the production topology to match.

Sessions log

Chronological notes on what each session accomplished — see Sessions Log.

Most recent (top of log):

  • 2026-06-19🐞 #113 off-server drive — recover offloaded drive result + stop drive on cancel (server v3.29.4); #114 opened. Fixed the worker-driven drive stall when an __orchestrate__ result exceeds the 100KB inline budget (worker offloads it with only a reference.ref → server now resolves+decodes it via result_store.resolve, metric ref_resolved, instead of dropping → non-convergent re-loop) + the cancel non-stop facet (match underscore playbook_cancelled + ExecutionState::is_terminal terminal guard, no restart). Server #241v3.29.4 + rig e2e#63. Kind gate-ON proven (785KB result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; sole-writer lag 0). 5/9 #113 large-context fixtures COMPLETE; the other 4 hit a distinct oversized-command.issued (full upstream context embedded → >1MB NATS payload) stall → #114 (#113 stays open until all 9 close). ai-meta → server 1e844c1 + e2e 12b27e9. No prod default flipped.
  • 2026-06-19🚀 #103 GKE pre-flip PREP — prod images pushed, GMP monitoring live, manifests staged; NO traffic flip / NO PUBLISH_ONLY. Verified prod already-Rust (the #49 cutover is done; pre-#103 live images), both flip secrets present, monitoring = Google Managed Prometheus (not VM). Pushed server v3.29.3 + worker v5.35.0 to the prod AR (amd64); applied + verified GMP PodMonitoring (worker+server /metrics — the noetl ns had none) + materializer-lag Rules (up{namespace="noetl"}=4 live); staged the roll-forward manifests (not applied — they roll live workloads); runbook gained a "Production (GKE)" section + GMP managedAlertmanager pager stub. Operator-gated: roll images → materializer shadow → pager → flip. ai-meta e5b6d6c → ops 9edd9c4 (ops PR #197). No prod default changed.
  • 2026-06-19🛡️ #103 materializer-lag GUARDRAIL shipped — the pre-flip observability gate. The server was FLIP-READY; the remaining gate was a materializer-lag metric + alert. Worker #116v5.35.0 extends the JetStream lag poller to track the noetl_events/noetl_materializer consumer on an independent task → noetl_worker_nats_consumer_pending{consumer="noetl_materializer"} climbs even when the materializer loop is dead. Ops #195+#196: VMRule (backlog warning>200/critical>2000/growing + stall-under-gate + project-errors + absent-under-gate, stall guarded on backlog>0), worker /metrics VMServiceScrape (was unscraped), VMAlert enabled, Grafana dashboard, flip runbook noetl-cqrs-publish-only-flip.md. Kind-proven full cycle on the VM stack: green baseline (backlog 0) → induced lag (materializer fault-injected, events publishing under the gate) → gauge 0→684, alerts fire (backlog warning+critical + stall) → recover → drains→0 idempotently (0 dup/loss), alerts clear. ai-meta → worker b910341 (v5.35.0) + ops 2fcfa59 + worker-wiki 0030f30. PUBLISH_ONLY stays default-off.
  • 2026-06-19🎯 #103 server cutover COMPLETE — FLIP-READY. The 2 ExecutionService cancel/finalize sites now route through the emit_event chokepoint (server #240 → v3.29.3 + e2e rig #62); kind-proven both modes; no remaining synchronous server event writers under the gate. Flipping PUBLISH_ONLY on is now a staged operator decision.
  • 2026-06-19🎯 #104 off-server-drive × gate reconciliation PROVEN — the last real blocker before the PUBLISH_ONLY flip is operator-safe. The combination #103 left unproven (gate-on was only ever validated with the in-process drive) is green on kind: gate-ON (PUBLISH_ONLY=true) with the off-server drive (PLUGIN_DRIVE=true) + materializer sole writer → fresh exec + cursor fan-out → COMPLETED; server wrote 0 noetl.event rows (all 25 PUBLISHED — event_ingest_published_total=25), materializer materialized all 25 exactly once (25 rows == 25 distinct ids, 0 catalog_id=0, 0 dup cycles), drive dispatched=applied, read-your-writes held (the relocated trigger fires post-materialize → server rebuilds state from committed log before bounding the off-server drive input). Server #238v3.29.2 (76d29bb): cold-cache apply now rebuilds WorkflowState from the durable log (the #104 WAL-rebuild principle) instead of dropping the in-flight result on a server restart mid-drive — kind crash-recovery proof: hard-kill mid-drive → cold_rebuild metric+log fires → that exec COMPLETES with full event integrity. Committed e2e rig kind_validate_orchestrate_gate.sh (e2e#61). Regression green: gate-off + in-process (prod default) and gate-off + off-server. ai-meta → server 76d29bb + e2e 61f7a5c. Remaining before a safe flip: only the 2 ExecutionService cancel/finalize sites. No prod default changed.
  • 2026-06-18🎯 #108 (c) — the worker-driven orchestrator drive is now the DEFAULT; #108 CLOSED. Flipped NOETL_ORCHESTRATE_PLUGIN_DRIVE to default true (server#233, v3.28.0 → server@80cc0e6). Gated on a scale soak on kind (images built from the released tips, server v3.27.0 / worker v5.33.0): a single 694-drive cursor+fan-out run (test_pft_flow_v2 3×40) COMPLETED with __orchestrate__ rows in noetl.event = 0 (event_suppressed +2082) and all 694 drives claimed on the system pool (shared pool got only the 671 real-step commands), 0 errors; 5× concurrent self-contained cursor = 5/5 COMPLETED. Then deployed the flipped image with no env var — the default-on path reproduced the identical shape (361 drives, system-isolated, 0 burst); 15/15 regression fixtures green; the revert (=false) verified to fall back to the in-process drive (system delta 0). In-process trigger_orchestrator_inner kept as the fallback. ai-meta → server 80cc0e6 + worker 437b0be + server-wiki 0210012.
  • 2026-06-18orchestrate drive isolated on the SYSTEM pool via pool affinity (#108 follow-up b, kind-validated). Server stamps execution_pool on the command notification (server#232 → server@846166b); worker declines (ACK+skip) notifications not for its pool segment (worker#114 → worker@e2162b7), so the drive runs on the dedicated system pool even under JetStream consumer-filter drift. No worker HTTP pending-poll — the NATS consumer is the only claim vector. Validated: __orchestrate__ claimed+executed on the system pool (3), zero on the default pool, simple_python COMPLETED; 553+196 tests green. Only (c) the deliberate default-flip remains.
  • 2026-06-18the orchestrate meta-command touches noetl.event ZERO times (#108 follow-up a, kind-validated). dispatch_orchestrate_command stops writing command.issued to noetl.event; the command lives only in noetl.command, and claim_command/get_command fall back to it on a miss (noetl.event stays authoritative for normal commands) (server#231 → server@9438f3b). So __orchestrate__ writes 0 of its former 5 rows per drive — the directive that system playbooks keep only their own state is met. Validated: cursor+fan-out COMPLETED via the noetl.command fallback, 0 event rows, 20 real steps normal, 0 errors. Remaining: NATS affinity (ops) + default-flip.
  • 2026-06-17system playbook events no longer burst Postgres + system-pool routing (#108 slice 4b, kind-validated). The __orchestrate__ meta-command is infrastructure, not a workflow step, so the server now skips persisting its lifecycle events to noetl.event (handle_event_inner + claim_command) (server#230 → server@6aef3a6). At scale they'd burst noetl.event/Postgres for no benefit. Validated: __orchestrate__ now writes only the lone command.issued (1 of 5 — 80% fewer rows); cursor+fan-out flow still COMPLETED. Drive routes to the system segment (true isolation pending a NATS-affinity ops fix; resilient via the pending-poll meanwhile). Follow-ups: eliminate the last command.issued (claim from noetl.command) + NATS affinity.
  • 2026-06-17🎯 the orchestrator drive runs OFF-SERVER on the worker pool (#108 slice 3, kind-validated). With NOETL_ORCHESTRATE_PLUGIN_DRIVE=on the server issues system/orchestrate (entry: run_state, args = the bounded WorkflowState) to the worker pool instead of evaluating in-process; the worker runs the drive, the server applies the result on the command's call.done (server#229 → server@465cdbb v3.23.0). Kind: test/simple_python drove start→end→COMPLETED through the round-trip (dispatched=2, applied=2, 0 decode_error, __orchestrate__ didn't leak as a step, playbook.completed). Default off, in-process fallback. Bug caught+fixed: output_b64 rides call.done not command.completed. Next: shadow→flip at scale, make drive the default, route to the system pool.
  • 2026-06-17worker-driven cutover slice 2: apply_orchestration_result extracted + slice 3 designed (#108). The post-evaluate emission (events → commands → terminal) is extracted verbatim from trigger_orchestrator_inner into a reusable fn (server#228 → server@586aeae) so the worker-driven drive applies a worker-computed result identically. Behavior-preserving (553 tests green, clippy clean). Slice 3 (dispatch) designed + grounded: apply_event would phantom-step a meta-command, so the design uses a reserved __orchestrate__ step ignored in state, a flag-gated scheduler, apply-on-callback, and loop-prevention. Lands behind NOETL_ORCHESTRATE_PLUGIN_DRIVE (default off), kind-validated before the flip.
  • 2026-06-17worker-driven cutover slice 1: configurable wasm guest entry (#108). The worker can now dispatch a named plug-in export (worker#113 → worker@04420d0): tool: {kind: wasm, plugin: {path, version, entry}} names the export (default run); the worker-driven orchestrator will use entry: run_state. invoke_bytes_with_entry + run_by_ref_entry/run_and_apply_by_ref_entry (originals delegate with run); test proves run→0xAA vs run_state→0xBB + missing-export error. Purely additive, no live change. Server scheduler+apply (the hot-path round-trip, default-off flag) next.
  • 2026-06-17orchestrate plug-in drives the real workload identically, live (#108 slice 4). The orchestrator runs the plug-in alongside the in-process drive on every evaluation + diffs commands (server#227 → server@bd652ab) — a process-global wasmtime host (feature orchestrate-shadow) loaded from noetl.plugin_module at boot, gated NOETL_ORCHESTRATE_PLUGIN_SHADOW; in-process result authoritative. Kind-validated over the live 10×1000 PFT: noetl_orchestrate_shadow_total{result="match"} 529, ZERO mismatch/error, workers stable. Plug-in gains a state-input path (run_state); both build configs green (default no wasmtime). Slices 1-4 prove orchestrator-as-plug-in end to end; next is the worker-driven cutover.
  • 2026-06-17system/orchestrate@1 registered + servable in a deployed server (#108 slice 3). The server bakes the orchestrate wasm into its image and seeds built-in system plug-ins into noetl.plugin_module on boot (server#226 → server@b21b589, kind-validated). New src/system_plugins.rs (pure dir-scan + sha256, unit-tested) + a wasmbuilder Docker stage + NOETL_SYSTEM_PLUGIN_DIR; in-process upsert (not the token-gated HTTP surface); digest-keyed hot-reload. Validated: GET /api/internal/plugins/system/orchestrate?version=1 → 200 application/wasm 1559093 bytes, ETag=digest, stale→409, baked sha256 == served digest. Next: kernel scheduler dispatches the now-registered plug-in.
  • 2026-06-17system/orchestrate plug-in runs identically to native in wasmtime (#108 slice 2). A wasmtime shadow-diff (server#225 → server@ccec104) loads the built .wasm through a harness mirroring the worker host's invoke_bytes ABI byte-for-byte and asserts the wasm output equals the native drive over auth0 multi-arc when: routing (minijinja in wasm) + cold-start. Finding: command-set identity (parsed Value eq), not raw bytes — the context map serializes in insertion order (serde_json preserve_order ← upstream HashMap iteration, differs wasm32 vs host arch); the scheduler deserializes to Vec<Command>, so the command set is the bar. 2 unit + shadow-diff green; plug-in excluded; test-only. Next: catalog register/serve → kernel scheduler.
  • 2026-06-17system/orchestrate WASM plug-in exists — drive core runs as a 0-import module (#108 slice 1). New standalone plugins/orchestrate/ crate (server#224 → server@10a629b) wraps the drive behind the worker plug-in ABI (input = JSON event-slice + playbook; output = JSON OrchestrationResult; data-plane = memory/alloc/run) and compiles to wasm32-unknown-unknown — the first non-trivial compiled system playbook. Feasibility risk retired: the .wasm has 0 imports (no WASI, no host render) — the whole drive incl. minijinja runs in-guest. Native parity test reproduces native evaluate byte-for-byte; 551 server tests green, crate excluded from the workspace. Next: worker-host shadow-diff → catalog register/serve → kernel scheduler (NOETL_ORCHESTRATE_PLUGIN, default off).
  • 2026-06-17Orchestrator drive core fully wasm-resident — Event-ABI round #109 CLOSED. Slice 3 (server#223) moved orchestrator/evaluate from src/engine/ into noetl-orchestrate-core. All 6 drive modules (renderer, playbook model, commands, evaluator, state, orchestrator switch) now compile native + wasm32-unknown-unknown — the system/orchestrate plug-in seed (#108). evaluate reads the pure core::event::Event; server converts db::Event at the trigger_orchestrator boundary (slice-1 From). 122 core + 565 server tests green, 0 WASI imports on wasm32, clippy clean; cargo-chef image (v3.20.0) kind-deployed, PFT 10×1000 — full command lifecycle, 0 errors, 0 restarts. ai-meta → server bfd3f77 (internal refactor, stays v3.20.0).
  • 2026-06-14Transfer tool: Snowflake↔Postgres both directions — #99 CLOSED. Both transfer arms implemented with full credential-alias resolution. tools v3.10.0 (tools#65) + worker v5.22.0 (worker#87) + e2e#58. SF→PG: $n::text::<udt> coercion + RFC3339 timestamp reformat; PG→SF: generated INSERTs. Full bidirectional data_transfer/snowflake_postgres fixture COMPLETED on kind against live sf_test account. tools → 4127b4b · worker → 6d97e7c · e2e → 94aa7f1.
  • 2026-06-14Snowflake key-pair JWT validated end-to-end — #98 last external-tool gap closed; transfer step → #99. noetl-tools v3.9.0 / v3.9.1 / v3.9.2 (tools#62/#63/#64) — key-pair JWT auth (bypasses MFA) + User-Agent fix (code 391903) + SQL-API context-in-body + multi-statement split (codes 391911 + 000008). Worker bumped to v3.9.2 (worker#83#86) + e2e#57 fixture cleanup. create_sf_database (CREATE DATABASE) + setup_sf_table (CREATE TABLE + INSERT) both COMPLETED via JWT on kind against the live sf_test account (NDCFGPC-MI21697). Transfer step fails (inline creds, no key-pair fields) — filed #99. ai-meta pointers: tools a216ab2 · worker 9d6b127 · e2e e191231.
  • 2026-06-12#90 Phase 7 shipped — scale hardening; #90 CLOSED (all 7 phases complete), live proof green. Final phase: server v3.5.0 (server#189) POST /api/execute/batch (N→N, partial-failure contained) + opt-in exactly-once dedup window (noetl.subscription_dedup, bounded-by-age, race-safe, default off); worker v5.19.0 (worker#79) batch dispatch + dedup opt-in + per-subscription rate limits (deterministic token-bucket RateGovernor, fetch-side backpressure → source keeps backlog, no loss, subscription.rate_limited event); ops (ops#176) + e2e (e2e#48); no tools change → no crate cascade. Live on kind: batch 12→12 COMPLETED on the subscription pool + per-message traceparent; dedup duplicate→1 execution + subscription.message.deduplicated; rate-limit engaged + 10/10 → executions (no loss); direct-curl within/outside-window + dedup-off + batch partial-failure all green. ai-meta → server 7b217d8 + worker 7531f4a + ops 6db69b9 + e2e 203593b. #90 closed; follow-ups tracked: #91#94 + tools#57.
  • 2026-06-12#90 Phase 6 shipped — CLI local noetl subscribe + FileEventSink + local_disk spool (live local proof green). Added noetl subscribe <spec.yaml> (cli v4.11.0, cli#60, closes cli#59): a kind: Subscription listener run standalone in local mode — no k8s, no NATS-dispatch server for the listening itself — reusing the same noetl_tools source clients + directive engine + spool engine the in-cluster worker uses, emitting the same ExecutorEvent envelope to a local FileEventSink (one event/line JSONL → replayable trail). Local dispatch (RFC §5.3): in-process via PlaybookRunner (pure-local default) or POST /api/execute. local_disk spool (§8.6): circuit-breaker + buffer + ordered replay + idempotency + dead-letter against a local dir, circuit state in a local file. New src/subscribe/{mod,spec,sink,dispatch,runtime,spool}.rs + examples/subscribe/. cli-only — no tools change / crate cascade (the source+spool surface ships in noetl-tools v3.5.0; bumps the lock 3.0.0 → 3.5.0 via the executor's "3"). Tests: 12 subscribe + full bin suite (53) green, incl. a deterministic outage→spool→ordered-replay→idempotency proof on the real engine. Live (in-cluster NATS on kind): 5 msgs → received=5 dispatched=5 failed=0 (19-event JSONL trail); local_disk spool outage → 6 message.spooled (0 dispatched, no loss) → recovery → 6 message.replayed in order → drained to 0. Finding: the NATS source ignores URL-embedded user:pass (async-nats ConnectOptions) — specs use explicit user/password. ai-meta → cli 2fb3fb0 (v4.11.0); wiki cli subscribe. #90 stays open for Phase 7 (scale hardening, volume-gated).
  • 2026-06-11#90 Pub/Sub + Kafka brought to live-E2E parity with NATS (validation gap closed). Stood up the two remaining subscription brokers in kind — Pub/Sub emulator (gcloud SDK image) + single-broker KRaft apache/kafka:3.9.1 — under noetl/ops (ops#170), and added bounded-drain fixtures + kind-validate runners under noetl/e2e (e2e#41). Both backends passed the same live bar as NATS: publish/produce 5 → bounded drain count=5 acked=true → execution COMPLETEDcall.done/command.completed/playbook.completed event trail. No adapter code change needed — the pure-Rust kafka crate talks to Kafka 3.9 KRaft and the Pub/Sub REST backend works against the emulator as-is. The one fix: the <step>.output.<field> accessor never resolved (both when: arcs skipped → drain stalled); corrected to <step>.<field> in the fixtures + the latent ops subscription_drain.yaml example. Validated on server v3.1.0 + worker v5.15.2 + tools v3.2.0; cluster left on that clean released stack. ai-meta → ops 568a4ac + e2e 8d21e7a. #90 stays open (Phases 2–7 design-only).
  • 2026-06-11#89 shipped — JSON null round-trips through {{ step }} (server fix, v3.0.6). #89nullundefined serialization — CLOSED. The #88 cursor fixture walked all 4 pages but its 4th check_pagination crashed: the terminal page's next_cursor: null, re-injected via the whole {{ fetch_page }} envelope, rendered as the JS token undefined (invalid JSON), so the consuming Python step received response as a str. Traced the corrupt command.issued args.response to the renderer that builds next-step inputs — the server orchestrator (src/template/jinja.rs::render_to_value), not the worker the issue blamed. json_value_to_minijinja maps JSON nullValue::UNDEFINED; minijinja's map repr emits undefined; render_to_value failed from_str and fell through to a raw string. The noetl-tools engine already had a | tojson retry for exactly this; the server's copy had diverged without it. Fix (server#177, v3.0.6) ports the retry. 5 new tests; 619 lib + 8 parity green; clippy clean. Kind-validated end to end on the live test-server (baseline 4th check_pagination error → fixed success; cursor collects 35, matching offset). ai-meta → server 8e17fbe. Standing direction honored — Claude wrote the Rust directly, no Codex.
  • 2026-06-10#88 shipped — pagination fixtures read response.body.*; #89 filed. #88 — offset/cursor pagination fixture path — CLOSED. The Rust http tool nests the parsed JSON payload under body ({{ fetch_page }}{body, headers, status_code}); the fixtures read response.get('data', {}), which resolved to {}, so has_more/next_cursor defaulted falsy and the loop exited after page 1 despite the correct post-#85 machinery. Confirmed the shape against a live http-tool result, then switched both check_pagination steps to response.get('body', {}) (e2e#40). Kind-validated: offset walks 0→10→20→30, users 10/10/10/5, validate_results success 35, playbook.completed COMPLETED; cursor path-fixed + walks all 4 pages (Mg==→Mw==→NA==→null, 35 events fetched) but the terminal page surfaced a distinct worker bug → #89 (worker serializes next_cursor: null as JS undefined when re-injecting {{ fetch_page }}, so the consuming Python step gets an unparseable str). Other pagination fixtures (retry/max_iterations/pipeline*/loop_with_pagination) share the same envelope-key assumption over /api/v1/assessments|flaky ({data, paging}) — flagged, left for follow-up. ai-meta → e2e 72a7525.
  • 2026-06-10#87 shipped, #85 deferred (e2e sweep follow-ups #85/#87). #87 — multi-tool sibling references — CLOSED. task_sequence (the tool: [list] pipeline runtime) stored each sub-tool's result for the aggregated output but never injected it into the running context, so a later sub-tool's {{ <label>.<field> }} rendered empty — masked in quoted positions, a syntax error at or near "," in unquoted numeric SQL (save_edge_cases test_large_payload). Fix (tools#48, v3.1.1) injects each sub-tool's result under its label (synthetic .data self-ref); worker adopts via worker#69. Kind-validated on a worker built from the fix: save_edge_cases test_large_payloadrecord_count = 100 (no syntax error), save_delegation_test clean. ai-meta → tools 76f942a + tools-wiki 4962f8b + worker b97f642. #85 — workflow-arc loop re-entry — DEFERRED (kept open). Implemented the dispatch-guard layer (draft server#176): a back-edge detector (cycle + recency) re-enters a completed loop head, so the loop no longer hangs (608 lib tests + 5 new pass). But kind validation surfaced a second blocker — set: ctx.X loop variables are recomputed per orchestrator pass and revert to the workload default when the producing step is re-dispatched (a minimal counter-loop thrashes 0,0,1,0,1,2,…). Full multi-page pagination needs durable event-sourced ctx propagation across iterations — larger than is safe to land well-tested in one session; held as a draft, not merged. Standing direction honored: Claude wrote all Rust directly (no Codex).
  • 2026-06-10#80 closed — container_callback chain green end to end. Fixing the watcher's missing curl (the literal #80 goal) surfaced two more layered bugs beneath it. Watcher image (ops#168): the manifest used the retired bitnami/kubectl:1.30.3 (removed from Docker Hub; the live cluster was patched to the bitnamilegacy archive) with a runtime apt/apk install step that never put curl on PATH → callback POST returned HTTP 000. Switched to alpine/k8s:1.30.3 (kubectl + jq + curl baked in), dropped the install hack. Server insert (server#173, v3.0.3): once curl worked the POST reached the server and 500'd — the container-callback handler inserted call.done via a stale query targeting an attempt column that doesn't exist on the deployed noetl.event; fixed to the working handlers::events column set. OOM path: the watcher only read Job-level conditions so failed_oom could never fire — added pod-level OOMKilledfailed_oom classification (ops#168); the completed_at fallback for failed Jobs used bare jq now (numeric epoch → HTTP 422), fixed to RFC3339 now | todate; and the e2e fixture's bytes(40MiB) was calloc-lazy (mapped to the zero page, never faulted in) so the container exited 0 — switched to a written-into bytearray that dirties pages and reliably OOM-kills (e2e#38). Verified the kind cluster actually enforces memory limits (120 MiB in a 32Mi pod → OOMKilled exit 137). Rebuilt the server image + reloaded into kind; kind_validate_container_callback.sh both probes GREEN — happy_path → succeeded (delta 1), oom → failed_oom (delta 1). This is the last blocker on the #43 container-callback chain. ai-meta → ops cacc513 + server 5d2cf58 (v3.0.3) + e2e 6aaf06e.
  • 2026-06-10#79 closed — e2e kind-val runners back on the current noetl CLI surface. Both scripts/kind_validate_*.sh runners aborted immediately on error: unrecognized subcommand 'playbook' — they targeted the retired noetl playbook register/execute + noetl execution status/events verbs. The validation logic and the event taxonomy (step.enter / command.completed / node_name / the fan-in barrier) were intact; only the invocation layer had drifted. Fix (e2e#37): noetl register playbook --file, noetl exec <catalog-path> --runtime distributed --json (exec by metadata.path, not the bare name), noetl status <id> --json, and the event log over noetl query (no events verb today — rows wrap under .result, order by event_id since noetl.event has no timestamp column). Added a fail-fast CLI-surface guard to each runner. Validated on kind (server-rust v3.0.1 + worker-rust, :8082): fanout_reduce PASS start-to-finish with no manual workaround; container_callback drives register→exec→COMPLETED cleanly and stops at the metric-delta assertion because the deployed noetl-k8s-watcher image lacks curl (watcher.sh: curl: not found → HTTP 000) — a cluster-side watcher gap tracked on #80. Version-skew note: PATH binary is noetl 2.17.0, repos/cli submodule is v4.10.0; the targeted surface is identical across both, so the runners work on either (the binary lags the submodule by a major line — worth refreshing for parity, not required here). Pointer: e2e → a3594b3; e2e wiki: new Kind-Val Runners page.
  • 2026-06-10#82 closed — GUI credential View/Edit recovered for pre-wallet records. The Secrets Wallet (#61) moved credential storage to forward-only envelope encryption; pre-wallet records now 500 on GET /api/credentials/{id}?include_data=true (Decryption failed: aead::Error), so the GUI View/Edit flow dead-ended on a generic toast (response shape unchanged). Fix (gui#36): View surfaces the real reason + points to Edit; Edit still opens with the list-row metadata (name/type/description/tags) + a warning banner and an empty-but-required data field, so re-entering the secret and saving re-seals the record under the current wallet — recovering it. Validated live against kind + the dev:kind UI on :3001. Also landed e2e#36 (duplicate workload probe-flag keys removed from tooling_non_blocking) and gui#35 (dev:kind convenience script). Pointers: gui → 8cacc9e (v1.11.1), e2e → 4a9ffbc.
  • 2026-06-10#81 closed — noetl-server v3.0.2 fixes the container-tool command type contradiction. ToolSpec.command was Option<String> (scalar) but the container tool kind writes a K8s-Job-style array — an array failed the server's ToolDefinition untagged-enum match (400), a scalar was rejected by the worker's ContainerConfig.command: Option<Vec<String>>. Typed command as Option<serde_json::Value> (same as args); ToolCall::from_spec forwards it verbatim. 2 regression tests; clippy clean (server#172, v3.0.2). Kind-val GREEN end-to-end: server accepts the array command, worker creates the K8s Job, Job reaches Complete 1/1. Server pointer bumped (ai-meta → server bd36672). Chain counter-bump validation stays gated on #79 (runner CLI) / #43.
  • 2026-06-09E2E sweep cleanup — noetl-tools v3.1.0 + noetl-server v3.0.1. Stripped the diagnostic tracing::debug! scaffolding added during the e2e triage, kept the production fixes: YAML when: true boolean + |tojson object-template fallback (tools#47), 64 MB result-store body limit + pipeline command/spec stash (server#171). Pointers bumped (ai-meta@316048c tools, @6590bd6 server); tracks #49. All 7 sweep playbooks PASS on Rust-only kind. Worker crates.io dep-revert deferred — v3.1.0 not yet on crates.io ([skip ci] release commit).
  • 2026-06-08noetl-tools v2.24.2 clippy cleanup + noetl/server#22 closed. Cleared the clippy -D warnings CI gate on noetl-tools (15 warnings across 7 files; all mechanical lint fixes). Closed stale noetl/server#22 (Phase D orchestrator engine port — complete). noetl/server PR #167 (same clippy shape) opened, awaiting merge.
  • 2026-06-05Rust-only regression rig — canonical v10 SQL + http config shapes. Swept ~30 self-contained e2e fixtures against the Rust-only kind stack and fixed three config-shape classes in noetl-tools: postgres command: alias + multi-statement SQL (tools#24, v2.18.3), a task_sequence→duckdb regression test (tools#25), and the duckdb command: alias + http params/headers/form non-string coercion (tools#26, v2.18.4). Worker adopted both (worker#50, worker#51). Newly GREEN: duckdb_test, json_serialization_save, duckdb_retry_query, pagination/{offset,cursor,max_iterations,pipeline}, retry_simple_config. Recovered the cluster first (server had latched into NATS not configured after a podman restart). Server-side follow-up noted: loop_with_pagination renders {{ execution_id }} empty in a multi-statement postgres command.
  • 2026-06-05postgres-tool observability — real SQLSTATE errors. noetl-tools 2.18.2 (tools#21) + worker dep bump (worker#49): the postgres tool surfaces the real SQLSTATE + message instead of the opaque db error. Validated end-to-end — a bad query reports ERROR: relation "..." does not exist (SQLSTATE 42P01) in the call.error event. Closes the last follow-up from the credential/iterator saga.
  • 2026-06-05iterator_save_test GREEN — full v10 + credential + iterator-pipeline surface validated. server#73 (v2.19.7) defers task_sequence _prev/_results refs at command-build so nested-pipeline templates render at runtime. iterator_save_test reaches playbook.completed and writes 3 rows to the real demo_noetl DB — the deepest v10 path (iterator → pipeline → _prev chaining → nested credential → postgres write). Closes the credential + iterator + pipeline chain (server#71, worker#46, worker#48, server#73).
  • 2026-06-05Nested-pipeline credentials + template-timing finding. worker#48 (v5.11.3) — the worker now pre-resolves keychain aliases on task_sequence SUB-tasks; iterator_save_test's nested save_item postgres step connects to demo_noetl. Closes the credential-path chain (store → alias-key → nested resolution, all validated). Last iterator_save_test blocker found + filed: server#72 — the server pre-renders task_sequence {{ _prev.* }} refs (runtime-only) to empty → malformed SQL (a symptom the v2.19.5 Chainable change surfaced).
  • 2026-06-05Keychain-credential path validated on Rust-only. Continuing R5 Tier 4, registered the pg_k8s postgres credential and probed the DB-backed fixtures. Surfaced + fixed a 3-bug chain in the keychain subsystem: credential store bound AES-GCM Vec<u8> to a TEXT column (server#71, v2.19.6); alias resolution read only the auth: key not v10's credential: (worker#46, v5.11.2). Proven: iterator_save_test's create_table connects + runs DDL against the real demo_noetl DB. Third bug — nested-pipeline credentials (task_sequence sub-tasks bypass worker resolution) — filed as worker#47 for a follow-up round. Session details
  • 2026-06-05v10 control-flow runs end-to-end on Rust-only. Phase F R5 Tier 4 re-probe found + fixed 7 more bugs across the Rust stack (server v2.19.5 server#69 6 commits, worker v5.11.1 worker#44, tools v2.18.1). Four v10 fixtures now reach playbook.completedstart_with_action, end_with_action, loop_test, control_flow_workbook; actions_test correct-fails on a missing TEST_SECRET env. Root-cause chain: catalog SQL type drift → ToolSpec null-serialization → worker array-config drop → orchestrator end-step skip + task_sequence label-wrap → minijinja Lenient-vs-Chainable undefined → end-step trigger gate. Also: rust-analyzer workspace setup + rule (ai-meta@38287b7). Session details
  • 2026-06-04 (late evening)Rust-only e2e complete + legacy cleanup. Six interlocking server gaps closed in one iteration (#55–#60)
    • two noetl-tools fixes (#15, #16); worker dep bump; kind cluster legacy Python deployments retired. control_flow_workbook runs fully end-to-end on the Rust-only stack. Standing direction pinned: Rust-only focus, ignore Python tasks. Session details
  • 2026-06-04 (afternoon)Pipeline + failure termination + workbook resolution. Three server PRs landed together as v2.19.3 (#61, #63, #65).
  • 2026-06-04 (morning)EE-5 lax decode + workload + input alias. v2.19.1 + v2.19.2 — unblocked Rust worker → Rust server emission + canonical v10 playbook compatibility.
  • 2026-06-04 (early morning)Phase F R4-5 + R4 complete. N=2 shard kind validation script + ExecutionService refactor.
  • 2026-06-03Phase F R4 series — DbPoolMap N+1 pool layer, AppState wiring, per-execution handler cutover, cluster-wide list fan-out.
  • 2026-06-02 (afternoon)Architecture pivot: rest of migration moves to system playbooks. Closed #30, #45; promoted #46.

Releases

See Releases for the per-repo release log with links to GitHub Releases pages.

Recent (2026-06):

  • 2026-06-19noetl/server v3.29.4 — 🐞 #113 off-server drive: recover offloaded drive result + stop drive on cancel (server#241, rig e2e#63). apply_worker_orchestration resolves+decodes an offloaded __orchestrate__ result (over the 100KB inline budget → durable reference.ref) instead of dropping it → non-convergent re-loop (metric ref_resolved); cancel now matches underscore playbook_cancelled + a terminal guard evicts the orch-cache (no restart). Kind gate-ON proven; 5/9 #113 fixtures COMPLETE, the other 4 hit a distinct oversized-command.issued stall → #114. No prod default flipped.
  • 2026-06-19noetl/server v3.29.3 — 🎯 #103 cutover COMPLETE, FLIP-READY: the 2 ExecutionService cancel/finalize writers route through the emit_event chokepoint (server#240, e2e rig e2e#62) — the last synchronous server noetl.event writer under the gate is closed. Kind-proven both modes (gate-off byte-identical INSERT; gate-on PUBLISHED + materializer sole writer + terminal state + 0 loss/dup). All three flip blockers closed → PUBLISH_ONLY flip is a staged operator decision. Default-off; no prod default changed.
  • 2026-06-19noetl/server v3.29.2 — off-server-drive × gate crash-recovery: cold-cache apply rebuilds WorkflowState from the durable log instead of dropping the in-flight drive result (server#238, refs #104/#103). Unblocks the PUBLISH_ONLY flip (off-server drive × gate now kind-proven). Confined to the cold branch; no prod default changed.
  • 2026-06-19noetl/tools v3.13.0 + noetl/worker v5.34.0 — #103 ack-after-materialize durability: deferred ack-after-processing capability (tools#71: AckMode::Defer + $JS.ACK durable handles + ack/nack/term) + in-process CQRS materializer consume-loop (worker#115: drain→project→ack-only-on-success, redeliver on failure) + system-pool wiring (ops#194). Kind fault-injection: gate-on sole-writer loss=0 across a mid-drain failure. Default-off.
  • 2026-06-18noetl/server v3.28.0 — worker-driven orchestrator drive now default ON (NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true) (server#233, closes #108; scale-soak-gated, revert = =false).
  • 2026-06-18noetl/worker v5.33.0 — pool-affinity decline (drive isolated on the system pool) (worker#114, refs #108 (b)).
  • 2026-06-14noetl/worker v5.22.0 — transfer endpoint credential-alias resolution, both Snowflake↔Postgres directions (worker#87, closes #99).
  • 2026-06-14noetl/tools v3.10.0 — Snowflake↔Postgres transfer arms + flatten credential config (tools#65, closes #99).
  • 2026-06-14noetl/tools v3.9.2 — Snowflake SQL-API context in request body + multi-statement split (tools#64, refs #98).
  • 2026-06-14noetl/tools v3.9.1 — set User-Agent on the Snowflake HTTP client (tools#63).
  • 2026-06-14noetl/tools v3.9.0 — Snowflake key-pair JWT authentication (tools#62; kind-validated on live sf_test account).
  • 2026-06-12noetl/server v3.5.0POST /api/execute/batch
    • opt-in exactly-once dedup window (server#189, RFC #90 Phase 7 — scale hardening).
  • 2026-06-12noetl/worker v5.19.0 — batch dispatch + dedup opt-in + per-subscription rate limits (worker#79, RFC #90 Phase 7 — scale hardening, closes #90).
  • 2026-06-12noetl/cli v4.11.0noetl subscribe, local-mode subscription listener (cli#60, closes cli#59, RFC #90 Phase 6). Standalone kind: Subscription listener + FileEventSink JSONL trail + local_disk store-and-forward spool; cli-only (reuses noetl-tools v3.5.0 source+spool). ai-meta → cli 2fb3fb0.
  • 2026-06-11noetl/server v3.0.6 — round-trip JSON null in whole-object {{ step }} references (server#177, closes noetl/ai-meta#89). A null field in a {{ step }} envelope rendered as the JS token undefined (invalid JSON), so the consuming step received an unparseable str; render_to_value now retries with | tojson (undefined/none → JSON null) — the server renderer had diverged from the noetl-tools engine that already did this. Kind-validated: cursor pagination collects all 35 events through the terminal next_cursor: null page. ai-meta pointer → server 8e17fbe.
  • 2026-06-10noetl/tools v3.1.1 — multi-tool sibling references (tools#48, closes noetl/ai-meta#87). TaskSequenceTool now injects each sub-tool's result under its label so a later sub-tool resolves {{ <label>.<field> }} (was rendering empty — a syntax error at or near "," in unquoted numeric SQL positions). Worker adopts via worker#69. Kind-validated (save_edge_cases test_large_payloadrecord_count = 100). ai-meta pointer → tools 76f942a + worker b97f642.
  • 2026-06-10noetl/server v3.0.3 — container-callback insert matches the deployed noetl.event schema (server#173, tracks noetl/ai-meta#43). The handler's call.done insert targeted a non-existent attempt column → HTTP 500 on every watcher callback; replaced with an inline INSERT matching the working ingestion path. Unblocked the container-callback chain (kind-val GREEN both probes). ai-meta pointer → 5d2cf58.
  • 2026-06-10noetl/gui v1.11.0 + v1.11.1 — credential View/Edit recovery for pre-wallet records (gui#36, closes noetl/ai-meta#82) + dev:kind convenience script (gui#35). ai-meta pointer → 8cacc9e.
  • 2026-06-10noetl/server v3.0.2 — container-tool command type contradiction fix (server#172, closes noetl/ai-meta#81). ToolSpec.command Option<String>Option<serde_json::Value>: the container tool's array command now decodes server-side + passes through to the worker's Vec<String>; scalars stay JSON strings for shell/db tools. Kind-val GREEN (K8s Job reaches Complete 1/1).
  • 2026-06-08noetl/tools v2.24.2 — clippy cleanup: 15 warnings resolved across 7 files (tools#44, closes tools#42). Mechanical lint fixes, zero behavioral changes.
  • 2026-06-05noetl/tools v2.18.4 — duckdb command: alias (parity with postgres) + http params/headers/form non-string coercion (tools#26); worker adopts it (worker#51). Unblocks the pagination + http + duckdb-command fixtures.
  • 2026-06-05noetl/tools v2.18.3 — postgres command: alias
    • multi-statement SQL on postgres + duckdb (tools#24, closes tools#23); worker adopts it (worker#50). duckdb_test + json_serialization_save GREEN.
  • 2026-06-05noetl/tools v2.18.2 — postgres tool surfaces the real SQLSTATE + message instead of db error (tools#21); worker bumped to it (worker#49).
  • 2026-06-05noetl/server v2.22.0Secrets Wallet Phase 2: GCP Cloud KMS KeyManager (Cloud KMS :encrypt/:decrypt + Workload Identity); runtime NOETL_KMS_PROVIDER (local/gcp-kms); KEK can leave the process (server#81, tracks #61). Kind-validated on local.
  • 2026-06-05noetl/server v2.21.0Secrets Wallet Phase 1c/1d: credentials + keychain store envelope-encrypted (per-record DEK wrapped by the KEK); self-describing {"v":1,…} blob, forward-only (server#79, tracks #61). Kind-validated end-to-end.
  • 2026-06-05noetl/server v2.20.0Secrets Wallet Phase 1b: envelope-encryption core — KeyManager/LocalDevKms/EnvelopeCipher (server#77).
  • 2026-06-05noetl/server v2.19.8Secrets Wallet Phase 1a: remove the all-zeros default encryption key, fail closed (server#75, tracks #61). Kind-validated.
  • 2026-06-05noetl/tools v2.18.5 — dollar-quote-aware statement splitter; the 2.18.3 splitter shredded plpgsql $$ … $$ blocks (tools#27).
  • 2026-06-05noetl/server v2.19.7 — defer task_sequence _prev/_results refs at command-build (server#73); nested-pipeline templates render at runtime → iterator_save_test GREEN.
  • 2026-06-05noetl/worker v5.11.3 — resolve keychain aliases on task_sequence sub-tasks (worker#48); nested postgres-in-pipeline steps connect.
  • 2026-06-05noetl/server v2.19.6 — credential store base64-armors the AES-GCM blob for the TEXT data_encrypted column (server#71); keychain creds register + round-trip.
  • 2026-06-05noetl/worker v5.11.2 — resolves keychain alias under the v10 credential: key (worker#46).
  • 2026-06-05noetl/server v2.19.5 — v10 control-flow end-to-end (server#69, 6 commits): catalog INT4 + catalog_id alias, ToolSpec skip-null, orchestrator end-step-with-action + task_sequence flatten + intra-pass dedup, template Chainable- undefined, end-step trigger gate.
  • 2026-06-05noetl/worker v5.11.1 — preserve array tool_config for task_sequence (worker#44).
  • 2026-06-05noetl/tools v2.18.1 — task_sequence parse_tasks accepts worker-envelope shape.
  • 2026-06-04noetl/server v2.19.4 — orchestrator template context: step data at top level + call.done capture.
  • 2026-06-04noetl/tools v2.18.0 — TaskSequenceTool.
  • 2026-06-04noetl/tools v2.17.1 — PythonTool result- global capture.
  • 2026-06-04noetl/server v2.19.3 — three fixes shipped together: pipeline flat shape decode (#61), failure termination (#63), workbook resolution (#65).
  • 2026-06-04noetl/server v2.19.2 — v10 workload + input alias (#59).
  • 2026-06-04noetl/server v2.19.1 — EE-5 lax decode for integer execution_id (#57).
  • 2026-06-04noetl/server v2.19.0 — Phase F R4-4b (ExecutionService refactor + cluster-wide list fan-out).
  • 2026-06-04noetl/server v2.13.0 → v2.19.0 — Phase F R4 series (DbPoolMap N+1 pool layer through R4-5 kind validation).
  • 2026-06-04noetl/gateway v3.2.0 — Phase F R3b-2 shard-info twin endpoint.

Conventions

How agents (Claude / Codex / Cursor) operate across this ecosystem — pointers into the rule files in agents/rules/:

How to use this dashboard

  • Just landed in this codebase? Read Repo Map, then Execution Model, then the umbrella for whatever you're working on.
  • Picking up an in-flight task? Find the matching umbrella page above; it has the full state of the work + the next concrete step.
  • Need to file new work? Follow the issue tracking convention — open the ai-task issue on noetl/ai-meta, then add the umbrella to the table above and create the corresponding wiki page.
  • Maintenance pass? Refresh this Home + the Sessions Log
    • the Releases page + the matching Umbrella-*.md page when you bump a submodule pointer. All four pages drift together — see Rule 0a's checklist.

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally