Skip to content
Kadyapam edited this page Jun 23, 2026 · 265 revisions

NoETL Ecosystem Dashboard

Last refreshed: 2026-06-23 (Claude session — 🪶 #104 OQ5 Option A SHIPPED — producer-staged result tier (noetl-worker v5.46.0). The producing worker stages the over-budget result tier object at emit time under NOETL_RESULT_PRODUCER_STAGE (default off → byte-identical no-op), decoupling the tier write from noetl.result_store; materializer skip-on-exists; shared decide_tier → byte-identical. worker#132 → v5.46.0 (27c7c17), e2e#81 + docs#186 merged, ai-meta worker pointer bumped (#128); ops#206 (open, do-not-apply) adds the soak-gate alert rules. The result_store-retirement soak is now LIVE on prod — T0 2026-06-23T21:45Z (48h floor / 72h target): both worker pools upgraded to v5.46.0 + NOETL_RESULT_PRODUCER_STAGE=true. The original soak gate mint_authoritative{legacy_fallback}=0 was vacuous (MINT is unset on noetl-worker-rust, so it carried no GMP data) and was repointed to resolve_total{fallback_*}=0 + materializer_skip_exists>0 + staged>0 (ops#207); a GMP rule-label fix (noetl.io/*noetl_io_*) restored evaluation of both this gate and the #103 flip guardrail (ops#208). ai-meta ops pointer → 8940fd5. Soak plan + start criteria on #104. No prod default changed; the result_store dual-write stays in place. Earlier — see Sessions Log for the A–D enablement, Phases E/F, and WI/ADC GCS history.) Refresh cadence: every session that lands meaningful cross-repo work (per agents/rules/wiki-maintenance.md Rule 0a)

Standing direction (2026-06-04). Per memory entry, Python tiers are deprioritized. Forward Rust-only e2e work is tracked under [#54][i54] (Phase F R5). Python pieces stay deployable for backwards-compat on GKE but are NOT a target for new feature work.

Single pane of glass for the NoETL platform. Every active umbrella, every submodule, every release lands here so a single page shows what's in flight, what shipped, what's next.

Convention. This wiki is the cross-repo dashboard. Per-repo wikis (e.g. noetl/server wiki, noetl/ops wiki) document that repo's surface; this wiki documents the system of repos. See Wiki convention for the split.

Active umbrellas

Umbrellas open in the ai-task queue: the Rust server parity-port umbrella ([#49][i49]), the distributed-OS program (#107), the event-WAL + derivable-storage model (#104), the decouple-result-data RFC (#115), the worker-driven orchestrate E2E coverage (#111), the Rust regression baseline migration (#98), and the Python-era deploy legacy cleanup (#97). #101 (orchestrator scaling — subsumed by the #115 RFC) and #95 (postgres timestamptz NaiveDateTime bug) both closed 2026-06-22. #100 (cursor/claim loop mode) closed 2026-06-15 — server v3.8.0 + tools v3.10.1; test_pft_flow_v2 all_passed:true on kind (see Recently closed). #99 (transfer-tool credential aliases) closed 2026-06-14 via tools#65 + worker#87 + e2e#58 (see Recently closed). The subscription / listener tool RFC (#90) closed 2026-06-12 with all 7 phases shipped + live-proven (see Recently closed); refinement follow-ups #91#94

# Opened Last update Umbrella Status Wiki page
#107 2026-06-17 2026-06-17 Program: Distributed Multitenant OS — Server Dissolution → Global Grid The strategic roof over #101–#105. Blueprint names NoETL a distributed multitenant OS (server→stateless edge; NATS WAL + object store the only durable state; processing = event-driven system playbooks on a sharded grid; foundation for quantum-cloud-hybrid). 5-step path: CQRS cutover (step 1, shadow green) → orchestrator-as-plug-in (#108, done 2026-06-18 — drive core fully wasm-resident (#109 closed); the system/orchestrate plug-in compiles to a 0-import .wasm (server#224), runs identically to native in wasmtime (server#225), is seeded into the registry on boot + servable (server#226), and drives the real workload identically live — shadow over the 10×1000 PFT, 529 evals 0 mismatch (server#227); worker-driven cutover — the drive runs OFF-SERVER on the pool, kind-validatedentry/run_state dispatch (worker#113) + apply_orchestration_result (server#228) + the flag-gated scheduler/apply/state-guard (server#229, NOETL_ORCHESTRATE_PLUGIN_DRIVE); simple_python drove start→end→COMPLETED through the worker round-trip. #108 CLOSED 2026-06-18 — (c) the default-flip shipped (server#233, v3.28.0, drive default ON) after a clean scale soak; orchestrator-as-plug-in is done; the in-server shadow + wasmtime server dep retired in #110 / server#234) → per-shard WAL → drop Postgres → cross-shard federation. docs blueprint
#115 2026-06-19 2026-06-20 RFC: Decouple result data from context — reference-only schema + one-level event chain + worker-side state builder PROGRAM-SCALE STEP 2 SHIPPED + multi-replica validated 2026-06-20 — execution-affinity write ordering (#116). Step 1 (KV coherence, server v3.38.0) was necessary-not-sufficient; affinity routes every trigger for an execution to the replica that ShardConfig::owns it (non-owner forwards) so the read→advance is atomic per execution → 2+ replicas produce one unforked chain. server#252 → server v3.39.0 5e00d0a (src/affinity.rs; NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME, all default off) + e2e#71 66b6e1b (2-replica StatefulSet topology + rig HARD gate). Multi-replica gate-ON kind PASS: linear/loop/fanout COMPLETE, every chain roots=1/dangling=0/walk==total, forwarded_ok proof, never-scan + sole-writer across replicas; single-replica unchanged; 595 server tests + clippy green; baseline restored. Remaining #117: off-server from_events spine ordered by event_id wedges fan-in under a chain-order≠id-order inversion (affinity + high-concurrency fanout) — fix = order spine by prev_event_id walk; linear/loop already reliable. PROD GKE untouched; all affinity flags default off. Phase 1 SHIPPED 2026-06-19 — references-in-state consume side. Worker resolve_context_references made selective (resolve a noetl:// ref only when this command's tool input binds the step's bulk — a path the bounded extracted summary can't satisfy; predicate / scalar / _ref access reads off the summary; worker#117 v5.36.0). Server hydrate_result_references surfaces _ref/_store/_uri on the kept summary + refs_in_state default flipped to true (server#243 v3.30.0). Kind gate-ON: all 9 #113 stalls COMPLETE (max command ctx 412KB, 0 __orchestrate__ event rows, materializer lag 0); the 3 targets bounded (output_select 1.32MB→10KB, storage_tiers 17.4MB→36KB, lease_expiry 201 spinning orch cmds→16). Closed #113 + #114. Off-server-drive prod cutover (#107/#111) unblocked on the size axis; PROD GKE untouched. Phase 2 MERGED 2026-06-19 — one-level prev_event_id event chain (server#244 → server v3.31.0 f5bd4a8 + noetl#667 → noetl ecd16a2; ai-meta pointer afdb365): each noetl.event/noetl.command carries the chain link, stamped at the emit chokepoint from a per-execution chain-head watermark, covering both gate paths + the materializer; chain-correctness proven walkable / 1-root / no-gap / no-scan across 6 gate-ON executions (incl. a real fan-out branch origin), sole-writer + Phase-1 bounded sizes intact, 573 tests + clippy green. Post-merge verified on live kind (both prev_event_id columns present, server image reflects merged code, gate-ON baseline live). Phase 3 MERGED 2026-06-19 — chain-walk state builder (server-side, flagged) (server#245 → server v3.32.0 8338417; ai-meta pointer bumped): the drive rebuilds WorkflowState by walking prev_event_id head→root (in-memory ChainHeads head + (execution_id,event_id) PK lookups, no WHERE execution_id scan) → same from_events (orchestrate-core unchanged; parity by construction), behind `NOETL_STATE_BUILD_MODE=chain_walk event_scan(defaultevent_scan, prod unchanged), event-scan kept as the fallback (cold-head / lag / non-genesis). Gate-ON validated: parity 41/41 MATCH 0-mismatch (tx-isolated), NO-SCAN proven (event_scans_total=0, 40 builds / 1064 PK hops / 0 fallbacks), all topologies COMPLETE, sole-writer + lag-0 + gate rig PASS, 577 tests + clippy green. Self-merged (no classifier block). **Phase 4 KERNEL + FLAG SHIPPED + shadow kind-validated** ([worker#118](https://github.com/noetl/worker/pull/118) → worker **v5.37.0** fef961c+ [server#246](https://github.com/noetl/server/pull/246) → server **v3.33.0**3e6006d): the pool-side state_builderreconstructsWorkflowState from the **noetl_eventsWAL** — a per-execution chain index walksprev_event_id head→root, caching the spine keyed by the immutable chain head + incremental tail-advance + cold-rebuild on miss/restart. A live WAL **shadow** loop (NOETL_STATE_BUILDER_SHADOW, default off) + the NOETL_STATE_BUILDER=offserver
[#49][i49] 2026-06-02 2026-06-14 Rust server FastAPI parity port — primary server v3.6.0 (system worker pool + cleanup/purge endpoint, server#193); prod is Rust-only (server-rust + worker-rust + system pool). Remaining entangled refactor tracked in #97. Umbrella: Rust Server Port
#104 2026-06-16 2026-06-23 Event WAL + derivable result storage: NATS-as-WAL, logical-URI naming, Feather tier In progress — 🪶 OQ5 Option A SHIPPED 2026-06-23 — producer-staged result tier (worker v5.46.0, NOETL_RESULT_PRODUCER_STAGE default-off) decouples the over-budget tier write from result_store; result_store-retirement soak LIVE on prod — T0 2026-06-23T21:45Z (48h floor / 72h target): both worker pools on v5.46.0 + PRODUCER_STAGE=true; the original mint_authoritative{legacy_fallback}=0 gate was vacuous (MINT unset on worker-rust) so it was repointed to resolve_total{fallback_*}=0 + materializer_skip_exists>0 + staged>0 (ops#207), and a GMP rule-label fix noetl.io/*noetl_io_* (ops#208) restored evaluation of both this gate and the #103 flip guardrail; ai-meta ops pointer 8940fd5. ✅ Phase D ENABLED LIVE on prod 2026-06-23 — staged A–D enablement complete. NOETL_RESULT_MINT_AUTHORITATIVE=true on system-pool + server only (NOT worker-rust — would spawn a stray result materializer at the wrong cell; its consume already resolves from the tier via Phase C): the URN→Feather/GCS tier is now the authoritative result store with noetl.result_store kept as the reversible dual-write fail-safe. Validated across 4 over-budget execs — Feather object only at env=prod/cell=usc1-a (no stray), result_store_dual_write_total 1:1 with execs (+ result_store row each time), resolved_feather correct (1200/1100/test_passed=true), event-mat sole-writer projected==acked, lag 0, never-scan 0, 0 restarts, 0 auth errors. LEFT ENABLED, revert armed (set env … NOETL_RESULT_MINT_AUTHORITATIVE-); GC/DR stay OFF. Remaining: OQ5 byte-source re-plumb + result_store retirement (gated). Prior — Phases E + F MERGED 2026-06-23 (the FINAL build phases; all PRs squash-merged in dependency order; flags default-off → inert in prod). Phase E — side-effect durability barrier: tools#78 publishes noetl_tools::registry::kind_is_side_effecting (noetl-tools v3.17.0 1d49dd5; members noetl-directives + noetl-locator re-published first per the workspace publish order), worker#130 the consume-pool barrier (NOETL_SIDE_EFFECT_BARRIER, adopt-only → side effects fire exactly once across re-drive) → noetl-worker v5.44.0 d696f7e, e2e#79 rig. Phase F — result-tier GC + DR: server#264 conservative dry-run-first GC sweeper (NOETL_RESULT_TIER_GC, never deletes a live-referenced object) → noetl-server v3.44.0 341b614, worker#131 materializer verify-and-repair DR (NOETL_RESULT_TIER_DR, byte-identical re-derive) → noetl-worker v5.45.0 dd07016, ops#205 + e2e#80. worker#130/#131 depend on the published noetl-tools 3.17 (no patch/path/git dep). ai-meta pointers bumped. A–F BUILD phases now complete — only operational items remain on #104 (prod GCS infra, OQ5 byte-source re-plumb, staged prod enablement/minting cutover). #104 stays OPEN. Prior — Phase D MERGED 2026-06-23 (all 4 PRs squash-merged in dependency order server → worker → ops → e2e; flags default-off → inert in prod): the minting flip ships the dual-write window — NOETL_RESULT_MINT_AUTHORITATIVE makes the URN → Feather/GCS tier the authoritative result store (materializer = authoritative tier writer, resolve-by-URN = primary consume path) with noetl.result_store kept as the reversible dual-write fallback. No new crate publish — no Cargo.toml dep change in either PR; server resolves noetl-locator 0.1.1, worker resolves from the registry. semantic-release cut noetl-server v3.43.0 + noetl-worker v5.43.0. server#263noetl-server v3.43.0 6f6b9ef; worker#129noetl-worker v5.43.0 be6863a; ops#204 b19b759; e2e#78 07e85aa. ai-meta pointers bumped. OQ5 — result_store retirement — DECIDED metric-gated (drop dual-write once mint_authoritative_total{path=legacy_fallback} holds 0 across a staging soak + retention time floor), gated on a not-yet-done byte-source re-plumbing prerequisite. Prod minting cutover (rolling server 3.43.0 + worker 5.43.0 to GKE) is a separate next task. #104 stays OPEN (Phase E/F + minting cutover remain). Prior — Phase C MERGED 2026-06-23 (all flags default-off → inert in prod): the result-DATA read half — server GCS object backend + cell-endpoint registry + GET /api/internal/cells, worker resolve-by-URN read path (closes OQ6) + fixes B/B1; server#262noetl-server v3.42.0 c2d5ca9, worker#128noetl-worker v5.42.0 7971041, e2e#77 39dc880. Prior — Phase B MERGED 2026-06-22 (flag default-off → inert in prod until a future rollout enables NOETL_RESULT_MATERIALIZER_ENABLED): the shadow Feather result tier — a separate noetl_result_materializer consume-loop on the system pool writes the over-budget result (tabular → Arrow Feather, non-tabular → JSON, small → inline no-op) to the derived §7 key alongside noetl.result_store; nothing reads it until Phase C. All 5 PRs squash-merged in dependency order; noetl-locator member bumped 0.1.0→0.1.1 pre-merge so the additive ResultCoordinates::parse/from_locator API publishes. tools#77noetl-tools v3.16.0 7da39d8 + noetl-locator 0.1.1 on crates.io; server#261noetl-server v3.41.0 4a6659e (sibling consumer); worker#127noetl-worker v5.41.0 4b1c15b (consume-loop); ops#203 c92753c (flag + cell seed); e2e#76 04c3332 (kind rig). No downstream repoint needed — worker resolves noetl-tools ^3.14.2 + a self-contained local inversion; server resolves noetl-locator ^0.1.0; both from the registry, no git/branch dep. #104 stays OPEN (Phases C–F remain). Prior — Phase A MERGED 2026-06-22 (flag default-off, inert in prod until a future server rollout): slim dependency-free noetl-locator 0.1.0 extracted + published to crates.io (resolves OQ7 — control plane parses the canonical result URI without noetl-tools' heavy graph; duckdb/kube/arrow/tonic/rhai/gcp_auth confirmed absent from the server tree; noetl-tools re-exports it), and the server accepts the canonical URI behind NOETL_RESULT_URI_ACCEPT. tools#76noetl-tools v3.15.0 dc0c5d8; server#260noetl-server v3.40.0 c89d078 (dep repointed git→0.1.0 pre-merge, 623 tests green); e2e#75 rig eeca8b7. Prior — RFC upgraded for review (2026-06-22): now that the event-WAL half is live on prod (materializer sole writer, off-server chain-walk, prev_event_id, off-server CQRS cutover v3.39.5), the RFC is re-scoped to the result-DATA half — move result bytes off Postgres noetl.result_store into a derivable-URN-addressed Arrow Feather tier in object store, add the resolve-by-URN read path + cell endpoint registry, place the durability barrier at side-effecting tool boundaries. Decided/proposed/open separated; phased plan A–F; #104 is on the critical path for #107 steps 4–5. Review requested on #104. Prior: off-server-drive × gate reconciliation PROVEN (2026-06-19): gate-ON (PUBLISH_ONLY=true) with the off-server drive (PLUGIN_DRIVE=true) + materializer sole writer is now green on kind (the combo #103 left unproven) — fresh exec + cursor fan-out → COMPLETED, server wrote 0 noetl.event rows (all PUBLISHED), materializer materialized all exactly once (rows==distinct ids, 0 dup), read-your-writes held via the relocated trigger. Server #238v3.29.2 rebuilds WorkflowState from the durable log on cold-cache apply (crash-recovery, kind-proven cold_rebuild); committed e2e rig kind_validate_orchestrate_gate.sh (e2e#61). This unblocked the #103 flip (the 2 cancel/finalize sites are now also done — server v3.29.3 #240; #103 is FLIP-READY). Prior: naming foundation noetl_tools::locator (tools#68/#70, v3.12.0) + worker URI stamp (worker#99); blueprint (docs#180). Umbrella: Event WAL Storage
#98 2026-06-14 2026-06-14 Grow the Rust regression baseline: migrate Python-era e2e fixtures Snowflake key-pair JWT validated — the last external-tool gap is closed. tools v3.9.2 (tools#62/#63/#64); create_sf_database + setup_sf_table COMPLETED via JWT on kind. Transfer step deferred to #99. Core green: 64 fixtures.
#97 2026-06-14 2026-06-14 Retire remaining Python-era deploy legacy (manifests, kind automation, helm chart) Open — Todo. Python manifests, kind redeploy automation that hardcodes Python deployment names, stale helm release rev 185.
#111 2026-06-18 2026-06-19 E2E: worker-driven orchestrate topology coverage + server-API-only gap tracking In progress — three committed kind rigs now: kind_validate_orchestrate_offserver.sh (e2e#59) asserts the off-server topology (gate-off), kind_validate_orchestrate_gate.sh (e2e#61) asserts it composes with the PUBLISH_ONLY gate, and kind_validate_cancel_finalize_gate.sh (e2e#62, 2026-06-19) dual-mode-asserts the ExecutionService cancel/finalize writers honour the gate (gate-off byte-identical INSERT; gate-on PUBLISHED, materializer sole writer, terminal state, 0 loss/dup) — the rig that closed the last #103 flip blocker. Off-server rig live-green (COMPLETED, __orchestrate__ in noetl.event = 0, dispatched=applied, shadow metric absent). Durable home for the server-API-only gap (server still sole-writer + rebuilds state — moves under #103/#104) + two operator decisions: (A) retire in-process drive fallback (gated on prod adopting a post-#108 image; prod still pre-#108), (B) reap accumulating __orchestrate__ PENDING delivery rows in noetl.command.
[i43]: https://github.com/noetl/ai-meta/issues/43
[i49]: https://github.com/noetl/ai-meta/issues/49
[i77]: https://github.com/noetl/ai-meta/issues/77
[i78]: https://github.com/noetl/ai-meta/issues/78
[i79]: https://github.com/noetl/ai-meta/issues/79
[i80]: https://github.com/noetl/ai-meta/issues/80
[i54]: https://github.com/noetl/ai-meta/issues/54
[i61]: https://github.com/noetl/ai-meta/issues/61

Recently closed (last 7 days)

# Closed Title
#95 2026-06-22 postgres pg_value_to_json returned null for tz-naive timestamp (NaiveDateTime) + date/time/uuid/numeric/bytea — FIXED + shipped to prod. The Rust postgres tool probed i64/i32/f64/bool/String/json/DateTime<Utc> and fell through to Value::Null for everything else; a plain timestamp column hit the fall-through and serialized to null even though the value was present (the auth0_login expires_at: null repro that tripped the gateway's validation). Added arms for timestamp (ISO-8601, no offset), date/time, uuid, numeric/decimal (exact decimal string via lossless binary-wire decode), bytea (base64). tools#75noetl-tools v3.14.2 6d9b674; worker#126 pin bump → noetl-worker v5.40.5 da24952. Built prod noetl-worker-rust:v5.40.5 @sha256:45212dbe… (Cloud Build us-central1) + rolled by digest onto noetl-worker-rust + noetl-worker-system-pool (server stays v3.39.5, CPU 250m/2 kept) — rolling restart clean, off-server cutover stayed healthy (materializer sole-writer lag 0, command lag 0, WAL rehydrated per #119). 409 tools tests + clippy + a kind-gated live-postgres before/after test.
#103 2026-06-22 Step 2 — CQRS event log — DONE. Off-server CQRS cutover is live + validated on prod (server v3.39.5 PUBLISH_ONLY=true+STATE_BUILDER=offserver / worker v5.40.x Rust in-process materializer src/materializer.rs as sole noetl.event writer, replacing the Python system/event_materializer playbook). Prod-scoped regression validator landed via e2e#74 (e2e@1deadf1) — 28/30 PASS against live prod (2 non-PASS = pg-cred prod-env diffs, not cutover bugs). Superseded Python ops PRs #189/#190/#192 closed in the same pass.
#102 2026-06-22 Step 1 orchestrator throughput — DONE. Batch noetl.event INSERTs shipped (server#198 v3.10.0 + server#199, v3.15.x); Part B's worker-side batched event emission was superseded by #103 (worker publishes granularity-preserving events to JetStream instead of HTTP-POST), now live on prod. No residual.
#101 2026-06-22 Orchestrator scaling — DONE. Acceptance met + kind-validated: incremental OrchStateCache, server-side hydrate_result_references, env-configurable worker inline budget NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES. Remaining consume side (worker render-time ref resolution) was reassigned to #115 Phase 1 — no residual on this issue.
#127 2026-06-22 Batch PFT throughput plateau — CLOSED: the task_sequence per-sub-task context optimization shipped to prod, compounding with the CPU-limit bump. Two compounding levers landed against the same hot path: (1) the prod worker CPU limit raised 1→2 (applied live earlier this session — removed the 38–47% CPU throttle); (2) this change — the behavior-preserving per-sub-task context optimization. The task_sequence drain rebuilt the template context per sub-task (running_ctx.clone() + 2–4× to_template_context() deep-clones + per-block ExecutionContext clones + a fresh context_to_value() per templated field). noetl-tools v3.14.1 (tools#74, squash 9dd9aa6 → release c8656c1, published to crates.io): TemplateEngine::render_value builds the proxied minijinja context ONCE and threads it through the recursion (render_value_with/render_with; minijinja Value is Arc-backed → reuse is a refcount bump), and new build_context_with_overlay(&variables, overlay) builds straight from &variables + a small overlay, skipping the intermediate to_template_context() HashMap deep-clone + per-block ExecutionContext clones in the set/policy paths. Isolated micro-bench (CPU held constant): per-sub-task context cost 2988.9µs→1147.1µs (−61.6%, 2.6×); 407 lib tests + 2 new equivalence pins + clippy clean. noetl-worker v5.40.4 (worker#125, squash 1a10a73 → release 0afbf5c): Cargo.lock pin 3.143.14.1, deps-only, no worker source change, clippy clean. Built the prod image via Cloud Build (noetl-worker-rust:0afbf5c) and rolled it onto prod noetl-worker-rust + noetl-worker-system-pool (server stays v3.39.5/.6, worker CPU req 250m/limit 2 kept) — rolling restart clean (pods Ready, 0 crashloop), off-server CQRS cutover stayed healthy (materializer sole-writer projected==acked, lag ~0, command lag 0, executions completing, system-pool WAL index rehydrated per #119). ai-meta repos/toolsc8656c1 + repos/worker0afbf5c. The code-opt + CPU-bump compound: more headroom AND less work per slot on the batch hot path.
#123 2026-06-22 Loop with a non-iterable in: silently wedged (commands=0, RUNNING forever) instead of failing loudly — FIXED: the off-server drive now decodes the error envelope and emits a terminal playbook.failed. evaluate_loop already returns CoreError::Validation("… did not evaluate to an iterable") and the in-process drive already turned that into a terminal playbook.failed — but under the prod-default off-server drive (NOETL_ORCHESTRATE_PLUGIN_DRIVE=true) the system/orchestrate wasm plug-in returns a structured {"error":…} envelope instead of an OrchestrationResult, which apply_worker_orchestration couldn't decode → logged a WARN, recorded decode_error, returned Ok(0) → no terminal event → the run sat RUNNING forever (this observability gap produced the false-positive #122 during the #120 2×2 repro). Fix: server apply_worker_orchestration decodes the drive ERROR envelope (decode_orchestrate_error) and emits a terminal playbook.failed (metric noetl_orchestrate_drive_total{stage="drive_error"}, structured execution_id), matching the in-process drive — a transient decode miss stays on the benign re-drive path; orchestrate-core prefixes the offending step name onto the existing evaluate_loop error (prefix_loop_step_error). An empty iterable ([]/{}) still short-circuits to next — only a non-iterable errors. noetl-server v3.39.6 (server#258, squash 275b914 → release 7f109a9); 600 server + 135 orchestrate-core tests + clippy clean; kind-validated prod-exact (PLUGIN_DRIVE=true+PUBLISH_ONLY=true+STATE_BUILDER=offserver): absent workload.batch_slotsFAILED with loop step 'process': Loop expression '{{ workload.batch_slots }}' did not evaluate to an iterable (got null), a valid [1,2,3] loop still COMPLETED 3-way, #120/#124 unaffected, 0 pod restarts. Code-only fix (worker stays v5.40.3) — ships to prod on a future server rollout; PROD stays healthy on v3.39.5 + the live off-server cutover (NOT redeployed this session). #127 stays OPEN (perf follow-up, separate).
#121 2026-06-22 Off-server WAL-chain-incomplete loop on system/ executions — FULLY FIXED in two halves (v3.39.4 #256 + v3.39.5 #257). Second half (server#257, squash 54ac277 → release v3.39.5 c421273): #256 alone did not hold — a live-prod re-cutover on v3.39.4 still wedged system/scheduled_cleanup because the system-pool worker drives system execs off-server regardless of the server-side gate (STATE_BUILDER=offserver on the worker → INSERT-not-publish chain leaves a NULL-prev orphan). #257 gates both off-server-drive decision sites in trigger_orchestrator_inner on should_publish(catalog_id) (new pure is_system_path helper + unit test) so system/* execs fall through to server-built run_state; regular publishable execs keep the off-server path (preserves the #256 win). 612 tests + clippy; kind full-gate system/scheduled_cleanup BEFORE 9× WAL chain incomplete → AFTER COMPLETED 0 loops, regular test/simple_loop still COMPLETED off-server. Server-only (worker stays v5.40.3). First half (server#256, v3.39.4): two distinct defects. (1) The gate-off claim_command raw in-tx INSERT (and the handle_batch_events gate-off batch INSERT) bypassed the event_write::emit_events/ChainHeads chokepoint, so command.claimed got prev_event_id=NULL and the chain head was never advanced — the next event (command.started) linked back to command.issued, skipping the orphaned claim, so the off-server chain_walk_from hit a NULL-prev non-genesis head → build_spine_to Incomplete. Fired for every system/ playbook because should_publish is false for system executions even under global PUBLISH_ONLY=true. Fixed by stamping prev_event_id via link_batch on both gate-off INSERT paths (same advance-then-write ordering emit_events uses). (2) The actual loop — the stateless off-server drive builds state from the noetl_events WAL, but system-execution events INSERT to noetl.event and never enter the WAL → the worker's WAL build can never complete → __offserver_retry__ re-drive loop. Fixed by gating the off-server WAL drive on should_publish(catalog_id); system/ executions drive server-built (reads noetl.event). noetl-server v3.39.4 (server#256, squash 28b17cb → release 77aaa06); changes confined to src/handlers/events.rs + src/state.rsserver-only rebuild, no worker bump. 598 tests + clippy clean; kind-validated prod-exact (PUBLISH_ONLY=true + STATE_BUILDER=offserver): system/scheduled_cleanup went orphan + 17×+ WAL chain incomplete (wedged RUNNING 120s+) → linked chain, 0 loop lines, COMPLETED in 6s; non-system off-server unaffected. Live prod repro: the full off-server cutover on prod (server v3.39.3 + off-server gate) WEDGED — drive applied froze while dispatched_offserver_stateless/offserver_retry climbed in lockstep to ~389, system pool logged the WAL-chain-incomplete no-op for multiple execs (incl. prod's own 327130580493803520), all prev_event_id=NULL; the armed revert (PUBLISH_ONLY=false+STATE_BUILDER=server) recovered cleanly — this is what reopened #121 and motivated the #257 second half above. PROD GKE default (STATE_BUILDER=server) + all defaults untouched. #123/#127 stay OPEN (separate).
#125 2026-06-21 task_sequence runner implemented only do: fail, silently ignoring do: jump/break/retry — FIXED: noetl-tools v3.14.0 honours all control-flow variants. The Rust task_sequence tool parsed the do: directive but only matched "fail", so do: jump/break/retry were silently ignored — each slot ran once and returned, leaving batches in progress with no retry/loop/jump path (pft_flow_test produced 16 done / rest pending, patient loss). Fix in tools#73 (noetl-tools v3.14.0, squash 62d0948 → release 638c3c6): honour do: jump/to: (loop to named step), do: break (exit the sequence), do: retry (with attempts/backoff + infinite-jump guard). Worker adopts via worker#124 (v5.40.3). Together with #126, makes the full 10×1000 batch pft_flow_test pass end-to-end on kind (zero patient loss, invariants clean). PROD + defaults untouched.
#126 2026-06-21 Rust http tool nested parsed response body under data.body instead of data — FIXED: noetl-tools v3.13.1 restores the Python-era output.data.data contract. The Rust http tool exposed the parsed response body under data.body, but playbooks written against the Python-era contract read output.data.data (or equivalently output.data for the body). pft_flow_test's save_batch step called jsonb_to_recordset on a non-array value → error. Fix in tools#72 (noetl-tools v3.13.1, squash 86f0216 → release 8dd0e1f): expose the parsed body under data again; keep body as a back-compat alias. PROD + defaults untouched.
#124 2026-06-21 Distributed task_sequence forward set:/sibling bindings rendered empty (data not propagated between sub-tasks) — FIXED: defer rendering of sub-task templates whose vars aren't resolvable at command-build time. Inside a multi-tool (task_sequence) step, a later sub-task's templates that reference a value a prior sub-task produces at runtime — a forward set:, a policy-rule set:, or a sibling result — were rendered to empty at command-build time, before the worker's per-sub-task binding ran. orchestrate-core/src/commands.rs::render_pipeline_config preserved set/args/spec/command verbatim but rendered every other sub-task field (url/params/method/…) against the step-entry context under UndefinedBehavior::Chainable, so runtime-only refs ({{ iter.* }}, sibling labels) silently collapsed to empty — concretely pft_flow_test's batch path: claim_batch writes iter.data_type via policy set:, then fetch_batch's url: …/api/v1/pft/batch/{{ iter.data_type }} pre-rendered to …/batch/ → 404 → 0 rows. Fix: new TemplateRenderer::render_value_deferring_unresolved renders only templates whose variable paths all resolve in the build-time context; any template referencing an unresolved path is preserved verbatim so the worker re-renders it against the per-sub-task running context (a superset). noetl-server v3.39.3 (server#255, 365d3be); cargo test 134/134 + clippy clean; kind-verified (fetch_batch now hits the real per-type URLs). PROD + defaults untouched. #121/#123/#125/#126 stay OPEN (separate).
#120 2026-06-21 Reduce barrier deadlocked (commands=0) on open/asymmetric loop joins — FIXED: a runtime liveness filter in the orchestrate-core reduce barrier. A back-edge U → T whose forward return path from T is absent (T does NOT forward-reach U) was counted as a genuine fan-in parent of T, so dispatch of T deferred forever (U never runs on the taken path; pft_flow_test setup_facility_work stalled commands=0 for 4.5 min). The barrier now blocks on an upstream only if it is live on the current path (done/skipped, entered/in-flight, or reachable from an active not-yet-terminal step); build_incoming_arcs unchanged (an open back-edge is still a genuine static fan-in). Affects the in-server and off-server drives identically (shared orchestrate-core); closed loops + genuine reduces unaffected. noetl-server v3.39.2 (server#254, 28e8950); new unit test test_open_loop_back_edge_does_not_block_dispatch; 133/133 + clippy clean; kind-validated (post-fix 2×2 off-server/gate matrix all COMPLETE, fanout_reduce/pagination/loop green). PROD + defaults untouched.
#119 2026-06-20 Off-server WAL state-builder drain stranded executions after a worker restart — FIXED: rebuild the in-memory index from the retained noetl_events WAL on every boot. The authoritative drain used a durable noetl_state_builder consumer whose cursor persists across restarts while the in-memory WalEventIndex rebuilds empty → the cursor outran the fresh index → build_spine_to(expected_head) permanently Incomplete → off-server execs looped offserver_retry and never completed (this is what hid the #118 symptom). Fix (worker-only, inside NOETL_STATE_BUILDER=offserver; PROD runs the in-server drive so untouched): the drain defaults to an ephemeral DeliverPolicy::All consumer rebuilding the full index from the retained WAL on every boot (no persisted cursor to outrun; also correct for >1 worker pod); instant revert NOETL_STATE_BUILDER_DURABLE=1; proof = index rehydrated… log + new noetl_worker_state_builder_indexed_executions gauge; never reintroduces a noetl.event scan. noetl-worker v5.40.2 (worker#123, 48b0bde). Gate-ON kind: forced mid-flight pod delete --force → new pod index rehydrated … indexed_executions=17 wal_events=200; single-replica 6/6 stress + multi-replica 21 execs COMPLETE, zero scans.
#118 2026-06-20 Single-replica off-server terminal-finalize chain fork (a duplicate playbook.completed orphaned the chain as a NULL-prev_event_id 2nd root) — FIXED: a bounded FinalizedGuard (exactly-one-terminal-per-execution) suppresses the duplicate at emit_events before the chain linker. A straggler drive on a materializer-lagged single replica re-drove off the lagged WAL (state not terminal yet) and emitted a 2nd terminal that linked to the now-evicted head → 2 roots + a benign event-scan. The guard makes the first terminal win; gate-off byte-identical (a duplicate never occurs on the synchronous in-process drive); metric noetl_terminal_dedup_total{suppressed}; rig gains a HARD terminals==1 assertion. Absent under multi-replica execution-affinity (#116 serializes finalize to the owner). noetl-server v3.39.1 (server#253, c5f8cb2) + e2e (e2e#73, fe97d92). Gate-ON kind (unblocked by #119): single-replica 6/6 stress iterations / ~126 execs every chain roots=1(incl. terminal)/terminals=1/zero-scan; multi-replica 21 execs clean.
#117 2026-06-20 Off-server from_events spine ordered by event_id broke fan-in under a chain-order≠id-order inversion (high-concurrency fan-out reduce wedge) — FIXED: order the spine by the prev_event_id chain walk + walk from the real tip (expected_head). worker v5.40.1 (worker#122, baeae78) + e2e (#72, cdf1768). 2-replica affinity gate-ON stress 6/6 iterations / 108 execs COMPLETE; 15 real id-inversions all fired the fan-in reduce. The residual single-replica terminal-finalize chain-linking race is now FIXED as #118.
#113 2026-06-19 Worker-driven orchestrate drive stalled when the drive result / accumulated context crossed the inline budget — fixed at the source by #115 Phase 1 (references-in-state consume side). All 9 of 9 stalled core fixtures now reach playbook.completed gate-ON (PUBLISH_ONLY + off-server drive + materializer sole-writer), server v3.30.0 + worker v5.36.0: output_select / storage_tiers / lease_expiry / pipeline_heavy_payload / save_edge_cases / large_result_extraction / http_to_postgres_{direct,simple,bulk_python}. Max next-command context across the 9 = 412KB (save_edge_cases); 0 __orchestrate__ event rows on every run; materializer lag = 0. The #113 decode fix (server#241 v3.29.4) + #114 offload cap (server#242 v3.29.5) stay as safety nets; the growth that pushed past budget is gone (worker selective resolve keeps foreign bulk out of the render; refs_in_state-true keeps the drive state + command.issued bounded). Landed via worker#117 + server#243.
#114 2026-06-19 Off-server drive oversized command.issued (full upstream context embedded) — resolved by #115 Phase 1 (refs_in_state default true). With the worker selective consume side in place, refs_in_state defaults true so the next command's render_context carries {reference, extracted} (small) instead of the full upstream payload — command.issued no longer approaches NATS max_payload. The #114 offload safety cap (maybe_offload_command_context, 512KB) stays as belt-and-suspenders. Verified gate-ON: all 4 of this issue's fixtures COMPLETE; max next-command context 412KB across the full 9. Landed via server#243 + worker#117.
#112 2026-06-18 Worker /dev/shm SIGBUS — k8s default 64 MiB tmpfs vs the 256 MiB Arrow IPC cache budget. Every worker process (Rust noetl-worker + legacy Python worker) allocates an Arrow IPC shared-memory cache at init (NOETL_IPC_CACHE_BUDGET_BYTES, default 256 MB) backed by POSIX shm on /dev/shm; the k8s container-runtime default /dev/shm is a 64 MiB tmpfs, so under shm-heavy load the cache writes past 64 MiB, the store page-faults against the full tmpfs, and the worker dies with SIGBUS (exit 135) and crash-loops. Surfaced during #103 CQRS kind validation on the system pool; a transient live fix was reverted, leaving the committed manifests latent. Fix (ops#193) gives every worker deployment a memory-backed /dev/shm (emptyDir medium: Memory, sizeLimit: 320Mi > budget), pins NOETL_IPC_CACHE_BUDGET_BYTES=268435456 next to the sizeLimit so the two can't drift, and raises the memory limit to 768Mi (tmpfs is charged to the pod cgroup). Applied to all 7 worker manifests (system / shared-rust / subscription / subscription-runtime / Python-cpu + 2 prod variants). Kind-validated on the system pool: reproduced SIGBUS (exit 135) on the 64 MiB tmpfs → after fix /dev/shm is 320 MiB and a full 256 MiB write completes (exit 0, peak 256M/320M) with the pod healthy (restarts=0, no OOM); cluster restored to baseline. ai-meta → ops f4df4c1 + wiki worker deployment-specification.
#110 2026-06-18 Retired the in-server orchestrate shadow + the wasmtime server dependency. The separable server-slimming follow-up to #108 (closed): the in-server shadow was the slice-4 cutover-confidence harness (ran the system/orchestrate plug-in inside the server via an embedded wasmtime host, diffed 529/0 against the in-process drive). With the worker-driven drive default-on + proven, the live drive uses the worker's wasmtime host — never the server's — so the shadow was dead weight. server#234 removed orchestrate_shadow.rs, the orchestrate-shadow cargo feature + the optional wasmtime dep (the cranelift/wasmtime tree — ~1000 Cargo.lock lines — fell out; cargo tree -i wasmtime now matches nothing), the trigger_orchestrator_inner shadow hook, the main.rs boot loader, the NOETL_ORCHESTRATE_PLUGIN_SHADOW config field, and the noetl_orchestrate_shadow_total metric. Kept run_state + NOETL_ORCHESTRATE_PLUGIN_DRIVE (default true). refactor: = no version bump (stays v3.28.0). Kind smoke on a 4-page cursor loop, both drive modes: COMPLETED with __orchestrate__ rows in noetl.event = 0 (worker-driven: 10 drive cmds on the system pool, dispatched=applied=10, event_suppressed=30); shadow metric gone from /metrics. ai-meta → server f3043c9 + wiki deployment-specification. See Umbrella-System-Pool-Design.
#108 2026-06-18 🎯 The orchestrator drive runs OFF-SERVER on the system pool, and it's now the DEFAULT. Step 2 of the dissolution (#107) — the server's brain moved onto the worker pool as the system/orchestrate WASM plug-in. Arc: slices 1–3 (off-server drive, server#229) → 4a (cursor+fan-out validated) → 4b + follow-up a (the __orchestrate__ meta-command touches noetl.event zero times, server#230/#231) → (b) system-pool isolation (server#232 + worker#114 + ops#191) → (c) the deliberate default-flip (server#233, v3.28.0): NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true. Gated on a scale soak (kind): a 694-drive cursor+fan-out run COMPLETED with __orchestrate__ rows in noetl.event = 0 and all 694 drives claimed on the system pool (shared pool got only the real steps); the default-on path (no env var) reproduced the identical shape (361 drives, isolated, 0 burst); 15/15 regression fixtures green; revert (=false → in-process drive) verified. ai-meta → server 80cc0e6 + worker 437b0be. See Umbrella-System-Pool-Design.
#109 2026-06-17 Event-ABI round — orchestrator + evaluate moved into the wasm core (#108 final slices). Slice 3 (server#223) relocated orchestrator/evaluate to noetl-orchestrate-core; the whole drive (renderer, playbook, commands, evaluator, state, orchestrator switch) now compiles native + wasm32-unknown-unknown. evaluate reads the pure core::event::Event; db::Event converts at the trigger_orchestrator boundary. 122 core + 565 server tests green, 0 WASI imports, kind PFT 10×1000 clean (0 errors, 0 restarts). Plug-in round (data-plane ABI + command_emit + scheduler + shadow-diff) tracked under #108.
#106 2026-06-17 EventEnvelope rejects timezone-less timestamps — blocked the CQRS materializer. The tailer publishes to_jsonb(noetl.event row) and created_at is timestamp WITHOUT time zone, so the timestamp had no offset; EventEnvelope.timestamp: Option<DateTime<Utc>> rejected it with the misleading premature end of input. Fixed (server#217) with a flexible deserialize_with (RFC3339 + tz-less→UTC). Validated live: materializer projects {projected:0, duplicates:20} = byte-identical. The #103 shadow gate is green.
#105 2026-06-17 Plug-in compilation & hot-reload: WASM-compiled system playbooks + managed library. The WASM plug-in runtime is complete + fully live-proven on kind: dispatch (tool_kind: "wasm" → wasmtime host → digest from registry → run), capability flush (object_putnoetl.object_store), real data flow (inputargs, worker#110), and hot-replace (worker#112) — republishing the same path@version swaps the running pool's behavior with 0 restarts. Landed across worker#93/#95/#97/#99/#101/#103/#105/#107/#108/#110/#112 + server#210/#212/#214 + tools#68 + docs#181. Remaining (executor: author sugar; compiled materialiser port) deferred to #104/#103. See Umbrella-WASM-Plugin-Compilation.
#100 2026-06-15 Cursor/claim loop mode: loop.spec.mode: cursor in the Rust orchestrator. noetl-server v3.8.0 (server#196) cursor loop engine + output namespace; noetl-tools v3.10.1 (tools#66) postgres -- comment splitter fix; worker#88 dep bump. test_pft_flow_v2 all_passed:true, 5/5 per data type on kind against the throttling/error-injecting paginated-api. See Umbrella-Cursor-Loop-Mode.
#99 2026-06-14 Transfer tool: Snowflake↔Postgres both directions, with credential-alias resolution. Both transfer arms implemented in noetl-tools v3.10.0 (tools#65); worker v5.22.0 (worker#87) pre-resolves source.auth/target.auth aliases; SF→PG coerces string cells via $n::text::<udt> + reformats Snowflake-epoch timestamps to RFC3339; PG→SF generates SQL-escaped INSERTs. e2e fixture migrated (e2e#58). Kind-validated bidirectionally against the live sf_test account: every step COMPLETED, real types preserved.
#90 2026-06-12 Subscription / listener tool (RFC) — all 7 phases shipped + live-proven. Bounded-drain tool (Mode A) → kind: Subscription continuous runtime (Mode B) + header-directive engine → gateway push-ingress (Mode C) + auth-gated directive trust → store-and-forward spool + circuit breaker → out-of-cluster Cloud Run + gcs spool → CLI local noetl subscribePhase 7 scale hardening. Final phase: server v3.5.0 (server#189) POST /api/execute/batch (N→N, partial-failure contained) + opt-in exactly-once dedup window (noetl.subscription_dedup, bounded-by-age, race-safe, default off); worker v5.19.0 (worker#79) batch dispatch + dedup opt-in + per-subscription rate limits (token-bucket fetch-side backpressure → source keeps backlog, no loss); ops (ops#176) + e2e (e2e#48); no tools change. Live on kind: batch 12→12 COMPLETED on the subscription pool + per-message traceparent; dedup duplicate→1 execution + subscription.message.deduplicated; rate-limit engaged + 10/10 → executions (no loss). Refinement follow-ups tracked: #91#94 + tools#57. ai-meta → server 7b217d8 + worker 7531f4a + ops 6db69b9 + e2e 203593b.
#89 2026-06-11 JSON null re-injected via {{ step }} serialized as JS undefined — fixed in the server template renderer — the cursor pagination fixture's terminal page returns next_cursor: null; re-injecting the whole {{ fetch_page }} envelope into the next step's input rendered that field as the bare token undefined, producing invalid JSON. render_to_value then failed serde_json::from_str and returned the entire envelope as a raw string, so the consuming Python step got response as a str and crashed ('str' object has no attribute 'get'). Root cause was the server, not the worker (the issue's hypothesis): src/template/jinja.rs::json_value_to_minijinja maps JSON nullValue::UNDEFINED and minijinja's map repr emits undefined; the server's renderer was a divergent copy missing the | tojson retry the noetl-tools engine already had. Fix (server#177, v3.0.6) adds that retry — a lone {{ expr }} rendering container-shaped-but-invalid JSON re-renders with | tojson; minijinja_to_json maps undefined/none → JSON null. 5 new regression tests; 619 lib + 8 parity green. Kind-validated: cursor walks all 4 pages, terminal next_cursor: null handled, validate_results collects 35 (first_id=1, last_id=35, success) — matching offset. ai-meta → server 8e17fbe.
#88 2026-06-10 e2e offset/cursor pagination fixtures read response.body.* not response.data.* — the Rust http tool nests the parsed JSON payload under body ({{ fetch_page }}{body, headers, status_code}), so the fixtures' response.get('data', {}) resolved to {}, has_more/next_cursor defaulted falsy, and the loop exited after page 1 even with the post-#85 loop machinery correct. Fixed both check_pagination steps to response.get('body', {}) via e2e#40. Kind-validated against the live paginated-api test-server (Rust server/worker :dev): offset walks 0→10→20→30, has_more T/T/T/F, users 10/10/10/5, validate_results success 35 (first_id=1, last_id=35), playbook.completed COMPLETED. cursor path-fix correct + walks all 4 pages (Mg==Mw==NA==null, 35 events fetched) but final collection blocked by a distinct worker bug → filed #89 (terminal next_cursor: null serialized as JS undefined). Other pagination fixtures (retry/max_iterations/pipeline*/loop_with_pagination) share the same envelope-key assumption over /api/v1/assessments|flaky ({data, paging}) — flagged, left for follow-up. ai-meta → e2e 72a7525.
#85 2026-06-10 Workflow-arc loops now advance across iterations + terminate cleanly — built on the dispatch-guard re-entry layer, two coupled orchestrator fixes via server#176 (v3.0.5). (1) Durable event-sourced loop-ctx propagation: step-level set: ctx.* loop variables were recomputed per pass and reverted to the workload default (loop thrashed 0,0,1,0,1,2,…); root cause was start's initializer set re-firing every pass in random HashMap order against check_pagination's advancing set. Fix persists each completion's rendered set: as a ctx.updated event (latest-wins fold + build_context overlay), emitted once per completion keyed by the stable completion event_id (not Utc::now()-fallback completed_at). (2) Loop-exit hang: the exit branch was marked step.skipped on a loop-body-completion pass (recency-based branch-point detector missed it), turning it terminal so the exit dispatch was suppressed; fixed with a structural loop-branch-point test. 614 lib tests (6 new; 2 verified to fail without their guard); clippy-clean. Kind-validated: counter loop advances 0→1→2→3 + terminates; real-http offset pagination advances 4 pages collecting 35. (Separate finding, filed follow-up: the e2e offset/cursor fixtures read response.data.users but the Rust http tool nests it at response.body.users.) ai-meta → server e519fdc.
#87 2026-06-10 Multi-tool step: a later sub-tool can now reference an earlier sibling's output — in a tool: [list] step, each sub-tool's result was stored for the aggregated output but never injected into the running context, so {{ <label>.<field> }} rendered empty (masked in quoted positions; a syntax error at or near "," in unquoted numeric SQL positions, e.g. save_edge_cases test_large_payload). Fixed via tools#48 (v3.1.1): inject each sub-tool's result under its label (with a synthetic .data self-ref) so later siblings resolve it. Worker adopts 3.1.1 via worker#69. Kind-validated: save_edge_cases test_large_payloadrecord_count = 100 (no syntax error), save_delegation_test clean. ai-meta → tools 76f942a + worker b97f642.
#83 2026-06-10 Orchestrator fan-in barrier deadlocked workflow loopsbuild_incoming_arcs counted a loop back-edge (check_pagination → fetch_page) as an upstream so the barrier deferred the loop head forever. Fixed via server#175 (v3.0.4): exclude back-edges via a new forward_reachable helper (genuine fan-in unaffected). Kind-validated; fanout_reduce green. ai-meta → server 480ba72.
#84 2026-06-10 Orchestrator never populated event.nameloop.done arc gates always skippedwhen: {{ event.name == "loop.done" }} (10+ fixtures) never matched, so in-step loop: steps hung after completion. Fixed via server#175 (v3.0.4): inject event.name = "loop.done" into a completed loop step's next-arc context. Kind-validated (test_pagination_basic completes). ai-meta → server 480ba72.
#86 2026-06-10 e2e fixtures: duckdb tool field is command/query, not commandscommands: (plural) failed pre-dispatch (missing field 'query' / malformed tool config). Renamed across 4 storage/gcs fixtures via e2e#39; save_all_storage_types green. ai-meta → e2e b0a5c85.
#78 2026-06-10 noetl-worker: pre-dispatch errors now emit terminal call.error instead of hanging — credential-alias resolution + tool-config deserialization failures used to ?-propagate out of execute_with_server_url; the dispatch loop only logged them, so the execution sat at command.started forever. Fixed via worker#68 (v5.15.1): typed CredentialResolutionError (terminal AliasNotFound/Invalid vs retryable Transient) + CredentialHttpError carrying the HTTP status so classify_fetch_error decides retryability by code (terminal: 404/400/401/403/500; retryable: 408/429/502/503/504 + transport), and handle_predispatch_failure emits call.error + command.failed. Diagnosis correction: the live pg_noetl_k8s repro is an HTTP 500 "Decryption failed: aead::Error", not a 404 (the worker has no /api/keychain/ call). Kind-val GREEN: test/postgres now → call.error/command.failed/playbook.failed (no hang); hello_world still completes. ai-meta → worker 99e2c66.
#80 2026-06-10 container_callback chain green end to end — fixing the watcher's missing curl (retired bitnami/kubectlalpine/k8s:1.30.3, ops#168) surfaced two more layered bugs. Server: container-callback call.done insert targeted a non-existent attempt column → HTTP 500; fixed to the deployed noetl.event schema (server#173, v3.0.3). OOM path: the watcher only read Job-level conditions so failed_oom could never fire — added pod-level OOMKilledfailed_oom / ImagePullBackOfffailed_image_pull classification + RFC3339 completed_at (bare now was returning HTTP 422) (ops#168); and the e2e fixture's calloc-lazy bytes(40MiB) never dirtied pages so it never OOM'd (e2e#38). kind_validate_container_callback.sh both probes GREEN — happy_path → succeeded, oom → failed_oom. ai-meta → ops cacc513 + server 5d2cf58 + e2e 6aaf06e.
#79 2026-06-10 e2e kind-val runner scripts updated to current noetl CLI surface — both scripts/kind_validate_*.sh aborted on unrecognized subcommand 'playbook'; the validation logic + event taxonomy were intact, only the invocation layer drifted. Fixed via e2e#37: register playbook / exec <catalog-path> --runtime distributed --json / status --json / event log via noetl query (.result, order by event_id); fail-fast CLI-surface guard. fanout_reduce PASS start-to-finish on kind (server-rust v3.0.1); container_callback drives cleanly and stops at the watcher curl gap (#80). ai-meta → e2e a3594b3; wiki Kind-Val Runners.
#82 2026-06-10 GUI: credential View/Edit recovered for pre-wallet (legacy-encrypted) records — Secrets Wallet (#61) made credential storage forward-only, so pre-wallet records 500 on decrypt and the GUI View/Edit flow dead-ended on a generic toast. Fixed via gui#36: View explains the cause + points to Edit; Edit reopens with the list-row metadata + a warning banner so re-saving re-seals the record under the current wallet. ai-meta → gui 8cacc9e (v1.11.1).
#81 2026-06-10 Container tool unusable: ToolSpec.command (String) vs ContainerConfig.command (Vec) type contradiction — landed via server#172 (v3.0.2). ToolSpec.command Option<String>Option<serde_json::Value> so the container tool's array command decodes server-side + passes through to the worker's Vec<String> (scalar stays a JSON string for shell/db tools); ToolCall::from_spec forwards verbatim. 2 regression tests; clippy clean. Kind-val GREEN: server accepts command: ["/bin/sh","-c"], worker creates the K8s Job, Job reaches Complete 1/1. Chain counter-bump validation stays gated on #79/#43.
#77 2026-06-09 Explicit input:/set: forward-only data binding — BREAKING v3.0.0 across noetl-tools + noetl-server. All 5 PRs merged: tools#45 (v3.0.0) + server#169 (v3.0.0) + e2e#35 (13 fixtures migrated) + cli#57 (v4.10.0, executor 0.5.0) + worker#66 (dep bump). Kind-val GREEN.
#76 2026-06-08 Sequential-mode iterator dispatch — LoopMode enum (Sequential default / Parallel), StepInfo.iterations_dispatched guard. Landed via noetl/server#166 (v2.62.0). First Claude-direct Rust PR under agents/rules/handoff-routing.md. Kind-val GREEN: test/loop COMPLETED 5/5 + iterator_save_test COMPLETED 4 steps.
#70 2026-06-08 noetl-server missing PUT /api/result/<execution_id> endpoint — landed via noetl/server#160 (v2.58.0). Durable result-store endpoints (PUT + GET /api/result/<eid>/<step>/<ref>) ported to the Rust server. Kind-val GREEN: output_select_test reached playbook.completed with test_result: "PASSED".
#69 2026-06-08 noetl-worker: over-budget call.done returned reference-only envelope, missing inline _ref for downstream {{ step._ref }} templates. Landed via noetl/worker#63 (v5.15.0): build_call_done_result's durable-success branch now embeds context: { data: { _ref: <noetl://...> } } alongside the existing reference block. Kind re-val pending noetl/ai-meta#70 (server-side PUT /api/result/<eid> endpoint missing — falls back to degraded shm-only path where there's no noetl:// URI to embed).
#68 2026-06-08 noetl-tools: ArtifactTool config required input but server pipeline emitted args (the post-#56 normalized name) — landed via noetl/tools#40 (v2.23.1) + worker dep bump via noetl/worker#62. One-line #[serde(alias = "args")] on ArtifactConfig.input accepts both shapes. Re-val surfaced a downstream _ref/output_select gap → filed as noetl/ai-meta#69.
#67 2026-06-08 Rust orchestrator hangs on mode: exclusive routing — untaken sibling never emits step.skipped, R4 fan-in barrier deadlocks. Landed via noetl/server#159 (v2.57.2). Three-part fix: (1) evaluator::evaluate_next_transitions surfaces unmatched siblings as not_matched_with_target; (2) orchestrator::process_in_progress two-pass emit-skipped-then-dispatch (HashMap-order-independent); (3) +2 unit tests (lib 568/0/0). Kind-val GREEN — comprehensive_test.yaml reached playbook.completed in ~4s (was hanging forever pre-fix).
#66 2026-06-07 Rust orchestrator: cross-step {{ step.data }} template resolves to None — landed via noetl/server#158 (v2.57.1). WorkflowState::build_context injects a self-referencing .data key on extracted user_data, guarded so the task_sequence flatten path's existing .data stays intact. 2 new unit tests + 566/0/0 lib. Surfaced by #65 kind-val; concrete repro kind execution 322087210360770560.
#65 2026-06-07 noetl-tools: external python script loaders (file/gcs/http source types) + legacy main() function convention — landed via tools#38 (v2.22.0) + tools#39 (v2.23.0); kind-val GREEN on the live worker (execution 322087210360770560 reached playbook.completed in ~6s; loaded main(name, count) returned the expected payload). noetl/worker#61 OPEN+mergeable to bump the worker pin. Surfaced finding: noetl/ai-meta#66.
#43 2026-06-07 R-3 Phase C-2: container tool kind — design callback pattern for K8s Job dispatch. All four Rust rounds shipped (1 k8s-watcher ops@8892043 / 2 callback endpoint server v2.48.0 / 3 Tool::Container tools v2.21.0 / 5 kind-val rig e2e@17de21d). Round 4 (Python parity) parked per Rust-only direction. Worker-side pending_callback adoption is a coordinated follow-up.
#64 2026-06-07 noetl-tools: artifact tool kind missing from Rust registry — landed via tools#35 v2.20.0 (thin ArtifactTool adapter translating the Python-era YAML shape into a ResultFetchTool call)
#61 2026-06-07 Secrets Wallet (Rust): envelope encryption + KMS/secret-provider plugins + sealed worker delivery + distributed multi-region resolution — all named phases + 6d.X cloud dynamic providers shipped; umbrella feature-complete
#54 2026-06-06 Phase F R5 — regression + e2e validation under sharded topology — closed at the umbrella level (Tier 1 + Tier 3 + Tier 4 e2e all GREEN on the Rust-only stack; subsequent regression findings filed as their own ai-task issues — #62/#63/#64/#65/#66 all now also closed)
#62 2026-06-05 noetl-server: /api/executions list query candidate-first rewrite + status-drift fix (server#99 v2.28.1) — 6.5 s → 0.015 s (~430×)
#63 2026-06-05 noetl-tools: python tool accepts nested script.source.code (inline) — tools#33 v2.19.3 + worker#54 adoption, test_script_loading kind-validated; external loaders split to #65
#60 2026-06-04 Rust orchestrator template context doesn't expose step data for next.arcs / step.when
#59 2026-06-04 Rust orchestrator doesn't resolve tool.kind:workbook references to inline actions
#58 2026-06-04 Rust orchestrator doesn't emit playbook.failed on command.failed — executions stall
#57 2026-06-04 Rust server rejects flat (name-as-field) pipeline shape in v10 playbook YAML
#56 2026-06-04 Canonical v10 playbook workload + input alias unrecognized on Rust stack
#55 2026-06-04 Rust server EventEmitRequest.execution_id wire-type drift blocks worker traffic
#53 2026-06-04 Rust worker → Rust server e2e compatibility (Phase D R3b/R3c terminal completion gap)
#52 2026-06-03 noetl-tools: add js_consume operation to nats tool kind
#51 2026-06-03 Fix system/outbox_publisher.yaml auth block to use AuthResolver pattern
#50 2026-06-03 Phase 2.a — system/outbox_publisher playbook + routing + auth wiring

Ecosystem map

See the Repo Map page for the full submodule inventory. Quick view of the production repos and their current versions (2026-06-23):

Repo Role Version Recent
noetl/server Rust control plane v3.45.0 🚀 #104 — WI/ADC GCS auth MERGED + ROLLED TO PROD as KSA noetl-server-rust (3-mode none/static/adc auth matrix via gcp_auth; auth=autoadc on real GCS, mints+auto-refreshes a short-lived WI/ADC OAuth token for the PAP-enforced bucket) (server#265, v3.45.0 fad5d8a; prod server @sha256:d3cbf1ad… runs as the WI-bound KSA with result-tier GCS ENV applied — all tier-enable flags OFF → tier inert, off-server cutover stayed sole-writer/lag-0/never-scan, 0 restarts; ready for staged B→C→D enablement). Prior: 📦 #104 Phase F MERGED — result-tier GC sweeper (NOETL_RESULT_TIER_GC, default off): conservative dry-run-first sweep reclaiming only provably-dead tier objects, never deletes a live-referenced one (server#264, v3.44.0 341b614; default-off byte-identical, inert in prod). Prior: 📦 #104 Phase D MERGED — mint-authoritative flag + result_store dual-write counter (NOETL_RESULT_MINT_AUTHORITATIVE, default off; the result_store PUT records each write on noetl_result_store_dual_write_total as the reversible dual-write fallback leg) (server#263, v3.43.0 6f6b9ef; no Cargo.toml change, resolves noetl-locator 0.1.1 from the registry; default-off byte-identical, inert in prod). Prior: 📦 #104 Phase C — GCS object backend + cell-endpoint registry + GET /api/internal/cells (the server side of the resolve-by-URN read path; default-off, inert in prod until a future rollout) (server#262, v3.42.0 c2d5ca9; no Cargo.toml change, resolves noetl-locator 0.1.1 from the registry). Prior: 📦 #104 Phase B — ensure the sibling noetl_result_materializer durable consumer at stream-birth so the worker's result-materializer loop drains on its own ack cursor (object-store latency never back-pressures the noetl.event audit fold); no new deps, control plane stays slim (server#261, v3.41.0 4a6659e; flag-gated on the worker side, inert in prod). Prior: 📦 #104 Phase A — accept the canonical result URI behind NOETL_RESULT_URI_ACCEPT (default off), parsed via the slim dependency-free noetl-locator 0.1.0 (heavy graph stays off the control plane) (server#260, v3.40.0 c89d078): dep repointed git→0.1.0 pre-merge, heavy crates (duckdb/kube/arrow/tonic/rhai/gcp_auth) confirmed absent, 623 tests green; inert in prod until a future server rollout enables the flag. Prior: ✅ #123 — surface a non-iterable loop in: as a terminal playbook.failed (off-server drive decodes the {"error":…} envelope via decode_orchestrate_error; orchestrate-core prefixes the offending step name onto the existing evaluate_loop error) (server#258, v3.39.6 7f109a9, squash 275b914): a non-iterable loop.in: used to silently wedge (commands=0, RUNNING forever) under the off-server drive because apply_worker_orchestration couldn't decode the wasm plug-in's error envelope (logged WARN, recorded decode_error, returned Ok(0) — no terminal event); now it emits a terminal playbook.failed (metric noetl_orchestrate_drive_total{stage="drive_error"}, structured execution_id), matching the in-process drive. An empty iterable ([]/{}) still short-circuits to next. Code-only (worker stays v5.40.3); 600 server + 135 orchestrate-core tests + clippy clean; kind-validated prod-exact (absent workload.batch_slots → FAILED with a clear message, valid [1,2,3] still COMPLETED, #120/#124 unaffected, 0 restarts); ships to prod on a future rollout. Prior: ✅ #121 SECOND-HALF — gate BOTH off-server-drive sites in trigger_orchestrator_inner on should_publish so system/* execs drive server-built run_state (they INSERT to noetl.event, never enter the noetl_events WAL); regular execs keep the off-server path (server#257, v3.39.5 c421273, squash 54ac277); pure is_system_path helper + unit test; server-only (worker stays v5.40.3); 612 tests + clippy; kind full-gate system/scheduled_cleanup BEFORE 9× WAL chain incomplete → AFTER COMPLETED 0 loops, regular loop still off-server (#256 was only the first half; #121 reopened after the v3.39.4 prod re-cutover still wedged). Prior: ✅ #121 first-half — link the gate-off command.claimed through ChainHeads (server#256, v3.39.4 77aaa06, squash 28b17cb). Prior: ✅ #124 — distributed task_sequence forward set:/sibling bindings no longer render empty (defer sub-task templates with unresolved vars to the worker re-render) (server#255, v3.39.2 28e8950): a runtime liveness filter in the orchestrate-core reduce barrier — an open/asymmetric loop back-edge predecessor that never runs on the taken path is no longer counted as a pending fan-in dependency (build_incoming_arcs unchanged); affects in-server + off-server identically; test test_open_loop_back_edge_does_not_block_dispatch, 133/133 + clippy, kind-validated. Prior: ✅ #116 program-scale step 2 — execution-affinity single-owner WRITE ORDERING (multi-replica gate-ON validated) (server#252, v3.39.0 5e00d0a): closes the chain-fork race step 1 left open (the command.issued prev-read + head CAS-advance are two non-atomic steps → concurrent cross-replica emits forked the chain). Affinity routes every trigger for an execution (POST /api/events, which fires the drive) to the single replica that sharding::ShardConfig::owns(execution_id) owns (stable XxHash64); non-owner forwards a reverse-proxy POST (one-hop loop guard, degrade-to-local). On the owner the single-process drive lock + in-memory ChainHeads make the read→advance atomic, no distributed lock; KV is the genesis/handoff vehicle (owner resolves LOCAL → kv_remote_hit→0). New src/affinity.rs; NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off, prod unchanged); metric noetl_execution_affinity_total{outcome} (forwarded_ok = proof). Multi-replica gate-ON kind (2-replica StatefulSet, offserver+audit_only+nats_kv+affinity+publish_only): rig PASS — chains roots=1/dangling=0/walk==total (NO fork), forwarded_ok +9, never-scan + sole-writer across replicas, executions COMPLETE; single-replica unchanged; 595 tests + clippy green. Follow-up #117: off-server from_events spine ordered by event_id wedges fan-in under a chain-order≠id-order inversion (high-concurrency fanout). Prior: ✅ #115 program-scale step 1 — multi-replica coherence DATA LAYER (NATS-KV-backed ChainHeads + ExecDescriptor); execution-affinity STAGED (server#251, v3.38.0 8f39a79): NOETL_REPLICA_COHERENCE=nats_kv (default local, prod unchanged) backs the off-server drive's watermark + descriptor with JetStream KV buckets — head advance = CAS (one chain under concurrent emits), descriptor = CAS merge; in-process maps = write-through cache / degraded fallback. src/coherence.rs; ChainHeads/ExecDescriptors async; metric noetl_replica_coherence_total{structure,op,outcome} (proof kv_remote_hit). Kind: single-replica nats_kv bit-for-bit parity with local (all topologies COMPLETE, clean chains, scans +0); 2-replica proved cross-replica resolves work. Necessary-not-sufficient: 2+ replicas still fork the chain (non-atomic issuing_event-read vs head-advance) → execution-affinity STAGED as step 2 (substrate in src/sharding.rs). Prior: ✅ #115 Phase 5 — atomic-working-item context (tenet 6): the drive hands a worker only its minimal declared slice (server#250, v3.37.0 a96ade8): new orchestrate-core::input_binding (analyze/project_context over minijinja undeclared_variables, conservative bail-to-full) + CommandBuilder::with_atomic_item_context; NOETL_ATOMIC_ITEM_CONTEXT (default false, prod unchanged); metric noetl_atomic_item_context_total{outcome}. Builds on #77 (Explicit Input Binding, CLOSED). Gate-ON kind-validated: flag-ON consumer ctx narrowed to the one declared key, COMPLETED; flag-OFF full ctx (back-compat). Prior: ✅ #115 Phase 6 — retire the hot-path noetl.event read class; the table is AUDIT-ONLY (server#249, v3.36.0 b71ca1d): NOETL_EVENT_READ_PATH=event_scan|audit_only (default event_scan, prod unchanged). Under audit_only, the remaining lifecycle readers of noetl.event (the WHERE execution_id replay class outside the drive — get_catalog_id per-ingest, inherit_parent_trace, subscription dedup-audit + container-callback catalog/existence) serve from the in-memory execute-time ExecDescriptor; a cold descriptor (post-terminal straggler after eviction / restart) resolves catalog_id from noetl.command (the synchronous queue) — never a noetl.event scan. New proof metric noetl_event_hotpath_reads_total{site,outcome}. Gate-ON kind-validated: hot-path scan Δ0 (served_descriptor +96 + served_command +3), drive state_build_total Δ0 + event_scans Δ0 ⇒ ZERO noetl.event scans anywhere on the hot path, end-to-end; linear/loop/fan-out/output_select COMPLETE; sole-writer + lag-0; audit still works (direct SELECT + status + replay event_count=25); 585 tests + clippy green; default event_scan, prod unchanged. The RFC never-scan end state (tenet 3) is reached under the flag. Prior: ✅ #115 Phase 4 REMAINDER — stateless off-server drive edge (zero state rebuild + zero noetl.event reads) (server#248, v3.35.0 6e30fc3): removes the residual server-side chain-walk bookkeeping on the drive path. Under NOETL_STATE_BUILDER=offserver a warm execute-time ExecDescriptor (catalog_id + routing seeded at playbook_started; terminal stamped at the emit_events chokepoint) lets trigger_orchestrator_inner route the system/orchestrate command WITHOUT building WorkflowStateexpected_head from the in-memory ChainHeads, trigger_event_id passed so the worker resolves the trigger type off its WAL, NO server-built state (__stateless__). apply_worker_orchestration sources catalog_id+routing from the descriptor (skips cold-rebuild) + evicts on terminal worker-built state. Cold descriptor (restart) falls through to the server-built path (re-seeds) → chain_walk + event_scan stay fallbacks. Proof: noetl_orchestrate_drive_total{dispatched_offserver_stateless,applied_stateless} advance while noetl_state_build_total stays flat. Gate-ON kind-validated: state_build_total Δ0, event_scans Δ0, dispatched_offserver_stateless +3 / applied_stateless +3, linear(13)/loop(62)/fan-out(25)/output_select(31) COMPLETE, offserver==server parity, sole-writer 25==25, lag-0; 583 tests + clippy green; default server, prod unchanged. Completes #107 step 2 server-side. Prior: ✅ #115 Phase 4 DRIVE CUTOVER — mark the off-server drive command + carry expected_head (server#247, v3.34.0 f0922bd): under NOETL_STATE_BUILDER=offserver (default server, prod unchanged) trigger_orchestrator_inner marks the system/orchestrate command __offserver_build__ + carries execution_id + expected_head (the staleness watermark) so the worker self-sources the drive state from the WAL; the server-built state rides along as the worker's incomplete-chain fallback. Gate-ON parity rig PASS (offserver==server, fan-in exactly-once, server scans 0). 580 tests + clippy green; no prod default changed. Prior: ✅ #115 Phase 4 — NOETL_STATE_BUILDER=offserver|server flag scaffold (default server) (server#246, v3.33.0 3e6006d): the server-side flag for the off-server state-builder drive cutover. server (default, prod unchanged) builds WorkflowState in-process; offserver routes the drive to obtain state from the pool-side off-server builder (worker state_builder, reading the noetl_events WAL). Flag only — the offserver drive-cutover wiring is staged (pool-side builder landed in worker v5.37.0). 2 config tests; no prod default changed. Prior: ✅ #115 Phase 3 MERGED — chain-walk state builder (flagged, default-off) (server#245, v3.32.0 8338417): behind NOETL_STATE_BUILD_MODE=chain_walk the drive reconstructs WorkflowState by following the one-level prev_event_id chain head→root (head from the in-memory ChainHeads watermark, then (execution_id,event_id) PK point-lookups — never a WHERE execution_id scan of noetl.event) and feeds the same from_events (orchestrate-core unchanged → parity by construction). Falls back to event_scan (default) on cold-head / materializer-lag / non-genesis so correctness is never sacrificed; NOETL_STATE_BUILD_PARITY_CHECK shadow-builds both ways in one REPEATABLE READ snapshot and asserts equal. New metrics noetl_state_build_total{mode,outcome} + noetl_state_build_event_scans_total (no-scan proof, 0 under chain_walk) + noetl_state_build_chain_hops + noetl_state_build_parity_total{result}. Gate-ON kind-validated: parity 41/41 MATCH, event_scans_total=0 / 1064 PK hops / 0 fallbacks, all topologies COMPLETE, sole-writer + lag-0, 577 lib tests + clippy green. Self-merged; prod default unchanged. Phase 4 (off-server builder + WAL cache) builds on this. Prior: ✅ #115 Phase 2 MERGED — one-level prev_event_id event chain (server#244, v3.31.0): each noetl.event carries prev_event_id (the immediately-previous event in causal order) + each noetl.command the issuing-event link, stamped at the emit chokepoint emit_events from a per-execution chain-head watermark (ChainHeads) — covering drive events + command.issued + worker-lifecycle on both gate paths, the materializer persisting it. Additive (nothing reads the link yet — that is Phase 3, in progress). Chain-correctness kind-proven walkable/1-root/no-gap/no-scan across 6 gate-ON topologies; 573 lib tests + clippy green. ai-meta pointer afdb365. Prior: ✅ #115 Phase 1 — surface _ref/_store on kept refs + refs_in_state default true (server#243, v3.30.0): hydrate_result_references keep_refs branch merges ref/store/uri onto the bounded extracted summary it surfaces as context.data (so {{ step._ref }} lazy-load + {{ step._ref is defined }}/{{ step._store }} predicates resolve off the summary without bulk), and refs_in_state now defaults true — references stay out of state + commands (drive state + command.issued bounded; the worker selective consume side landed in worker#117). Closed #113 + #114; kind gate-ON all 9 stalls COMPLETE. Prior: 🐞 #114 — offload oversized command context (server#242, v3.29.5): a command.issued context over NOETL_COMMAND_CONTEXT_MAX_BYTES (512KB) is offloaded to noetl.result_store with a {__context_ref__} marker; get_command/claim_command resolve it before the worker sees the command (metrics context_offloaded/context_ref_resolved) — so the published event stays under NATS max_payload and never wedges the publish-only gate. Kind gate-ON: rig PASS, every command.issued event <1MB, 6 of #113's 9 fixtures COMPLETE; chose ref-on-oversize over refs_in_state=true (the latter breaks bulk-consuming fixtures — consume side not impl). Remaining 3 fixtures + the cutover gated on the refs_in_state consume side (#101). Prior: 🐞 #113 — off-server drive: recover offloaded drive result + stop drive on cancel (server#241, v3.29.4): apply_worker_orchestration resolves+decodes an OFFLOADED __orchestrate__ drive result (over the 100KB inline budget → durable reference.ref, no inline output_b64) instead of dropping it → non-convergent re-loop (new noetl_orchestrate_drive_total{stage=ref_resolved}); cancel now stops the drive (match underscore playbook_cancelled + ExecutionState::is_terminal terminal guard evicts the orch-cache, no restart). Kind gate-ON proven (785KB result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; sole-writer lag 0); rig e2e#63. 5/9 #113 fixtures COMPLETE; 4 hit a distinct oversized-command.issued stall → #114. No prod default flipped. Prior: 🎯 #103 — CQRS write-path cutover COMPLETE, FLIP-READY (server#240, v3.29.3): the 2 ExecutionService cancel/finalize writers route through the emit_event chokepoint — the last synchronous server noetl.event writer under the gate is closed, so with NOETL_EVENT_INGEST_PUBLISH_ONLY=on the server writes zero event rows (materializer is the sole writer). Kind-proven both modes (gate-off byte-identical INSERT; gate-on PUBLISHED + materializer sole writer + terminal state + 0 loss/dup). All three flip blockers closed → flipping the gate on is a staged operator decision. Default-off; no prod default changed. Prior: #104 — off-server-drive × gate crash-recovery (v3.29.2, server#238): cold-cache apply rebuilds WorkflowState from the durable log. Prior: #108 (c) — worker-driven orchestrator drive now DEFAULT ON (server#233, v3.28.0): NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true after the scale soak proved zero noetl.event burst + full system-pool isolation; in-process drive kept as the =false revert. Prior: #101 — bounded-memory orchestrator + stall-proof reconcile (v3.9.0, server#197): projection-snapshot bounded rebuild (flat memory — 167KB snapshot at 200k events, was OOM at ~19k); throttled O(events) consistency COUNT off the hot path; background reconcile poller (force-advances every active execution every 8s → no permanent deadlock under DB backpressure); results-by-reference resolution; GET /api/executions/{id} memory-bomb fix. Validated kind 10×1000 (flat memory) + GKE db-g1-small/PgBouncer 10×200 (poller broke a stall, 0 fails/restarts, Cloud SQL ~15 backends). Prior: #90 Phase 7 — POST /api/execute/batch + opt-in exactly-once dedup window (v3.5.0, server#189, closes server#188): batch endpoint creates N executions in one round-trip with partial-failure containment, reusing the single-execute path so per-message routing/trace/dedup are intact; the opt-in dedup window (noetl.subscription_dedup, bounded-by-age, race-safe INSERT … ON CONFLICT, scoped by subscription, subscription.message.deduplicated audit, default off — RFC §10 OQ1); validation of the new dispatch.batch_dispatch/batch_max/dedup/limits spec blocks; noetl_execute_outcomes_total + noetl_execute_batch_size. Live on kind: batch 12→12 COMPLETED; dedup duplicate→1 execution; direct-curl within/outside-window + dedup-off all green. ai-meta → server 7b217d8. Prior (v3.4.2): #90 Phase 5 gcs/s3 spool credential optional (ADC). Prior: #90 Phase 4 — spool config validation + subscription lifecycle-status fix (v3.4.1, server#184+server#185): validate the spec.spool block at registration; lifecycle-status reconstruction now matches only the six lifecycle event types so spool/circuit events (which share the subscription's execution_id) can't 500 subscription_get/activate. Prior: #90 Phase 3 — push-ingress config endpoint + push catalog validation (v3.3.0, server#182): mode: push requires an ingress.verify block (hmac_sha256 | bearer | pubsub_oidc; none rejected) + new GET /api/internal/ingress/{listener} (service-account-gated) resolving the verify-secret alias through the Wallet + idempotent subscription registration — the gateway's DB-free config source. Live-validated on kind. Prior: #90 Phase 2 — kind: Subscription type + lifecycle + pool routing + W3C trace (v3.2.0, server#180): first-class kind: Subscription catalog type (source/mode/dispatch validation, no step-DAG) + event-sourced lifecycle endpoints /api/subscriptions (register→activate→pause/resume→drain→deactivate, idempotent register, GET list/get) + execution_pool override on /api/execute routing the whole run to noetl.commands.<pool>.<eid> (persisted in playbook_started meta, orchestrator reads back) + W3C trace into meta.trace + command notification + child inheritance. Startup-seeds the subscription resource kind; decodes noetl.event.created_at as TIMESTAMP. Live E2E green. Prior (v3.1.0): subscription ToolKind. Prior: Round-trip JSON null in whole-object {{ step }} references (v3.0.6, server#177; closes noetl/ai-meta#89): a single-expression {{ step }} reference to a prior result envelope carrying a null field rendered that field as the JS token undefined via minijinja's map repr (json_value_to_minijinja maps JSON nullValue::UNDEFINED); render_to_value then failed serde_json::from_str and returned the whole envelope as a raw string, so the consuming python/rhai step received an unparseable str and crashed. Fix adds a | tojson retry to render_to_value (mirrors the noetl-tools TemplateEngine::render_value the server copy had diverged from): a lone {{ expr }} whose plain render is container-shaped-but-invalid JSON re-renders with | tojson, and minijinja_to_json maps undefined/none → JSON null so the field round-trips as null. 5 new regression tests (null in nested + top-level objects, null array element, explicit | tojson no-double-pipe, scalars unchanged); 619 lib + 8 parity green; clippy clean. Kind-validated against the live paginated-api test-server: cursor fixture walks all 4 pages, terminal next_cursor: null handled, validate_results collects 35 (first_id=1, last_id=35, success) — matching offset; pre-fix the 4th check_pagination was command.completed error. ai-meta → server 8e17fbe. Prior: Workflow-arc loops advance across iterations + terminate (v3.0.5, server#176; closes noetl/ai-meta#85): two coupled fixes atop the dispatch-guard re-entry layer. (1) Durable event-sourced loop-ctx propagation — step-level set: ctx.* loop variables were recomputed per pass and reverted to the workload default (loop thrashed 0,0,1,0,1,2,…); root cause was start's initializer set re-firing every pass in random HashMap order against the loop's advancing set. Fix persists each completion's rendered set: as a ctx.updated event (latest-wins fold in WorkflowState.ctx + build_context overlay), emitted once per completion keyed by the stable completion event_id (the Utc::now()-fallback completed_at is unstable across reconstructions). (2) Loop-exit hang — the exit branch was marked step.skipped on a loop-body-completion pass (recency-based branch-point detector missed it), turning it terminal so the is_step_done guard suppressed the exit dispatch; fixed with a structural loop-branch-point test (any step with a back-edge arc). 614 lib tests (+6; 2 verified to fail without their guard); clippy-clean. Kind-validated: counter loop 0→1→2→3 + terminates; real-http offset pagination 4 pages collecting 35. ai-meta → server e519fdc. (Separate finding: the e2e offset/cursor fixtures read response.data.users but the Rust http tool nests it at response.body.users — filed as a follow-up.) Prior: Unblock workflow loops + loop.done-gated transitions (v3.0.4, server#175; closes noetl/ai-meta#83 + #84): the fan-in/reduce barrier counted a loop back-edge (check_pagination → fetch_page) as an upstream and deferred the loop head forever — fix excludes back-edges via a new forward_reachable helper (genuine fan-in unaffected); and event.name was never populated for arc evaluation so when: {{ event.name == "loop.done" }} never matched (10+ fixtures hung after an in-step loop:) — fix injects event.name = "loop.done" into a completed loop step's next-arc context. Found in a full e2e regression re-sweep (19→27/36 on kind); landed with e2e fixture fix e2e#39 (duckdb commandscommand, closes #86). Follow-ups #85/#87 filed open. 26 orchestrator tests +2 new; kind-validated. Prior: container-callback insert matches the deployed event schema (v3.0.3, server#173; tracks noetl/ai-meta#43): the container-callback handler emitted its resume call.done via a stale query targeting attempt + id columns that don't exist on the deployed noetl.event (PK (execution_id, event_id)) — every watcher POST 500'd with column "attempt" of relation "event" does not exist, blocking the #43 chain. Replaced with an inline INSERT matching the working handlers::events column set; terminal outcome rides in a chk_event_result_shape-conforming result envelope. Kind-val GREEN: kind_validate_container_callback.sh both probes pass (happy_path → succeeded, oom → failed_oom). Prior: container-tool command type contradiction fixed (v3.0.2, server#172; closes noetl/ai-meta#81): ToolSpec.command Option<String>Option<serde_json::Value> so the container tool kind's K8s-Job-style array command decodes server-side (was a 400 data did not match any variant of untagged enum ToolDefinition) and passes through unchanged to the worker's ContainerConfig.command: Option<Vec<String>>; a scalar stays a JSON string for the shell/db consumers; ToolCall::from_spec forwards the value verbatim instead of wrapping in Value::String. 2 new regression tests (playbook::types 18/18); clippy clean. Kind-val GREEN end-to-end — server accepts command: ["/bin/sh","-c"], worker dispatches the container tool, K8s Job reaches Complete 1/1 (pre-fix kubectl get jobs empty). Prior: e2e-sweep cleanup (v3.0.1, server#171; tracks noetl/ai-meta#49): 64 MB result-store PUT body limit (DefaultBodyLimit — was rejecting 15 MB+ payloads with HTTP 413); render_pipeline_config stashes set/args/spec/command blocks before Tera rendering; iter namespace map in build_iteration_command; cmd_render_ctx uses command.context override; stripped diagnostic tracing::debug! blocks. All 7 e2e sweep playbooks PASS on Rust-only kind stack. Prior: Sequential-mode iterator dispatch (v2.62.0, server#166; closes noetl/ai-meta#76): LoopMode enum (Sequential default / Parallel); LoopSpec.mode parsed from loop.spec.mode YAML; StepInfo.iterations_dispatched tracks command.issued count for the sequential dispatch guard; sequential pattern dispatches iteration 0 at fan-out then on each command.completed dispatches next if iterations_dispatched == iterations_completed(). Default is Sequential — existing playbooks without explicit spec.mode get sequential behavior. 3 new tests; lib pass; clippy clean. Kind-val GREEN: test/loop COMPLETED 5/5 + iterator_save_test COMPLETED 4 steps. First Claude-direct Rust PR under agents/rules/handoff-routing.md. Prior: Durable result-store endpoints (v2.58.0, server#160; closes noetl/ai-meta#70): PUT + GET /api/result/<eid>/<step>/<ref>. Kind-val GREEN: output_select_test reached playbook.completed. Prior: Rust orchestrator exclusive-routing fix — step.skipped for untaken siblings (v2.57.2, server#159; closes noetl/ai-meta#67): under mode: exclusive only one arc fires; pre-fix, the static planner declared the untaken sibling's target as an upstream of any downstream merge step, then the R4 fan-in barrier waited for it forever. Three-part fix: (1) evaluator::evaluate_next_transitions stops break-ing on exclusive-mode match — surfaces unmatched siblings as not_matched_with_target results; (2) orchestrator::process_in_progress two-pass refactor — emit step.skipped for ALL unmatched arc targets first, then dispatch matched (HashMap-order-independent); (3) +2 unit tests + new defensive Jinja regression guard (server lib 568/0/0). Kind-val GREEN: e2e/fixtures/playbooks/comprehensive_test.yaml reaches playbook.completed in ~4s (was hanging forever). Single-commit patch on top of v2.57.1. Prior: Rust orchestrator step.data template accessor fix (v2.57.1, server#158; closes noetl/ai-meta#66): WorkflowState::build_context now injects a self-referencing .data key on the extracted user_data so {{ step.field }} (existing flat path) AND {{ step.data.field }} (new wrapped path) both resolve. Guarded by !map.contains_key("data") to preserve task_sequence flatten back-compat (a labeled sub-task's data field stays addressable as both <step>.<label>.data.x AND <step>.data.x). 2 new unit tests; cargo test lib 566/0/0; release build clean. Single-commit patch on top of v2.57.0. Prior: Phase D R5 R7 — cross-server parity harness; Replay engine port complete (v2.57.0, server#157; closes server#148; tracks noetl/ai-meta#49 Phase D R5). Final slice — hermetic parity rig: events.json (13 synthetic events) + expected.json (Python's pre-recorded fold output) + regenerate_expected.py (standalone Python script — verbatim extract of service.py fold + helpers, no noetl-package imports). tests/parity_harness.rs 8-test integration suite asserts structural parity field-by-field across all six projections + payload refs. Parity is structural not byte-for-byte hex (Python and Rust hash different digest inputs per R4's design). All 8 tests pass; lib 564/0/0; release build clean. No kind-val needed — test-only PR. All 7 Phase D R5 rounds shipped today (v2.51.0 → v2.57.0); the Replay engine port is complete. Phase D R5 R6 — payload resolver (v2.56.0, server#156; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): every event's result.reference JSON gets parsed into a typed PayloadSummary and appended to the relevant projection's payload-refs list. PayloadSummary + PayloadRefEntry types mirror Python's dict shapes; extract_payload_ref + payload_summary mirror Python's helpers with three-tier fallback (reference.<field>rows_ref.meta.<field>rows_ref.ipc.<field>). ReplayExecutionState.payload_refs + ReplayFrameState.output_ref/output_ref_summary + ReplayBusinessObjectState.payload_refs/last_payload_ref all populated. 15 new unit tests; lib 564/0/0. Kind-validated against live execution showing 3 populated payload_refs with real SHA-256 digests. Only R7 remains. Phase D R5 R5 — snapshot seed + base_state + upcaster digest (v2.55.0, server#155; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): the replay fold can now start from a prior fold's output and continue from there. ReplaySnapshotSeed mirrors Python's frozen dataclass; ReplaySnapshotInfo is the output subset; ReplayFoldOptions carries the optional inputs; new fold_replay_state_with_options entry point + 5-arg fold_replay_state as a back-compat shim. R5-introduced ReplayState fields (replay_snapshot, upcaster_registry_digest) both skip_serializing_if None — default folds produce R1–R4-identical JSON. Snapshot-storage backend deferred to a downstream sub-issue. 8 new unit tests; lib 549/0/0. Kind-validated wire-shape back-compat. Phase D R5 R4 — typed Checksum + projection_checksums (v2.54.0, server#154; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): every replay fold now produces a typed Checksum over the full state + a 6-entry projection_checksums map. ChecksumType enum (initial variant Sha256) + Checksum { type, value } struct + stable_json_bytes deterministic JSON encoder + compute_checksums per-projection + top-level digest run at the end of fold_replay_state. Design decision: digest input is the typed Rust state directly, not Python's flat-row normalize layer; Python byte-for-byte parity deferred to R7. 9 new unit tests; lib 541/0/0. Kind-validated: top-level checksum 41265876487f...ae426; six projection_checksums entries all populated. Phase D R5 R3 — loops + business_objects projections (v2.53.0, server#153; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): third slice of Phase D R5 — replay fold populates the last two per-projection maps. Two new typed state structs (ReplayLoopState, ReplayBusinessObjectState) replace R2's serde_json::Map placeholders; ReplayState.{loops,business_objects} flip to BTreeMap for deterministic key ordering. Two new ID extractors mirror Python's _loop_id / _business_object_identity resolution order; business_object_status helper mirrors Python's _business_object_status suffix-derived ACTIVE/DELETED transitions; two new populate functions with full event-shape coverage (loop counters bump on command.{completed,failed} + loop.shard.{done,failed} + loop.{done,fanin.completed}; business-object attributes REPLACE/PATCH from meta). payload_refs deferred to R6. 13 new unit tests; lib 532/0/0. Kind-validated: re-probe returns loops + business_objects empty (expected — fixture doesn't emit those events). R4 design captured (per user direction): typed ChecksumType enum + Checksum { type, value } struct; future checksum types slot in via enum. Phase D R5 R2 — stages + frames + commands projections (v2.52.0, server#152; Refs server#148; tracks noetl/ai-meta#49 Phase D R5): second slice of Phase D R5 — replay fold populates stages + frames + commands projections. Mirrors Python's state["stages"] / state["frames"] / state["commands"] per-projection dicts. ReplayEventRow extended with stage_id / frame_id / command_id / worker_id / aggregate_type / aggregate_id / meta columns (all #[sqlx(default)]); three new typed state structs (ReplayStageState / ReplayFrameState / ReplayCommandState); ReplayState.{stages,frames,commands} flip from serde_json::Map to BTreeMap for deterministic ordering; three new ID extractors mirror Python's resolution order; three new populate functions with full status transitions (stage opened/closed; frame dispatched/started/committed/failed/abandoned; command full lifecycle). 10 new unit tests; lib 518/0/0. Kind-validated: re-probe of prior fanout_reduce execution returns commands map populated with 4 entries carrying worker_id + issued_event_id + last_event_id. Phase D R5 R1 — Replay endpoint scaffold + execution projection (v2.51.0, server#149; tracks server#148): opens Phase D Round 5 (Python's noetl/server/api/replay/service.py ~1236 LoC → Rust). Sub-issue server#148 documents the 7-round decomposition (R1 scaffold + execution / R2 stages+frames+commands ✅ / R3 loops+business_objects / R4 typed Checksum + projection_checksums / R5 snapshot seeds / R6 payload resolver / R7 cross-server parity harness). R1 ships new GET /api/replay/state route mirroring Python's endpoint.py byte-for-byte (query params + defaults + projection enum + mutually-exclusive cutoffs returning 400); new services::replay module with ReplayService + ReplayCutoff + ReplayProjection + ReplayState + ReplayExecutionState + pure deterministic fold_replay_state; minimal execution projection fold reuses the Phase D R4 terminal-event short-circuit. Phase D R4 follow-up: status endpoint short-circuits on terminal events (v2.50.1, server#147; closes server#146): ExecutionService::get_status now (1) looks up playbook.completed / playbook.failed FIRST and returns COMPLETED / FAILED directly, and (2) accepts 'success' lowercase in the completed_steps counter. Kind-validated: prior execution flipped RUNNINGCOMPLETED on the same DB data. Phase D R4 slice 2 — apply_event handles step.skipped (v2.50.0, server#145; closes server#144): new step.skipped arm in state::WorkflowState::apply_event records the step with StepState::Skipped — fan-in barrier no longer defers forever when an upstream's when guard evaluates false. Container Tool Callback umbrella #43 Round 2 — POST /api/internal/container-callback/{execution_id}/{step} (v2.48.0, server#141; closes server#140; tracks noetl/ai-meta#43): external K8s watcher (Round 1, ops#166) POSTs Job terminal-state events here; handler validates path params, checks staleness via a single indexed SELECT on noetl.event, and emits a call.done event with the structured terminal state on match (or bumps noetl_container_callback_stale_total + returns 202 if no events exist for the execution). Six TerminalState variants survive in meta.terminal_state so playbooks can branch on the specific failure reason. Two new counters; 7 new unit tests; lib 487/0. Round 2 lands first per the umbrella's recommended ordering — smallest blast radius; unblocks Round 1 (watcher Deployment) + Round 3 (tools#36Tool::Container). Secrets Wallet #61 cloud-specific dynamic-secret providers shipped — umbrella feature-complete (v2.45.0 server#137 + v2.46.0 server#139 + v2.47.0 server#138): 6d.1 AWS STS AssumeRoleWithWebIdentity (EKS IRSA path; no SigV4 — token IS the credential); 6d.3 Azure AAD client-credentials (off-cluster non-IMDS path; sovereign-cloud overrides via env); 6d.2 GCP iamcredentials.generateAccessToken (workload-identity impersonation of a target SA). All three return SecretValue.expires_at populated — Phase 6d's cache_decision clamps cache TTL accordingly; Phase 7c.3 background refresh re-resolves inside the window. 39 new unit tests across the three providers; lib 470/0. Secrets Wallet umbrella is feature-complete: envelope encryption + KMS + 5 static-secret providers (GCP-SM, K8s, Vault, AWS-SM, Azure-KV) + 3 dynamic-secret providers (AWS-STS, GCP-IAM, Azure-AAD) + residency policy + cross-region broker + KEK rotation + audit + auto-renewal with stampede collapse. Phase 7c.3 — background-refresh wire-up + stampede collapse (v2.44.0, server#136): cache-hit path now spawns a background tokio::spawn to re-resolve via the provider + update the cache via KeychainService::set when the row is inside the refresh window. Cached value returns IMMEDIATELY to the caller — worker fetches stay on the fast path. Stampede collapse via new src/services/keychain_refresh.rs RefreshInflight (Arc<tokio::sync::Mutex<HashSet<(i64, String)>>>); concurrent refreshes for the same (catalog_id, alias) collapse to one provider call. Refactor: extracted resolve_via_provider from try_resolve_keychain so cache-miss inline + background refresh share identical code. Phase 7c series wire-complete (7c + 7c.2 + 7c.3). Phase 7a.2 / 7b.2 / 7c.2 — operator-facing rotation + audit storage + cache-refresh primitive (v2.42.0 server#127 + v2.43.0 server#129 + server#131): 7a.2 wraps the Phase-7a rewrap_storage_string primitive with POST /api/internal/wallet/rotate-kek (batched cursor scan of noetl.credential + noetl.keychain, RotateSummary { processed, rewrapped, skipped, failed, last_id } for progress checkpointing) and GET /api/internal/wallet/key-status (per-version row counts). 7b.2 adds the noetl.secret_audit table (CREATE TABLE IF NOT EXISTS at startup; server-owned), DbAuditSink impl, and GET /api/internal/secret-audit?credential=&execution_id=&from=&to=&limit= (bounded; ORDER BY occurred_at DESC; hard cap 10_000). 7c.2 adds KeychainService::should_refresh(catalog_id, keychain_name, execution_id, scope_type, now) — cache-side companion of the Phase-7c decision primitive; reads the row's expires_at, asks secrets::dynamic::should_refresh_default (honours KEYCHAIN_CACHE_REFRESH_WINDOW_SECS), bumps noetl_secret_refresh_total{outcome="triggered"} on true. Backward compatible. Next: 7c.3 stampede-collapse mutex + background re-resolve; 6d.1/.2/.3 cloud-specific dynamic providers. Phase 7c — token auto-renewal primitives (closes Phase 7) (v2.41.0, server#125): final named round of the Secrets Wallet umbrella. secrets::dynamic::should_refresh(expires_at, refresh_window, now) decision primitive (true iff expires_at set + still valid + within refresh window); KEYCHAIN_CACHE_REFRESH_WINDOW_SECS env (default 60). noetl_secret_refresh_total{outcome} counter (triggered|succeeded|failed|stampede_collapsed; failed alert-worthy) + noetl_secret_refresh_duration_seconds histogram (50ms–5s buckets). 5 new unit tests; lib 427/0. Lib-only. All named phases (1–7) of the Secrets Wallet umbrella are complete. Remaining queue is discrete follow-up sub-issues only. Phase 7b primitives — secret-resolution audit service (v2.40.0, server#123): durable audit trail of every credential resolution. AuditEvent struct (NEVER contains the secret value); bounded Operation + Outcome enums; AuditSink trait + NoopAuditSink default + SecretAuditService wrapper with record_async (fire-and-forget, never blocks resolver) + record_strict (await, used when compliance mandates the row exist before the value releases) + record (dispatches by strict-mode). NOETL_SECRET_AUDIT_REQUIRED env (default false; 1/true/TRUE/yes/YES enable strict). noetl_secret_audit_writes_total{operation, outcome, status} counter (failed_strict alert-worthy). 8 new unit tests; lib 422/0. Lib-only. Phase 7a — KEK rotation primitives (v2.39.0, server#121): starts Phase 7. KeyManager::current_key_version() trait accessor; EnvelopeCipher::rewrap_storage_string primitive (parse → if same version Skipped; else unwrap with historical version → re-wrap with current → Rewrapped { old_key_version, new_key_version, new_storage_string }). Plaintext payload NEVER reconstructed — pure DEK re-wrap, AES-GCM ciphertext bytes stay byte-identical. noetl_wallet_rotate_total{table, status} counter (skipped|rewrapped|failed_unwrap|failed_wrap|parse_error). 4 new unit tests; lib 414/0. Lib-only. Phase 6e — cross-region broker (closes Phase 6) (v2.38.0, server#119): BrokerRegistry (region → broker_url from NOETL_SECRET_BROKER_REGISTRY env; empty default = pre-6e fail-closed); POST /api/internal/cross-region/resolve peer endpoint validates expected_entry_region == server_region() (defensive against stale peer registries), resolves locally, seals via Phase-5a primitives to the requesting worker's pubkey directly; get_sealed handler falls back to broker on AppError::ResidencyViolation; KeychainDef.no_broker_fallback per-credential opt-out; AppError::CrossRegionUnreachable → HTTP 502. Two new metrics: noetl_secret_broker_call_total{broker_region, outcome} + noetl_secret_broker_call_duration_seconds{broker_region} histogram (50ms–5s). 10 new unit tests; lib 410/0. Both residency shapes operational: hard isolation (strict + no broker = HTTP 403) + soft federation (strict + broker registered = transparent cross-region routing). Phase 6 closes. Phase 6d primitives — dynamic-secret support + cache honors issuer TTL (v2.37.0, server#117): SecretValue.expires_at: Option<DateTime<Utc>> field; src/secrets/dynamic.rs cache_decision() honors min(default_ttl, expires_at - now - safety_margin) and returns SkipCacheAlreadyExpired when the deadline is already past or inside the operator's safety margin; KEYCHAIN_CACHE_DYNAMIC_SAFETY_MARGIN_SECS env (default 60); resolve_keychain_entry_with_meta returns the bundle's earliest expires_at; CredentialService::resolve_via_provider consumes the helper. Two new metrics: noetl_secret_dynamic_ttl_seconds histogram (1m/5m/15m/1h/4h/12h buckets) + noetl_secret_cache_skip_total{reason} counter. 7 new unit tests; lib 398/0. Backward compatible (providers without expires_at keep the 600 s default). Phase 6c — residency-policy gate (v2.36.0, server#115): KeychainDef.residency enum (none|advisory|strict, default none) + allowed_regions allowlist; resolver runs the gate at the top of resolve_keychain_entry BEFORE any provider call so strict-mode mismatches short-circuit with AppError::ResidencyViolation (HTTP 403, clear "credential X is region-locked to Y; this server is in Z" message that NEVER includes the value). noetl_secret_residency_check_total{policy, decision} counter — strict + violation_blocked is alert-worthy, advisory + violation_allowed is the migration-window signal. 8 new unit tests; lib 391/0. Phase 6b — ProviderRegistry + per-(provider, region) metrics (v2.35.0, server#113): server-side cache of (provider_id, region) → Arc<dyn SecretProvider> so the resolver doesn't rebuild from env on every cache-miss; RwLock + double-checked locking on the build path so concurrent get_or_build for the same key only builds once. Optional TTL via NOETL_SECRET_PROVIDER_TTL_SECONDS. New noetl_secret_provider_build_total{provider,region,status="cache_hit|ok|error"} counter + noetl_secret_resolve_duration_seconds{provider,region} histogram (5 ms – 5 s buckets). 7 new unit tests; lib 383/0. Phase 6a — region tag on keychain entries + per-region routing (v2.34.0, server#111): starts Phase 6 (residency-aware distributed resolution). KeychainDef.region optional field (no schema migration — lives in existing JSON blob); SecretRef.region provider-agnostic; AWS provider consumes it as the regional endpoint with explicit precedence (<region>: ref prefix > field > legacy project overload > AWS_REGION env). New NOETL_SERVER_REGION env + server_region() / effective_region() fallback helpers. noetl_secret_resolve_total{provider,region,status} counter per observability.md Principle 1. 5 new unit tests; lib 376/0. Lib-only — backward compatible. Phase 5b — wire format + sealing endpoint (v2.33.0, server#107): new GET /api/credentials/{id}/sealed?worker_id=<name> returns a SealedEnvelope (X25519-sealed credential JSON) addressed to the named worker; workers opt in by including worker_public_key in their register payload's runtime JSON blob — no schema migration; 400 BadRequest when the worker_pool row exists but didn't register a key; noetl_credentials_sealed_total{status} counter + credential.seal span per observability.md. Kind-validated end-to-end (Python cryptography + HKDF + ChaCha20-Poly1305 opens the envelope → recovers the bearer token + scope round-trip). Phase 5a — sealed payload crypto primitives (v2.32.0, server#107): src/crypto/sealed.rs X25519 ECDH + HKDF-SHA256 + ChaCha20-Poly1305 sealed-box (nonce derived from the shared secret, AAD pins alg+v for clean alg-mismatch rejection); 12 unit tests (round-trip, tamper, alg/version-mismatch, JSON wire stability); lib 369/0. Defense-in-depth on top of Phase-4 mTLS — cleartext never enters the response body. Lib-only; 5b adds the runtime-registry worker pubkey + sealing endpoint, 5c the worker side. Providers 3.x — AWS Secrets Manager + Azure Key Vault (v2.31.0, server#105): two new backends behind the one SecretProvider trait completing the 5-provider matrix. AWS SM uses hand-rolled AWS Signature Version 4 signing (no aws-sdk dep tree; signing key verified by a unit test against AWS's published reference vector); ref shape [<region>:]<secret-id>[#<json-key>] with JSON-key extraction for multi-field secrets; creds from env (the IRSA-injected triple). Azure KV uses IMDS Managed Identity (AKS/VMs) with TTL-cached bearer; ref shape [<vault>/]<secret-name>[#<version>]; sovereign clouds via NOETL_AZURE_KEYVAULT_DNS_SUFFIX. 21 new unit tests; lib 357/0; cloud-only backends (kind-val at unit-test layer like GCP). Phase 4a — opt-in TLS/mTLS listener (v2.30.0, server#103): the worker↔server credential channel (GET /api/credentials/<alias>) was plain HTTP; opt-in TLS via NOETL_TLS_CERT+NOETL_TLS_KEY (+NOETL_TLS_CLIENT_CA ⇒ mTLS), ring rustls + axum-server bind_rustls, kind-validated (200 w/ client cert, rejected w/o, plain HTTP refused). Providers 3.x — HashiCorp Vault provider (v2.29.0, server#101): a provider: vault keychain alias resolves from a Vault KV v2 secret (X-Vault-Token; ref [<mount>/]<path>#<key>), kind-validated end-to-end against an in-cluster Vault — second backend validatable on kind after K8s. /api/executions list perf + status fix (v2.28.1, server#99, #62): candidate-first rewrite (start-event index, not a 3.2M-row seq scan) — 6.5 s → 0.015 s (~430×), identical list; bool_or status-drift fix (was all-RUNNING). Secrets Wallet #61 providers 3.x — Kubernetes Secrets provider (v2.28.0, server#97): a provider: k8s keychain alias resolves from an in-cluster Secret via the API server + ServiceAccount token + cluster CA — the first secret backend kind-validated end-to-end with a real value (GCP needs GKE). Orchestrator-strand fix (v2.27.2, server#95): a deterministic evaluate failure (an invalid template in a step code body, an unknown step in a next arc, malformed routing) now emits a terminal playbook.failed instead of stranding the run in RUNNING forever — surfaced by the #54 e2e sweep (closed server#94). Parser fix (v2.27.1, server#93): NextSpec untagged-variant order — the list form next: [{step: x}] was deserialized into a struct Router positionally, silently dropping its arcs (and defeating unknown-step validation); sequence-shaped variants now precede the struct. Secrets Wallet Phase 3c — keychain cache (v2.27.0, server#91): execution-scoped, envelope-encrypted, TTL'd cache so an auth: "{{ alias }}" lookup isn't re-fetched from the secret manager per step; + fixed the keychain storage layer (queries never matched the table — also repairs the /api/keychain endpoints). Phase 3 (resolution) complete — R3b (v2.26.0) resolves a provider: gcp keychain alias from GCP Secret Manager on a credential miss; built on R3a/R2/R1 (v2.23.0–v2.25.0). Phases 1–2: Cloud KMS for the KEK (v2.22.0); envelope encryption (v2.21.0)
noetl/worker Rust NATS pull worker v5.45.0 📦 #104 Phase E + F MERGED — side-effect durability barrier + result-tier DR re-derive (NOETL_SIDE_EFFECT_BARRIER E adopt-only exactly-once side effects, worker#130 → v5.44.0 d696f7e, depends on published noetl-tools 3.17; NOETL_RESULT_TIER_DR F materializer verify-and-repair byte-identical, worker#131 → v5.45.0 dd07016; both default-off, inert in prod). Prior: 🪶 #104 Phase D MERGED — the minting flip (NOETL_RESULT_MINT_AUTHORITATIVE, default off: the result materializer becomes the authoritative tier writer (implies the Phase B flag) + resolve-by-URN becomes the primary consume read path (implies the Phase C flag), with the dual-written result_store as the fail-safe fallback; new noetl_worker_result_mint_authoritative_total{path} = tier
noetl/tools Shared tool registry crate v3.17.0 📦 #104 Phase E MERGED — registry::kind_is_side_effecting side-effect classifier the worker durability barrier consumes (conservative default true; only noop/rhai false) (tools#78, v3.17.0 1d49dd5; members noetl-directives + noetl-locator re-published first). Prior: 📦 #104 Phase B — ResultCoordinates::parse/from_locator on the slim noetl-locator (the inverse of logical_uri — recover (tenant, project, execution_id, step, frame, row, attempt) from the worker-stamped URI). Additive, so the member bumped to noetl-locator 0.1.1 (published) + noetl-tools v3.16.0 7da39d8 re-exports it (tools#77). Prior: 📦 #104 Phase A — extract the slim dependency-free noetl-locator crate (noetl-locator 0.1.0, pure std); noetl-tools re-exports it as noetl_tools::locator (tools#76, v3.15.0 dc0c5d8): the Resource Locator (ResourceLocator, ResultCoordinates, shard_key, CellPlacement, legacy-ref parse) is now a lean workspace member so the control-plane noetl-server can parse the canonical result URI without pulling noetl-tools' heavy graph (duckdb/kube/arrow/tonic/rhai). feat(locator): subject → noetl-tools minor + the release CI published the new member noetl-locator 0.1.0 to crates.io (member-publish order: locator before the root). Worker stamp path (noetl_tools::locator::ResultCoordinates) unchanged. Mirrors the lean noetl-directives member (#92). Prior: ✅ #95 — postgres pg_value_to_json temporal/identity serialization (tools#75, v3.14.2 6d9b674): the postgres tool probed i64/i32/f64/bool/String/json/DateTime<Utc> and fell through to Value::Null for everything else — a tz-naive timestamp (NaiveDateTime) column serialized to null even though the value was present (the auth0_login expires_at: null repro). Added arms for timestamp (ISO-8601, no offset suffix — a tz-naive value carries no zone), date/time, uuid (hyphenated lowercase), numeric/decimal (exact decimal string via a direct lossless decode of the postgres numeric binary wire format), bytea (base64). 409 lib tests + clippy + a kind-gated live-postgres before/after integration test (tests/postgres_temporal_kind.rs). Prior: ✅ #127 — task_sequence per-sub-task context optimization (behavior-preserving) (tools#74, v3.14.1 c8656c1): the drain rebuilt the template context per sub-task (running_ctx.clone() + 2–4× to_template_context() deep-clones + per-block ExecutionContext clones + a fresh context_to_value() per templated field). TemplateEngine::render_value now builds the proxied minijinja context ONCE and threads it through the recursion (render_value_with/render_with; minijinja Value is Arc-backed → reuse is a refcount bump), and new build_context_with_overlay(&variables, overlay) builds straight from &variables + a small overlay, skipping the intermediate HashMap deep-clone. Isolated micro-bench: per-sub-task context cost 2988.9µs→1147.1µs (−61.6%, 2.6×). 407 lib tests + 2 new equivalence pins + clippy clean. Worker adopts via worker#125 (v5.40.4). Closes #127. Prior: ✅ #125 — task_sequence honours all control-flow directives (do: jump/break/retry) (tools#73, v3.14.0 638c3c6): the Rust runner previously matched only "fail"; do: jump/to:, do: break, do: retry (attempts/backoff + infinite-jump guard) now work. ✅ #126 — http tool exposes parsed body under data (Python-era contract restored) (tools#72, v3.13.1 8dd0e1f): body was under data.body; now under data with body as a back-compat alias. Together, the two fixes make the 10×1000 batch pft_flow_test pass end-to-end on kind. Perf follow-up #127 filed. Prior: #103 — deferred (ack-after-processing) ack (v3.13.0, tools#71): AckMode::Defer in the subscription SourceClient surfaces a durable per-message ack handle (NATS $JS.ACK reply subject) instead of acking inline; SourceClient::ack(ack_ids, AckDisposition) = Ack/Nack/Term (NATS + Pub/Sub); tool operation: ack|nack|term. Opt-in — existing callers unchanged. The capability the worker materializer (v5.34.0) drives for ack-after-materialize. Prior: #90 Phase 4 — store-and-forward spool engine + per-downstream circuit breaker (v3.4.0, tools#54): noetl_tools::spool — circuit breaker (trip/half-open/close, NATS-KV-serializable, per-downstream OQ2), SpoolItem (SHA-256 + noetl://spool ref + recv_seq-ordered keys), nats_object/local_disk backends, ordered-replay engine (global/per_key/none + idempotency + dead-letter + retention/GC). 44 unit tests + real-NATS integration test. Prior: #90 Phase 2 — header-directive engine + public build_source factory (v3.3.0, tools#52): source/directives.rsDirectiveSpec/DispatchPlan turn allowlisted message headers into dispatch instructions (redirect dispatch.playbook, dispatch.execution_pool, priority→pool, idempotency_key, content_type/schema_hint, W3C trace), untrusted by default (allowlist + value-allowlist enforced at parse; multi-value last-wins; applied[] audit). Public build_source(cfg, ctx) so the worker continuous runtime constructs the same SourceClient. 12 new tests. Prior (v3.2.0): bounded-drain subscription tool + SourceClient (Phase 1). Prior: Multi-tool sibling references (v3.1.1, tools#48; closes noetl/ai-meta#87): in a tool: [list] step, TaskSequenceTool stored each sub-tool's result for the aggregated output but never injected it into the running context, so a later sub-tool's {{ <label>.<field> }} rendered empty — masked in quoted positions, a syntax error at or near "," in unquoted numeric SQL (save_edge_cases test_large_payload). Fix injects each sub-tool's result under its label (with a synthetic .data self-ref matching build_context) so later siblings + a later python sub-tool's stdin variables resolve it. 2 new unit tests; lib 300/0. Kind-validated: save_edge_cases test_large_payloadrecord_count = 100, save_delegation_test clean. Worker adopts via worker#69 (b97f642). Prior: e2e-sweep cleanup (v3.1.0, tools#47; tracks noetl/ai-meta#49): YAML boolean when: true in policy rules now checks as_bool() before the string-template fallthrough (Value::Bool as_str() returns None); `
noetl/cli Rust CLI + local-mode runner v4.11.0 noetl subscribe — local-mode subscription listener (RFC #90 Phase 6) (cli#60, ai-meta → 2fb3fb0): standalone listener + FileEventSink JSONL + local_disk spool; cli-only (noetl-tools v3.5.0 source+spool reused unchanged). Prior: --include-data flag doc fix (cli#58)
noetl/gateway Gatekeeper — auth + SSE + push-ingress v3.3.0 #90 Phase 3 — push-ingress (Mode C) + auth-gated directive trust (gateway#28): POST /ingress/{listener} verifies HMAC / bearer / Pub-Sub-OIDC → only-then directives → one POST /api/execute per delivery on the dedicated pool (verify-and-forward, no DB on the ingress path); verify_then_plan makes the auth gate a testable invariant; first /metrics surface. Live E2E green (HMAC 12/12 + bearer 12/12). Prior (v3.2.0): Phase F R3b-2 shard-info twin endpoint.
noetl/noetl Python control plane (legacy; retained for back-compat) v4.12.1 (ecd16a2) ✅ #115 Phase 2 DDL (noetl#667, ecd16a2): canonical schema_ddl.sql prev_event_id columns on noetl.event + noetl.command + idx_event_prev_event_id for fresh installs (the Rust server also ensures them idempotently at startup). Otherwise deprioritized per Rust-only direction; pytest debt at noetl/noetl#663 parked
noetl/ops Helm + manifests (untagged) 🪶 #104 Phase B — NOETL_RESULT_MATERIALIZER_ENABLED (default false) + single-cell seed NOETL_RESULT_CELL_ENV/REGION/CELL on the kind system-pool deployment; prod manifests untouched (ops#203, c92753c). Prior: k8s-watcher durable image + pod-level OOM classification (ops@cacc513, ops#168; closes noetl/ai-meta#80, tracks noetl/ai-meta#43): retired the dead bitnami/kubectl:1.30.3 (removed from Docker Hub; cluster was on the bitnamilegacy stopgap) for alpine/k8s:1.30.3 (kubectl + jq + curl baked in) — the prior runtime install never put curl on PATH so callback POSTs returned HTTP 000. classify_pod_failure() now reads the backing Pod's status (RBAC already grants pod reads) to emit failed_oom (OOMKilled) / failed_image_pull (ImagePullBackOff); build_body's completed_at fallback uses RFC3339 `now
noetl/docs Docusaurus site (untagged) ADR Implementation-status block
noetl/travel Reference SPA (domain-fork example) (untagged) Production

Architecture at a glance

Today — kind cluster is fully Rust

The 2026-06-04 session retired all Python deployments + their services + configmaps from the kind cluster (per the user directive "delete all legacy stuff"). Local validation runs against the Rust topology by default:

              ┌──────────────────┐
              │  Gateway         │  noetl/gateway  v3.2.0
              │  (Rust)          │  auth · SSE · subscriptions · shard routing
              └────────┬─────────┘
                       │ HTTPS
              ┌────────▼──────────────────────────────┐
              │  noetl-server-rust  (v2.19.7)         │
              │  catalog · execute · events ·         │
              │  /api/internal · DbPoolMap sharding   │
              │  orchestrator engine · SSE            │
              │  workbook resolution · pipeline parse │
              └────────┬──────────────────────────────┘
                       │
              ┌────────▼─────────┐
              │   NATS JetStream │  NOETL_COMMANDS stream
              │   + Postgres     │  noetl.event + noetl.command
              └────────┬─────────┘
                       │
        ┌──────────────┴──────────────┐
        ▼                             ▼
  ┌──────────────────┐         ┌──────────────────┐
  │ noetl-worker-    │         │ worker-system-   │
  │ rust v5.11.3     │         │ pool             │
  │ (shared pool)    │         │ (Rust, runs      │
  │                  │         │  system          │
  │ noetl-tools      │         │  playbooks)      │
  │ v2.18.1:         │         │                  │
  │ python · shell · │         │ Consumer:        │
  │ http · postgres  │         │  ..._pool_system │
  │ duckdb · rhai ·  │         │ Filter:          │
  │ task_sequence ·  │         │  noetl.commands. │
  │ playbook · noop  │         │  system.>        │
  └──────────────────┘         └──────────────────┘
   Consumer:                     [Outbox publisher +
    ..._pool_shared               projector migrated
   Filter:                        to system playbooks
    noetl.commands.               via Phase 2.a]
    shared.>

Retired this session (2026-06-04): noetl-server (Python deploy), noetl-worker (Python deploy), noetl-outbox-publisher (Python deploy), noetl-projector (Python statefulset), and the noetl/noetl-ext/noetl-projector/noetl-worker-metrics services + 4 legacy configmaps + noetl-worker SA. The kind cluster is the regression-test topology for Rust-only e2e.

[Status legend: ✅ = shipped + kind-validated]

v10 playbook compatibility on Rust (closed loop)

Six interlocking gaps fixed in one iteration brought control_flow_workbook end-to-end on the Rust-only stack — exercising the complete v10 control-flow surface:

playbook YAML  ──►  noetl-server-rust orchestrator                 noetl-worker-rust
                                                                   + noetl-tools v2.18
─────────────────  ──────────────────────────────────────────────  ──────────────────
workload: {...}   ► #56 workload + input alias decode               PythonTool wrapper
tool.kind: python ► (existing dispatch)                             globals().update(args)
tool.kind:                                                          ► #17 capture
  workbook        ► #59 parser substitutes inline action            `result = {...}`
                                                                    global → data
tool: [{...}]     ► #57 ToolDefinition::Pipeline accepts both       ► #18 TaskSequenceTool
  (pipeline)        flat (name-as-field) + nested (label-as-key)     runtime
                                                                    
{{ step.field }}  ► #60 build_context exposes step data at top
  in next.arcs      level (not just steps.<name>); apply_event
                    captures call.done before command.completed
                    overwrites
                    
command.failed    ► #58 trigger_orchestrator on command.failed
                    + dedicated short-circuit in process_in_progress
                    emits playbook.failed terminal
                    
worker → server   ► #55 EventEmitRequest accepts i64 wire shape
  event emission    (was rejecting integer execution_id)

Validated end-to-end:

noetl exec tests/fixtures/playbooks/control_flow_workbook
→ playbook_started
→ start (python)
→ eval_flag (workbook→python; is_hot=true captured via marker)
→ hot_path (next.arc when="{{ eval_flag.is_hot == true }}" matched)
→ parallel hot_task_a + hot_task_b
→ playbook.completed ✅

Long-term Python trajectory

Per the Rust-only direction, Python pieces stay only as:

  1. Container payloads — runtime stays Rust; user code that wants Python ships in a container dispatched by the container tool kind (#43, in design).
  2. Back-compat GKE deployments — existing Python pods on the production cluster aren't removed (yet) since GKE traffic still uses them; no new feature work goes there.

The kind cluster is the canary for the Rust-only topology. When the operator runs the validation rigs end-to-end against their live cluster, R5 cutover decision will move the production topology to match.

Sessions log

Chronological notes on what each session accomplished — see Sessions Log.

Most recent (top of log):

  • 2026-06-23🪶 #104 OQ5 Option A SHIPPED — producer-staged result tier (worker v5.46.0); decouples the over-budget tier write from result_store (flag default-off). worker#132 + e2e#81 + docs#186 merged, ai-meta pointer #128; ops#206 soak-gate rules open. result_store-retirement soak defined but NOT started (gated on a prod producer-stage flip).
  • 2026-06-23🚀 #104 — WI/ADC GCS auth MERGED (server#265, fad5d8a → noetl-server v3.45.0) + ROLLED TO PROD GKE: server runs as the WI-bound KSA noetl-server-rust (@sha256:d3cbf1ad…) with the result-tier GCS ENV applied (OBJECT_STORE_BACKEND=gcs, bucket noetl-demo-19700101-results, auth=adc, cell usc1-a/256); /api/internal/cells reads it back; all tier-enable flags stay OFF → tier inert. Server healthy auth=adc (no token minted — lazy), off-server cutover sole-writer/lag-0/never-scan (smoke + system/scheduled_cleanup COMPLETED, published==acked=13, chain clean), 0 restarts. Prod configured + WI-live, ready for staged B→C→D enablement.
  • 2026-06-23🪶 #104 Phases E + F MERGED — the FINAL build phases (side-effect durability barrier + result-tier GC + DR), repo-only, flags default-off → inert in prod. Phase E: tools#78noetl-tools v3.17.0 (registry::kind_is_side_effecting, members noetl-directives + noetl-locator re-published first), worker#130 barrier → v5.44.0 (repointed onto published 3.17, no patch), e2e#79. Phase F: server#264 GC sweeper → noetl-server v3.44.0, worker#131 DR re-derive → noetl-worker v5.45.0, ops#205 + e2e#80. ai-meta pointers + deployment-spec wiki rows bumped. A–F build phases complete; #104 stays OPEN (operational items remain).
  • 2026-06-23🪶 #104 Phase D MERGED — the minting flip (dual-write window). All 4 PRs squash-merged in dependency order (server → worker → ops → e2e), flags default-off → inert in prod. Repo-only; prod minting cutover is a separate next task. NOETL_RESULT_MINT_AUTHORITATIVE makes the URN → Feather/GCS tier the authoritative result store with noetl.result_store kept as the reversible dual-write fallback. No new crate publish — no Cargo.toml dep change in either PR; server resolves noetl-locator 0.1.1, worker resolves from the registry. semantic-release cut noetl-server v3.43.0 + noetl-worker v5.43.0: server#263v3.43.0 6f6b9ef, worker#129v5.43.0 be6863a, ops#204 b19b759, e2e#78 07e85aa. ai-meta pointers bumped. OQ5 — result_store retirement — DECIDED metric-gated (drop dual-write once mint_authoritative_total{path=legacy_fallback} holds 0 across a staging soak + retention time floor), gated on a not-yet-done byte-source re-plumbing prerequisite (materializer fetches the payload from result_store today). #104 stays OPEN (Phase E/F + minting prod-cutover remain).
  • 2026-06-23🪶 #104 Phase C MERGED — resolve-by-URN read path (server GCS object backend + cell registry + GET /api/internal/cells; worker resolve-by-URN + fixes B/B1, closes OQ6). All flags default-off → inert in prod. Merged in dependency order (server → worker → e2e): server#262noetl-server v3.42.0 c2d5ca9, worker#128noetl-worker v5.42.0 7971041 (38 unit tests), e2e#77 39dc880 (3-pass rig + fake-gcs). No new crate publish — worker resolves noetl-tools 3.16.0 + adds published arrow=53; server resolves noetl-locator 0.1.1, both from the registry (no repoint). ai-meta pointers bumped; #104 stays OPEN (Phase D minting flip remains).
  • 2026-06-22🪶 #104 Phase B MERGED — shadow Feather result tier (5 PRs, flags default-off → inert in prod). Merged in dependency order: tools#77 (→ noetl-tools v3.16.0 + noetl-locator 0.1.1 published; member bumped 0.1.0→0.1.1 pre-merge so the additive parse/from_locator API publishes) · server#261 (→ v3.41.0, sibling noetl_result_materializer consumer) · worker#127 (→ v5.41.0, the consume-loop) · ops#203 (flag + cell seed) · e2e#76 (kind rig). No downstream repoint needed (both resolve from the registry, no git/branch dep). ai-meta pointers bumped; #104 stays OPEN (C–F remain). Session details
  • 2026-06-22📦 #104 Phase A MERGED — slim noetl-locator 0.1.0 extracted + published to crates.io; server accepts the canonical result URI behind NOETL_RESULT_URI_ACCEPT (default off). Repo-only, no prod deploy. Squash-merged tools#76 with a feat(locator): subject → noetl-tools v3.15.0 (dc0c5d8) and the release CI published the new workspace member noetl-locator 0.1.0 (member-publish order: locator before the root crate). Repointed server#260's temporary git-dep to noetl-locator = "0.1.0", confirmed cargo build/cargo tree resolve from the registry with the heavy graph absent (duckdb/kube/arrow/tonic/rhai/gcp_auth = 0), 623 tests green; squash-merged → noetl-server v3.40.0 (c89d078) + the e2e#75 kind rig (eeca8b7). Resolves the umbrella's OQ7. Flag default-off → inert in prod until a later server rollout. ai-meta pointers bumped (repos/toolsdc0c5d8, repos/serverc89d078, repos/e2eeeca8b7); #104 stays OPEN.
  • 2026-06-22🧹 CQRS / off-server program close-out + tracking reconciliation (repo-only, no prod changes). Merged e2e#74 (squash 1deadf1, the prod-scoped Rust regression validator, 28/30 against live prod) + bumped repos/e2e. Closed #103 (CQRS event log — prod cutover live + validated; e2e#74 was its last piece), #102 (server#198/#199 landed; worker-side superseded by #103), #101 (acceptance met + kind-validated; consume side moved to #115 Phase 1). Closed superseded Python ops PRs #189/#190/#192 (replaced by the Rust in-process worker materializer noetl-worker/src/materializer.rs). Archived the completed 2026-06-18-orchestrate-plugin-dissolution + stale 2026-06-09-rust-stack-session-snapshot handoffs. Reconciled roadmap board 3 + this Active-umbrellas table. Fixed the failing release-server CI (cargo publish — added a version to the noetl-orchestrate-core path dep + publish the member first, mirroring noetl/tools).
  • 2026-06-22🐘 #95 CLOSED — postgres pg_value_to_json temporal/identity serialization shipped to prod end-to-end. The Rust postgres tool returned null for a tz-naive timestamp (NaiveDateTime) column plus date/time/uuid/numeric/bytea (the auth0_login expires_at: null repro). Merged tools#75 (squash 06302ac → semantic-release noetl-tools v3.14.2 6d9b674, published to crates.io) + worker#126 pin bump 3.14.13.14.2 (squash 60a849dnoetl-worker v5.40.5 da24952). Built the prod image via Cloud Build us-central1 (noetl-worker-rust:v5.40.5 @sha256:45212dbe…) and rolled it by digest onto prod noetl-worker-rust + noetl-worker-system-pool (server stays v3.39.5, CPU 250m/2 kept) — rolling restart clean (2/2 + 1/1, 0 crashloop), off-server CQRS cutover stayed healthy (materializer sole-writer lag 0, command lag 0, project_errors 0, WAL index rehydrated 5 execs/198 events per #119). ai-meta repos/tools6d9b674 + repos/workerda24952.
  • 2026-06-22✅ #127 CLOSED — the task_sequence per-sub-task context optimization merged, released, and shipped to prod; the code-opt + CPU-limit bump now compound on the live off-server cutover. Merged tools#74 (squash 9dd9aa6 → semantic-release noetl-tools v3.14.1 c8656c1, published to crates.io) — render_value builds the proxied minijinja context ONCE + threads it through the recursion (render_value_with/render_with), and new build_context_with_overlay skips the redundant to_template_context() HashMap clone + per-block ExecutionContext clones; micro-bench −61.6%/2.6× per-sub-task context cost (407 tests + 2 equivalence pins + clippy). Bumped the worker noetl-tools pin 3.143.14.1 (worker#125, squash 1a10a73noetl-worker v5.40.4 0afbf5c; deps-only, no worker source change, clippy clean). Built the prod image via Cloud Build (us-central1-docker.pkg.dev/noetl-demo-19700101/noetl/noetl-worker-rust:0afbf5c) and rolled it onto prod noetl-worker-rust + noetl-worker-system-pool (server stays v3.39.5/.6, worker CPU limits kept req 250m/limit 2). Rolling restart clean (pods Ready, 0 crashloop), and the off-server CQRS cutover stayed healthy throughout — materializer sole-writer projected==acked, lag ~0, command lag 0, executions completing, system-pool rehydrated its WAL index (#119). ai-meta repos/toolsc8656c1 + repos/worker0afbf5c pointers bumped + wiki. #127 closed + board 3 → Done.
  • 2026-06-22⚡ #127 PROD WORKER CPU LIMIT RAISED 1→2 + APPLIED LIVE — the ~20% batch-throughput win materialized on prod. ops#202 (85e3c23) raised request 100m→250m / limit 1→2 on both noetl-worker-rust + noetl-worker-system-pool; ai-meta repos/ops pointer bumped. Kind profiling found the step workers pegged one CPU and spent 38–47% throttled at the 1-CPU limit (per-sub-task running_ctx.clone() + to_template_context() hot path); 2 CPU cut a 10k batch 166–172s→137.6s (~20%, zero patient loss). Live prod baseline confirmed 1 CPU (kind had 0.5); nodes 4× ek-standard-16 (~63.5 CPU) at ~2% → headroom. Applied via kubectl set resources: rolling restart clean (all pods Ready, 0 restarts), off-server CQRS cutover stayed healthy (materializer sole-writer lag 0, WAL index rehydrated). PROD gate vars / image / DB untouched; the deeper context-clone hot path is the structural follow-up. #127 stays OPEN.
  • 2026-06-22✅ #123 SHIPPED + CLOSED — a non-iterable loop in: no longer silently wedges (commands=0, RUNNING forever); the off-server drive now decodes the {"error":…} envelope and emits a terminal playbook.failed (server v3.39.6). server#258 (squash 275b914 → v3.39.6 7f109a9): apply_worker_orchestration decodes the drive ERROR envelope (decode_orchestrate_error) and emits the terminal failure (metric noetl_orchestrate_drive_total{stage="drive_error"}, structured execution_id), matching the in-process drive; orchestrate-core prefixes the offending step name onto the existing evaluate_loop error. An empty iterable ([]/{}) still short-circuits to next. 600 server + 135 orchestrate-core tests + clippy clean; kind-validated prod-exact (absent workload.batch_slots → FAILED with a clear message, valid [1,2,3] still COMPLETED 3-way, #120/#124 unaffected, 0 restarts). Code-only fix (worker stays v5.40.3); ai-meta pointer bumped → v3.39.6; #123 closed + board 3 → Done. PROD stays healthy on v3.39.5 + the live off-server cutover (NOT redeployed — ships to prod on a future server rollout). #127 stays OPEN.
  • 2026-06-22✅ #121 SECOND-HALF SHIPPED + CLOSED — off-server system/* WAL-chain wedge fully fixed (server v3.39.5). server#257 (squash 54ac277 → v3.39.5 c421273) gates BOTH off-server-drive sites in trigger_orchestrator_inner on should_publish(catalog_id) so system/* execs drive server-built run_state; regular execs keep the off-server path. #256 was only the first half (#121 reopened after the v3.39.4 prod re-cutover still wedged system/scheduled_cleanup on the system-pool worker). Pure is_system_path helper + unit test; 612 tests + clippy; kind full-gate BEFORE 9× WAL chain incomplete → AFTER COMPLETED 0 loops + regular loop still off-server. Server-only (worker stays v5.40.3); ai-meta pointer bumped; #121 closed + board 3 → Done. PROD re-cutover = Phase-B follow-up this session.*
  • 2026-06-21✅ #121 first-half SHIPPED (partial) — off-server WAL-chain-incomplete loop on system/ execs (link gate-off command.claimed through ChainHeads + don't route system/ execs to the off-server WAL drive); server v3.39.4; server-only; live prod off-server cutover wedge is the real-world repro.
  • 2026-06-21✅ #125 + #126 SHIPPED + CLOSED — task_sequence do: jump/break/retry control flow + http tool body data-shape fixed (noetl-tools v3.14.0 tools#73 + v3.13.1 tools#72); worker v5.40.3 adopts the fixes (worker#124). Full 10×1000 batch pft_flow_test now passes end-to-end on kind, correctness-clean (zero patient loss). Perf follow-up #127 filed. #121/#123 stay OPEN.
  • 2026-06-21✅ #124 SHIPPED + CLOSED — distributed task_sequence forward set:/sibling bindings no longer render empty; defer sub-task templates with unresolved vars to the worker re-render (render_value_deferring_unresolved); server v3.39.3 (server#255). #121/#123/#125/#126 stay OPEN.
  • 2026-06-21✅ #120 SHIPPED + CLOSED — reduce barrier no longer deadlocks (commands=0) on open/asymmetric loop joins (orchestrate-core); runtime liveness filter in the barrier (server v3.39.2 28e8950). server#254 (squash fbb855f); test test_open_loop_back_edge_does_not_block_dispatch, 133/133 + clippy, kind-validated 2×2 matrix COMPLETE; PROD/defaults untouched. #121/#123 stay OPEN.
  • 2026-06-20✅ #116 program-scale step 2 SHIPPED + multi-replica gate-ON validated — execution-affinity single-owner WRITE ORDERING. server#252 v3.39.0 + e2e#71. Step 1 (KV coherence) was necessary-not-sufficient — the command.issued prev-read + head CAS-advance are two non-atomic steps, so concurrent cross-replica emits forked the chain. Affinity routes every trigger (POST /api/events, which fires the drive) to the single replica that ShardConfig::owns(execution_id) owns; a non-owner forwards a reverse-proxy POST (one-hop loop guard, degrade-to-local). On the owner the single-process drive lock + in-memory ChainHeads make the read→advance atomic, no distributed lock; KV is the genesis/handoff vehicle (owner resolves LOCAL → kv_remote_hit→0). server#252 (5e00d0a, v3.39.0): src/affinity.rs, flags NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off), metric noetl_execution_affinity_total{outcome}. e2e#71 (66b6e1b): 2-replica StatefulSet topology + rig HARD gate. Multi-replica gate-ON kind PASS: linear/loop/fanout COMPLETE, every chain roots=1/dangling=0/walk==total (NO fork), forwarded_ok +9, never-scan (scans+0) + sole-writer across replicas; single-replica unchanged; 595 server tests + clippy green; baseline restored. Follow-up #117: off-server from_events spine ordered by event_id wedges fan-in under a chain-order≠id-order inversion (affinity + high-concurrency fanout) — fix = order spine by prev_event_id walk; linear/loop already reliable. Prod multi-replica verdict: write-ordering COMPLETE (no fork) — prod can horizontally scale the off-server stack for linear/loop; high-concurrency fan-out needs #117 first. All affinity flags default off; PROD GKE untouched.
  • 2026-06-20✅ #115 program-scale step 1 SHIPPED — multi-replica coherence DATA LAYER (NATS-KV-backed ChainHeads + ExecDescriptor); execution-affinity STAGED. server#251 v3.38.0 + e2e#70. NOETL_REPLICA_COHERENCE=nats_kv (default local, prod unchanged) backs the off-server drive's watermark + descriptor with JetStream KV buckets so 2+ replicas resolve the same value — head advance = CAS (one chain), descriptor = CAS merge; in-process maps = write-through cache / degraded fallback (local → bit-identical). server#251 (8f39a79): src/coherence.rs (CoherenceKv + lazy buckets), ChainHeads/ExecDescriptors async, ExecDescriptor serde, metric noetl_replica_coherence_total{structure,op,outcome} (proof kv_remote_hit). e2e#70 (e222877): kind_validate_replica_coherence.sh + un-staled the #113/#114 offload asserts behind NOETL_RIG_EXPECT_OFFLOAD (default false; under refs_in_state=true the offload paths legitimately stay flat). Kind-validated: single-replica nats_kv is bit-for-bit parity with local (linear/loop/fan-out ×2 all COMPLETE, roots=1/dangling=0/walk==total, state_build_event_scans +0, hotpath scan +0, sole-writer intact); 2-replica proved cross-replica resolves work (kv_remote_hit advanced for head + descriptor, no kv_unavailable). Necessary but NOT sufficient: on 2+ replicas concurrent cross-replica emits still fork the chain (the issuing_event head-read vs head-advance is non-atomic across replicas → observed forked chains + a cross-execution prev), so executions don't reliably COMPLETE on 2+ replicas yet — the remaining piece is execution-affinity (one replica owns an execution's drive + chain write; substrate present in src/sharding.rs shard_for/owns), STAGED as program-scale step 2. 588 server tests + clippy green; baseline restored. ai-meta pointers → server 8f39a79 (v3.38.0) + e2e e222877. PROD GKE untouched; default local; no gate/mode/builder default changed. Off-server architecture is multi-replica-COHERENT (data) but not yet multi-replica-COMPLETE (write-ordering) — prod cutover stays single-replica until affinity lands.
  • 2026-06-20✅ #115 Phase 5 SHIPPED + gate-ON validated — atomic-working-item context (tenet 6): the drive hands a worker only its minimal declared slice; NOETL_ATOMIC_ITEM_CONTEXT (default off). #77 (Explicit Input Binding) resolved. server#250 v3.37.0 + worker#121 v5.40.0 + e2e#69.
  • 2026-06-20✅ #115 Phase 6 SHIPPED + gate-ON literal-zero validated — hot-path noetl.event read class RETIRED; the table is AUDIT-ONLY (server v3.36.0). NOETL_EVENT_READ_PATH=event_scan|audit_only (default event_scan, prod unchanged) retires the remaining lifecycle readers of noetl.event (the WHERE execution_id replay class outside the drive). server#249 (b71ca1d): under audit_only get_catalog_id (per-ingest) + inherit_parent_trace + subscription dedup-audit + container-callback catalog/existence serve from the in-memory execute-time ExecDescriptor; a cold descriptor (post-terminal straggler after eviction / restart) resolves catalog_id from noetl.command (synchronous queue) — never a noetl.event scan. Proof metric noetl_event_hotpath_reads_total{site,outcome}. ops#199 (e5b0737) pins event_scan on the prod server manifest (operator-gated flip). e2e#67+#68 (0ab3c0a) kind_validate_event_read_path_phase6.sh. Gate-ON kind-validated (PUBLISH_ONLY + offserver + materializer + audit_only): hot-path scan Δ0 (served_descriptor +96 + served_command +3), drive state_build_total Δ0 + event_scans Δ0 ⇒ ZERO noetl.event scans anywhere on the hot path, end-to-end; linear/loop/fan-out/output_select COMPLETE; sole-writer + lag-0; audit still works (direct SELECT + status COMPLETED + replay event_count=25); committed gate rig PASS with audit_only on (no regression); 585 server tests + clippy green; baseline restored. Completes the RFC never-scan end state (tenet 3) under the flag. ai-meta pointers → server b71ca1d (v3.36.0) + ops e5b0737 + e2e 0ab3c0a. PROD GKE untouched; default event_scan; no gate/mode/builder default changed. Remainder = Phase 5 (atomic-item, needs #77) + program-scale (per-shard WAL, multi-replica descriptor coherence).
  • 2026-06-19✅ #115 Phase 4 KERNEL + FLAG SHIPPED + shadow kind-validated — off-server state builder (worker v5.37.0 + server v3.33.0); drive cutover staged. The pool-side state_builder (worker#118, fef961c) reconstructs WorkflowState from the noetl_events WAL — a per-execution chain index walks prev_event_id head→root, caches the spine keyed by the immutable chain head, advances only the new tail. A live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off) + the server NOETL_STATE_BUILDER=offserver|server flag (server#246, 3e6006d, default server). Gate-ON kind-validated on live cluster: shadow replayed the WAL → chain-walked spines whose indexed==spine sizes match Phase-3 topologies (linear 13, loop 62, fan-out 25, output_select 31, storage_tiers 55) = parity by construction; WAL-read wal_events_total=993 with event_scans_total=0; cache cold_rebuild=28 (replay/restart) + incremental=21 (live tail-advance — fresh fan-out indexed Incremental(5), indexed==spine==25==DB event_rows 25); fresh fan-out COMPLETED gate-ON, event_rows==distinct, 0 __orchestrate__ event rows, materializer pending=0/project_errors=0; 8 worker unit tests + 2 server config tests + clippy green; baseline restored. ai-meta pointers → worker fef961c + server 3e6006d. PROD untouched; no default changed. The offserver drive cutover (drive consumes the builder's state) + Phase 5 (atomic-item context, #77) / Phase 6 (retire event read path) remain.
  • 2026-06-19✅ #115 Phase 3 MERGED — chain-walk state builder (server v3.32.0); Phase 4 (off-server state builder + WAL cache) started. server#245 self-merged (no classifier block) → server v3.32.0 (8338417); ai-meta repos/server pointer bumped. Behind NOETL_STATE_BUILD_MODE=chain_walk (default event_scan, prod unchanged) the drive reconstructs WorkflowState by walking prev_event_id head→root (in-memory ChainHeads head + (execution_id,event_id) PK lookups — never a WHERE execution_id scan) → same from_events (orchestrate-core unchanged; parity by construction); falls back to event-scan on cold-head / lag / non-genesis. Gate-ON kind-validated (prior session): parity 41/41 MATCH, event_scans_total=0 / 1064 PK hops / 0 fallbacks, all topologies COMPLETE, sole-writer + lag-0 + gate rig PASS, 577 tests + clippy green. Phase 4 now in progress — move the chain-walk state construction OFF the server onto the system worker pool reading the WAL/NATS stream, pool-side cache keyed by the immutable chain head + incremental tail-advance (server chain-walk + event-scan remain fallbacks). PROD GKE untouched; no gate default changed.
  • 2026-06-19✅ #115 Phase 2 implemented + kind-validated — one-level prev_event_id event chain (server#244 + noetl#667 merged). Each noetl.event carries prev_event_id (the immediately-previous event in causal order) + each noetl.command the issuing-event link, so per-execution events form a walkable singly-linked list followable pointer-by-pointer without scanning noetl.event (additive; no reader yet — Phase 3). The emit chokepoint emit_events stamps the event link from a per-execution chain-head watermark (ChainHeads) — one path covers drive events + command.issued + worker-lifecycle on both the gate-off INSERT and the gate-on publish, the materializer persisting it; the command link = the real step.enter/unblocking completion so cursor-fan-out bodies share their branch origin (§4.4). Server-only (no orchestrate-core change). Chain-correctness proven gate-ON across 6 executions (linear 13/13, loop 62/62, fan-out 25/25 with a real shared branch origin, sub-playbook 46/46, + Phase-1 output_select 31/31 & storage_tiers 55/55 bounded): each has 1 root, 0 dangling / 0 duplicate event prev, 1 head, pointer-walk == full sequence (no gaps), real-step command dangling=0; kind_validate_orchestrate_gate.sh PASS (sole-writer 25==25, 0 dup cycles, catalog0=0, lag 0); 573 lib tests + clippy green. PROD untouched; no gate default changed. Awaiting merge → ai-meta pointer bump; Phase 3 (chain-walk state builder) next.
  • 2026-06-19🐞 #114 oversized-command.issued offload shipped (server v3.29.5); refs_in_state consume side (#101) is the remaining off-server-drive cutover blocker. With #113's decode fix in place, 4 large-context fixtures wedged at a distinct second stall: under refs_in_state=false the off-server drive embeds the full resolved upstream context into the next command, so its command.issued event (~1.32MB) exceeded NATS max_payload → publish ack-timeout → wedge. Fix: a command context over NOETL_COMMAND_CONTEXT_MAX_BYTES (512KB) is offloaded to noetl.result_store with a {__context_ref__} marker; get_command/claim_command resolve it before the worker sees it (metrics context_offloaded/context_ref_resolved). Server #242v3.29.5 + rig phase 8 e2e#64. Kind gate-ON: off-server rig PASS (new test_oversize_command_context COMPLETED, max command.issued ctx 585B, offload+resolve fired, 0 __orchestrate__ event rows, materializer lag 0); every command.issued event <1MB across all fixtures; 6 of #113's 9 now COMPLETE. Chose ref-on-oversize over refs_in_state=true (candidate #1): a kind experiment proved refs_in_state=true fixes the state-bloat (kind_playbook_lease_expiry completes — drive-state 29KB vs ~1MB loop) but breaks the bulk-consuming fixtures (test_storage_tiers/test_output_select fail at the bulk step) because the worker render-time ref-resolution isn't implemented — so the default stays false. The remaining 3 fixtures + the cutover (#107/#111) now hinge on the refs_in_state consume side (#101)__orchestrate__ drive-state bloat (17.4MB for storage_tiers) + the _ref/bulk-resolve gap. ai-meta → server 385d21f (v3.29.5) + e2e 9919392. No prod default flipped; prod is pre-#108 in-server drive (unaffected).
  • 2026-06-19🐞 #113 off-server drive — recover offloaded drive result + stop drive on cancel (server v3.29.4); #114 opened. Fixed the worker-driven drive stall when an __orchestrate__ result exceeds the 100KB inline budget (worker offloads it with only a reference.ref → server now resolves+decodes it via result_store.resolve, metric ref_resolved, instead of dropping → non-convergent re-loop) + the cancel non-stop facet (match underscore playbook_cancelled + ExecutionState::is_terminal terminal guard, no restart). Server #241v3.29.4 + rig e2e#63. Kind gate-ON proven (785KB result → ref_resolved→COMPLETED, 0 decode WARNs; cancel froze a drive-loop instantly; sole-writer lag 0). 5/9 #113 large-context fixtures COMPLETE; the other 4 hit a distinct oversized-command.issued (full upstream context embedded → >1MB NATS payload) stall → #114 (#113 stays open until all 9 close). ai-meta → server 1e844c1 + e2e 12b27e9. No prod default flipped.
  • 2026-06-19🚀 #103 GKE pre-flip PREP — prod images pushed, GMP monitoring live, manifests staged; NO traffic flip / NO PUBLISH_ONLY. Verified prod already-Rust (the #49 cutover is done; pre-#103 live images), both flip secrets present, monitoring = Google Managed Prometheus (not VM). Pushed server v3.29.3 + worker v5.35.0 to the prod AR (amd64); applied + verified GMP PodMonitoring (worker+server /metrics — the noetl ns had none) + materializer-lag Rules (up{namespace="noetl"}=4 live); staged the roll-forward manifests (not applied — they roll live workloads); runbook gained a "Production (GKE)" section + GMP managedAlertmanager pager stub. Operator-gated: roll images → materializer shadow → pager → flip. ai-meta e5b6d6c → ops 9edd9c4 (ops PR #197). No prod default changed.
  • 2026-06-19🛡️ #103 materializer-lag GUARDRAIL shipped — the pre-flip observability gate. The server was FLIP-READY; the remaining gate was a materializer-lag metric + alert. Worker #116v5.35.0 extends the JetStream lag poller to track the noetl_events/noetl_materializer consumer on an independent task → noetl_worker_nats_consumer_pending{consumer="noetl_materializer"} climbs even when the materializer loop is dead. Ops #195+#196: VMRule (backlog warning>200/critical>2000/growing + stall-under-gate + project-errors + absent-under-gate, stall guarded on backlog>0), worker /metrics VMServiceScrape (was unscraped), VMAlert enabled, Grafana dashboard, flip runbook noetl-cqrs-publish-only-flip.md. Kind-proven full cycle on the VM stack: green baseline (backlog 0) → induced lag (materializer fault-injected, events publishing under the gate) → gauge 0→684, alerts fire (backlog warning+critical + stall) → recover → drains→0 idempotently (0 dup/loss), alerts clear. ai-meta → worker b910341 (v5.35.0) + ops 2fcfa59 + worker-wiki 0030f30. PUBLISH_ONLY stays default-off.
  • 2026-06-19🎯 #103 server cutover COMPLETE — FLIP-READY. The 2 ExecutionService cancel/finalize sites now route through the emit_event chokepoint (server #240 → v3.29.3 + e2e rig #62); kind-proven both modes; no remaining synchronous server event writers under the gate. Flipping PUBLISH_ONLY on is now a staged operator decision.
  • 2026-06-19🎯 #104 off-server-drive × gate reconciliation PROVEN — the last real blocker before the PUBLISH_ONLY flip is operator-safe. The combination #103 left unproven (gate-on was only ever validated with the in-process drive) is green on kind: gate-ON (PUBLISH_ONLY=true) with the off-server drive (PLUGIN_DRIVE=true) + materializer sole writer → fresh exec + cursor fan-out → COMPLETED; server wrote 0 noetl.event rows (all 25 PUBLISHED — event_ingest_published_total=25), materializer materialized all 25 exactly once (25 rows == 25 distinct ids, 0 catalog_id=0, 0 dup cycles), drive dispatched=applied, read-your-writes held (the relocated trigger fires post-materialize → server rebuilds state from committed log before bounding the off-server drive input). Server #238v3.29.2 (76d29bb): cold-cache apply now rebuilds WorkflowState from the durable log (the #104 WAL-rebuild principle) instead of dropping the in-flight result on a server restart mid-drive — kind crash-recovery proof: hard-kill mid-drive → cold_rebuild metric+log fires → that exec COMPLETES with full event integrity. Committed e2e rig kind_validate_orchestrate_gate.sh (e2e#61). Regression green: gate-off + in-process (prod default) and gate-off + off-server. ai-meta → server 76d29bb + e2e 61f7a5c. Remaining before a safe flip: only the 2 ExecutionService cancel/finalize sites. No prod default changed.
  • 2026-06-18🎯 #108 (c) — the worker-driven orchestrator drive is now the DEFAULT; #108 CLOSED. Flipped NOETL_ORCHESTRATE_PLUGIN_DRIVE to default true (server#233, v3.28.0 → server@80cc0e6). Gated on a scale soak on kind (images built from the released tips, server v3.27.0 / worker v5.33.0): a single 694-drive cursor+fan-out run (test_pft_flow_v2 3×40) COMPLETED with __orchestrate__ rows in noetl.event = 0 (event_suppressed +2082) and all 694 drives claimed on the system pool (shared pool got only the 671 real-step commands), 0 errors; 5× concurrent self-contained cursor = 5/5 COMPLETED. Then deployed the flipped image with no env var — the default-on path reproduced the identical shape (361 drives, system-isolated, 0 burst); 15/15 regression fixtures green; the revert (=false) verified to fall back to the in-process drive (system delta 0). In-process trigger_orchestrator_inner kept as the fallback. ai-meta → server 80cc0e6 + worker 437b0be + server-wiki 0210012.
  • 2026-06-18orchestrate drive isolated on the SYSTEM pool via pool affinity (#108 follow-up b, kind-validated). Server stamps execution_pool on the command notification (server#232 → server@846166b); worker declines (ACK+skip) notifications not for its pool segment (worker#114 → worker@e2162b7), so the drive runs on the dedicated system pool even under JetStream consumer-filter drift. No worker HTTP pending-poll — the NATS consumer is the only claim vector. Validated: __orchestrate__ claimed+executed on the system pool (3), zero on the default pool, simple_python COMPLETED; 553+196 tests green. Only (c) the deliberate default-flip remains.
  • 2026-06-18the orchestrate meta-command touches noetl.event ZERO times (#108 follow-up a, kind-validated). dispatch_orchestrate_command stops writing command.issued to noetl.event; the command lives only in noetl.command, and claim_command/get_command fall back to it on a miss (noetl.event stays authoritative for normal commands) (server#231 → server@9438f3b). So __orchestrate__ writes 0 of its former 5 rows per drive — the directive that system playbooks keep only their own state is met. Validated: cursor+fan-out COMPLETED via the noetl.command fallback, 0 event rows, 20 real steps normal, 0 errors. Remaining: NATS affinity (ops) + default-flip.
  • 2026-06-17system playbook events no longer burst Postgres + system-pool routing (#108 slice 4b, kind-validated). The __orchestrate__ meta-command is infrastructure, not a workflow step, so the server now skips persisting its lifecycle events to noetl.event (handle_event_inner + claim_command) (server#230 → server@6aef3a6). At scale they'd burst noetl.event/Postgres for no benefit. Validated: __orchestrate__ now writes only the lone command.issued (1 of 5 — 80% fewer rows); cursor+fan-out flow still COMPLETED. Drive routes to the system segment (true isolation pending a NATS-affinity ops fix; resilient via the pending-poll meanwhile). Follow-ups: eliminate the last command.issued (claim from noetl.command) + NATS affinity.
  • 2026-06-17🎯 the orchestrator drive runs OFF-SERVER on the worker pool (#108 slice 3, kind-validated). With NOETL_ORCHESTRATE_PLUGIN_DRIVE=on the server issues system/orchestrate (entry: run_state, args = the bounded WorkflowState) to the worker pool instead of evaluating in-process; the worker runs the drive, the server applies the result on the command's call.done (server#229 → server@465cdbb v3.23.0). Kind: test/simple_python drove start→end→COMPLETED through the round-trip (dispatched=2, applied=2, 0 decode_error, __orchestrate__ didn't leak as a step, playbook.completed). Default off, in-process fallback. Bug caught+fixed: output_b64 rides call.done not command.completed. Next: shadow→flip at scale, make drive the default, route to the system pool.
  • 2026-06-17worker-driven cutover slice 2: apply_orchestration_result extracted + slice 3 designed (#108). The post-evaluate emission (events → commands → terminal) is extracted verbatim from trigger_orchestrator_inner into a reusable fn (server#228 → server@586aeae) so the worker-driven drive applies a worker-computed result identically. Behavior-preserving (553 tests green, clippy clean). Slice 3 (dispatch) designed + grounded: apply_event would phantom-step a meta-command, so the design uses a reserved __orchestrate__ step ignored in state, a flag-gated scheduler, apply-on-callback, and loop-prevention. Lands behind NOETL_ORCHESTRATE_PLUGIN_DRIVE (default off), kind-validated before the flip.
  • 2026-06-17worker-driven cutover slice 1: configurable wasm guest entry (#108). The worker can now dispatch a named plug-in export (worker#113 → worker@04420d0): tool: {kind: wasm, plugin: {path, version, entry}} names the export (default run); the worker-driven orchestrator will use entry: run_state. invoke_bytes_with_entry + run_by_ref_entry/run_and_apply_by_ref_entry (originals delegate with run); test proves run→0xAA vs run_state→0xBB + missing-export error. Purely additive, no live change. Server scheduler+apply (the hot-path round-trip, default-off flag) next.
  • 2026-06-17orchestrate plug-in drives the real workload identically, live (#108 slice 4). The orchestrator runs the plug-in alongside the in-process drive on every evaluation + diffs commands (server#227 → server@bd652ab) — a process-global wasmtime host (feature orchestrate-shadow) loaded from noetl.plugin_module at boot, gated NOETL_ORCHESTRATE_PLUGIN_SHADOW; in-process result authoritative. Kind-validated over the live 10×1000 PFT: noetl_orchestrate_shadow_total{result="match"} 529, ZERO mismatch/error, workers stable. Plug-in gains a state-input path (run_state); both build configs green (default no wasmtime). Slices 1-4 prove orchestrator-as-plug-in end to end; next is the worker-driven cutover.
  • 2026-06-17system/orchestrate@1 registered + servable in a deployed server (#108 slice 3). The server bakes the orchestrate wasm into its image and seeds built-in system plug-ins into noetl.plugin_module on boot (server#226 → server@b21b589, kind-validated). New src/system_plugins.rs (pure dir-scan + sha256, unit-tested) + a wasmbuilder Docker stage + NOETL_SYSTEM_PLUGIN_DIR; in-process upsert (not the token-gated HTTP surface); digest-keyed hot-reload. Validated: GET /api/internal/plugins/system/orchestrate?version=1 → 200 application/wasm 1559093 bytes, ETag=digest, stale→409, baked sha256 == served digest. Next: kernel scheduler dispatches the now-registered plug-in.
  • 2026-06-17system/orchestrate plug-in runs identically to native in wasmtime (#108 slice 2). A wasmtime shadow-diff (server#225 → server@ccec104) loads the built .wasm through a harness mirroring the worker host's invoke_bytes ABI byte-for-byte and asserts the wasm output equals the native drive over auth0 multi-arc when: routing (minijinja in wasm) + cold-start. Finding: command-set identity (parsed Value eq), not raw bytes — the context map serializes in insertion order (serde_json preserve_order ← upstream HashMap iteration, differs wasm32 vs host arch); the scheduler deserializes to Vec<Command>, so the command set is the bar. 2 unit + shadow-diff green; plug-in excluded; test-only. Next: catalog register/serve → kernel scheduler.
  • 2026-06-17system/orchestrate WASM plug-in exists — drive core runs as a 0-import module (#108 slice 1). New standalone plugins/orchestrate/ crate (server#224 → server@10a629b) wraps the drive behind the worker plug-in ABI (input = JSON event-slice + playbook; output = JSON OrchestrationResult; data-plane = memory/alloc/run) and compiles to wasm32-unknown-unknown — the first non-trivial compiled system playbook. Feasibility risk retired: the .wasm has 0 imports (no WASI, no host render) — the whole drive incl. minijinja runs in-guest. Native parity test reproduces native evaluate byte-for-byte; 551 server tests green, crate excluded from the workspace. Next: worker-host shadow-diff → catalog register/serve → kernel scheduler (NOETL_ORCHESTRATE_PLUGIN, default off).
  • 2026-06-17Orchestrator drive core fully wasm-resident — Event-ABI round #109 CLOSED. Slice 3 (server#223) moved orchestrator/evaluate from src/engine/ into noetl-orchestrate-core. All 6 drive modules (renderer, playbook model, commands, evaluator, state, orchestrator switch) now compile native + wasm32-unknown-unknown — the system/orchestrate plug-in seed (#108). evaluate reads the pure core::event::Event; server converts db::Event at the trigger_orchestrator boundary (slice-1 From). 122 core + 565 server tests green, 0 WASI imports on wasm32, clippy clean; cargo-chef image (v3.20.0) kind-deployed, PFT 10×1000 — full command lifecycle, 0 errors, 0 restarts. ai-meta → server bfd3f77 (internal refactor, stays v3.20.0).
  • 2026-06-14Transfer tool: Snowflake↔Postgres both directions — #99 CLOSED. Both transfer arms implemented with full credential-alias resolution. tools v3.10.0 (tools#65) + worker v5.22.0 (worker#87) + e2e#58. SF→PG: $n::text::<udt> coercion + RFC3339 timestamp reformat; PG→SF: generated INSERTs. Full bidirectional data_transfer/snowflake_postgres fixture COMPLETED on kind against live sf_test account. tools → 4127b4b · worker → 6d97e7c · e2e → 94aa7f1.
  • 2026-06-14Snowflake key-pair JWT validated end-to-end — #98 last external-tool gap closed; transfer step → #99. noetl-tools v3.9.0 / v3.9.1 / v3.9.2 (tools#62/#63/#64) — key-pair JWT auth (bypasses MFA) + User-Agent fix (code 391903) + SQL-API context-in-body + multi-statement split (codes 391911 + 000008). Worker bumped to v3.9.2 (worker#83#86) + e2e#57 fixture cleanup. create_sf_database (CREATE DATABASE) + setup_sf_table (CREATE TABLE + INSERT) both COMPLETED via JWT on kind against the live sf_test account (NDCFGPC-MI21697). Transfer step fails (inline creds, no key-pair fields) — filed #99. ai-meta pointers: tools a216ab2 · worker 9d6b127 · e2e e191231.
  • 2026-06-12#90 Phase 7 shipped — scale hardening; #90 CLOSED (all 7 phases complete), live proof green. Final phase: server v3.5.0 (server#189) POST /api/execute/batch (N→N, partial-failure contained) + opt-in exactly-once dedup window (noetl.subscription_dedup, bounded-by-age, race-safe, default off); worker v5.19.0 (worker#79) batch dispatch + dedup opt-in + per-subscription rate limits (deterministic token-bucket RateGovernor, fetch-side backpressure → source keeps backlog, no loss, subscription.rate_limited event); ops (ops#176) + e2e (e2e#48); no tools change → no crate cascade. Live on kind: batch 12→12 COMPLETED on the subscription pool + per-message traceparent; dedup duplicate→1 execution + subscription.message.deduplicated; rate-limit engaged + 10/10 → executions (no loss); direct-curl within/outside-window + dedup-off + batch partial-failure all green. ai-meta → server 7b217d8 + worker 7531f4a + ops 6db69b9 + e2e 203593b. #90 closed; follow-ups tracked: #91#94 + tools#57.
  • 2026-06-12#90 Phase 6 shipped — CLI local noetl subscribe + FileEventSink + local_disk spool (live local proof green). Added noetl subscribe <spec.yaml> (cli v4.11.0, cli#60, closes cli#59): a kind: Subscription listener run standalone in local mode — no k8s, no NATS-dispatch server for the listening itself — reusing the same noetl_tools source clients + directive engine + spool engine the in-cluster worker uses, emitting the same ExecutorEvent envelope to a local FileEventSink (one event/line JSONL → replayable trail). Local dispatch (RFC §5.3): in-process via PlaybookRunner (pure-local default) or POST /api/execute. local_disk spool (§8.6): circuit-breaker + buffer + ordered replay + idempotency + dead-letter against a local dir, circuit state in a local file. New src/subscribe/{mod,spec,sink,dispatch,runtime,spool}.rs + examples/subscribe/. cli-only — no tools change / crate cascade (the source+spool surface ships in noetl-tools v3.5.0; bumps the lock 3.0.0 → 3.5.0 via the executor's "3"). Tests: 12 subscribe + full bin suite (53) green, incl. a deterministic outage→spool→ordered-replay→idempotency proof on the real engine. Live (in-cluster NATS on kind): 5 msgs → received=5 dispatched=5 failed=0 (19-event JSONL trail); local_disk spool outage → 6 message.spooled (0 dispatched, no loss) → recovery → 6 message.replayed in order → drained to 0. Finding: the NATS source ignores URL-embedded user:pass (async-nats ConnectOptions) — specs use explicit user/password. ai-meta → cli 2fb3fb0 (v4.11.0); wiki cli subscribe. #90 stays open for Phase 7 (scale hardening, volume-gated).
  • 2026-06-11#90 Pub/Sub + Kafka brought to live-E2E parity with NATS (validation gap closed). Stood up the two remaining subscription brokers in kind — Pub/Sub emulator (gcloud SDK image) + single-broker KRaft apache/kafka:3.9.1 — under noetl/ops (ops#170), and added bounded-drain fixtures + kind-validate runners under noetl/e2e (e2e#41). Both backends passed the same live bar as NATS: publish/produce 5 → bounded drain count=5 acked=true → execution COMPLETEDcall.done/command.completed/playbook.completed event trail. No adapter code change needed — the pure-Rust kafka crate talks to Kafka 3.9 KRaft and the Pub/Sub REST backend works against the emulator as-is. The one fix: the <step>.output.<field> accessor never resolved (both when: arcs skipped → drain stalled); corrected to <step>.<field> in the fixtures + the latent ops subscription_drain.yaml example. Validated on server v3.1.0 + worker v5.15.2 + tools v3.2.0; cluster left on that clean released stack. ai-meta → ops 568a4ac + e2e 8d21e7a. #90 stays open (Phases 2–7 design-only).
  • 2026-06-11#89 shipped — JSON null round-trips through {{ step }} (server fix, v3.0.6). #89nullundefined serialization — CLOSED. The #88 cursor fixture walked all 4 pages but its 4th check_pagination crashed: the terminal page's next_cursor: null, re-injected via the whole {{ fetch_page }} envelope, rendered as the JS token undefined (invalid JSON), so the consuming Python step received response as a str. Traced the corrupt command.issued args.response to the renderer that builds next-step inputs — the server orchestrator (src/template/jinja.rs::render_to_value), not the worker the issue blamed. json_value_to_minijinja maps JSON nullValue::UNDEFINED; minijinja's map repr emits undefined; render_to_value failed from_str and fell through to a raw string. The noetl-tools engine already had a | tojson retry for exactly this; the server's copy had diverged without it. Fix (server#177, v3.0.6) ports the retry. 5 new tests; 619 lib + 8 parity green; clippy clean. Kind-validated end to end on the live test-server (baseline 4th check_pagination error → fixed success; cursor collects 35, matching offset). ai-meta → server 8e17fbe. Standing direction honored — Claude wrote the Rust directly, no Codex.
  • 2026-06-10#88 shipped — pagination fixtures read response.body.*; #89 filed. #88 — offset/cursor pagination fixture path — CLOSED. The Rust http tool nests the parsed JSON payload under body ({{ fetch_page }}{body, headers, status_code}); the fixtures read response.get('data', {}), which resolved to {}, so has_more/next_cursor defaulted falsy and the loop exited after page 1 despite the correct post-#85 machinery. Confirmed the shape against a live http-tool result, then switched both check_pagination steps to response.get('body', {}) (e2e#40). Kind-validated: offset walks 0→10→20→30, users 10/10/10/5, validate_results success 35, playbook.completed COMPLETED; cursor path-fixed + walks all 4 pages (Mg==→Mw==→NA==→null, 35 events fetched) but the terminal page surfaced a distinct worker bug → #89 (worker serializes next_cursor: null as JS undefined when re-injecting {{ fetch_page }}, so the consuming Python step gets an unparseable str). Other pagination fixtures (retry/max_iterations/pipeline*/loop_with_pagination) share the same envelope-key assumption over /api/v1/assessments|flaky ({data, paging}) — flagged, left for follow-up. ai-meta → e2e 72a7525.
  • 2026-06-10#87 shipped, #85 deferred (e2e sweep follow-ups #85/#87). #87 — multi-tool sibling references — CLOSED. task_sequence (the tool: [list] pipeline runtime) stored each sub-tool's result for the aggregated output but never injected it into the running context, so a later sub-tool's {{ <label>.<field> }} rendered empty — masked in quoted positions, a syntax error at or near "," in unquoted numeric SQL (save_edge_cases test_large_payload). Fix (tools#48, v3.1.1) injects each sub-tool's result under its label (synthetic .data self-ref); worker adopts via worker#69. Kind-validated on a worker built from the fix: save_edge_cases test_large_payloadrecord_count = 100 (no syntax error), save_delegation_test clean. ai-meta → tools 76f942a + tools-wiki 4962f8b + worker b97f642. #85 — workflow-arc loop re-entry — DEFERRED (kept open). Implemented the dispatch-guard layer (draft server#176): a back-edge detector (cycle + recency) re-enters a completed loop head, so the loop no longer hangs (608 lib tests + 5 new pass). But kind validation surfaced a second blocker — set: ctx.X loop variables are recomputed per orchestrator pass and revert to the workload default when the producing step is re-dispatched (a minimal counter-loop thrashes 0,0,1,0,1,2,…). Full multi-page pagination needs durable event-sourced ctx propagation across iterations — larger than is safe to land well-tested in one session; held as a draft, not merged. Standing direction honored: Claude wrote all Rust directly (no Codex).
  • 2026-06-10#80 closed — container_callback chain green end to end. Fixing the watcher's missing curl (the literal #80 goal) surfaced two more layered bugs beneath it. Watcher image (ops#168): the manifest used the retired bitnami/kubectl:1.30.3 (removed from Docker Hub; the live cluster was patched to the bitnamilegacy archive) with a runtime apt/apk install step that never put curl on PATH → callback POST returned HTTP 000. Switched to alpine/k8s:1.30.3 (kubectl + jq + curl baked in), dropped the install hack. Server insert (server#173, v3.0.3): once curl worked the POST reached the server and 500'd — the container-callback handler inserted call.done via a stale query targeting an attempt column that doesn't exist on the deployed noetl.event; fixed to the working handlers::events column set. OOM path: the watcher only read Job-level conditions so failed_oom could never fire — added pod-level OOMKilledfailed_oom classification (ops#168); the completed_at fallback for failed Jobs used bare jq now (numeric epoch → HTTP 422), fixed to RFC3339 now | todate; and the e2e fixture's bytes(40MiB) was calloc-lazy (mapped to the zero page, never faulted in) so the container exited 0 — switched to a written-into bytearray that dirties pages and reliably OOM-kills (e2e#38). Verified the kind cluster actually enforces memory limits (120 MiB in a 32Mi pod → OOMKilled exit 137). Rebuilt the server image + reloaded into kind; kind_validate_container_callback.sh both probes GREEN — happy_path → succeeded (delta 1), oom → failed_oom (delta 1). This is the last blocker on the #43 container-callback chain. ai-meta → ops cacc513 + server 5d2cf58 (v3.0.3) + e2e 6aaf06e.
  • 2026-06-10#79 closed — e2e kind-val runners back on the current noetl CLI surface. Both scripts/kind_validate_*.sh runners aborted immediately on error: unrecognized subcommand 'playbook' — they targeted the retired noetl playbook register/execute + noetl execution status/events verbs. The validation logic and the event taxonomy (step.enter / command.completed / node_name / the fan-in barrier) were intact; only the invocation layer had drifted. Fix (e2e#37): noetl register playbook --file, noetl exec <catalog-path> --runtime distributed --json (exec by metadata.path, not the bare name), noetl status <id> --json, and the event log over noetl query (no events verb today — rows wrap under .result, order by event_id since noetl.event has no timestamp column). Added a fail-fast CLI-surface guard to each runner. Validated on kind (server-rust v3.0.1 + worker-rust, :8082): fanout_reduce PASS start-to-finish with no manual workaround; container_callback drives register→exec→COMPLETED cleanly and stops at the metric-delta assertion because the deployed noetl-k8s-watcher image lacks curl (watcher.sh: curl: not found → HTTP 000) — a cluster-side watcher gap tracked on #80. Version-skew note: PATH binary is noetl 2.17.0, repos/cli submodule is v4.10.0; the targeted surface is identical across both, so the runners work on either (the binary lags the submodule by a major line — worth refreshing for parity, not required here). Pointer: e2e → a3594b3; e2e wiki: new Kind-Val Runners page.
  • 2026-06-10#82 closed — GUI credential View/Edit recovered for pre-wallet records. The Secrets Wallet (#61) moved credential storage to forward-only envelope encryption; pre-wallet records now 500 on GET /api/credentials/{id}?include_data=true (Decryption failed: aead::Error), so the GUI View/Edit flow dead-ended on a generic toast (response shape unchanged). Fix (gui#36): View surfaces the real reason + points to Edit; Edit still opens with the list-row metadata (name/type/description/tags) + a warning banner and an empty-but-required data field, so re-entering the secret and saving re-seals the record under the current wallet — recovering it. Validated live against kind + the dev:kind UI on :3001. Also landed e2e#36 (duplicate workload probe-flag keys removed from tooling_non_blocking) and gui#35 (dev:kind convenience script). Pointers: gui → 8cacc9e (v1.11.1), e2e → 4a9ffbc.
  • 2026-06-10#81 closed — noetl-server v3.0.2 fixes the container-tool command type contradiction. ToolSpec.command was Option<String> (scalar) but the container tool kind writes a K8s-Job-style array — an array failed the server's ToolDefinition untagged-enum match (400), a scalar was rejected by the worker's ContainerConfig.command: Option<Vec<String>>. Typed command as Option<serde_json::Value> (same as args); ToolCall::from_spec forwards it verbatim. 2 regression tests; clippy clean (server#172, v3.0.2). Kind-val GREEN end-to-end: server accepts the array command, worker creates the K8s Job, Job reaches Complete 1/1. Server pointer bumped (ai-meta → server bd36672). Chain counter-bump validation stays gated on #79 (runner CLI) / #43.
  • 2026-06-09E2E sweep cleanup — noetl-tools v3.1.0 + noetl-server v3.0.1. Stripped the diagnostic tracing::debug! scaffolding added during the e2e triage, kept the production fixes: YAML when: true boolean + |tojson object-template fallback (tools#47), 64 MB result-store body limit + pipeline command/spec stash (server#171). Pointers bumped (ai-meta@316048c tools, @6590bd6 server); tracks #49. All 7 sweep playbooks PASS on Rust-only kind. Worker crates.io dep-revert deferred — v3.1.0 not yet on crates.io ([skip ci] release commit).
  • 2026-06-08noetl-tools v2.24.2 clippy cleanup + noetl/server#22 closed. Cleared the clippy -D warnings CI gate on noetl-tools (15 warnings across 7 files; all mechanical lint fixes). Closed stale noetl/server#22 (Phase D orchestrator engine port — complete). noetl/server PR #167 (same clippy shape) opened, awaiting merge.
  • 2026-06-05Rust-only regression rig — canonical v10 SQL + http config shapes. Swept ~30 self-contained e2e fixtures against the Rust-only kind stack and fixed three config-shape classes in noetl-tools: postgres command: alias + multi-statement SQL (tools#24, v2.18.3), a task_sequence→duckdb regression test (tools#25), and the duckdb command: alias + http params/headers/form non-string coercion (tools#26, v2.18.4). Worker adopted both (worker#50, worker#51). Newly GREEN: duckdb_test, json_serialization_save, duckdb_retry_query, pagination/{offset,cursor,max_iterations,pipeline}, retry_simple_config. Recovered the cluster first (server had latched into NATS not configured after a podman restart). Server-side follow-up noted: loop_with_pagination renders {{ execution_id }} empty in a multi-statement postgres command.
  • 2026-06-05postgres-tool observability — real SQLSTATE errors. noetl-tools 2.18.2 (tools#21) + worker dep bump (worker#49): the postgres tool surfaces the real SQLSTATE + message instead of the opaque db error. Validated end-to-end — a bad query reports ERROR: relation "..." does not exist (SQLSTATE 42P01) in the call.error event. Closes the last follow-up from the credential/iterator saga.
  • 2026-06-05iterator_save_test GREEN — full v10 + credential + iterator-pipeline surface validated. server#73 (v2.19.7) defers task_sequence _prev/_results refs at command-build so nested-pipeline templates render at runtime. iterator_save_test reaches playbook.completed and writes 3 rows to the real demo_noetl DB — the deepest v10 path (iterator → pipeline → _prev chaining → nested credential → postgres write). Closes the credential + iterator + pipeline chain (server#71, worker#46, worker#48, server#73).
  • 2026-06-05Nested-pipeline credentials + template-timing finding. worker#48 (v5.11.3) — the worker now pre-resolves keychain aliases on task_sequence SUB-tasks; iterator_save_test's nested save_item postgres step connects to demo_noetl. Closes the credential-path chain (store → alias-key → nested resolution, all validated). Last iterator_save_test blocker found + filed: server#72 — the server pre-renders task_sequence {{ _prev.* }} refs (runtime-only) to empty → malformed SQL (a symptom the v2.19.5 Chainable change surfaced).
  • 2026-06-05Keychain-credential path validated on Rust-only. Continuing R5 Tier 4, registered the pg_k8s postgres credential and probed the DB-backed fixtures. Surfaced + fixed a 3-bug chain in the keychain subsystem: credential store bound AES-GCM Vec<u8> to a TEXT column (server#71, v2.19.6); alias resolution read only the auth: key not v10's credential: (worker#46, v5.11.2). Proven: iterator_save_test's create_table connects + runs DDL against the real demo_noetl DB. Third bug — nested-pipeline credentials (task_sequence sub-tasks bypass worker resolution) — filed as worker#47 for a follow-up round. Session details
  • 2026-06-05v10 control-flow runs end-to-end on Rust-only. Phase F R5 Tier 4 re-probe found + fixed 7 more bugs across the Rust stack (server v2.19.5 server#69 6 commits, worker v5.11.1 worker#44, tools v2.18.1). Four v10 fixtures now reach playbook.completedstart_with_action, end_with_action, loop_test, control_flow_workbook; actions_test correct-fails on a missing TEST_SECRET env. Root-cause chain: catalog SQL type drift → ToolSpec null-serialization → worker array-config drop → orchestrator end-step skip + task_sequence label-wrap → minijinja Lenient-vs-Chainable undefined → end-step trigger gate. Also: rust-analyzer workspace setup + rule (ai-meta@38287b7). Session details
  • 2026-06-04 (late evening)Rust-only e2e complete + legacy cleanup. Six interlocking server gaps closed in one iteration (#55–#60)
    • two noetl-tools fixes (#15, #16); worker dep bump; kind cluster legacy Python deployments retired. control_flow_workbook runs fully end-to-end on the Rust-only stack. Standing direction pinned: Rust-only focus, ignore Python tasks. Session details
  • 2026-06-04 (afternoon)Pipeline + failure termination + workbook resolution. Three server PRs landed together as v2.19.3 (#61, #63, #65).
  • 2026-06-04 (morning)EE-5 lax decode + workload + input alias. v2.19.1 + v2.19.2 — unblocked Rust worker → Rust server emission + canonical v10 playbook compatibility.
  • 2026-06-04 (early morning)Phase F R4-5 + R4 complete. N=2 shard kind validation script + ExecutionService refactor.
  • 2026-06-03Phase F R4 series — DbPoolMap N+1 pool layer, AppState wiring, per-execution handler cutover, cluster-wide list fan-out.
  • 2026-06-02 (afternoon)Architecture pivot: rest of migration moves to system playbooks. Closed #30, #45; promoted #46.

Releases

See Releases for the per-repo release log with links to GitHub Releases pages.

Recent (2026-06):

  • 2026-06-23noetl/worker v5.46.0 — 🪶 #104 OQ5 Option A: producer-staged result tier (NOETL_RESULT_PRODUCER_STAGE, default off) — decouples the over-budget tier write from result_store; materializer skip-on-exists.
  • 2026-06-23noetl/server v3.45.0 — 🚀 #104: WI/ADC GCS auth (3-mode none/static/adc) MERGED + ROLLED TO PROD as WI KSA noetl-server-rust (@sha256:d3cbf1ad…); result-tier GCS ENV applied, all tier-enable flags OFF → tier inert; off-server cutover stayed healthy, 0 restarts (server#265, fad5d8a).
  • 2026-06-23noetl/server v3.44.0 — 🪶 #104 Phase F: result-tier GC sweeper (NOETL_RESULT_TIER_GC, default off) — conservative dry-run-first, never deletes a live-referenced object (server#264, 341b614).
  • 2026-06-23noetl/worker v5.45.0 — 🪶 #104 Phase F: result-tier DR re-derive (NOETL_RESULT_TIER_DR, default off) — materializer verify-and-repair, byte-identical rebuild of a missing/corrupt tier object (worker#131, dd07016).
  • 2026-06-23noetl/worker v5.44.0 — 🪶 #104 Phase E: side-effect durability barrier (NOETL_SIDE_EFFECT_BARRIER, default off) — adopt-only, side effects fire exactly once across re-drive; depends on published noetl-tools 3.17 (worker#130, d696f7e).
  • 2026-06-23noetl/tools v3.17.0 — 🪶 #104 Phase E: registry::kind_is_side_effecting side-effect classifier (conservative default true; only noop/rhai false) (tools#78, 1d49dd5; members noetl-directives + noetl-locator re-published first).
  • 2026-06-23noetl/server v3.43.0 — 🪶 #104 Phase D: mint-authoritative flag (NOETL_RESULT_MINT_AUTHORITATIVE, default off) + result_store dual-write counter (noetl_result_store_dual_write_total) — the reversible dual-write fallback leg of the minting flip; no Cargo.toml change (resolves noetl-locator 0.1.1 from the registry) (server#263, 6f6b9ef).
  • 2026-06-23noetl/worker v5.43.0 — 🪶 #104 Phase D: the minting flip — NOETL_RESULT_MINT_AUTHORITATIVE (default off) makes the materializer the authoritative tier writer + resolve-by-URN the primary consume path, with the dual-written result_store as the fail-safe fallback; new noetl_worker_result_mint_authoritative_total{path} (tier | legacy_fallback) (worker#129, be6863a).
  • 2026-06-23noetl/server v3.42.0 — 🪶 #104 Phase C: GCS object backend + cell-endpoint registry + GET /api/internal/cells (resolve-by-URN read side; default-off) (server#262, c2d5ca9).
  • 2026-06-23noetl/worker v5.42.0 — 🪶 #104 Phase C: resolve-by-URN read path + fixes B/B1 (references-in-state behavior, flatten_single_tool_result, closes OQ6; adds published arrow=53, resolves noetl-tools 3.16.0; 38 unit tests) (worker#128, 7971041).
  • 2026-06-22noetl/tools v3.16.0 + noetl-locator v0.1.1 — 🪶 #104 Phase B: ResultCoordinates::parse/from_locator (the inverse of logical_uri); additive → member bumped 0.1.0→0.1.1, published (tools#77).
  • 2026-06-22noetl/server v3.41.0 — 🪶 #104 Phase B: ensure the sibling noetl_result_materializer durable consumer at stream-birth (own ack cursor) (server#261).
  • 2026-06-22noetl/worker v5.41.0 — 🪶 #104 Phase B: shadow Feather result tier — separate noetl_events consume-loop writes over-budget results (tabular → Feather, non-tabular → JSON) at the derived §7 key; gated NOETL_RESULT_MATERIALIZER_ENABLED (default off) (worker#127).
  • 2026-06-22noetl/server v3.40.0 — 📦 #104 Phase A: accept the canonical result URI behind NOETL_RESULT_URI_ACCEPT (default off), parsed via the slim noetl-locator 0.1.0 (heavy graph absent from the control plane) (server#260, c89d078).
  • 2026-06-22noetl/tools v3.15.0 + new crate noetl-locator v0.1.0 — 📦 #104 Phase A: extract the slim dependency-free noetl-locator crate (pure std; ResourceLocator/ResultCoordinates/shard_key/CellPlacement/legacy parse) so the control-plane server parses the result URI without noetl-tools' heavy graph; noetl-tools re-exports it as noetl_tools::locator (tools#76, dc0c5d8; noetl-locator 0.1.0 published to crates.io).
  • 2026-06-22noetl/worker v5.40.5 — ✅ #95: adopt noetl-tools 3.14.2 (postgres temporal/identity serialization); Cargo.lock pin 3.14.13.14.2, no worker source change; built noetl-worker-rust:v5.40.5 @sha256:45212dbe… + rolled by digest to prod under the live off-server cutover (worker#126, da24952).
  • 2026-06-22noetl/worker v5.40.4 — ✅ #127: adopt noetl-tools 3.14.1 (task_sequence per-sub-task context optimization); Cargo.lock pin 3.143.14.1, no worker source change; built noetl-worker-rust:0afbf5c + rolled to prod under the live off-server cutover (worker#125, 0afbf5c).
  • 2026-06-22noetl/tools v3.14.2 — ✅ #95: postgres pg_value_to_json now serializes tz-naive timestamp (NaiveDateTime) + date/time/uuid/numeric (lossless binary-wire decode)/bytea instead of returning null; the auth0_login expires_at: null repro (tools#75, 6d9b674).
  • 2026-06-22noetl/tools v3.14.1 — ✅ #127: behavior-preserving task_sequence per-sub-task context reuse — render_value builds the proxied minijinja context once + build_context_with_overlay skips the redundant to_template_context() clone; micro-bench −61.6%/2.6× per-sub-task context cost (tools#74, c8656c1).
  • 2026-06-22noetl/server v3.39.6 — ✅ #123: surface a non-iterable loop in: as a terminal playbook.failed — the off-server drive decodes the {"error":…} envelope (decode_orchestrate_error, metric noetl_orchestrate_drive_total{stage="drive_error"}) instead of silently wedging (commands=0); orchestrate-core prefixes the offending step name onto the evaluate_loop error (server#258, 7f109a9, squash 275b914). Code-only (worker stays v5.40.3); 600 server + 135 orchestrate-core tests + clippy; kind-validated prod-exact (absent iterable → FAILED with a clear message, valid [1,2,3] still COMPLETED).
  • 2026-06-22noetl/server v3.39.5 — ✅ #121 second-half: gate BOTH off-server-drive sites in trigger_orchestrator_inner on should_publish so system/* execs drive server-built run_state (they INSERT to noetl.event, never enter the WAL); regular execs keep the off-server path (server#257, c421273, squash 54ac277). Server-only (worker stays v5.40.3); 612 tests + clippy; kind full-gate system/scheduled_cleanup BEFORE 9× WAL chain incomplete → AFTER COMPLETED 0 loops.
  • 2026-06-21noetl/server v3.39.4 — ✅ #121 first-half (partial): link gate-off command.claimed through ChainHeads + don't route system/ execs to the off-server WAL drive (server#256, 77aaa06)
  • 2026-06-21noetl/worker v5.40.3 — ✅ #125 + #126 adopt noetl-tools 3.14.0 (task_sequence control flow + http body data-shape fixes); 10×1000 batch pft_flow_test clean (worker#124, 6dd3449).
  • 2026-06-21noetl/tools v3.14.0 — ✅ #125 task_sequence honours do: jump/break/retry (tools#73, 638c3c6).
  • 2026-06-21noetl/tools v3.13.1 — ✅ #126 http tool body under data (Python-era contract restored; back-compat body alias) (tools#72, 8dd0e1f).
  • 2026-06-21noetl/server v3.39.3 — ✅ #124 distributed task_sequence forward set:/sibling bindings no longer render empty; defer sub-task templates with unresolved vars to the worker re-render (server#255, 365d3be).
  • 2026-06-21noetl/server v3.39.2 — ✅ #120 reduce barrier no longer deadlocks on open/asymmetric loop joins; runtime liveness filter in the orchestrate-core barrier (server#254 28e8950)
  • 2026-06-20noetl/server v3.39.0 — ✅ #116 program-scale step 2: execution-affinity single-owner WRITE ORDERING (multi-replica gate-ON validated) — closes the chain-fork race step 1 left open. Affinity routes every trigger for an execution to the replica that ShardConfig::owns it (non-owner forwards a reverse-proxy POST); owner's single-process drive lock + in-memory ChainHeads make the read→advance atomic, no distributed lock; KV = genesis/handoff vehicle (kv_remote_hit→0). src/affinity.rs; NOETL_EXECUTION_AFFINITY/NOETL_PEER_URL_TEMPLATE/NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off); metric noetl_execution_affinity_total{outcome} (server#252 5e00d0a). 2-replica gate-ON kind PASS: chains roots=1/dangling=0/walk==total (NO fork), forwarded_ok +9, never-scan + sole-writer across replicas. Follow-up #117 (off-server from_events spine event_id-order wedges fan-in under inversion).
  • 2026-06-20noetl/server v3.38.0 — ✅ #115 program-scale step 1: multi-replica coherence DATA LAYER — NOETL_REPLICA_COHERENCE=nats_kv (default local, prod unchanged) backs ChainHeads + ExecDescriptor with JetStream KV buckets (head CAS + descriptor CAS merge); src/coherence.rs; metric noetl_replica_coherence_total{structure,op,outcome} (server#251 8f39a79). Kind: single-replica parity with local; 2-replica cross-replica resolves proven. Necessary-not-sufficient → execution-affinity STAGED (2+ replicas still fork the chain).
  • 2026-06-20noetl/server v3.37.0 — ✅ #115 Phase 5 atomic-working-item context (tenet 6): input_binding + NOETL_ATOMIC_ITEM_CONTEXT (default off) (server#250 a96ade8)
  • 2026-06-20noetl/worker v5.40.0 — ✅ #115 Phase 5 forward the atomic-item-context flag onto the off-server from_events drive (worker#121 2484d17)
  • 2026-06-20noetl/server v3.36.0 — ✅ #115 Phase 6 retire the hot-path noetl.event read class; the table is AUDIT-ONLY (server#249, b71ca1d): NOETL_EVENT_READ_PATH=event_scan|audit_only (default event_scan, prod unchanged). Under audit_only the remaining lifecycle readers (get_catalog_id, inherit_parent_trace, dedup-audit + container-callback catalog/existence) serve from the in-memory ExecDescriptor; cold → noetl.command (synchronous queue) — never a noetl.event scan. Proof metric noetl_event_hotpath_reads_total{site,outcome}. Gate-ON kind-validated: hot-path scan Δ0 + drive state_build_total/event_scans Δ0 ⇒ ZERO noetl.event scans on the hot path, end-to-end; audit/replay still work; 585 tests + clippy green. RFC never-scan end state (tenet 3) reached under the flag.
  • 2026-06-19noetl/worker v5.37.0 — ✅ #115 Phase 4 off-server state-builder kernel + WAL shadow loop (worker#118, fef961c): pool-side per-execution chain index sourced from the noetl_events WAL; chain_walk() head→root spine in event_id order (parity by construction); cache keyed by the immutable chain head (CacheHit / Incremental tail-advance / ColdRebuild). Live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off) + metrics. Gate-ON kind-validated (993 WAL events, 0 noetl.event scans, 28 cold + 21 incremental); 8 unit tests + clippy green; default off. Drive cutover staged.
  • 2026-06-19noetl/server v3.33.0 — ✅ #115 Phase 4 NOETL_STATE_BUILDER=offserver|server flag scaffold (server#246, 3e6006d, default server): the server-side flag for the off-server state-builder drive cutover (staged). 2 config tests; no prod default changed.
  • 2026-06-19noetl/server v3.32.0 — ✅ #115 Phase 3 chain-walk state builder (flagged, default-off) (server#245, ai-meta pointer bumped): behind NOETL_STATE_BUILD_MODE=chain_walk the drive reconstructs WorkflowState by following the one-level prev_event_id chain head→root (in-memory ChainHeads head + (execution_id,event_id) PK lookups — never a WHERE execution_id scan) → same from_events (parity by construction); event-scan kept as the default + fallback (cold-head / lag / non-genesis). NOETL_STATE_BUILD_PARITY_CHECK shadow-builds both ways in one REPEATABLE READ snapshot. New metrics noetl_state_build_total{mode,outcome} / _event_scans_total (no-scan proof) / _chain_hops / _parity_total{result}. Gate-ON kind-validated: parity 41/41 MATCH, scans=0 / 1064 hops / 0 fallbacks, all topologies COMPLETE, sole-writer + lag-0, 577 tests + clippy green. No prod default changed.
  • 2026-06-19noetl/server v3.31.0 — ✅ #115 Phase 2 one-level prev_event_id event chain (server#244, ai-meta afdb365): each noetl.event/noetl.command carries the chain link, stamped at the emit chokepoint from a per-execution chain-head watermark (ChainHeads), covering both gate paths + the materializer. Additive; nothing reads it yet (Phase 3). Kind-proven walkable/1-root/no-gap/no-scan across 6 gate-ON topologies; 573 tests + clippy green. Companion DDL noetl/noetl ecd16a2 (noetl#667). No prod default changed.
  • 2026-06-19noetl/server v3.30.0 — ✅ #115 Phase 1 surface _ref/_store on kept refs + refs_in_state default true (server#243): consume-side accessors for {{ step._ref }} lazy-load + storage-tier predicates; references stay out of state/commands by default (worker selective-resolve landed in worker#117 v5.36.0). Closed #113 + #114; kind gate-ON all 9 stalls COMPLETE.
  • 2026-06-19noetl/server v3.29.4 — 🐞 #113 off-server drive: recover offloaded drive result + stop drive on cancel (server#241, rig e2e#63). apply_worker_orchestration resolves+decodes an offloaded __orchestrate__ result (over the 100KB inline budget → durable reference.ref) instead of dropping it → non-convergent re-loop (metric ref_resolved); cancel now matches underscore playbook_cancelled + a terminal guard evicts the orch-cache (no restart). Kind gate-ON proven; 5/9 #113 fixtures COMPLETE, the other 4 hit a distinct oversized-command.issued stall → #114. No prod default flipped.
  • 2026-06-19noetl/server v3.29.5 — 🐞 #114 offload oversized command context: a command.issued context over NOETL_COMMAND_CONTEXT_MAX_BYTES (512KB) is offloaded to noetl.result_store with a {__context_ref__} marker, resolved in get_command/claim_command (server#242, rig e2e#64) — published events stay under NATS max_payload, no publish-wall wedge. Kind gate-ON: rig PASS, all command.issued <1MB, 6 of #113's 9 fixtures COMPLETE; remaining 3 + cutover gated on the refs_in_state consume side (#101). No prod default changed.
  • 2026-06-19noetl/server v3.29.4 — 🐞 #113 recover offloaded drive result + stop drive on cancel: apply_worker_orchestration resolves+decodes an over-budget __orchestrate__ drive result via result_store.resolve (metric ref_resolved) instead of dropping it → non-convergent re-loop; cancel matches underscore playbook_cancelled + a terminal guard evicts the orch-cache (server#241, rig e2e#63). No prod default changed.
  • 2026-06-19noetl/server v3.29.3 — 🎯 #103 cutover COMPLETE, FLIP-READY: the 2 ExecutionService cancel/finalize writers route through the emit_event chokepoint (server#240, e2e rig e2e#62) — the last synchronous server noetl.event writer under the gate is closed. Kind-proven both modes (gate-off byte-identical INSERT; gate-on PUBLISHED + materializer sole writer + terminal state + 0 loss/dup). All three flip blockers closed → PUBLISH_ONLY flip is a staged operator decision. Default-off; no prod default changed.
  • 2026-06-19noetl/server v3.29.2 — off-server-drive × gate crash-recovery: cold-cache apply rebuilds WorkflowState from the durable log instead of dropping the in-flight drive result (server#238, refs #104/#103). Unblocks the PUBLISH_ONLY flip (off-server drive × gate now kind-proven). Confined to the cold branch; no prod default changed.
  • 2026-06-19noetl/tools v3.13.0 + noetl/worker v5.34.0 — #103 ack-after-materialize durability: deferred ack-after-processing capability (tools#71: AckMode::Defer + $JS.ACK durable handles + ack/nack/term) + in-process CQRS materializer consume-loop (worker#115: drain→project→ack-only-on-success, redeliver on failure) + system-pool wiring (ops#194). Kind fault-injection: gate-on sole-writer loss=0 across a mid-drain failure. Default-off.
  • 2026-06-18noetl/server v3.28.0 — worker-driven orchestrator drive now default ON (NOETL_ORCHESTRATE_PLUGIN_DRIVE defaults true) (server#233, closes #108; scale-soak-gated, revert = =false).
  • 2026-06-18noetl/worker v5.33.0 — pool-affinity decline (drive isolated on the system pool) (worker#114, refs #108 (b)).
  • 2026-06-14noetl/worker v5.22.0 — transfer endpoint credential-alias resolution, both Snowflake↔Postgres directions (worker#87, closes #99).
  • 2026-06-14noetl/tools v3.10.0 — Snowflake↔Postgres transfer arms + flatten credential config (tools#65, closes #99).
  • 2026-06-14noetl/tools v3.9.2 — Snowflake SQL-API context in request body + multi-statement split (tools#64, refs #98).
  • 2026-06-14noetl/tools v3.9.1 — set User-Agent on the Snowflake HTTP client (tools#63).
  • 2026-06-14noetl/tools v3.9.0 — Snowflake key-pair JWT authentication (tools#62; kind-validated on live sf_test account).
  • 2026-06-12noetl/server v3.5.0POST /api/execute/batch
    • opt-in exactly-once dedup window (server#189, RFC #90 Phase 7 — scale hardening).
  • 2026-06-12noetl/worker v5.19.0 — batch dispatch + dedup opt-in + per-subscription rate limits (worker#79, RFC #90 Phase 7 — scale hardening, closes #90).
  • 2026-06-12noetl/cli v4.11.0noetl subscribe, local-mode subscription listener (cli#60, closes cli#59, RFC #90 Phase 6). Standalone kind: Subscription listener + FileEventSink JSONL trail + local_disk store-and-forward spool; cli-only (reuses noetl-tools v3.5.0 source+spool). ai-meta → cli 2fb3fb0.
  • 2026-06-11noetl/server v3.0.6 — round-trip JSON null in whole-object {{ step }} references (server#177, closes noetl/ai-meta#89). A null field in a {{ step }} envelope rendered as the JS token undefined (invalid JSON), so the consuming step received an unparseable str; render_to_value now retries with | tojson (undefined/none → JSON null) — the server renderer had diverged from the noetl-tools engine that already did this. Kind-validated: cursor pagination collects all 35 events through the terminal next_cursor: null page. ai-meta pointer → server 8e17fbe.
  • 2026-06-10noetl/tools v3.1.1 — multi-tool sibling references (tools#48, closes noetl/ai-meta#87). TaskSequenceTool now injects each sub-tool's result under its label so a later sub-tool resolves {{ <label>.<field> }} (was rendering empty — a syntax error at or near "," in unquoted numeric SQL positions). Worker adopts via worker#69. Kind-validated (save_edge_cases test_large_payloadrecord_count = 100). ai-meta pointer → tools 76f942a + worker b97f642.
  • 2026-06-10noetl/server v3.0.3 — container-callback insert matches the deployed noetl.event schema (server#173, tracks noetl/ai-meta#43). The handler's call.done insert targeted a non-existent attempt column → HTTP 500 on every watcher callback; replaced with an inline INSERT matching the working ingestion path. Unblocked the container-callback chain (kind-val GREEN both probes). ai-meta pointer → 5d2cf58.
  • 2026-06-10noetl/gui v1.11.0 + v1.11.1 — credential View/Edit recovery for pre-wallet records (gui#36, closes noetl/ai-meta#82) + dev:kind convenience script (gui#35). ai-meta pointer → 8cacc9e.
  • 2026-06-10noetl/server v3.0.2 — container-tool command type contradiction fix (server#172, closes noetl/ai-meta#81). ToolSpec.command Option<String>Option<serde_json::Value>: the container tool's array command now decodes server-side + passes through to the worker's Vec<String>; scalars stay JSON strings for shell/db tools. Kind-val GREEN (K8s Job reaches Complete 1/1).
  • 2026-06-08noetl/tools v2.24.2 — clippy cleanup: 15 warnings resolved across 7 files (tools#44, closes tools#42). Mechanical lint fixes, zero behavioral changes.
  • 2026-06-05noetl/tools v2.18.4 — duckdb command: alias (parity with postgres) + http params/headers/form non-string coercion (tools#26); worker adopts it (worker#51). Unblocks the pagination + http + duckdb-command fixtures.
  • 2026-06-05noetl/tools v2.18.3 — postgres command: alias
    • multi-statement SQL on postgres + duckdb (tools#24, closes tools#23); worker adopts it (worker#50). duckdb_test + json_serialization_save GREEN.
  • 2026-06-05noetl/tools v2.18.2 — postgres tool surfaces the real SQLSTATE + message instead of db error (tools#21); worker bumped to it (worker#49).
  • 2026-06-05noetl/server v2.22.0Secrets Wallet Phase 2: GCP Cloud KMS KeyManager (Cloud KMS :encrypt/:decrypt + Workload Identity); runtime NOETL_KMS_PROVIDER (local/gcp-kms); KEK can leave the process (server#81, tracks #61). Kind-validated on local.
  • 2026-06-05noetl/server v2.21.0Secrets Wallet Phase 1c/1d: credentials + keychain store envelope-encrypted (per-record DEK wrapped by the KEK); self-describing {"v":1,…} blob, forward-only (server#79, tracks #61). Kind-validated end-to-end.
  • 2026-06-05noetl/server v2.20.0Secrets Wallet Phase 1b: envelope-encryption core — KeyManager/LocalDevKms/EnvelopeCipher (server#77).
  • 2026-06-05noetl/server v2.19.8Secrets Wallet Phase 1a: remove the all-zeros default encryption key, fail closed (server#75, tracks #61). Kind-validated.
  • 2026-06-05noetl/tools v2.18.5 — dollar-quote-aware statement splitter; the 2.18.3 splitter shredded plpgsql $$ … $$ blocks (tools#27).
  • 2026-06-05noetl/server v2.19.7 — defer task_sequence _prev/_results refs at command-build (server#73); nested-pipeline templates render at runtime → iterator_save_test GREEN.
  • 2026-06-05noetl/worker v5.11.3 — resolve keychain aliases on task_sequence sub-tasks (worker#48); nested postgres-in-pipeline steps connect.
  • 2026-06-05noetl/server v2.19.6 — credential store base64-armors the AES-GCM blob for the TEXT data_encrypted column (server#71); keychain creds register + round-trip.
  • 2026-06-05noetl/worker v5.11.2 — resolves keychain alias under the v10 credential: key (worker#46).
  • 2026-06-05noetl/server v2.19.5 — v10 control-flow end-to-end (server#69, 6 commits): catalog INT4 + catalog_id alias, ToolSpec skip-null, orchestrator end-step-with-action + task_sequence flatten + intra-pass dedup, template Chainable- undefined, end-step trigger gate.
  • 2026-06-05noetl/worker v5.11.1 — preserve array tool_config for task_sequence (worker#44).
  • 2026-06-05noetl/tools v2.18.1 — task_sequence parse_tasks accepts worker-envelope shape.
  • 2026-06-04noetl/server v2.19.4 — orchestrator template context: step data at top level + call.done capture.
  • 2026-06-04noetl/tools v2.18.0 — TaskSequenceTool.
  • 2026-06-04noetl/tools v2.17.1 — PythonTool result- global capture.
  • 2026-06-04noetl/server v2.19.3 — three fixes shipped together: pipeline flat shape decode (#61), failure termination (#63), workbook resolution (#65).
  • 2026-06-04noetl/server v2.19.2 — v10 workload + input alias (#59).
  • 2026-06-04noetl/server v2.19.1 — EE-5 lax decode for integer execution_id (#57).
  • 2026-06-04noetl/server v2.19.0 — Phase F R4-4b (ExecutionService refactor + cluster-wide list fan-out).
  • 2026-06-04noetl/server v2.13.0 → v2.19.0 — Phase F R4 series (DbPoolMap N+1 pool layer through R4-5 kind validation).
  • 2026-06-04noetl/gateway v3.2.0 — Phase F R3b-2 shard-info twin endpoint.

Conventions

How agents (Claude / Codex / Cursor) operate across this ecosystem — pointers into the rule files in agents/rules/:

How to use this dashboard

  • Just landed in this codebase? Read Repo Map, then Execution Model, then the umbrella for whatever you're working on.
  • Picking up an in-flight task? Find the matching umbrella page above; it has the full state of the work + the next concrete step.
  • Need to file new work? Follow the issue tracking convention — open the ai-task issue on noetl/ai-meta, then add the umbrella to the table above and create the corresponding wiki page.
  • Maintenance pass? Refresh this Home + the Sessions Log
    • the Releases page + the matching Umbrella-*.md page when you bump a submodule pointer. All four pages drift together — see Rule 0a's checklist.

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally