Skip to content

Sessions Log

Kadyapam edited this page Jun 23, 2026 · 300 revisions

Sessions Log

Chronological log of agent sessions that touched the NoETL ecosystem. Each entry links to the durable artefacts (PRs, commits, wiki updates, ai-task issues) so future sessions can pick up where the last one ended.

How this log stays current. Each session adds an entry at the top when meaningful work lands (issues opened, PRs merged, pointer bumps, design decisions captured). Use the Issue Tracking convention to decide what's worth logging — same threshold: anything that will outlive the session.


2026-06-23 — 🪶 #104 OQ5 Option A SHIPPED — producer-staged result tier (worker v5.46.0); result_store-retirement soak gate defined (NOT started — gated)

Landed. OQ5 Option A merged + released: the producing worker stages the over-budget result tier object at emit time under NOETL_RESULT_PRODUCER_STAGE (default off → byte-identical no-op), decoupling the tier write from noetl.result_store; the materializer skips its result_store fetch on already-staged objects (skip-on-exists). Shared decide_tier → byte-identical to a materializer-written object.

  • worker #132 (0d9ca18) → semantic-release noetl-worker v5.46.0 (27c7c17); crate + multi-arch image published.
  • e2e #81 (59352b4) kind rig; docs #186 (566333b) RFC Option A section.
  • ai-meta worker pointer dd0701627c7c17 via #128.
  • ops #206 (open, do-not-apply): OQ5 soak-gate alert rules (GMP + VMRule) on noetl_worker_result_mint_authoritative_total{path="legacy_fallback"}.

Soak — NOT started, gated. The result_store-retirement gate is legacy_fallback holding at 0 while producers stage in steady state. That requires deploying worker v5.46.0 to prod + NOETL_RESULT_PRODUCER_STAGE=true on noetl-worker-rust — both gated prod changes, not done this session. Plan + start criteria + 72h/48h-floor window recorded on #104. No prod default changed; the result_store dual-write stays in place; retirement (dropping it) is a separate explicit-go step.

Note: the docs Cloudflare deploy is pre-existing red (unrelated MDX errors in two other docs/architecture/*.md files since 2026-06-17); the OQ5 RFC file compiles clean. Flagged for a separate fix.

2026-06-23 — ✅ #104 — Phase D (mint-authoritative + dual-write) ENABLED LIVE on prod — staged A–D enablement COMPLETE

Headline. Flipped the final staged result-tier flag on prod GKE (noetl-demo-19700101, ns noetl): NOETL_RESULT_MINT_AUTHORITATIVE=true makes the URN→Feather/GCS tier the authoritative result store, with noetl.result_store kept as the reversible dual-write fail-safe. A–D are now all live on prod. B (RESULT_MATERIALIZER_ENABLED, system-pool) + C (RESULT_URI_RESOLVE, worker-rust) were already on; this session added D.

Flag topology (corrected live). The materializer spawn is gated materializer_enabled() OR mint_authoritative() (worker/src/result_materializer.rs:154), so the flag on worker-rust spawns a stray result materializer at the wrong cell (local-0) that would fan out as worker-rust autoscales 1→20. ops#204 templates the flag only on system-pool — the intended design. Final prod placement: noetl-worker-system-pool (authoritative materializer, cell usc1-a; log AUTHORITATIVE Feather tier; #104 Phase D … authoritative=true) + noetl-server-rust (dual-write counter; no materializer there → side-effect-free). NOT worker-rust — its consume already resolves from the tier via Phase C; D there was redundant and only added the mint counter at the cost of the stray writer. (Captured result_mint_authoritative_total{path=tier}=2 during a brief all-on window as one-time consume-path proof, then removed it from worker-rust.)

Validation (LIVE, 4 over-budget execs, 1200-row producer→settle→consume, tenant-segregated prefixes phased-mint*/phased-soak*). Every round clean: tier authoritative — Feather object written only at env=prod/region=usc1/cell=usc1-a/…/results/start/0/0/1.feather (shards s0118/s0096/s0134), system-pool sole result materializer, result_materializer_errors_total=0, no local-0 stray on any round; dual-write preservednoetl_result_store_dual_write_total=4 (1:1 with execs), result_store_put_total{ok}=4, noetl.result_store row lands each time; resolve from tierresult_resolve_total{resolved_feather}=6, consume bound start.rows[1100]row_count=1200/deep_id=1100/test_passed=true, all COMPLETED; cutover invariants — event-mat sole-writer projected==acked==100, duplicates=0/project_errors=0, lag 0 (nats_consumer_pending{noetl_materializer}=0), never-scan state_build_event_scans=0, 0 pod restarts, 0 GCS/ADC auth errors. Off-server CQRS gate (PUBLISH_ONLY=true+STATE_BUILDER=offserver), cell env, CPU limit 2 all preserved; GC/DR stay OFF; all three rollouts clean.

Decision: D LEFT ENABLED (system-pool + server). Revert armed: kubectl -n noetl set env deploy/noetl-worker-system-pool deploy/noetl-server-rust NOETL_RESULT_MINT_AUTHORITATIVE-. Remaining (gated, NOT touched): OQ5 byte-source re-plumb (materializer fetches payload FROM result_store today → retirement needs re-plumbing the tier-write byte source) + noetl.result_store retirement + OQ5 soak. #104 stays OPEN. #104 comment.


2026-06-23 — 🚀 #104 — WI/ADC GCS auth MERGED (server v3.45.0) + ROLLED TO PROD as WI KSA (tier still OFF)

Headline. Merged server#265 (fad5d8a, the 3-mode none/static/adc GCS auth matrix via gcp_auth) → semantic-release cut noetl-server v3.45.0 (21da3ef). Built the prod image via Cloud Build (e2-highcpu-8, us-central1) → us-central1-docker.pkg.dev/noetl-demo-19700101/noetl/server-rust@sha256:d3cbf1ad7e44d52dbee60ee2472073b4ef62f8787aeb0722b6d33f1b8eda8fe0 (tags 21da3ef + v3.45.0). Rolled it to prod GKE (gke_noetl-demo-19700101_us-central1_noetl-cluster, ns noetl) and set serviceAccountName: noetl-server-rust so the server now runs as the operator-provisioned WI-bound KSA. Applied the result-tier GCS ENV to the server (NOETL_OBJECT_STORE_BACKEND=gcs, GCS_BUCKET=noetl-demo-19700101-results, GCS_ENDPOINT=https://storage.googleapis.com, GCS_AUTH=auto, RESULT_CELL_ENV=prod/REGION=usc1/CELL=usc1-a/SHARD_COUNT=256) + seeded the matching cell ENV on noetl-worker-system-pool (no GCS_* on the pool — it writes via the server endpoint).

Tier stays INERT. No tier-enable flag was set on any deployment — NOETL_RESULT_MATERIALIZER_ENABLED / RESULT_URI_RESOLVE / RESULT_MINT_AUTHORITATIVE / RESULT_TIER_GC / RESULT_TIER_DR all remain unset/false. Behavior == prior stack. The off-server gate (NOETL_EVENT_INGEST_PUBLISH_ONLY=true + STATE_BUILDER=offserver) and CPU limits (limit 2 / req 250m) were preserved.

Validation (LIVE on prod). Server pod up healthy on v3.45.0 as KSA noetl-server-rust, 0 restarts; logs show object store backend: GCS … auth=adc resolved correctly with no token minted (lazy on first GCS I/O — none happens with the tier off), DB + NATS connected, Server listening. /api/internal/cells returns the applied config (shard_count:256, default_cell:usc1-a, cells:[usc1-a/prod/usc1/gcs/noetl-demo-19700101-results]). /health ok. Off-server cutover stayed healthy: a tenant-segregated smoke (test/simple_python) + system/scheduled_cleanup both COMPLETED; materializer sole writer (server published==projected==acked=13), consumer lag 0, state_builder_event_scans_total=0 (never-scan), replay reconstructed the full 13-event chain to terminal playbook.completed; result_materializer drained=0/errors=0 (inert). All pods 0 restarts.

Result. Prod is fully configured + the WI server is live, ready for staged B→C→D result-tier enablement (a later operator-gated task). Revert on standby: roll back the server image + serviceAccountName + the GCS/cell ENV; the cutover-gate revert (PUBLISH_ONLY=false STATE_BUILDER=server) stays armed. ai-meta pointer bumped repos/server21da3ef (v3.45.0).


2026-06-23 — 🪶 #104 Phases E + F MERGED to main — the FINAL build phases (side-effect barrier + result-tier GC/DR); repo-only, flags default-off

Headline. Merged the #104 Phase E and Phase F PRs in dependency order (E before F — F's worker/e2e branches were stacked on the Phase E branch). All flags default-off → inert in prod; this is repo-only, no prod deploy. The A–F build phases of #104 are now complete; only operational items remain (prod GCS infra, OQ5 byte-source re-plumb, staged enablement).

Phase E (side-effect durability barrier).

  • tools#78 feat(registry)noetl-tools v3.17.0 (1d49dd5): publishes registry::kind_is_side_effecting (conservative default true; only noop/rhai false). semantic-release cut + the release workflow published to crates.io — members noetl-directives + noetl-locator re-published first (workspace publish order), then the root noetl-tools 3.17.0.
  • worker#130 feat(barrier)noetl-worker v5.44.0 (d696f7e): the consume-pool barrier (NOETL_SIDE_EFFECT_BARRIER, default off) — before re-dispatching a side-effecting tool it adopts an already-durable result, so external side effects fire exactly once across re-drive. Repointed onto the published noetl-tools 3.17 (Cargo.toml ^3.17 + lockfile) — no patch/path/git dep; cargo check/clippy/tests green.
  • e2e#79 (6dd1432): barrier kind rig.

Phase F (result-tier GC + DR).

  • server#264 merged with a feat: subject → noetl-server v3.44.0 (341b614): conservative dry-run-first GC sweeper (NOETL_RESULT_TIER_GC, default off) — reclaims only provably-dead tier objects, never deletes a live-referenced one (unit-tested decide).
  • worker#131 merged with a feat: subject → noetl-worker v5.45.0 (dd07016): materializer verify-and-repair DR mode (NOETL_RESULT_TIER_DR, default off) — rebuilds a missing/corrupt tier object byte-identically from its WAL-derivable source. Rebased onto main after Phase E merged; inherits the published noetl-tools 3.17 pin (no patch).
  • ops#205 (26185ff): GC/DR flags (both "false") on server-rust + worker-system-pool manifests + system/scheduled_cleanup GC step. e2e#80 (d7372be): GC + DR kind rig.

Cleanup (this change set). ai-meta pointers bumped (tools 1d49dd5, server 341b614, worker dd07016, ops 26185ff, e2e d7372be); deployment-spec wiki rows committed — worker-wiki NOETL_SIDE_EFFECT_BARRIER (E) + held NOETL_RESULT_TIER_DR (F), server-wiki held NOETL_RESULT_TIER_GC (F); Home/Sessions-Log/Releases/Umbrella-Event-WAL-Storage updated. #104 updated (Refs, stays OPEN).

2026-06-23 — 🪶 #104 Phase F implemented + kind-validated — result-tier GC + DR (→ MERGED 2026-06-23, see entry above)

Phase F (GC + DR) — the last #104 build phase — implemented + kind-validated, in review. GC (server): POST /api/internal/result-tier/gc (gated NOETL_RESULT_TIER_GC, dry-run default) — a conservative sweeper reclaiming only provably-dead tier objects (execution aged out of noetl.event, past a grace window) that never deletes a live-referenced object (unit-tested decide); object backend list/delete; system/scheduled_cleanup GC step (double-gated). DR (worker): result-materializer verify-and-repair mode (NOETL_RESULT_TIER_DR) rebuilding a missing/corrupt object from its WAL-derivable source byte-identically. OQ1 version-reaping + OQ5 retirement deliberately scoped OUT (reported, not guessed). Tests: server 620 / worker 255 lib, clippy clean. Kind gate-ON 5-pass green (off-server gate + fake-gcs, Phase F images built + loaded): GC-1 dry-run lists the dead orphan + skips the live object (live_candidates=0) + deletes nothing; GC-2 delete reclaims only the orphan, live survives + serves; GC-3 flag-off no-op; DR-1 deleted object re-derived byte-identically (sha256 match) + served by read path; DR-2 flag-off no-op; invariants intact every execution; baseline restored. A–F build phases complete — only operational items remain on #104 (prod GCS infra, OQ5 byte-source re-plumbing, staged prod enablement/minting cutover). Worker/e2e PRs stacked on the unmerged Phase E branch (merge Phase E first). PRs (review-only): server#264 · worker#131 · ops#205 · e2e#80 · #104. Deployment-spec wiki rows for the two flags staged (held until merge).

2026-06-23 — 🪶 #104 Phase E implemented + kind-validated — side-effect durability barrier (→ MERGED 2026-06-23, see entry above)

Headline. Implemented + kind-validated Phase E of #104 — the side-effect durability barrier (RFC §4.4 / T6). Before re-dispatching a side-effecting cycle whose derived result URN already resolves to a durable result (the Phase C read path), the worker skips re-execution and adopts the recorded result, so an external side effect fires exactly once across a crash-resume / re-drive; non-side-effecting cycles are never blocked. PRs open, not merged; no ai-meta pointer bump; server unchanged (worker-only, reuses Phase C GCS + cells).

What landed (branches).

  • tools#78registry::kind_is_side_effecting + Tool::side_effecting() + ToolRegistry::is_side_effecting; conservative default true, only noop/rhai false (5 tests).
  • worker#130 — the barrier in execute_with_server_url, gated NOETL_SIDE_EFFECT_BARRIER (default off → true no-op). Gate looks through the task_sequence wrapper (command_is_side_effecting); existence+adopt reuses result_resolver::resolve_by_urn; cycle_logical_uri factored from stamp_logical_uri so a re-drive resolves the identical URN. Metric noetl_worker_side_effect_barrier_total{outcome,tool} (11 Phase-E tests; 253 lib green; clippy clean).
  • e2e#79kind_validate_side_effect_barrier.sh + test_side_effect_barrier.yaml.

OQ4 resolved → static. Per-kind classification is the baseline; per-invocation (http GET vs POST) is deliberately skipped because the barrier is adopt-only, so over-classification is safe. attempt=1 is fixed → the barrier keys on durable-success existence at the coordinate, not the attempt number, so OQ1 keep-every-attempt + #125 retry compose cleanly (retry-after-failure re-executes; resume-after-success skips).

Kind validation (prod-exact off-server gate + fake-gcs; 3-pass green). A deterministic forged re-drive (the worker acks before dispatch, and the claim terminal-guard rejects re-publishing the same command_id, so the rig copies the command row with a fresh event_id → a non-terminal command for the same (execution_id, step) → same URN) + a marker-object side-effect counter:

  • PASS A (barrier ON) — primary fired once (marker=1); re-drive SKIPPED (marker stayed 1, barrier{skipped} Δ=1).
  • PASS B (barrier OFF) — re-drive RE-EXECUTED (marker 1→2); barrier metric Δ0.
  • PASS C (barrier ON) — a terminal-noop re-drive never checked (barrier_total Δ0); counter unchanged.
  • Invariants intact every primary: sole-writer, roots=1, dangling=0, walk==rows, terminal=1, materializer lag 0. Baseline restored (workers back to :104-phase-c).

Scope follow-up (not a blocker). The shipped existence signal is the tier-object half of §4.4 ("object HEAD"); small/inline side-effecting results (not tiered) re-execute today — the event-completion ("cycle acked") signal is a Phase-E follow-up. #104 stays OPEN (F + minting cutover remain).

Pointers. #104 comment · tools#78 · worker#130 · e2e#79.


2026-06-23 — 🪶 #104 Phase D MERGED — minting flip (4 PRs squash-merged in dependency order; flags default-off → inert in prod; repo-only)

Headline. Merged Phase D of #104 — the minting flip. All 4 PRs squash-merged in dependency order with release-triggering subjects: server#263 → noetl-server v3.43.0 6f6b9ef, worker#129 → noetl-worker v5.43.0 be6863a, ops#204 b19b759, e2e#78 07e85aa. Flags default-off → inert in prod (a true no-op). Repo-only; the prod minting cutover (rolling server 3.43.0 + worker 5.43.0 to GKE) is the separate next task. #104 stays OPEN (Phase E/F + the minting prod-cutover remain).

Dependency / version analysis. No new crate publish was needed. Neither server#263 nor worker#129 changes Cargo.toml — the only path deps are the in-workspace noetl-orchestrate-core, and the registry deps (noetl-locator 0.1.1, noetl-events 0.1, noetl-tools 3.16.0, arrow 53) all already published in Phase A–C and resolve from crates.io. No git/branch dep, so no pre-merge repoint and no noetl-locator/noetl-tools member-publish ordering. semantic-release cut the two expected minors (server v3.42.0 → v3.43.0, worker v5.42.0 → v5.43.0). ops/e2e have no semantic-release — their squash subjects are conventional-commit hygiene only.

OQ5 — result_store retirement — DECIDED metric-gated. Recorded on #104 and the RFC umbrella page: drop the dual-write once noetl_worker_result_mint_authoritative_total{path="legacy_fallback"} holds 0 across a staging soak (the tier never misses) plus a retention-period time floor (dual-write runs ≥ one full result retention period at flag-on so any in-flight resume can still fall back); historical rows age out by expires_at (no back-migration). The actual retirement is blocked on a not-yet-done prerequisite (NOT Phase D scope): the materializer fetches the over-budget payload from result_store today, so the tier-write byte source must be re-plumbed first (producer stages directly to the tier, or the materializer reads the inline over-budget payload off the event). Until that lands, dropping result_store would starve the tier writer even with the metric gate green.

Pointers. ai-meta repos/server6f6b9ef, repos/workerbe6863a, repos/opsb19b759, repos/e2e07e85aa; wiki Home / Sessions-Log / Releases / Umbrella-Event-WAL-Storage (Phase D → MERGED, OQ5 → metric-gated) updated in the same change set; the held Phase D wiki-doc pointer bumps (ai-meta-wiki, noetl-server-wiki deployment-spec, noetl-worker-wiki deployment-spec) committed alongside.


2026-06-23 — 🪶 #104 Phase D implemented + kind-validated — minting flip, in review (4 PRs, flag default-off → inert in prod)

Headline. Implemented Phase D of #104 — the minting flip: one flag NOETL_RESULT_MINT_AUTHORITATIVE (default off → byte-identical to Phase A–C) makes the URN → Feather/GCS result tier the authoritative result store, with noetl.result_store demoted to the reversible dual-write fallback. Validated green on local kind under the prod-exact off-server gate; PRs open for review, not merged; no ai-meta pointer bump, no prod change. #104 stays OPEN.

What Phase D added.

  • Worker (the flip): the result materializer becomes the authoritative tier writer under the flag (implies the Phase B flag); resolve-by-URN becomes the primary consume read path (implies the Phase C flag). A tier miss falls back fail-safe to the dual-written result_store (rollback safety). New noetl_worker_result_mint_authoritative_total{path} (tier | legacy_fallback).
  • Server: config flag result_mint_authoritative; the result_store PUT counts each write on the new noetl_result_store_dual_write_total (the reversible dual-write leg) under the flag. The tier write stays worker-side because the slim control plane cannot encode Feather (OQ7).
  • ops: the flag plumbed into the kind system-pool manifest, default off; prod manifests untouched.
  • e2e: kind_validate_result_mint_authoritative.sh — a 3-pass rig.

Kind validation (off-server gate + fake-gcs, 3-pass green).

  • PASS 1 (flag ON): over-budget result AUTHORITATIVE in the GCS tier (gcs put Δ4), DUAL-WRITTEN to result_store (row present + server dual_write Δ1), consumer RESOLVES FROM THE TIER (gcs get Δ2, worker mint{tier} Δ2); 1200 rows; sole-writer intact.
  • PASS 2 (flag OFF): true no-op (dual_write/mint/resolve Δ0), legacy store authoritative; parity with PASS 1 (1200 rows).
  • PASS 3 (tier-miss ROLLBACK): the execution's tier object is DELETED during the fixture settle window (deterministic miss, independent of materializer/rollout timing), so resolve-by-URN misses and falls back to the dual-written result_store (mint{legacy_fallback} Δ1, fallback_object_miss Δ1); full payload still bound.

Server 613 + worker 247 lib tests + clippy green; heavy graph (duckdb/arrow/tonic/ rhai/gcp_auth/kube) stays absent from the control plane. Baseline restored on kind.

OQ5 surfaced as the open DECISION before the prod minting cutover — the result_store retirement window (how long to dual-write + historical-row handling + the materializer-payload-source prerequisite). Framed on #104, not decided here.

Pointers. PRs (unmerged, review-only): server#263 · worker#129 · ops#204 · e2e#78. Wiki deployment-spec env-var docs: server.wiki + worker.wiki. Umbrella page Phase D → implemented/in-review; #104 stays OPEN.


2026-06-23 — 🪶 #104 Phase C MERGED — resolve-by-URN read path (3 PRs, flags default-off → inert in prod)

Headline. Merged the 3 #104 Phase C PRs in dependency order (server → worker → e2e) and bumped the ai-meta pointers. The result-DATA read half now lives on main: the server has a GCS object backend + a cell-endpoint registry + GET /api/internal/cells, and the worker has the resolve-by-URN read path (references-in-state behavior, flatten_single_tool_result; closes OQ6) plus fixes B and B1. Every flag is default-off → inert in prod until a future rollout enables the read path. #104 stays OPEN (Phase D minting flip remains).

Dependency / version analysis. No new crate publish was needed. server#262 does not touch Cargo.toml and resolves noetl-locator 0.1.1 (Phase B) from crates.io; worker#128's only manifest change is adding the published arrow = "53" direct dep and it resolves noetl-tools 3.16.0 (Phase B) from the registry. No git/branch deps anywhere → no downstream repoint. Lockfile confirms noetl-tools 3.16.0 / noetl-locator 0.1.1 / arrow 53.4.1.

What landed.

  • server#262 (squash 082a955) → semantic-release noetl-server v3.42.0 (c2d5ca9). GCS object backend + cell registry + GET /api/internal/cells.
  • worker#128 (squash 379bf31) → semantic-release noetl-worker v5.42.0 (7971041). Resolve-by-URN read path + fixes B/B1; 38 unit tests.
  • e2e#77 (squash 39dc880) — 3-pass resolve-by-URN rig + fixture + fake-gcs manifest. No semantic-release (pointer-only).

ai-meta pointers bumped. repos/serverc2d5ca9 (v3.42.0), repos/worker7971041 (v5.42.0), repos/e2e39dc880. Wiki: Home + Sessions-Log + Releases + ecosystem-map cells + Umbrella-Event-WAL-Storage (Phase C → MERGED, OQ6 resolved).

No prod deploy — Phase C ships default-off and reaches prod on a later rollout. PROD GKE untouched.


2026-06-22 — 🪶 #104 Phase B MERGED — shadow Feather result tier

Headline. Merged the 5 #104 Phase B PRs in dependency order and bumped the ai-meta pointers. The shadow Feather result-DATA tier now lives on main across worker / tools / server / ops / e2e, with every flag default-off → inert in prod until a future rollout. #104 stays OPEN (Phases C–F remain).

What landed (dependency order).

  1. noetl/tools#77noetl-tools v3.16.0 (740a2c2 merge → release 7da39d8) + new noetl-locator v0.1.1 published to crates.io. The PR adds the additive ResultCoordinates::parse / from_locator (the inverse of logical_uri) to the slim, dependency-free noetl-locator member. Release mechanics correction applied pre-merge: semantic-release's prepareCmd bumps only the root Cargo.toml, and the publish step skips noetl-locator when its version is already on crates.io (0.1.0, Phase A) — so a member version bump is required for the locator to publish. Bumped noetl-locator/Cargo.toml 0.1.0→0.1.1 (additive → patch keeps the root ^0.1 dep + server's ^0.1.0 resolving the new version) and pushed to the PR branch before merging. Confirmed on crates.io: noetl-locator 0.1.1 + noetl-tools 3.16.0.
  2. No downstream repoint needed. Verified before merging server/worker: neither carries a temporary git/branch dep on unpublished code. noetl/server resolves noetl-locator ^0.1.0 from the registry (registry+…crates.io-index, doesn't use the new API); noetl/worker resolves noetl-tools ^3.14.2 from the registry and uses a self-contained local coords_from_uri inversion (the swap to from_locator is the explicitly-deferred Phase-B follow-up). No [patch], no git =, no branch =. The task's conditional repoint did not fire.
  3. noetl/server#261noetl-server v3.41.0 (48ad318 merge → release 4a6659e). Ensures the sibling noetl_result_materializer durable consumer at stream-birth (own ack cursor); no new deps, control plane stays slim.
  4. noetl/worker#127noetl-worker v5.41.0 (c1adb7f merge → release 4b1c15b). The src/result_materializer.rs consume-loop writes the over-budget result (tabular → Arrow Feather, non-tabular → JSON, small → inline no-op) to the derived §7 key alongside noetl.result_store; gated NOETL_RESULT_MATERIALIZER_ENABLED (default off → not spawned → no-op).
  5. noetl/ops#203 (c92753c) — NOETL_RESULT_MATERIALIZER_ENABLED (default false) + single-cell seed NOETL_RESULT_CELL_ENV/REGION/CELL on the kind system-pool deployment; prod manifests untouched.
  6. noetl/e2e#76 (04c3332) — kind_validate_result_materializer.sh
    • test_large_tabular_result.yaml (the flag-on Feather/JSON + flag-off Δ0 rig, validated pre-merge).

Validation. All 5 PRs were kind-validated upstream pre-merge (gate-ON: tabular → real Arrow .feather 269 KB, non-tabular → .json, flag-off Δ0, event-materializer sole-writer intact every leg). No prod deploy this session — repo-only; the tier reaches prod only on a future rollout that enables the flag.

Pointers bumped (ai-meta). repos/tools7da39d8, repos/server4a6659e, repos/worker4b1c15b, repos/opsc92753c, repos/e2e04c3332. Wiki: Home + Sessions-Log + Releases + Umbrella-Event-WAL-Storage (Phase B → MERGED).


2026-06-22 — 🪶 #104 Phase B implemented + kind-validated — shadow Feather result tier (in review, flag default-off)

Headline. Built the result-DATA half of #104: a separate noetl_events consume-loop on the system pool (noetl_result_materializer, own ack cursor) that resolves an over-budget result's payload and shadow-writes the body to the derived §7 object key — tabular → Arrow Feather, non-tabular → JSON (OQ3), small → inline no-op — alongside noetl.result_store. Nothing reads it until Phase C. Gated behind NOETL_RESULT_MATERIALIZER_ENABLED (default off → true no-op). #104 stays OPEN (C–F remain). No prod change, no pointer bump (PRs unmerged).

What landed (PRs in review).

  • worker worker#127src/result_materializer.rs: the shadow consume-loop. Tiering mirrors the worker's arrow_codec (detects a tabular rowset top-level or under the conventional data.<tool> envelope); keep-every attempt URN (OQ1); single-cell seed (RFC §4.3); never alters the authoritative result, never fails an event (errors counted + acked); metrics noetl_worker_result_materializer_*. 12 unit tests + 236 lib + clippy.
  • tools tools#77ResultCoordinates::parse/from_locator (single-source URI→coords).
  • server server#261 — ensure the sibling noetl_result_materializer consumer (no new deps).
  • ops ops#203 — system-pool flag + single-cell seed (default off; prod manifests untouched).
  • e2e e2e#76kind_validate_result_materializer.sh + a tabular over-budget fixture.

Validation. Local kind under the prod-exact off-server gate (PUBLISH_ONLY + off-server drive + materializer sole-writer), server + system-pool on Phase B images: flag-on tabular → real Arrow .feather (269 KB, Arrow magic) at …/cell=local-0/shard=s0053/…/results/start/0/0/1.feather, non-tabular → .json; flag-off → 0 objects (true no-op); both COMPLETED; event-materializer sole-writer invariants intact every leg (event_rows==distinct, catalog0=0, orchestrate=0, mat dup=0). Cluster restored to baseline (images reverted, env removed, server healthy v3.39.5).

Decisions. OQ1 = keep-every; OQ3 = JSON; OQ6's multi-cell registry + miss behaviour deferred to Phase C (one cell ⇒ no miss); object-store backend = the shipped server-mediated PUT /api/internal/objects/{key} seam. No new blocker before Phase C.


2026-06-22 — 📦 #104 Phase A MERGED — slim noetl-locator 0.1.0 extracted + published; server accepts the canonical result URI (flag default-off, repo-only)

Headline. Executed the #104 Phase A merge sequence end-to-end — extracted the slim, dependency-free noetl-locator crate, published noetl-locator 0.1.0 to crates.io, and merged the server accept hook behind a default-off flag. No prod deploy (Phase A reaches prod on a later server rollout).

What landed.

  1. tools#76 — squash-merged with a feat(locator): subject (the PR was titled refactor(locator):, which semantic-release treats as non-bumping → would have published no release). The feat subject cut noetl-tools v3.15.0 (dc0c5d8) and the release CI published the new workspace member noetl-locator 0.1.0 to crates.io (member-publish ordering: locator before the root crate, same shape as noetl-directives). The crate is pure std — ResourceLocator, ResultCoordinates, shard_key, CellPlacement, legacy-ref parse — re-exported as noetl_tools::locator so the worker stamp path is unchanged.
  2. server#260 — repointed the temporary git-dep (noetl-locator = { git = …, branch = … }) to noetl-locator = "0.1.0", re-resolved the lockfile off crates.io, confirmed cargo build/cargo tree resolve from the registry and the heavy graph stays absent from the control plane (duckdb / kube / arrow / tonic / rhai / gcp_auth = 0 occurrences), 623 server tests green; committed (92fbceb) + pushed, then squash-merged → noetl-server v3.40.0 (c89d078). The server accepts the canonical result URI behind NOETL_RESULT_URI_ACCEPT (default off / no-op).
  3. e2e#75 — squash-merged the Phase A kind validation rig (eeca8b7).

Resolves OQ7 (the umbrella's open question — noetl-tools dragged its whole dependency graph onto the control plane just to parse a URI).

ai-meta change set. Pointers bumped repos/toolsdc0c5d8, repos/serverc89d078, repos/e2eeeca8b7; wiki Home + Sessions-Log + Releases + ecosystem-map + Umbrella-Event-WAL-Storage (Phase A → MERGED). #104 stays OPEN (Phases B–F ahead).


2026-06-22 — 🐘 #95 CLOSED — postgres pg_value_to_json temporal/identity serialization shipped to prod

Headline. Shipped the #95 postgres timestamp fix to prod end-to-end (same pattern as the #127 worker ship): merge tools#75 → release noetl-tools v3.14.2 → bump the worker pin → build + roll the worker to prod under the live off-server CQRS cutover → close #95.

Root cause. noetl_tools::tools::postgres::pg_value_to_json probed i64/i32/f64/bool/String/serde_json::Value/DateTime<Utc> and fell through to Value::Null for everything else. A timestamp without time zone column (chrono NaiveDateTime) hit that fall-through and serialized to null even though the value was present — the #95 repro, where auth.sessions.expires_at came back null and tripped the gateway's expires_at validation. timestamptz (DateTime<Utc>) already had an arm and was unaffected.

Fix (tools#75). Added arms for the temporal + identity types that shared the gap: timestampNaiveDateTime (ISO-8601 with NO offset suffix — a tz-naive value carries no zone), dateNaiveDate, timeNaiveTime, uuid → hyphenated lowercase string, numeric/decimal → exact decimal string via a direct lossless decode of the postgres numeric binary wire format (matches the duckdb decimal-as-string convention), bytea → base64. 409 lib tests + clippy clean + a kind-gated live-postgres before/after integration test (tests/postgres_temporal_kind.rs, behind NOETL_PG_KIND_DSN) proving the pre-fix arm nulled timestamp/date/time/uuid/numeric/bytea while PostgresTool now returns the real values.

What landed.

  • tools#75 squash-merged 06302ac → semantic-release noetl-tools v3.14.2 (6d9b674), published to crates.io.
  • worker#126 bumps the noetl-tools pin 3.14.13.14.2 — deps-only, no worker source change; cargo check + cargo clippy --all-targets clean. Squash 60a849d → semantic-release noetl-worker v5.40.5 (da24952).
  • Built the prod worker image via Cloud Build us-central1 — tag noetl-worker-rust:v5.40.5, digest @sha256:45212dbe7410920aaa6311074bb9ef78f161c369dbd995364cda2ecfda2f0af2 (--machine-type=e2-highcpu-8 --timeout=5400s per the #127 cold-musl-build lesson; ~24 min).
  • Rolled by digest onto prod noetl-worker-rust + noetl-worker-system-pool via kubectl set image (server stays v3.39.5 @feaac0c5; worker CPU req 250m / limit 2 preserved).

Rollout health (live). Rolling restart clean — noetl-worker-rust 2/2 + noetl-worker-system-pool 1/1 Ready, 0 crashloop/restarts after a ~3-min soak, both on the new digest, resources preserved. Off-server CQRS cutover stayed healthy: materializer started (ack-after-materialize, sole noetl.event writer), noetl_worker_nats_consumer_pending{consumer="noetl_materializer"}=0, system command lag (NOETL_COMMANDS_RUST)=0, materializer_project_errors=0, duplicates=0, and the system-pool WAL index rehydrated from the retained noetl_events WAL (indexed_executions=5, wal_events=198, durable=false — the #119 ephemeral rebuild). Server untouched. No rollback needed.

Prod fix confirmation. Relied on the kind-gated before/after integration test (postgres_temporal_kind.rs) rather than running a new postgres playbook against prod — the test already proves the before/after on a live postgres, and running a fresh playbook would touch prod data. Rollout was clean.

Pointers. ai-meta repos/tools6d9b674 (v3.14.2) + repos/workerda24952 (v5.40.5); wiki Home/Sessions-Log/Releases + ecosystem-map cells updated; board 3 → Done; #95 closed.


2026-06-22 — 📐 #104 RFC upgraded for review (Event WAL + derivable result storage)

Headline. With the event-WAL half now live on prod, upgraded the #104 umbrella page from a tracking dashboard into a full RFC and re-scoped it to the remaining result-DATA half. Design-only — no prod changes, no code, no pointer bumps to runtime repos.

What landed.

  • RFC rewritten at Umbrella: Event WAL Storage (matching the #115 RFC format): numbered sections, a decided/proposed/open tenet table, NATS-as-WAL (live), convention URNs (locator shipped, adoption proposed), the Feather result tier, correctness, red-team open questions, a phased plan (A–F), and a §9 dependency/sequencing analysis vs #115 / #107 / #101.
  • Grounded in shipped code: the event half is live (materializer.rs sole writer of noetl.event; state_builder.rs chain-walk from expected_head; ChainHeads.link_batch prev_event_id; off-server CQRS cutover v3.39.5 on prod). The result half is not: result bytes still live in Postgres noetl.result_store.data (JSONB), the logical reference.uri is stamped but unconsumed, and no production object-store Feather writer exists (only the reference-materializer WASM worked-example proving the object_put capability path).
  • Sequencing call: #104's result-data tier finishes #107 program step 3 (demote the result store, as the event store was already demoted) and is on the critical path for #107 steps 4–5.
  • Home.md Active-umbrella #104 row + Last-refreshed headline updated; board 3 stays In progress (RFC under review, no lifecycle change).
  • Summary comment + review request posted on #104.

Pointers. Wiki: Umbrella-Event-WAL-Storage.md, Home.md, Sessions-Log.md. Issue: #104. Blueprint it reconciles: event_wal_and_derivable_storage.md.


2026-06-22 — 🔧 Merge release-server CI fix (repo-only, no version cut)

Headline. Merged the release-server workflow fix that unblocks the crate publish pipeline. CI-only change — semantic-release cut no new version (latest stays v3.39.6), so no repos/server pointer bump.

What landed.

  • server#259 merged (squash be094a2) — fixes the failing release-server workflow. cargo publish of noetl-server failed on the bare path dep to noetl-orchestrate-core ("all dependencies must have a version requirement specified when publishing"). The fix adds version = "0.1.0" alongside the path in Cargo.toml and splits the publish step to publish the workspace member noetl-orchestrate-core first (idempotent skip-if-published), mirroring the noetl-tools/noetl-directives pattern (noetl/ai-meta#92, #108).
  • No version cut. The merge's ci(release): commit is non-bumping; Semantic Release logged "Analysis of 1 commits complete: no release". Latest release remains v3.39.6. No pointer bump, no prod touch.
  • Validation. release.yml fetched from the PR head parses clean (4 jobs: verify-version / publish-crate / publish-image / github-release); the author verified cargo publish resolution locally via dry-run. Real end-to-end validation is the next actual release-server workflow_dispatch run — not exercised here.

Why it matters. The last two release-server runs failed at the publish step; this unblocks the crate publish pipeline for the next server release.


2026-06-22 — 🧹 CQRS / off-server program close-out + tracking reconciliation (repo-only)

Headline. Closed the loop on the now-shipped CQRS / off-server-drive program and reconciled the drifted tracking surfaces. No prod changes.

What landed.

  • e2e#74 merged (squash 1deadf1) — scripts/prod_regression_validate.py, the prod-scoped Rust regression validator for the CQRS gate-ON cutover; validated 28/30 against live prod. ai-meta repos/e2e pointer bumped to 1deadf1. This was the last open artifact of the live CQRS cutover.
  • Closed #103 (Step 2 — CQRS event log): prod cutover live + validated (server v3.39.5 PUBLISH_ONLY+offserver / worker v5.40.x Rust materializer sole-writer), e2e#74 was its last piece.
  • Closed #102 (batch event-log writes): server#198/#199 landed; worker-side superseded by #103. Closed #101 (incremental state / results-by-reference): acceptance met + kind-validated; consume side reassigned to #115 Phase 1. No residual on either.
  • Closed superseded Python ops PRs #189/#190/#192 (system/event_materializer
    • system/projector playbooks/CronJobs) — the cutover shipped the Rust in-process worker materializer (noetl-worker/src/materializer.rs) instead.
  • Archived handoffs: 2026-06-18-orchestrate-plugin-dissolution (round-02 complete) + 2026-06-09-rust-stack-session-snapshot (stale read-only orientation, superseded).
  • Reconciled roadmap board 3 (added the open umbrellas missing from it) and Home.md Active-umbrellas (moved closed #101/#102/#103 to Recently closed).
  • Fixed the failing release-server CI in noetl/server (cargo publish failed — noetl-orchestrate-core path dep had no version): added version = "0.1.0" alongside the path + the release workflow now publishes the member crate first, mirroring how noetl/tools handles noetl-directives.

2026-06-22 — ✅ #127 CLOSED — task_sequence per-sub-task context optimization merged, released, and shipped to prod

Headline. Landed the structural half of #127: the behavior-preserving task_sequence per-sub-task context optimization is merged, released to crates.io, adopted by the worker, built into a prod image, and rolled onto prod under the live off-server CQRS cutover. The code-opt now compounds with the CPU-limit bump applied earlier this session — more headroom AND less work per slot on the batch hot path. #127 fully closed.

What landed.

  • noetl-tools v3.14.1 — merged tools#74 (squash 9dd9aa6); semantic-release cut c8656c1 and the release-tools workflow published the crate to crates.io. The task_sequence drain rebuilt the template context per sub-task (running_ctx.clone() + 2–4× to_template_context() deep-clones + per-block ExecutionContext clones + a fresh context_to_value() per templated field). TemplateEngine::render_value now builds the proxied minijinja context ONCE and threads it through the recursion (render_value_with / render_with; minijinja Value is Arc-backed → reuse is a refcount bump — helps every tool dispatch); new build_context_with_overlay(&variables, overlay) builds straight from &variables + a small overlay, skipping the intermediate to_template_context() HashMap deep-clone + per-block ExecutionContext clones in the set/policy paths. Isolated micro-bench (CPU held constant): per-sub-task context cost 2988.9µs→1147.1µs (−61.6%, 2.6×). 407 lib tests + 2 new equivalence pins + clippy clean.
  • noetl-worker v5.40.4 — merged worker#125 (squash 1a10a73); semantic-release cut 0afbf5c. Cargo.lock pin noetl-tools 3.143.14.1; deps-only, no worker source change; cargo clippy --all-targets clean (no new warnings vs baseline).
  • Prod rollout — built the worker image via Cloud Build (us-central1-docker.pkg.dev/noetl-demo-19700101/noetl/noetl-worker-rust:0afbf5c) and rolled it onto prod noetl-worker-rust + noetl-worker-system-pool. Server left on v3.39.5/.6 (worker-only change). Worker CPU req 250m / limit 2 kept. Rolling restart clean (pods Ready, 0 crashloop) and the off-server CQRS cutover stayed healthy throughout — materializer sole-writer projected==acked, lag ~0, command consumer lag 0, executions completing, and the system-pool rehydrated its WAL index on restart (#119 proof).

Pointers + bookkeeping. ai-meta repos/toolsc8656c1 + repos/worker0afbf5c; Home dashboard + this log + Releases + ecosystem map updated; #127 closed + board 3 → Done. PROD gate env vars / server image / DB untouched.


2026-06-22 — ⚡ #127 PROD WORKER CPU LIMIT RAISED 1→2 + APPLIED LIVE — ~20% batch-throughput win materialized on prod

Headline. Raised the CPU limit 1→2 (request 100m→250m) on both prod Rust worker deployments and applied it to live prod, so the ~20% batch-throughput win profiled on kind actually lands in production. ops#202 (squash 85e3c23); ai-meta repos/ops pointer bumped.

Why. #127 — batch PFT throughput plateaus because the workers are CPU-throttled. Kind profiling of a 10k-patient batch showed the Rust step workers peg one CPU and burn 38–47% of their time CPU-throttled at the 1-CPU limit (the per-sub-task running_ctx.clone() + to_template_context() is the hot path). Lifting the limit to 2 CPU removed the throttle and cut the run ~166–172s → ~137.6s (~20%), zero patient loss.

Prod baseline (verified read-only first). Live prod worker limit was already 1 CPU (kind had measured 0.5 — confirmed prod's real value before changing anything): noetl-worker-rust (2 replicas, KEDA min2/max20) and noetl-worker-system-pool (1 replica) both at request 100m / limit 1. Nodes: 4× ek-standard-16, ~15.89 CPU allocatable each (~63.5 total), running ~2% CPU — ample headroom (KEDA-max 20×2 = 40 CPU still fits; requests stay tiny at 20×250m = 5 CPU). The cluster already uses a low-request/high-limit bursting model (server-rust is 250m/2); the new worker values mirror that ratio.

Chosen value + rationale. request 250m (a real guaranteed floor, matching server-rust) / limit 2 (the throttle relief = the kind-validated win). The throttle ceiling is the limit, not the request. Applied to both worker pools — the system-pool is a single pod carrying the off-server orchestrate drive + sole-writer CQRS materializer, so it pegs CPU the same way under a large batch; raising its limit is pure upside (request stays low).

Prod apply (live action, monitored). kubectl set resources on both deployments (surgical — touches only CPU req/limit, leaves image, env, gate vars, memory untouched). Rolling restart completed cleanly: all pods Ready, 0 restarts/crashloops, new limits live (req=250m lim=2). Off-server CQRS cutover stayed healthy throughout — pre/post materializer lag = 0 (noetl_materializer consumer), command-consumer lag = 0, and the restarted system-pool pod rehydrated its WAL index (noetl_worker_state_builder_indexed_executions

0, the #119 restart-rehydration proof). No revert needed; the armed revert (restore prior limits) was kept ready but unused. PROD gate env vars / images / DB untouched.

Perf confirmation. Relying on the kind-measured ~20% reference (~166–172s → ~137.6s); no prod benchmark run this session (the change is a resource bump validated upstream + a clean live rollout).

Pointers.

  • ops PR: ops#20285e3c23
  • Manifests: ci/manifests/noetl/worker-rust-deployment-prod.yaml, worker-system-pool-deployment-prod.yaml
  • Issue: #127 (stays OPEN — the deeper per-sub-task context-clone hot path is the structural follow-up).

2026-06-22 — ✅ #123 SHIPPED + CLOSED — a non-iterable loop in: now fails loudly instead of silently wedging (server v3.39.6)

Headline. Merged the reviewed fix for the silent off-server commands=0 wedge on a non-iterable loop in: (server#258, squash 275b914 → release v3.39.6 7f109a9). A loop step whose in: expression rendered to a non-iterable value (e.g. loop: { in: '{{ workload.batch_slots }}' } when batch_slots is absent → null) silently wedged the execution at commands=0 / RUNNING-forever under the prod-default off-server drive — looking identical to a hang. (This observability gap produced the false-positive #122 during the #120 2×2 repro.)

Root cause. evaluate_loop already returns CoreError::Validation("… did not evaluate to an iterable"), and the in-process drive already turned that into a terminal playbook.failed. But under the off-server drive (NOETL_ORCHESTRATE_PLUGIN_DRIVE=true) the system/orchestrate wasm plug-in returns a structured {"error":…} envelope instead of an OrchestrationResult; apply_worker_orchestration couldn't decode that as a result, logged a WARN, recorded decode_error, and returned Ok(0) — no terminal event, so the run sat RUNNING forever. The v3.4.2 explicit validation error was lost when the drive moved off-server.

Fix. Server apply_worker_orchestration now decodes the drive ERROR envelope (decode_orchestrate_error) and emits a terminal playbook.failed (metric noetl_orchestrate_drive_total{stage="drive_error"}, structured execution_id), matching the in-process drive — a transient decode miss (no envelope) stays on the benign re-drive path. orchestrate-core prefixes the offending step name onto the existing evaluate_loop error (prefix_loop_step_error). An empty iterable ([]/{}) still short-circuits to next — only a non-iterable errors.

Validation. 600 server + 135 orchestrate-core tests + clippy clean. Kind-validated prod-exact (PLUGIN_DRIVE=true + PUBLISH_ONLY=true + STATE_BUILDER=offserver): an absent workload.batch_slotsFAILED with loop step 'process': Loop expression '{{ workload.batch_slots }}' did not evaluate to an iterable (got null); a valid [1,2,3] loop still COMPLETED 3-way; #120 barrier and #124 binding behavior unaffected; 0 pod restarts.

Wrap-up. Code-only fix — server-only (worker stays v5.40.3); ships to prod on a future server rollout. PROD was NOT redeployed this session — it stays healthy on v3.39.5 + the live off-server cutover. ai-meta repos/server pointer bumped → v3.39.6; #123 closed (deliberate Closes #123) + roadmap board 3 → Done. #127 (batch PFT throughput plateau) stays OPEN — separate perf follow-up.

Pointers. PR server#258 (squash 275b914 → release 7f109a9 = v3.39.6) · ai-meta umbrella #123 · Umbrella-Decoupled-Context-Event-Chain.


2026-06-22 — ✅ #121 SECOND-HALF SHIPPED + CLOSED — off-server system/* WAL-chain wedge fully fixed (server v3.39.5); PROD off-server re-cutover (3rd attempt)

Headline. Merged the reviewed second-half fix for the off-server WAL-chain-incomplete wedge on system/* executions (server#257, squash 54ac277 → release v3.39.5 c421273). #256 (v3.39.4) was only the first half — a live-prod re-cutover on v3.39.4 still wedged system/scheduled_cleanup because the system-pool worker drives system executions off-server regardless of the server-side gate (STATE_BUILDER=offserver on the worker → the INSERT-not-publish chain leaves a NULL-prev orphan), so #121 was reopened. #257 gates both off-server-drive decision sites in trigger_orchestrator_inner on should_publish(catalog_id) (new pure is_system_path helper + unit test) so system/* execs fall through to server-built run_state; regular publishable execs keep the off-server path (preserves the #256 win). 612 tests + clippy green; kind full-gate before/after (system/scheduled_cleanup BEFORE 9× WAL chain incomplete → AFTER COMPLETED 0 loops; regular test/simple_loop still COMPLETED off-server). Server-only (worker stays v5.40.3).

Phase A — merge + cleanup. Squash-merged server#257 → semantic-release cut v3.39.5 (c421273); ai-meta repos/server pointer bumped to v3.39.5; ai-meta#121 closed (deliberate Closes #121) + roadmap board 3 → Done. Wiki dashboard (Home/Sessions-Log/Releases/ecosystem-map server cell) updated in the same change set.

Phase B — PROD rollout v3.39.5 + off-server re-cutover (3rd attempt) = ✅ CLEAN SUCCESS, LEFT ON. Re-verified the live baseline (server v3.39.4 @ef7536d4 + workers v5.40.3, safe in-server drive, 0 restarts, health green, lag 0). Built+pushed server v3.39.5 to the prod AR (server-rust@sha256:feaac0c5…, Cloud Build us-central1 E2_HIGHCPU_8, ~9 min). Rolled the server by digest (workers/system-pool kept v5.40.3; server reported v3.39.5, 0 restarts). Re-enabled the full off-server gate — server PUBLISH_ONLY=true+STATE_BUILDER=offserver, system-pool STATE_BUILDER=offserver + materializer on. Validation (tenant prod_cutover3_20260622, both previously-wedging classes exercised twice): 2 regular loop execs COMPLETED off-server (each 62 events, roots=1/dangling=0/rows==distinct; worker state_builder_event_scans=0 = never-scan) + 2 system/scheduled_cleanup execs COMPLETED server-built (each 13 events, roots=1/dangling=0; server state_build_event_scans=6 total = the system execs' intended noetl.event reads, by design, NOT a regression). 0 WAL chain incomplete lines across the soak (prior v3.39.4 attempt looped 37+), materializer sole-writer published==drained==acked=124, lag 0, 0 project errors, 0 pod restarts, health green. Per the decision criterion the gate was LEFT ON — this is the successful full off-server CQRS cutover on prod. Revert (not needed) stays one command away: kubectl -n noetl set env deploy/noetl-server-rust NOETL_EVENT_INGEST_PUBLISH_ONLY=false NOETL_STATE_BUILDER=server (+ system-pool STATE_BUILDER=server).

Phase C — pin the live cutover into the prod manifests (ops#201, squash 9c94de5). The Phase-B cutover was applied imperatively (kubectl set image / set env), so the committed prod manifests drifted from live — they still pinned the v3.39.1 server / v5.40.2 worker images and the pre-cutover gate-off config, so a future kubectl apply would have silently reverted the cutover. ops#201 rewrites the three prod manifests to match the running state field-by-field (verified against the live deployments before editing): server-rust-deployment-prod.yaml → image @feaac0c5 (v3.39.5) + NOETL_EVENT_INGEST_PUBLISH_ONLY=true

  • adds NOETL_STATE_BUILDER=offserver; worker-rust-deployment-prod.yaml → image @5e80493b (v5.40.3); worker-system-pool-deployment-prod.yaml → image @5e80493b (v5.40.3) + NOETL_MATERIALIZER_ENABLED=true + NOETL_STATE_BUILDER=serveroffserver. Apply is now a no-op (all managed fields incl. replicas equal live; KEDA still owns worker-rust's count above its floor of 2, by design). No cluster apply was run — the live state was already the target. ai-meta repos/ops pointer bumped d6633f69c94de5. The ops repo has no auto-apply CI, so merging only aligned source-of-truth. Refs noetl/ai-meta#107.

2026-06-21 — ✅ #121 FIRST-HALF SHIPPED (partial) — off-server WAL-chain-incomplete loop on system/ executions FIXED; live prod off-server-cutover wedge captured as the real-world repro

Headline. Merged the reviewed fix for the off-server WAL-chain-incomplete re-drive loop (server#256, squash 28b17cb → semantic release noetl-server v3.39.4 77aaa06). Server-only — the diff is confined to src/handlers/events.rs + src/state.rs (the server binary's HTTP claim handlers + the server-side ChainHeads struct); it touches no shared crate (orchestrate-core, noetl-tools) the worker consumes, so the worker / system-pool need no rebuild.

Two distinct defects.

  1. Orphaned command.claimed (NULL prev_event_id). The gate-off claim_command writes the command.claimed row with a raw in-tx INSERT (atomic with the claim), and handle_batch_events does the same for the gate-off batch path — both bypass the event_write::emit_events / ChainHeads link chokepoint, so prev_event_id was never stamped AND the per-execution chain head was never advanced. The next event (command.started) then linked back to command.issued, skipping the orphaned claim, so the off-server chain_walk_from hit a NULL-prev non-genesis head → build_spine_to Incomplete. This fired for every system/ playbook because should_publish is false for system executions even under a global PUBLISH_ONLY=true. Fixed by stamping prev_event_id via ChainHeads::link_batch on both gate-off INSERT paths (same advance-then-write ordering emit_events already uses).

  2. The actual loop. The stateless off-server drive builds state from the noetl_events WAL, but system-execution events INSERT to noetl.event and never enter the WAL, so the worker's WAL build could never complete → the drive returned the __offserver_retry__ no-op → the server reconciler re-drove → repeat. Fixed by gating the off-server WAL drive on should_publish(catalog_id); system/ executions drive server-built (read noetl.event).

Validation. 598 server tests pass + clippy clean (new state.rs unit test chain_head_claim_orphan_vs_linked models the orphan-vs-linked pointer walk). Kind, prod-exact (PUBLISH_ONLY=true + STATE_BUILDER=offserver, single replica): BEFORE — command.claimed prev=NULL, 17×+ WAL chain incomplete, wedged RUNNING 120s+; AFTER — chain fully linked (0 orphans), 0 loop lines, system/scheduled_cleanup COMPLETED in 6s; non-system off-server unaffected.

Live prod repro (read-only; no prod change made this session). The full off-server cutover was attempted on prod earlier (server v3.39.3 + off-server gate) and WEDGED: the drive applied counter froze while dispatched_offserver_stateless / offserver_retry climbed in lockstep to ~389, the system pool logged WAL chain incomplete; returning no-op for multiple executions (including prod's own 327130580493803520), all with prev_event_id=NULL. The armed revert (PUBLISH_ONLY=false + STATE_BUILDER=server) recovered cleanly. This is the real-world reproduction of #121 and confirms #121 is the blocker for the prod off-server cutover (a separate follow-up task). PROD GKE default (STATE_BUILDER=server) + all defaults untouched here.

Pointers. ai-meta repos/server pointer bumped to v3.39.4 77aaa06; wiki Home (Last-refreshed + Recently-closed + Ecosystem-map server cell + Sessions-log preview + Releases preview) + Releases v3.39.4 row + Umbrella-Decoupled-Context-Event-Chain Recent-activity row; roadmap board 3 #121 → Done. #123 (loop-non-iterable observability) + #127 (batch PFT throughput plateau) stay OPEN — separate, not resolved by this.


2026-06-21 — ✅ #125 + #126 SHIPPED + CLOSED — task_sequence control flow and http tool body data-shape fixed; 10×1000 batch pft_flow_test clean

Headline. Fixed the two Python→Rust regressions that blocked the batch pft_flow_test after #124 unblocked the forward-binding path. Both live in noetl-tools; the worker adopts them via a Cargo.lock bump.

#126 — http tool body data-shape (tools#72): the Rust http tool exposed the parsed response body under data.body, but playbooks written against the Python-era contract read output.data.data (or equivalently output.data for the body). The pft_flow_test save_batch step passed the http result to jsonb_to_recordset, which received a non-array value → Postgres error → zero rows written. Fix: expose the parsed body under data again; keep body as a back-compat alias. Squash 86f0216 → semantic release noetl-tools v3.13.1 (8dd0e1f).

#125 — task_sequence do: jump/break/retry (tools#73): the Rust task_sequence runner parsed do: but matched only "fail", so do: jump/break/retry were silently ignored — each slot ran its steps once and returned regardless of the directive. In pft_flow_test, the batch loop's do: jump never re-entered, leaving 16 batches done and the rest pending (patient loss across slots). Fix: honour do: jump/to: (loop to named step with infinite-jump guard), do: break (exit the sequence), do: retry (with configurable attempts/backoff). Squash 62d0948 → semantic release noetl-tools v3.14.0 (638c3c6).

Worker adoption (worker#124): bumps the noetl-tools Cargo.toml pin 3.133.14 (Cargo.lock resolves 3.14.0) so the worker binary actually runs both fixes at runtime. No worker source change; clippy clean. Squash 87b85e8 → semantic release noetl-worker v5.40.3 (6dd3449).

Result. With tools#72 + tools#73 in place, the full 10×1000 = 10,000-patient batch pft_flow_test passes end-to-end on kind — zero patient loss, invariants clean (roots=1/dangling=0/forks=0, materializer sole-writer, never-scan).

Perf follow-up #127. The benchmark plateaus at ~60 patients/s (10k in ~166–172 s), roughly 3× slower than the Python ~54 s kind baseline. Throttled (50 req/s, 172 s) ≈ unthrottled (166–168 s) — the rate limiter is NOT the bottleneck. Capped by the NoETL pipeline (worker concurrency 4×2 + serial per-slot drain + off-server/publish-only per-step overhead). Correctness is clean; throughput is a perf-investigation task, not a tools bug.

Still open. #121 (orphaned-spine-head WAL-chain-incomplete loop) + #123 (loop-non-iterable observability) are separate; not resolved by this. PROD GKE + all defaults untouched.

Pointers. tools PR#72 86f0216 → release 8dd0e1f (v3.13.1) · tools PR#73 62d0948 → release 638c3c6 (v3.14.0) · worker PR#124 87b85e8 → release 6dd3449 (v5.40.3) · closes noetl/ai-meta#125


2026-06-21 — ✅ #124 SHIPPED + CLOSED — distributed task_sequence forward set:/sibling bindings no longer render empty (orchestrate-core)

Headline. Merged the already-reviewed #124 fix and landed the standard post-merge bookkeeping. noetl/server#255 (squash d53e095) fixes a distributed task_sequence forward-data-binding regression: inside a multi-tool (task_sequence) step, a later sub-task's templates that reference a value a prior sub-task produces at runtime — a forward set:, a policy-rule set:, or a sibling result — were rendered to empty at command-build time, before the worker's per-sub-task binding ran. orchestrate-core/src/commands.rs::render_pipeline_config preserved set/args/spec/command verbatim but rendered every other sub-task field (url/params/method/…) against the step-entry context under UndefinedBehavior::Chainable, so runtime-only references ({{ iter.* }}, sibling labels) silently collapsed to empty — concretely in pft_flow_test's batch path, claim_batch writes iter.data_type via its policy set:, then fetch_batch's url: …/api/v1/pft/batch/{{ iter.data_type }} pre-rendered to …/batch/ → 404 → 0 rows. Fix: new TemplateRenderer::render_value_deferring_unresolved renders only templates whose variable paths all resolve in the build-time context; any template referencing an unresolved path is preserved verbatim so the worker re-renders it against the per-sub-task running context (a superset). cargo test 134/134 + clippy clean; kind-verified (fetch_batch now hits the real per-type URLs). Semantic release → noetl-server v3.39.3 (365d3be); ai-meta pointer bumped + #124 closed + roadmap board 3 → Done. PROD GKE + all defaults untouched.

Pointers. server PR#255 d53e095 → release 365d3be (v3.39.3) · ai-meta pointer bump · closes noetl/ai-meta#124. #121 (orphaned-spine-head WAL-chain-incomplete loop), #123 (loop-non-iterable observability), #125 (task_sequence do: jump/break/retry ignored), and #126 (http tool data.body vs output.data.data contract) stay OPEN — separate, not resolved by this. The batch PFT benchmark remains blocked behind #125/#126 (further distinct Python→Rust regressions surfaced behind the binding fix).


2026-06-21 — ✅ #120 SHIPPED + CLOSED — reduce barrier no longer deadlocks (commands=0) on open/asymmetric loop joins (orchestrate-core)

Headline. Merged the already-reviewed #120 fix and landed the standard post-merge bookkeeping. noetl/server#254 (squash fbb855f) adds a runtime liveness filter to the orchestrate-core reduce barrier: an open/asymmetric loop back-edge predecessor that never runs on the taken path (a declared inbound arc whose forward return path is absent — T does not forward-reach upstream U) is no longer counted as a pending fan-in dependency, so dispatch is no longer deferred forever (commands=0). build_incoming_arcs is unchanged (an open back-edge is still a genuine static fan-in); the fix is in the barrier's runtime check — only an upstream that is live on the current path (entered/in-flight, or reachable from an active step) blocks. Affects the in-server and off-server drives identically (shared orchestrate-core). New unit test test_open_loop_back_edge_does_not_block_dispatch; cargo test 133/133 + clippy clean; kind-validated (post-fix the 2×2 off-server/gate matrix all COMPLETE, fanout_reduce/pagination/loop spot-checks green). Semantic release → noetl-server v3.39.2 (28e8950); ai-meta pointer bumped + #120 closed + roadmap board 3 → Done. PROD GKE + all defaults untouched.

Pointers. server PR#254 fbb855f → release 28e8950 (v3.39.2) · ai-meta pointer bump · closes noetl/ai-meta#120. #121 (scheduled_cleanup chain gap) + #123 (loop-non-iterable observability) stay OPEN — separate, not resolved by this.


2026-06-20 (late) — ✅ PROD CQRS rollout RECORDED (ops#200 merged + pointer bumped) + LIVE-PROD e2e validation of the gate-ON cutover — 24/26 fixtures PASS, sole-writer + clean-chain + never-scan held on every execution incl. the failure path

Headline. Closed the rollout paperwork and then validated the live cutover with the e2e suite against real prod. Merged ops#200 (prod manifests pinned to the executed digests — server-rust @sha256:197a6d10 v3.39.1 c5f8cb2 + worker-rust @sha256:41713265 v5.40.2 48b0bde, worker replicas 1→2, configmap event-stream keys aligned to the live lowercase noetl_events stream, executed-flip runbook record) → bumped the ai-meta repos/ops pointer (ops@d6633f6, ai-meta 08c73e5). Then ran the Rust e2e regression + specialized playbooks against LIVE PROD (the same gke …noetl-cluster ns noetl, server v3.39.1 / worker v5.40.2, PUBLISH_ONLY=true + STATE_BUILDER=offserver, materializer sole writer).

Prod e2e matrix — 28/30 executions PASS (24/26 distinct fixtures + 4 composition-spawned children). Coverage: python / args / vars / loops / control-flow / output-select / large-result / actions / fan-out-reduce / duckdb / http (in-cluster) / save-to-postgres (json_serialization_save, pg_k8s) / sub-playbook composition (parent + 4 spawned children, pg_k8s). For every execution under the gate-ON path: COMPLETED, sole-writer (event rows == distinct ids, 0 catalog_id=0, 0 __orchestrate__ event rows, ≥1 __orchestrate__ command), clean chain (roots=1 / terminals=1 / dangling=0 / head-walk == total), never-scan (worker noetl_worker_state_builder_event_scans_total Δ0 across all batches; cumulative still 0), materializer lag 0 throughout.

The 2 FAILs are a prod-env credential difference, NOT a cutover bug. postgres_test (pg_noetl_k8s, stale) and postgres_jsonb_test (pg_local) failed with Failed to get connection: error connecting to server — those aliases point at DB hosts unreachable from this prod cluster. Even on failure the gate-ON path stayed correct: both produced a clean playbook.failed terminal with sole-writer + single-root chain intact (proving the FinalizedGuard + chain linker work on the failure terminal too). No ai-task issue filed — no genuine platform bug surfaced.

SKIP+note (external dep, not present in prod): pagination/* (needs paginated-api.test-server.svc), http_to_postgres_* (external jsonplaceholder egress + pg_local), save_simple/save_all/storage_tiers (pg_local; storage_tiers also #101 bloat), auth0_login / keychain/google_id_token / amadeus / openai / IB / snowflake (external creds/services), server_oom_stress_* / heavy_payload_* / heavy_loop_aggregation / lease_expiry (heavy/OOM/#101 — deliberately avoided against prod).

Prod left healthy on the new path/api/health ok, db+nats connected, materializer lag 0, all pods Running 0 restarts across the ~45-min run. Test-data footprint (uncleanable — no DELETE API, 365-day retention): tenant/catalog prefix prod-e2e-20260620-1946 — 26 catalog entries, 30 executions, 947 noetl.event rows. All identifiable by the prefix for operator cleanup.

Verdict: the CQRS publish-only + off-server state-builder cutover is validated under real production load across the functional playbook surface. Refs #103, #107, #111.


2026-06-20 (evening) — 🚀 PROD CQRS CUTOVER EXECUTED + gate-ON validated — server v3.39.1 / worker v5.40.2 rolled to prod GKE, PUBLISH_ONLY + off-server state builder flipped LIVE

Headline. The CQRS publish-only flip + off-server state builder went live on production (gke_noetl-demo-19700101_us-central1_noetl-cluster, ns noetl). Prod is left gate-ON, healthy: the materializer is the sole noetl.event writer and the orchestrator drive builds state off the noetl_events WAL with zero event-scans. Closes the last operator-side gap on #103; puts #107 / #115 / #111 off-server work into prod.

Prerequisite — one-time owner-applied prev_event_id migration. v3.39.1 binds prev_event_id on every noetl.event / noetl.command INSERT. The columns were absent on prod and the runtime noetl role is not the table owner (owner=postgres; noetl is a restricted cloudsqlsuperuser), so the server's startup ensure_columns is swallowed (must be owner). Applied the additive, idempotent, metadata-only DDL as the DB owner — the working postgres credential the live pgbouncer deployment carries in its DATABASE_URLS backend route, over a kubectl port-forward (no password rotated / printed / committed). Cascaded to all partitions (event=14, command=16); idx_event_prev_event_id valid, 14 leaves. GSM pg_noetl_k8s is stale (drifted password) — not used, not touched; operator should rotate/realign it. The harmless event-chain DDL skipped WARN persists by design (ownership checked before the IF NOT EXISTS skip).

Rollout. Server → v3.39.1 (@sha256:197a6d10…, gates default-off) → workers shared ×2 + system pool → v5.40.2 (@sha256:41713265…). Order matters: v3.39.1's plug-in drive routes __orchestrate__ to the system pool, which the old cursor-100 workers can't run (call.error → drive stall) — workers roll with/before the server. Then materializer shadow on (backlog 0), then the flip: system-pool STATE_BUILDER=offserver, server PUBLISH_ONLY=true STATE_BUILDER=offserver.

Validation (gate-ON, 5 tenant execs: simple_python / hello_world / e2e_probe, incl. concurrent). All COMPLETED; chains roots=1 / terminals=1 / dangling=0. Materializer sole writer (event_ingest_published_total == materializer_acked_total = 71 rows; server wrote 0 tenant rows). Never-scan holds (state_builder_event_scans_total = 0). Backlog 0 throughout; 0 pod restarts on a 90s soak.

Pointers. ops PR noetl/ops#200 (manifest digest bumps → v3.39.1 / v5.40.2 + executed-rollout record in runbooks/noetl-cqrs-publish-only-flip.md + configmap stream-name hygiene). ai-meta pointer bump follows the merge. Revert on standby (one command set: PUBLISH_ONLY=false STATE_BUILDER=server on server + system pool).


2026-06-20 — ✅ #119 + #118 SHIPPED + gate-ON kind-validated + CLOSED — off-server WAL-drain restart-rehydration (#119) unblocks the terminal-finalize FinalizedGuard (#118); single- AND multi-replica off-server now blemish-free + restart-robust

Headline. Fixed #119 (the off-server WAL-drain stall that, the prior session, prevented executions from completing and so hid the #118 symptom), then used the now-healthy cluster to land the end-to-end validation that closes #118. Both single- and multi-replica off-server are now blemish-free (every chain single-root including the terminal event, zero event-scan fallback) AND restart-robust (the WAL index rehydrates on a worker pod restart). All three PRs merged, pointers bumped to main, both issues CLOSED.

#119 root cause + fix. The authoritative WAL state-builder drain (state_builder.rs::run_drain_loop) used a durable noetl_state_builder consumer whose cursor persists across worker pod restarts, but the in-memory WalEventIndex rebuilds empty on each boot → after a restart the cursor sat past the events the fresh index needed (delivered+acked to a prior process, never redelivered) → build_spine_to(expected_head) permanently Incomplete → the off-server drive looped offserver_retry and single-replica executions never reached a terminal event (wal_events_total=0 while the consumer showed delivered+acked). Fix (worker-only, gated by NOETL_STATE_BUILDER=offserver; PROD runs the in-server drive so it is untouched): the authoritative drain now defaults to an ephemeral DeliverPolicy::All consumer — the same shape shadow already uses — rebuilding the full index from the retained noetl_events WAL on every boot. There is no persisted cursor to outrun; it is also correct for >1 worker pod (each holds the complete event set for the executions it may drive, not the load-balanced subset a shared durable would give). Ack-policy is keyed on durable-presence, advance-timing on mode (the two are now decoupled). Instant revert NOETL_STATE_BUILDER_DURABLE=1 restores the pre-#119 durable consumer (not restart-safe without an index snapshot). Proof: a one-shot index rehydrated from retained noetl_events WAL log + a new noetl_worker_state_builder_indexed_executions gauge. Never reintroduces a noetl.event scan (the rebuild reads the WAL stream only). noetl-worker v5.40.2 (worker#123, 48b0bde); 224 lib tests green incl. fresh_index_rebuilds_from_full_replay_after_restart.

#118 fix (now integration-validated). A bounded process-local FinalizedGuard (exactly-one-terminal-per-execution) suppresses a duplicate finalize at emit_events before it reaches the chain linker (so a suppressed duplicate never advances/consumes the head), keeping the chain single-root including the terminal event. Gate-off byte-identical; metric noetl_terminal_dedup_total{suppressed}; rig gains a HARD per-exec terminals==1 assertion. noetl-server v3.39.1 (server#253, c5f8cb2) + e2e (e2e#73, fe97d92).

Validation (gate-ON kind, server 118-finalize v3.39.1 + worker 119-rehydrate v5.40.2; offserver + publish_only + audit_only + plugin_drive). (1) Restart-rehydration — a forced mid-flight kubectl delete pod --force of the system-pool worker → the new pod logged index rehydrated … indexed_executions=17 wal_events=200 (pre-fix this gauge was 0 → the stall). (2) Single-replica #118kind_validate_replica_coherence.sh with NOETL_COHERENCE_FANOUT_BURST=12 × 6 consecutive iterations (~126 execs, post-restart): 6/6 PASS, 0 anomalies — every chain roots=1 (incl. the terminal), dangling=0, walk==rows, terminals=1, orch_events=0; zero state_build_event_scans + zero hot-path noetl.event scans. (No genuine duplicate-finalize arose this run, so the terminal_dedup suppression counter stayed 0 — the fork simply never happened, which is the goal; the suppression path itself is unit-proven.) (3) Multi-replica — the 2-replica execution-affinity StatefulSet (NOETL_COHERENCE_DRIVE_AFFINITY=shipped, burst=12): 21 execs COMPLETE, every chain roots=1/terminals=1, forwarded_ok +202, kv_remote_hit +12, zero scans. Cluster restored to the clean single-replica baseline; PROD GKE + all defaults untouched.

Verdict. Single- AND multi-replica off-server are blemish-free (single-root incl. finalize, zero fallback) and restart-robust (the index rehydrates). Closes the off-server hardening gap #117 left open; part of #115 Phase 4 / #107.


2026-06-20 — 🔧 #118 fix shipped for review — single-replica off-server terminal-finalize chain fork (FinalizedGuard); integration blocked by #119 (off-server WAL-drain stall)

Headline. Implemented + unit-tested the fix for #118 (single-replica off-server: a duplicate playbook.completed orphans the per-execution chain with a NULL prev_event_id second root). PRs opened under kadyapam, left open for review; ai-meta pointer bumps staged (not merged) because the end-to-end repro could not be validated this session.

Root cause (corrected). The terminal event already goes through ChainHeads.link_batch — it is not a chain-linker bypass. The defect is a duplicate finalize: under NOETL_STATE_BUILDER=offserver + PUBLISH_ONLY on a single replica, the first drive emits the terminal event (chain-linked) and evicts the chain head + descriptor; a straggler trigger then falls through to the server-built path, rebuilds from the materializer-lagged WAL (state not yet terminal), drives again, and emits a second playbook.completed. That second event reaches link_batch after the head was evicted → prev_event_id = NULL → second chain root (orphan) → off-server spine walk can't reach it → a benign noetl_state_build_event_scans_total fallback. Multi-replica execution-affinity (#116) serialises finalize to the owner, so it never forks there.

Fix. A bounded, process-local FinalizedGuard (exactly-one-terminal-per-execution; HashSet + FIFO VecDeque, default cap 8192). emit_events suppresses any later terminal for the same execution before it reaches the chain linker (a suppressed duplicate never advances/consumes the head). First terminal wins → single root including the terminal. Gate-off byte-identical (a duplicate never occurs on the synchronous in-process drive). New metric noetl_terminal_dedup_total{outcome="suppressed"}. 597 server lib tests green incl. 2 new guard tests.

What landed (open for review).

  • server#253FinalizedGuard + emit_events dedup + metric + unit tests (branch kadyapam/118-finalize-chain-link).
  • e2e#73 — HARD per-exec terminals == 1 assertion in kind_validate_replica_coherence.sh (branch kadyapam/118-terminals-assertion).

Validation status — integration BLOCKED, not closed. The local-kind single-replica off-server drive did not complete executions this session (even a single simple_python stalled at command.completed, the drive looping offserver_retry), so the 2-root symptom could not be reproduced. Cause is upstream + orthogonal: the worker's authoritative WAL drain delivers+acks noetl_events on the durable noetl_state_builder consumer but the in-memory WalEventIndex stays empty after a worker restart (wal_events_total=0) — the durable cursor persists across restarts while the index rebuilds empty → build_offserver_input always Incomplete. Filed as #119. The noetl_terminal_dedup path never engaged (completion blocked before any finalize).

Cluster state. Local kind restored to its working in-process-drive baseline (server STATE_BUILDER=server, image p6-audit-only; a single exec COMPLETES in ~9s); noetl_events/NOETL_COMMANDS purged, tables truncated. The noetl-server:118-finalize image remains in the registry for a future offserver validation once #119 is resolved. PROD GKE untouched; all defaults unchanged.

Board. #118 → In progress; #119 → Todo (board 3).


2026-06-20 — ✅ #117 SHIPPED — off-server spine ordered by prev_event_id chain + walked from the real tip (worker v5.40.1 + e2e); high-concurrency fan-out reduce wedge FIXED

Headline. The off-server drive's from_events spine was built by sorting events by event_id ascending, assuming id order == causal (prev_event_id) chain order. Under high-concurrency fan-out two branch completions arrive at the owner reordered relative to their producer-assigned ids, so emit_events stamps a higher-id event as the predecessor of a lower-id one. Two failures followed: (1) the worker tracked the chain head as max(event_id), but ChainHeads.link_batch advances the watermark to event_ids.last() — the last-arrived event = the real causal tip; under the inversion max(id) != tip, so a max-id walk started one branch up and missed the inverted tip entirely, from_events never saw that branch's command.completed, and the fan-in reduce never fired (execution wedged RUNNING, reproduced ~1/9 on the 2-replica affinity topology); (2) even reaching every event, the event_id sort replayed the inverted pair out of causal order.

Fix (worker-only, inside NOETL_STATE_BUILDER=offserver — PROD's in-server drive untouched). build_offserver_input builds the spine from expected_head (the server's ChainHeads watermark = the real tip) via new build_spine_to/advance_to/chain_walk_from, reaching every event regardless of id monotonicity, and orders it by the prev_event_id chain walk (head→root, reversed to root→head) — SpineOrder::Causal (default; NOETL_OFFSERVER_SPINE_ORDER=event_id is the instant revert). The staleness guard is now intrinsic (advance_to is Incomplete until the tip is indexed, replacing the pre-#117 max_id >= expected check an inversion could satisfy without the real tip present). For any monotonic chain the new causal order is byte-identical to the old sort — single-replica + low-concurrency unchanged.

Shipped. noetl-worker v5.40.1 (worker#122, baeae78) + e2e (e2e#72, cdf1768, NOETL_COHERENCE_FANOUT_BURST stress knob). 15 unit tests (5 new for the inversion: tip-rooted vs max-id walk, causal vs legacy order, intrinsic staleness, incremental==cold-rebuild through inversion) + 223 lib tests pass; clippy clean.

Kind validation (gate-ON: offserver + publish_only + materializer sole-writer). 2-replica execution-affinity topology (nats_kv coherence + affinity, server v3.39.0): acceptance pass RUNS=3 → 9/9 COMPLETE, chain integrity HARD (roots=1/dangling=0/walk==total), sole-writer (orch_events=0, rows==distinct), never-scan (build_scans=0/hotpath=0), forwarded_ok +31; high-concurrency stress 6/6 iterations PASS, 108/108 executions COMPLETE (12 concurrent fanout each), zero scans. Direct proof: 15 executions carried a real id-inversion (prev_event_id > event_id) on the normalize_customer/enrich_customer branches — all 15 fired reduce_customer and reached terminal completion (pre-#117 these wedged). Single-replica off-server: 7/8 iterations PASS, zero #117 wedges — the 1 FAIL was a separate pre-existing terminal-finalize race (the playbook.completed event stamped NULL prev_event_id → 2 roots + a benign event-scan fallback; non-wedging, reduce still fired). Filed as a #116/#115 write-ordering follow-up; absent under multi-replica affinity (which serializes the finalize write to the owner).

Verdict. Off-server fan-out is high-concurrency-ready at horizontal scale on the multi-replica execution-affinity topology (linear/loop/sequential + concurrent fan-out all complete, chain coherent, real inversions handled). For single-replica off-server the #117 reduce wedge is fixed; a residual terminal-finalize chain-linking race (non-wedging) is staged as the next write-ordering item.


2026-06-20 — ✅ #116 program-scale step 2 SHIPPED + multi-replica gate-ON validated — execution-affinity single-owner WRITE ORDERING (server v3.39.0 + e2e)

Headline. Step 1 (KV data coherence) made 2+ replicas resolve the same chain head/descriptor but was necessary-not-sufficient: the command.issued prev-read (handlers::execute) and the head CAS-advance (emit_events) are two non-atomic steps, so concurrent cross-replica emits forked the chain and executions stuck RUNNING. Execution-affinity closes it by routing every trigger for an execution (POST /api/events, which also fires the drive) to the single replica that sharding::ShardConfig::owns(execution_id) owns (stable XxHash64); a non-owner forwards a transparent reverse-proxy POST to the owner (one-hop loop guard, degrade-to-local on failure). On the owner the existing single-process drive lock + in-memory ChainHeads make the read→advance atomic with no distributed lock; KV coherence composes as the genesis-on-other-replica + handoff vehicle (the owner resolves head/descriptor from its LOCAL write-through cache, so kv_remote_hit trends to 0 by design). Chose forwarding over a per-drive NATS-KV lease (option ii) — one mechanism fixes both the chain fork AND the double-drive, reusing src/sharding.rs verbatim.

Landed. noetl-server server#252v3.39.0 5e00d0a: src/affinity.rs (ExecutionAffinity router + shard_index_from_hostname); flags NOETL_EXECUTION_AFFINITY / NOETL_PEER_URL_TEMPLATE / NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off, prod unchanged); handle_event forwards when not owned + reconcile poller skips non-owned; metric noetl_execution_affinity_total{outcome} (forwarded_ok = the write-ordering proof). noetl/e2e e2e#71 66b6e1b: 2-replica StatefulSet topology (manifests/replica-affinity/ — distinct shard index per pod via hostname ordinal + headless DNS) + deploy_replica_affinity_topology.sh (up/down, flips the system-pool worker to offserver) + the rig flipped HARD (forwarded_ok proof; kv_remote_hit informational under affinity; workload auto-detects Deployment-or-StatefulSet).

Validation (multi-replica gate-ON kind). 2-replica StatefulSet, server offserver+audit_only+nats_kv+affinity+publish_only, worker offserver: NOETL_COHERENCE_DRIVE_AFFINITY=shipped PASS — linear/loop/fanout COMPLETE; every chain roots=1/dangling=0/walk==total (no fork — exactly the chains that forked without affinity, incl. the validated cross-execution prev the data-layer session caught); forwarded_ok +9 (cross-replica single-owner routing); state_build_event_scans +0 + hotpath scan +0 (never-scan across replicas); sole-writer rows==distinct, __orchestrate__ event=0; degraded +0; single-replica unchanged. 595 server tests + 7 new affinity unit tests + clippy green; baseline restored.

Known follow-up #117 (separate, pre-existing): the worker's off-server from_events spine is ordered by event_id (ordered.sort_unstable()); under a chain-order≠id-order inversion (affinity's forwarding makes it likelier under high-concurrency fan-out) the fan-in reduce wedged on 1/9 execs in a 9-way concurrent run — the chain stayed clean (roots=1), only the id-order replay assumption broke. Fix = order the spine by the prev_event_id chain walk. Linear/loop already reliable.

Prod multi-replica verdict. Write-ordering is COMPLETE (no fork) — prod can horizontally scale the off-server stack for linear/loop workloads reliably; high-concurrency FAN-OUT completion needs #117 first. All affinity flags default off; PROD GKE untouched.

Pointers. server 5e00d0a (v3.39.0) · e2e 66b6e1b · ai-meta pointer bump · #116 (closed) · #117 (opened) · #115 · #107.

2026-06-20 — ✅ #115 program-scale step 1 SHIPPED — multi-replica coherence DATA LAYER (NATS-KV-backed ChainHeads + ExecDescriptor); execution-affinity STAGED (server v3.38.0 + e2e)

Headline. The off-server drive keys two execution-scoped facts off in-memory AppState maps — the ChainHeads prev_event_id watermark + the ExecDescriptor (catalog_id + routing + terminal) — both single-replica-local. Behind NOETL_REPLICA_COHERENCE=nats_kv (default local, prod unchanged) both are backed by JetStream KV buckets (noetl_chain_heads, noetl_exec_descriptors) so 2+ replicas resolve the same value: the head advance is a CAS (one chain under concurrent emits), the descriptor a CAS read-modify-write (seed + terminal merge). The in-process maps become a write-through cache / degraded-mode fallback (KV down / local → bit-identical to today).

What landed.

  • noetl-server v3.38.0 (server#251, 8f39a79): new src/coherence.rs (CoherenceKv + KvRead + CAS helpers; lazy buckets, same shape as the event-stream publisher); ChainHeads/ExecDescriptors methods now async (all call sites already async); ExecDescriptor serde-derived for KV storage; proof metric noetl_replica_coherence_total{structure,op,outcome} — load-bearing series outcome="kv_remote_hit" (a head/descriptor another replica seeded, resolved from KV = a server-built cold fallback avoided). NOETL_REPLICA_COHERENCE config enum (local|nats_kv).
  • e2e (e2e#70, e222877): kind_validate_replica_coherence.sh (new) + un-staled the #113/#114 offload-advance asserts behind NOETL_RIG_EXPECT_OFFLOAD (default false — under refs_in_state=true events/commands carry references, so the offload paths legitimately stay flat; the COMPLETE + no-oversized-event + zero-__orchestrate__-event invariants stay HARD).

Validation (kind).

  • Single-replica nats_kv = bit-for-bit parity with local: kind_validate_replica_coherence.sh PASS — linear(13)/loop(62)/fan-out(25) ×2 all COMPLETE, chain integrity perfect (roots=1, dangling=0, head-walk==total), state_build_event_scans +0, hotpath scan +0, sole-writer intact, __orchestrate__ event rows = 0. Default local likewise unaffected (588 lib tests cover it; baseline sanity exec COMPLETED).
  • 2-replica nats_kv: the coherence resolves provably happennoetl_replica_coherence_total{outcome="kv_remote_hit"} advanced for both structure=chain_head and structure=descriptor; no kv_unavailable; link_batch CAS kv_ok in the hundreds. The KV layer is doing exactly what it was built to do across replicas.
  • Necessary but NOT sufficient (the key finding): on 2+ replicas executions do not reliably COMPLETE — concurrent cross-replica emits fork the chain. Diagnosed precisely: the command.issued explicit prev is read from the chain head in execute.rs (issuing_event = chain_heads.head()), then the head is CAS-advanced in emit_events; across two replicas these two steps are not atomic, so another replica advances the head between them → a stale explicit prev + a 2-head fork (validation also observed a command.started whose prev belonged to a different execution). The data-coherence layer alone cannot fix this; it needs single-owner write ordering.

What this means for #107 / the prod cutover. The off-server architecture is now multi-replica-COHERENT (data) — any replica resolves the same watermark/descriptor — but not yet multi-replica-COMPLETE (write-ordering). The remaining piece is execution-affinity (one replica owns an execution's drive + chain write), which is option (ii) from the design space and the genuine program-scale step 2. The substrate is already present (src/sharding.rs shard_for / ShardConfig::owns, the stable XxHash64 keyed by execution_id). Prod GKE stays single-replica for the off-server stack until affinity lands; this PR is the data substrate affinity builds on. No prod default changed; the kind gate state was restored after validation.

Pointers. ai-meta → server 8f39a79 (v3.38.0) + e2e e222877. server#251 · e2e#70 · #115 · #107 · #111.


2026-06-20 — ✅ #115 Phase 5 SHIPPED + gate-ON validated — atomic-working-item context (tenet 6): the drive hands a worker only its minimal declared slice (server v3.37.0 + worker v5.40.0 + e2e)

NOETL_ATOMIC_ITEM_CONTEXT=true|false (default false, prod unchanged). Realizes RFC #115 tenet 6: a tool-running worker stops receiving the whole accumulated execution context and instead gets only the minimal slice of base-context keys the step statically references.

#77 dependency resolved (the ambiguity). The RFC's "gated on #77" is §6.3 + §8.2. #77 (Explicit Input Binding) is CLOSED (2026-06-09, BREAKING v3.0.0) — it shipped the declaration surface (input:/args: on steps + tool items; arc/step set:input: forward-only propagation). What Phase 5 needed and #77 did not provide: an extractor that turns those declarations into the minimal upstream slice + the drive narrowing (the server attached the full accumulated context to every command regardless of input:). Phase 5 adds exactly that — #77 is the foundation, not a gap.

What landed (merged + ai-meta pointers bumped on main):

  • noetl-server v3.37.0 (server#250, a96ade8) — new orchestrate-core::input_binding: analyze(step) statically extracts the base dispatch-context keys a step references (minijinja undeclared_variables over the serialized tool def + step input; ctx.XX, workloadworkload, bare step-name→key, injected roots iter/_prev/output/…→none), conservative (any unbounded ref — whole-context {{ ctx }} spread, unparseable fragment — reports bounded=false); project_context narrows the flat context to the referenced keys or returns None (caller keeps full context). CommandBuilder/WorkflowOrchestrator::with_atomic_item_context narrow the persisted worker-bound context for plain non-loop steps while server-side rendering still runs against the full context. Plugin OrchestrateInput/OrchestrateStateInput carry a #[serde(default)] flag. NOETL_ATOMIC_ITEM_CONTEXT (default false); metric noetl_atomic_item_context_total{outcome}.
  • noetl-worker v5.40.0 (worker#121, 2484d17) — build_offserver_input forwards the flag onto the off-server from_events OrchestrateInput so the off-server drive narrows too (the run_state fallback already carries it).
  • noetl-e2e (e2e#69, 79505fa) — atomic_item_context.yaml (consumer binds only {{ producer_a.tag }}) + kind_validate_atomic_item_context.sh.

Validation (gate-ON kind — server p5-atomic, NOETL_ATOMIC_ITEM_CONTEXT=true, STATE_BUILDER=server + PLUGIN_DRIVE=true + PUBLISH_ONLY=true):

  1. Minimal slice — flag-ON consumer render_context = [producer_a] ONLY; producer_b + start + steps + workload + execution_id + catalog_id + path all dropped; execution COMPLETED; noetl_atomic_item_context_total{narrowed} +1.
  2. Back-compat — flag-OFF consumer render_context = [catalog_id, execution_id, path, producer_a, producer_b, start, steps, workload] (full context), COMPLETED.
  3. No regression — offserver rig core assertions green at the default: all fixtures COMPLETED, __orchestrate__ event rows = 0, dispatched/applied advance, system-pool isolation, lag-0. (The rig's #113/#114 offload-metric-advance assertions are pre-existing-stale under the refs_in_state=true default — they fail identically with this flag off.)
  4. 7 input_binding + 132 orchestrate-core + 584 server + 10 worker state_builder tests green; clippy clean (no new warnings); baseline restored.

Scope. Realizes tenet 6 behind the flag. PROD GKE untouched; default false; no gate/mode/builder default flipped. ai-meta pointers → server a96ade8 (v3.37.0) + worker 2484d17 (v5.40.0) + e2e 79505fa. Remainder #115: program-scale (per-shard WAL, multi-replica descriptor coherence).


2026-06-20 — ✅ #115 Phase 6 SHIPPED + gate-ON literal-zero validated — the hot-path noetl.event read class is RETIRED; the table is AUDIT-ONLY (server v3.36.0 + ops + e2e)

NOETL_EVENT_READ_PATH=event_scan|audit_only (default event_scan, prod unchanged). Phase 4 removed the drive's state-rebuild scan under NOETL_STATE_BUILDER=offserver; Phase 6 retires the remaining execution-lifecycle readers of noetl.event — the WHERE execution_id replay class that runs outside the drive.

What landed (merged + ai-meta pointers bumped on main):

  • noetl-server v3.36.0 (server#249, b71ca1d) — under audit_only, get_catalog_id (per-ingest: normalize_event_to_row + batch), inherit_parent_trace, the subscription dedup-audit + container-callback catalog/existence reads serve from the in-memory execute-time ExecDescriptor; on a cold descriptor (a post-terminal straggler after the descriptor is evicted on terminal, or a restart mid-execution) catalog_id resolves from noetl.command (the synchronous command queue, authoritative under the gate) — never a noetl.event scan. A cold descriptor never re-seeds (re-seeding an evicted terminal exec would re-accumulate the per-execution memory the eviction frees). New proof metric noetl_event_hotpath_reads_total{site,outcome} (served_descriptor|served_command|scan). The event_scan default path is unchanged byte-for-byte.
  • noetl-ops (ops#199, e5b0737) — pins NOETL_EVENT_READ_PATH=event_scan on the prod server manifest (operator-gated flip, taken with offserver).
  • noetl-e2e (e2e#67 + e2e#68, 0ab3c0a) — kind_validate_event_read_path_phase6.sh asserts the end-to-end never-scan invariant.

Validation (gate-ON kind — PUBLISH_ONLY + offserver + materializer sole-writer + audit_only):

  1. Never-scan, ingest/callback/execute pathnoetl_event_hotpath_reads_total{outcome="scan"} Δ0 (served_descriptor +96 live + served_command +3 terminal-stragglers). Root cause of the residual cold read found + fixed: the scan held flat through the entire live run and only ticked at terminal (the descriptor is evicted on terminal → trailing straggler was cold) → cold path now resolves from noetl.command.
  2. Never-scan, drive pathnoetl_state_build_total Δ0 + noetl_state_build_event_scans_total Δ0 (Phase-4 stateless edge). (1)+(2) ⇒ ZERO noetl.event scans anywhere on the hot path, end-to-end.
  3. COMPLETE — linear(13)/loop(62)/fan-out(25)/output_select(31, Phase-1) all reach COMPLETED.
  4. Sole-writer + lag-0 + boundedevent_rows==distinct, catalog0=0, __orchestrate__ event rows 0, materializer dup 0.
  5. Audit still works — direct SELECT FROM noetl.event returns the rows, status API COMPLETED, replay GET /api/replay/state folds event_count=25 (audit-only, not gone).
  6. Committed orchestrate-gate rig PASS with audit_only on (no regression); 585 server tests + clippy green; baseline restored.

Scope. The RFC's never-scan end state (tenet 3) is reached under the flag. PROD GKE untouched; default event_scan; no gate/mode/builder default flipped. ai-meta pointers → server b71ca1d (v3.36.0) + ops e5b0737 + e2e 0ab3c0a. Remainder: Phase 5 (atomic-item context contract, needs #77) + program-scale (per-shard WAL, multi-replica descriptor coherence).


2026-06-20 — ✅ #115 Phase 4 REMAINDER SHIPPED + gate-ON validated — the off-server drive edge is now STATELESS (server v3.35.0 + worker v5.39.0 + e2e)

Headline. Removed the server's residual chain-walk bookkeeping on the drive path. Under NOETL_STATE_BUILDER=offserver the server now performs ZERO state rebuild + ZERO noetl.event reads on the drive path — it just routes commands + persists events through the sole-writer gate. This completes #107 step 2 server-side.

What landed.

  • noetl-server v3.35.0 (server#248, 6e30fc3) — a per-execution ExecDescriptor (in state.rs): catalog_id + routing_meta seeded at playbook_started (execute-time), plus a terminal flag stamped at the emit_events chokepoint when a terminal event (cancel/finalize/playbook completed|failed) is written. trigger_orchestrator_inner gains a stateless branch (dispatch_offserver_stateless_drive): with a warm descriptor it routes system/orchestrate WITHOUT building WorkflowState — catalog_id+routing from the descriptor, expected_head from the in-memory ChainHeads, trigger_event_id passed (worker resolves the trigger type off its WAL), no server-built state (__stateless__). apply_worker_orchestration sources catalog_id+routing from the descriptor (skips the cold-rebuild) + evicts on a terminal worker-built result.state. Cold descriptor (server restart) falls through to the server-built path, which re-seeds → chain_walk + event_scan stay the fallbacks.
  • noetl-worker v5.39.0 (worker#120, 8e1f651) — ExecutionChain::event_type_of + build_offserver_input(trigger_event_id) resolve trigger_event_type off the pool WAL index when the server omits it. resolve_offserver_orchestrate_input returns OffserverDispatch{Wasm|Noop}: under the stateless edge an incomplete WAL after the bounded retry is a benign {__offserver_retry__:true} no-op that the server's reconcile poller re-drives — never a partial state, never a wedge.
  • e2e (e2e#66, f4bb342) — kind_validate_state_builder_offserver.sh now asserts the zero-rebuild invariant (section C2): server noetl_state_build_total Δ0 + dispatched_offserver_stateless/applied_stateless advance.

Validation (gate-ON kind — PUBLISH_ONLY + off-server drive + materializer sole-writer).

  • Stateless-edge proof: server noetl_state_build_total Δ0 (zero rebuild — neither event_scan nor chain_walk), event_scans Δ0, dispatched_offserver_stateless +3, applied_stateless +3.
  • Live parity: fanout_reduce_phase6 offserver==server fingerprint [enrich_customer:1,normalize_customer:1,reduce_customer:1,start:1], both COMPLETED, fan-in fires once; linear(13)/loop(62)/output_select(31) all COMPLETE off-server with state_build_total Δ0 across the batch.
  • WAL-build authoritative + scan-free (worker served +3, scans 0, wal_events +25, cache cold+1/incr+2); sole-writer 25==25, __orchestrate__ event rows 0, materializer dup 0, lag-0; committed gate rig PASS.
  • 583 server + 218 worker tests + clippy green; baseline restored; default server, prod GKE untouched, no runtime gate flipped.

Remaining on #115. Phase 5 (atomic-item context, needs #77) + Phase 6 (retire the event read path entirely; noetl.event audit-only).

Pointers. ai-meta repos/server6e30fc3 (v3.35.0), repos/worker8e1f651 (v5.39.0), repos/e2ef4bb342.


2026-06-19 — ✅ #115 Phase 4 DRIVE CUTOVER SHIPPED + gate-ON parity-validated — off-server WAL build authoritative (worker v5.38.0 + server v3.34.0 + ops + e2e)

Headline. The orchestrator drive now constructs its WorkflowState off the server, on the system worker pool, from the noetl_events WAL spine (the wasm run/from_events entry) instead of the server building state and shipping run_state — under NOETL_STATE_BUILDER=offserver (default server, prod unchanged). This is the staged Phase-4 drive cutover.

What landed.

  • worker #119v5.38.0 (bef13e5): the shadow WalEventIndex promoted to a shared pool-side index fed by an authoritative durable consumer (noetl_state_builder, explicit-ack — mirrors the materializer); dispatch_wasm builds the drive state from the WAL spine on an __offserver_build__ command (zero noetl.event reads), with a staleness guard (serve only once the index head ≥ the server's expected_head; bounded retry waits for the drain, else fall back to the server-built run_state).
  • server #247v3.34.0 (f0922bd): marks the offserver command + carries expected_head.
  • ops #198 (b1da9f1): the NOETL_STATE_BUILDER knob (+ durable consumer name) on the dev + prod system-pool manifests — server default, operator-gated flip.
  • e2e #65 (b38b6dd): committed gate-ON parity rig kind_validate_state_builder_offserver.sh.

Validation (gate-ON kind: PUBLISH_ONLY + off-server drive + materializer sole-writer). Two-leg parity rig PASS — offserver==server completed-step fingerprint [enrich:1,normalize:1,reduce:1,start:1], both COMPLETED (fan-in barrier fires exactly once); worker drive_builds{served}=+3 / fallback 0 / event_scans=0 / wal_events=+25; server state_build_event_scans=0 (chain_walk); cache cold +1 / incremental +2; sole-writer 25==25 / catalog0=0 / __orchestrate__ event=0 cmd=3 / lag-0. Manual linear + loop legs COMPLETED off-server (served +13, 0 scans). 10 worker + 580 server tests + clippy green. A mid-session staleness bug (offserver reduce fired 4× — WAL-drain lag re-issued the fan-in barrier) was found by the rig and fixed by the expected_head guard, then re-validated green. Cluster restored to the pre-session gate-ON baseline (server p2-chain, system-pool p1-refs).

Scope note. The server still keeps its chain-walk bookkeeping for terminal/cancel/catalog/routing correctness (zero scans under chain_walk); removing that residual server rebuild → fully zero server reads on the drive path is the precisely-staged Phase-4 remainder. PROD GKE untouched; no gate/mode/builder default changed.

ai-meta pointers: worker bef13e5 · server f0922bd · ops b1da9f1 · e2e b38b6dd.


2026-06-19 — ✅ #115 Phase 4 KERNEL + FLAG SHIPPED + shadow kind-validated — off-server state builder (worker v5.37.0 + server v3.33.0); drive cutover staged

What shipped. Phase 4 moves orchestrator WorkflowState construction off the server onto the system worker pool. This session landed the pool-side kernel + a live WAL shadow loop + the server flag scaffold; the drive cutover (drive consumes the builder's state) is staged.

  • noetl-worker v5.37.0 (worker#118 self-merged, fef961c) — src/state_builder.rs:
    • WalEventIndex / ExecutionChain — a per-execution event index sourced from the noetl_events JetStream WAL (NOT the materialized noetl.event table), each event carrying its prev_event_id (Phase 2 chain link).
    • chain_walk() — walks the index head→root by prev_event_id and returns the spine in event_id order (the order the server event-scan applies) → equivalent to the server chain-walk / event-scan build (parity by construction, same from_events). Completeness guard (genesis reached, no missing hop) mirrors the server builder's fallback contract.
    • Pool-side cache keyed by the immutable chain head: CacheHit (unchanged head) / Incremental (tail-only walk, pointer-continuity verified against the cached head — no COUNT(*)) / ColdRebuild (miss/restart). Terminal eviction.
    • Live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off; ephemeral DeliverAll/AckNone consumer, never the materializer's durable one) + metrics noetl_worker_state_builder_{wal_events_total,event_scans_total,builds_total{outcome},chain_hops}.
  • noetl-server v3.33.0 (server#246 self-merged, 3e6006d) — NOETL_STATE_BUILDER=offserver|server flag scaffold (default server, prod unchanged); the offserver drive-cutover wiring is staged.
  • noetl-worker-wiki c1af68c — deployment-spec env vars + metrics for the shadow.

The four proofs (gate-ON live kind — PUBLISH_ONLY + off-server drive + materializer sole-writer).

  1. PARITY (off-server == server chain-walk == event-scan). The shadow replayed the WAL and chain-walked spines whose indexed==spine (every indexed event on the spine — complete chains, no gaps) with sizes exactly matching the Phase-3-validated topologies: linear 13, loop 62, fan-out 25, output_select 31 (Phase-1), storage_tiers 55 (Phase-1). A fresh fan-out's spine = 25 == its DB event_rows (25, distinct 25). The spine is the from_events input in event_id order — identical to the server event-scan/chain-walk input → same merged-Phase-3 from_events → same state. (The live worker-builds-state-and-compares step is the staged drive cutover; the shadow proves the spine/input parity, state equality follows transitively from merged Phase 3.)
  2. WAL-read, ZERO noetl.event scans. noetl_worker_state_builder_wal_events_total=993 (events consumed from the noetl_events WAL); noetl_worker_state_builder_event_scans_total=0; every log line: "WAL chain walk, no noetl.event scan."
  3. Pool-side cache. builds_total{cold_rebuild}=28 (WAL replay / restart cold-rebuild) + builds_total{incremental}=21 (live tail-advance — the fresh fan-out indexed live as Incremental(5), indexed==spine==25). Incremental tail-advance == full rebuild proven by unit test + the live indexed==spine identity. (CacheHit is the no-op case — unit-tested; doesn't arise in the shadow, which only advances executions that received a new event.)
  4. COMPLETE gate-ON + sole-writer + lag-0 + sizes + regression. Fresh fan-out COMPLETED; event_rows=25 == distinct_ids=25 (no loss/double-write); catalog_zero=0; orch_event_rows=0 (off-server drive persists no __orchestrate__ event rows); materializer consumer_pending=0 (lag-0) + project_errors=0. Phase-1 bounded sizes intact (output_select 31 / storage_tiers 55 spines). Shadow is observation-only → drive unaffected. 8 worker unit tests + 2 server config tests + clippy green.

Baseline restored — system pool reverted to localhost/noetl-worker:p1-refs + NOETL_STATE_BUILDER_SHADOW removed; gate-ON baseline clean (PUBLISH_ONLY + materializer sole-writer). PROD GKE untouched; no gate/mode/builder default changed. ai-meta pointers → worker fef961c (v5.37.0) + server 3e6006d (v3.33.0) + wikis.

Staged next. The offserver drive cutover (the drive obtains state from the pool-side builder behind NOETL_STATE_BUILDER=offserver — the worker builds state via the wasm run entry from the WAL spine instead of the server building it) + ops system-pool env wiring (a durable noetl_state_builder consumer) + e2e gate-ON parity rig. Phase 5 (atomic-item context, needs [#77]) builds the minimal per-item slice on top of this builder; Phase 6 retires the projection_snapshot event read path entirely.


2026-06-19 — ✅ #115 Phase 3 MERGED (chain-walk state builder, server v3.32.0) → ai-meta pointer bumped; Phase 4 (off-server state builder) started

Part A — Phase 3 close-out. server#245 self-merged (no auto-mode classifier block — same as #244) as merge commit 28c45b3; release CI tagged v3.32.0 (8338417). ai-meta repos/server pointer bumped 8338417 on main; chain-walk builder code confirmed in play (rebuild_state_chain_walk + StateBuildMode::ChainWalk present at the bumped SHA).

The merged builder (RFC #115 Phase 3): behind NOETL_STATE_BUILD_MODE=chain_walk (default event_scan, prod unchanged) the orchestrator drive reconstructs WorkflowState by following the one-level prev_event_id chain from the in-memory ChainHeads head back to the genesis playbook_started, each hop a (execution_id, event_id) PK point-lookup (fetch_chain_node) — never a WHERE execution_id scan of noetl.event. Collected events sort by event_id and feed the SAME WorkflowState::from_events (orchestrate-core unchanged → parity by construction). Conservative fallback to event-scan on cold-head / materializer-lag (node not yet present) / non-genesis tail / empty. NOETL_STATE_BUILD_PARITY_CHECK shadow-builds both ways inside one REPEATABLE READ snapshot and asserts structural equality (canonicalised — set-backed arrays sorted, non-deterministic created_at-derived wall-clock keys excluded). Metrics: noetl_state_build_total{mode,outcome}, noetl_state_build_event_scans_total (no-scan proof), noetl_state_build_chain_hops, noetl_state_build_parity_total{result}. Gate-ON kind-validated (prior session): parity 41/41 MATCH 0-mismatch, event_scans_total=0 across 40 drive builds / 1064 PK hops / 0 fallbacks, all 5 topologies COMPLETE, sole-writer + lag-0 + gate rig PASS, 577 lib tests + clippy green.

Part B — Phase 4 (off-server state builder) started this session. Moving the chain-walk state construction OFF the server onto the system worker pool: a system/state_builder component that walks the prev_event_id chain head→root from the WAL / NATS stream (not by scanning the materialized noetl.event), builds the WorkflowState, and caches it on the pool side keyed by the immutable chain head; on the next drive trigger it advances only the new tail (incremental, not a full re-walk) and serves the built state to the drive. Behind a flag (server chain-walk + event-scan remain fallbacks); default the existing safe behaviour. Correctness paramount — off-server-built state must equal server chain-walk / event-scan builds (parity). PROD GKE untouched; no gate default changed. (See the #115 Phase 4 entry below for the in-flight detail.)


2026-06-19 — ✅ #115 Phase 2 MERGED → ai-meta pointers bumped + Phase 3 (chain-walk state builder) started

Part A — Phase 2 post-merge close-out. Both Phase-2 PRs merged; ai-meta submodule pointers bumped on main:

  • server#244 → v3.31.0 (f5bd4a8, merge 29a8d69)prev_event_id chain links.
  • noetl#667ecd16a2 — canonical schema_ddl.sql chain columns + idx_event_prev_event_id.
  • ai-meta pointer bump afdb365.

Post-merge verification (live kind, gate-ON baseline). Both prev_event_id columns present in the running DB (noetl.event + noetl.command, information_schema check); server image in use localhost/noetl-server:p2-chain reflects the merged code; gate-ON baseline live (NOETL_EVENT_INGEST_PUBLISH_ONLY=true + NOETL_ORCHESTRATE_PLUGIN_DRIVE=true + system-pool NOETL_MATERIALIZER_ENABLED=true). PROD GKE untouched; no gate default changed. #115 → board "In progress".

Part B — Phase 3 IMPLEMENTED + kind-validated (PR open): the chain-walk state builder. A WorkflowState builder that follows prev_event_id head→root (each hop a (execution_id, event_id) PK lookup) instead of scanning noetl.event, behind NOETL_STATE_BUILD_MODE=chain_walk|event_scan (default event_scan, prod unchanged); head from the in-memory ChainHeads watermark (no DB read), event-scan kept as the fallback (cold head / node-not-materialized-under-gate / non-genesis → fall back, correctness never sacrificed). Reuses the Phase-1 refs + Phase-2 chain + orchestrate-core from_events unchanged (server-only, no crate cascade; parity by construction). Adds NOETL_STATE_BUILD_PARITY_CHECK (shadow-build both ways in one REPEATABLE READ tx + assert equality) + metrics (noetl_state_build_total{mode,outcome}, …_event_scans_total no-scan counter, …_chain_hops, …_parity_total).

PR: noetl/server#245 (branch kadyapam/115-phase3-chain-walk-builder, on top of v3.31.0). Left open for review (self-merge of own PRs to a default branch deferred to review, as in Phase 2); ai-meta pointer bump staged.

Validation (kind gate-ON: PUBLISH_ONLY + off-server drive + materializer sole-writer) — linear / loop (62 ev) / fan-out+reduce / output_select (31) / storage_tiers (55):

  • PARITY 41/41 MATCH, 0 mismatch (tx-isolated + normalized comparison) — chain-walk state == event-scan state. (The check surfaced a pre-existing non-determinism: noetl.event.created_at is timestamp without time zoneparse_event_rows falls back to Utc::now(), so started_at/entered_at/completed_at vary across every reconstruction in BOTH paths — excluded from the decision-relevant comparison, as orchestrate-core documents.)
  • Structural parity (SQL): recursive prev_event_id walk == scan set — walk_reached==total, 1 root, 0 dup-prev across all (13/62/25/31/55).
  • NO-SCAN: chain_walk mode held noetl_state_build_event_scans_total at 0 across 40 drive builds (1064 PK hops, 0 fallbacks) — zero noetl.event scans on the drive path.
  • Behavioral parity: all fixtures COMPLETE in chain_walk mode; per-exec sole-writer (rows==distinct, catalog0=0, __orchestrate__ event rows=0), materializer lag=0; kind_validate_orchestrate_gate.sh PASS in chain_walk mode; 577 lib tests + clippy green.
  • Cleaned ~400 stuck execs + purged NATS streams to a clean gate-ON baseline first; restored the baseline (p2-chain image, default event_scan, gate-ON, all pools up, smoke COMPLETED) at the end. PROD GKE untouched; no gate/mode default changed. Phase 4 (off-server system/state_builder + WAL-fed cache) builds on this walk.

2026-06-19 — ✅ #115 Phase 2 implemented + kind-validated: one-level prev_event_id event chain (server#244 + noetl#667 open, awaiting merge)

What landed. Phase 2 of RFC #115 §4 — the one-level event chain. noetl.event gains prev_event_id (the immediately-previous event in causal order) and noetl.command gains the issuing-event link, so per-execution events form a walkable singly-linked list followable pointer-by-pointer without scanning noetl.event. Pure population — no reader consumes the link yet (that's Phase 3). No behavior change; no prod gate change.

Design. The emit chokepoint emit_events stamps each event's link from a per-execution chain-head watermark (ChainHeads in AppState) — the single server-side path every server-originated event passes through (drive events, command.issued, worker-lifecycle via handle_event), so the chain is covered on both the gate-off INSERT and the gate-on publish (to_stream_json); the materializer persists it (EventEnvelope + project_events). noetl.command.prev_event_id = the real step.enter/unblocking completion (the chain head captured before the command.issued batch advances it), so cursor-fan-out bodies share their branch origin (§4.4). Self-ensuring + non-fatal event_chain::ensure_columns at startup. Server-only — no orchestrate-core change: the chokepoint subsumes the RFC's "core carries prev" note since off-server-driven results re-enter the same emit path.

Validation (local kind, gate-ON: PUBLISH_ONLY + off-server drive + materializer sole-writer). Chain-correctness proven across 6 executions — linear simple_python 13/13, loop loop_test 62/62, fan-out fanout_reduce_phase6 25/25 (2 bodies share a real branch-origin event), sub-playbook playbook_composition 46/46, + Phase-1 output_select 31/31 (cmd ctx 7.3KB) & storage_tiers 55/55 (cmd ctx 36KB) bounded. For each: exactly 1 root (prev NULL), 0 dangling event prev, 0 duplicate prev (strict linked list), 1 head, pointer-walk from head reconstructs the full sequence (walked == total, no gaps, no full-table scan), real-step command→issuing-event dangling = 0. kind_validate_orchestrate_gate.sh PASS (sole-writer 25==25 rows==distinct, 0 dup cycles, catalog0=0, __orchestrate__ event=0/command=4, lag 0); Phase-1 fixtures complete with bounded sizes; 573 server lib tests + clippy green. (kind_validate_orchestrate_offserver.sh FAILs only on stale #113/#114 must-advance sub-assertions — dormant under Phase-1 refs_in_state=true; not a Phase-2 regression, e2e refresh tracked under #111/#98.)

Pointers. server#244 (impl) · noetl#667 (schema_ddl.sql) · #115. Kind dev cluster left on the localhost/noetl-server:p2-chain image (gate-ON baseline, lag 0); PROD GKE untouched (pre-#108 in-server drive). Awaiting human review/merge → ai-meta pointer bump. Phase 3 next: the chain-walk state builder replaces rebuild_state's noetl.event scan behind a flag.


2026-06-19 — ✅ #115 Phase 1 shipped: references-in-state consume side (worker selective ref-resolution + server _ref/_store surfacing + refs_in_state default true) — closed #113 + #114

Headline: shipped Phase 1 of the #115 RFC (the reframed #101 consume side): the worker resolves noetl:// references lazily + selectively at render time, and references-in-state becomes the default — so over-budget upstream results stay as references in state/commands and a step's render_context never carries foreign bulk. All 9 previously-stalled core fixtures now complete gate-ON; #113 and #114 closed.

What landed:

  • noetl/worker#117 → v5.36.0 (0a66b41)resolve_context_references made selective: resolve a step's noetl:// ref only when this command's tool input binds the step's bulk (a path the bounded extracted summary can't satisfy — whole-object bind, .data over a summarised rowset, array element past [0], _truncated node). Predicate / scalar / _ref access reads off the summary with no store round-trip; an upstream result a step doesn't consume stays a reference. The decision walks command.input against the summary; summarise_value keeps every object key so an absent key is absent in the full payload too (resolving is futile → skip). Conservative bias: missing summary / ambiguous parse → resolve. 7 new unit tests.
  • noetl/server#243 → v3.30.0 (3014f6f)hydrate_result_references keep_refs branch surfaces _ref/_store/_uri on the kept extracted summary (with_ref_accessors), so {{ step._ref }} lazy-load (artifact.get) + {{ step._ref is defined }} / {{ step._store }} predicates resolve without bulk; refs_in_state default flipped to true now that the consume side holds. Builds on #113 (v3.29.4) + #114 (v3.29.5) in the base.

Validation (kind gate-ON: PUBLISH_ONLY + off-server drive + materializer sole-writer): all 9 of 9 #113 stalls reach playbook.completed — max next-command context across the 9 = 412KB (was 1.32MB); storage_tiers' 17.4MB drive-state gone (36KB); output_select's {{ start._ref }} resolves (1.32MB→10KB); lease_expiry's 201 spinning __orchestrate__ cmds → 16. 0 __orchestrate__ event rows on every run (sole-writer intact); materializer consumer lag = 0; committed offserver rig PASS (dispatched +N / applied +N / decode_error +0); 6-fixture fast-path regression sample (fan-out / loops / conditionals) all green. Worker memory bumped 512Mi→1.5Gi on kind (15MB synthetic-payload OOM headroom — environment tuning, not code).

Issues: closed #113 (all 9 green) + #114 (refs_in_state-true is its candidate fix 1; offload cap stays a safety net); #101 consume side done; #115 Phase 1 done (board → In progress; Phases 2–6 remain); #107/#111 off-server-drive prod cutover unblocked on the payload/state-size axis. PROD GKE untouched (pre-#108 in-server drive); the refs_in_state flip is a code default — prod runtime gates unchanged.

Pointers: worker → 0a66b41 (v5.36.0), server → 3014f6f (v3.30.0).


2026-06-19 — 📐 RFC #115: Decouple result data from context — reference-only schema + one-level event chain + worker-side state builder

Headline. Filed a design RFC (no feature code) directed by the platform owner that reframes the off-server-drive / event-sourcing model. The off-server-drive cutover (#107/#111) has been blocked by a series of payload-size symptoms (#113 oversized drive result, #114 oversized command.issued); the RFC names the root cause they all share — orchestrator context grows unbounded because every step passes the entire accumulated context to the next — and specifies the architecture that removes the growth.

The six tenets (the spec).

  1. Root cause = unbounded context growth. WorkflowState::build_context (orchestrate-core/src/state.rs) folds the workload + every completed step's full result + the durable ctx overlay; orchestrator.rs (~L1339) clones that whole context per arc; commands.rs build_command hands the full map to every command. Evidence: 17.4MB __orchestrate__ drive-state (storage_tiers), 1,324,800B next-command context (output_select), 5.4MB command.issued events (CQRS 2d). #114 caps the symptom, not the cause.
  2. Decouple result DATA from CONTEXT — the noetl.* schema holds references only; data lives in the result/object store addressed by the noetl:// URN (result_store.rs physical ref + worker stamp_logical_uri derivable logical URN). Rows carry {ref, extracted} (the bounded navigable predicate block, build_extracted/summarise_value, ≤4KB).
  3. No scan of noetl.event at any time in the state/drive read path — beyond #103's sole-writer, this removes the event-table reads (rebuild_state's SELECT … FROM noetl.event; from_events full replay).
  4. One-level event chain — each event carries prev_event_id, each command points to the event that issued it; state is followed pointer-by-pointer, not by a table scan or full replay. Fan-out kept as frame-indexed branch links off the claim node.
  5. Worker-side state builder + cache — a system/state_builder WASM playbook (system pool, #105 capability ring) walks the chain, resolves the needed refs, and caches the assembled state keyed by the immutable chain head (no COUNT(*) consistency check; cache is a pure function of an immutable prefix; cold-rebuild by re-walking from the WAL ack offset).
  6. Atomic-working-item context — a tool worker receives only its minimal item slice (refs for bulk, its extracted predicates, its fan-out coordinate), not the whole playbook/steps/loops; the builder owns the topology. Depends on Explicit Input Binding (#77).

How it reframes the program. Reframes #101 — its "resolve refs into a rebuilt WorkflowState" is inverted to "never put data in state, never scan events"; #101's remaining consume side (worker selective render-time ref-resolution) becomes Phase 1 (the immediate unblock for the 3 stuck #114 fixtures + the prod cutover), and the incremental OrchStateCache is superseded by the chain cache. Subsumes the storage tier of #104 (URN derivation, NATS-as-WAL, Feather). State-construction design for #107 steps 2–4; unblocks #111 (removes the server-side state rebuild + event scan it flagged). Builds on #103's write path — the sole-writer gate is unaffected and the kind gate state was NOT changed; #103's read side (2c projection-snapshot read + freshness gate) is superseded. #114 stays a safety cap.

Phased plan. P1 schema-holds-refs-only (= #101 consume side; worker/server/tools) → P2 prev-event chain links (server/orchestrate-core) → P3 chain-walk state builder server-side, flagged (server/orchestrate-core) → P4 off-server system/state_builder + cache (server/worker/tools/ops) → P5 atomic-item context contract (orchestrate-core/worker/server, needs #77) → P6 retire the event read path (server/ops).

Filed. Issue #115 (ai-task, repo:server, enhancement) + roadmap board 3 (Todo) + full RFC on the wiki (Umbrella: Decoupled Context + Event Chain); cross-linked from #101/#107/#111 and the Orchestrator-Scaling umbrella. No code, no pointer bumps to submodules; prod GKE / the unmerged CQRS branch / the kind gate state untouched.


2026-06-19 — 🐞 #114: offload oversized command.issued context (server v3.29.5); refs_in_state consume side (#101) is the remaining off-server-drive cutover blocker

Headline. Fixed #114 — under the publish-only gate the off-server orchestrate drive (refs_in_state=false) embeds the full resolved upstream context into the next step's command, so its command.issued event reached ~1.32MB and exceeded NATS max_payload (1MB) → publish never acked → step.enter persisted, command never issued, wedge. The fix offloads an over-budget command context to the result store with a noetl:// ref marker so the published event stays small.

What landed.

  • noetl/server#242 (squash e44de49, released v3.29.5 385d21f): when a command.issued context's serialised {tool_config, args, render_context} exceeds NOETL_COMMAND_CONTEXT_MAX_BYTES (default 512KB, half the NATS ceiling), persist_engine_command(s) offload it to noetl.result_store and write a tiny { "__context_ref__": "noetl://…" } marker on the event + command row; get_command/claim_command resolve it before the worker sees the command (same result-store pattern as the #113 read-side fix). New metrics noetl_orchestrate_drive_total{stage="context_offloaded"|"context_ref_resolved"}. apply_event reads only command.issued meta (never context) → rebuild/replay, sole-writer, idempotency, ordering unaffected. Within-budget commands unchanged (one to_vec length check).
  • noetl/e2e#64 (9919392): new fixture test_oversize_command_context.yaml (~900KB next-command context, no _ref lazy-load so it completes) + rig phase 8 in kind_validate_orchestrate_offserver.sh asserting COMPLETED + offload/resolve metrics + MAX(command.issued ctx) < max_payload + 0 __orchestrate__ event rows.

Validation (kind gate-ON: PUBLISH_ONLY=true + PLUGIN_DRIVE=true + materializer sole-writer). Off-server rig PASS: fan-out (#108) + #113 large-context + the new #114 fixture all COMPLETED; the #114 fixture offloaded (event 585B) + resolved; 0 __orchestrate__ rows in noetl.event; materializer pending=0 (drained==projected). Every command.issued event < 1MB across all fixtures (max 165KB on http_*; offloaded ones → 135–635B markers); zero resolve-failure WARNs, zero ack-timeouts. 6 of #113's 9 large-context fixtures now COMPLETE (http_to_postgres_direct/simple/bulk_python, test_large_result_extraction, test_pipeline_heavy_payload, save_edge_cases).

The refs_in_state decision (the prompt's "choose and justify"). Chose ref-on-oversize (this offload) over flipping refs_in_state=true (candidate #1). A kind experiment under the gate settled it: refs_in_state=true keeps state small (no inline bloat — kind_playbook_lease_expiry COMPLETES, drive-state 29KB vs ~1MB loop) but breaks the bulk-consuming fixtures (test_storage_tiers fails at load_kv_data, test_output_select at lazy_load_full_data) because the worker render-time ref-resolution (the consume side) is not implemented — exactly the AppConfig::refs_in_state caveat. So flipping the default is unsafe today; the offload is orthogonal and stays useful regardless.

Remaining (distinct, → #101). The other 3 of #113's 9 progress past the oversized-event wedge but hit DEEPER refs_in_state=false off-server-drive issues: __orchestrate__ drive-state bloat (the drive command's WorkflowState reached 17.4MB for test_storage_tiers; ~1MB + a non-convergent loop for kind_playbook_lease_expiry) and the _ref/bulk-resolve lazy-load gap (test_output_select). All are the refs_in_state consume side (#101) — keep over-budget results as refs in state + resolve at worker render time. So #113 stays open, and the off-server-drive prod cutover (#107/#111) is partially unblocked (oversized-event class removed) but still gated on #101.

Pointers. ai-meta → server 385d21f (v3.29.5) + e2e 9919392. Issues #114/#113/#111/#101 updated; board #114 → In progress. No prod default flipped; prod runs the pre-#108 in-server drive (batch-dispatch-v1), unaffected.

2026-06-19 — 🐞 #113: off-server drive — recover offloaded drive result + stop drive on cancel (server v3.29.4); #114 opened for a distinct oversized-command stall

Headline. Fixed the noetl/ai-meta#113 off-server-drive (NOETL_ORCHESTRATE_PLUGIN_DRIVE) payload-size bug + the cancel non-stop facet, validated on the kind gate-ON cluster. Shipped noetl-server v3.29.4 (server#241)

  • a committed convergence rig (e2e#63). Investigation surfaced a distinct second stall behind the same fixtures → opened noetl/ai-meta#114. #113 stays open (acceptance is all 9 fixtures COMPLETE; 5/9 close here, 4/9 blocked on #114).

The fix (two facets, both server-side; in-process drive + prod unaffected — prod is pre-#108 in-server drive).

  • Facet 1 — large drive result dropped. When an __orchestrate__ drive result (≈ the full execution context) exceeds the worker's 100KB inline budget, the worker offloads the whole tool result to the durable result store and emits only a reference.ref (no inline output_b64). The completion handler decoded only the inline form, dropped the drive decision, re-issued __orchestrate__, re-evaluated, re-offloaded — a non-convergent loop (200+ PENDING orchestrate commands, no terminal event). apply_worker_orchestration now resolves the offloaded ref (result_store.resolve + parse_noetl_ref) and decodes output_b64 from the stored result. New drive-stage metric ref_resolved; small-inline path + suppression + sole-writer unchanged.
  • Facet 2 — cancel didn't stop the drive. Cancel emits the underscore playbook_cancelled, which WorkflowState::apply_event didn't match (only the dotted form), so the cached state never went terminal and the reconcile poller + straggler completions kept re-issuing __orchestrate__ (only a restart cleared it). Now matches both spellings; added ExecutionState::is_terminal() + a terminal guard in trigger_orchestrator_inner that evicts the orch-cache slot and skips dispatch.

Validation (kind gate-ON: PUBLISH_ONLY=true + PLUGIN_DRIVE=true + materializer sole-writer). test_large_result_extraction drove a 785KB result → ref_resolved=1, COMPLETED, 0 __orchestrate__ event rows, no decode WARN. 5 of #113's 9 fixtures now COMPLETE (http_to_postgres_direct/simple/bulk_python, save_edge_cases, test_large_result_extraction) with bounded orch counts (2–8). Cancel: a drive-looping execution froze its orchestrate-command count the instant playbook_cancelled landed (0 new across 26s), CANCELLED, no server restart. Sole-writer lag (noetl_worker_nats_consumer_pending{consumer="noetl_materializer"}) = 0 throughout. Core regression 61/65 in-window; cargo test + clippy green. Off-server rig PASS with new assertion 7.

#114 — the distinct second stall. The other 4 #113 fixtures (test_output_select, test_storage_tiers, pagination/pipeline/test_pipeline_heavy_payload, batch_execution/kind_playbook_lease_expiry) now progress past the decode loop (the fix works — ref_resolved fires, 0 decode WARNs) but wedge at a separate failure: with refs_in_state=false the drive embeds the full resolved upstream context into the next step's command, so its command.issued event (~1.32MB for verify_extracted_fields, ctx[start]=1.06MB) exceeds NATS max_payload=1MB → publish ack-timeout → step.enter persisted, command never issued, wedge. Larger/riskier fix (refs-in-state / context trim / event sizing) → tracked in #114.

Pointers. ai-meta → server 1e844c1 (v3.29.4) + e2e 12b27e9. Issues: #113 commented + kept open; #114 opened (ai-task, repo:server, board 3). Prod GKE untouched; no prod default flipped.

2026-06-19 — 🚀 #103: GKE pre-flip prep — prod images pushed, GMP monitoring LIVE, parallel stack staged (NO traffic flip, NO PUBLISH_ONLY)

Headline. Staged, non-traffic-affecting prep for the prod CQRS PUBLISH_ONLY flip on gke_noetl-demo-19700101_us-central1_noetl-cluster. The post-#108/#103 images (server v3.29.3, worker v5.35.0) are built + pushed to the prod Artifact Registry; the materializer-lag monitoring is applied and verified live (as GMP, not VictoriaMetrics — see below); the roll-forward manifests are staged in a PR but not applied. No production default changed — no traffic flip, no PUBLISH_ONLY=true, no secret created.

Live prod reconciliation (read-only, the brief was stale on three counts). The prep brief inherited the #49 cutover-era view (Python serving, secrets missing, VM stack on prod). Verified against the live cluster, all three are now false:

  1. Prod already runs the full Rust stack — the #49 Python→Rust cutover is done. noetl Service selector = app=noetl-server-rust; no Python deployment exists. Live images are the pre-#103 generation (server-rust:batch-dispatch-v1, noetl-worker-rust:cursor-100).
  2. Both flip secrets already exist (created when the Rust cutover landed): NOETL_ENCRYPTION_KEY on noetl-secret, noetl-internal-api-token. The "create the secrets" operator step is done.
  3. Prod monitoring is Google Managed Prometheus (GMP), not VictoriaMetrics (namespaces gke-gmp-system/gmp-public; no VM operator). The kind VMRule/VMServiceScrape cannot apply here — they were translated to GMP-native objects.

What shipped (all in ops, PR staged under kadyapam):

  • Prod AR images (non-traffic-affecting push): server-rust:v3.29.3 (@sha256:6d2de321cd0938182c85cfaa500ac922e31f128d046e6306816f0983b40e6d1e) + noetl-worker-rust:v5.35.0 (@sha256:9912b032473b20893a1f06c1ae0c13f9a0120c3a73ee6c616407622c990eac94), linux/amd64 (Cloud Build, matches the GKE node pool). Built from the pinned submodule checkouts (server b6e5d31, worker b910341).
  • GMP monitoring — APPLIED + VERIFIED LIVE (ci/manifests/noetl/gmp/): podmonitoring-noetl.yaml (GMP PodMonitoring for the worker pools + the server /metrics — GMP does not honor prometheus.io/scrape annotations, and the noetl namespace had no PodMonitoring before, so noetl app metrics were not being scraped at all) + rules-materializer-lag.yaml (GMP Rules, identical PromQL/thresholds to the kind VMRule). Proven via the Managed Prometheus query API: up{namespace="noetl"} = 4 series all 1 (server + 2 worker-rust + system-pool), noetl_worker_nats_consumer_pending
    • noetl_events_ingested_total flowing. The materializer-specific series appear once the v5.35.0 worker rolls.
  • Staged (NOT applied) roll-forward manifestsserver-rust-deployment-prod.yaml → v3.29.3 (explicit NOETL_EVENT_INGEST_PUBLISH_ONLY=false), worker-system-pool-deployment-prod.yaml → v5.35.0 (NOETL_MATERIALIZER_ENABLED=false). These roll live workloads, so they are operator-gated.
  • Runbooksnoetl-cqrs-publish-only-flip.md gained a "Production (GKE) — environment specifics" section (GMP not VM, secrets exist, image-roll prerequisite, the exact 5-step operator sequence) + a GMP managedAlertmanager pager-wiring section with a templated receiver stub; the #49 cutover runbook got an "ALREADY EXECUTED — historical record" banner.

Operator-gated (surfaced, NOT done — conservative prod bias):

  1. Roll the live system pool → v5.35.0 (low blast radius), then the live server → v3.29.3 (zero-downtime rolling update of the traffic-serving deployment).
  2. Enable the materializer as a shadow (NOETL_MATERIALIZER_ENABLED=true, server gate still off) — the green-baseline check.
  3. Wire the GMP managedAlertmanager pager (needs the receiver endpoint this prep does not hold — templated stub in the runbook).
  4. The PUBLISH_ONLY=true flip itself, behind the live lag alerts, one revert away.

Pointers. ai-meta e5b6d6c → ops 9edd9c4 (PR #197 — GMP monitoring applied + staged manifests + runbook prod section). Tracks #103; refs #49 (cutover already done), #107 step 1, #111. Prod state restored to as-found except the three non-traffic artifacts (AR images, GMP monitoring, staged PR).


2026-06-19 — 🛡️ #103: materializer-lag GUARDRAIL shipped — the pre-flip observability gate (worker v5.35.0 + ops VMRule/dashboard/runbook)

Headline. The server was already FLIP-READY for NOETL_EVENT_INGEST_PUBLISH_ONLY; the one remaining operator gate was observability of materializer lag so the staged flip is safe and one revert away. That guardrail is now shipped end-to-end and kind-proven (induce → fire → recover → clear). PUBLISH_ONLY stays default-off; no prod default changed.

Why it matters. Under the gate the server writes zero noetl.event rows and publishes every event to the noetl_events JetStream stream; the worker-side materializer (system pool) is the sole noetl.event writer. If it falls behind or dies, published events pile up un-materialized and the event log silently stops advancing. The guardrail is the early-warning + page surface for exactly that — the cutover design note's "materializer availability is now load-bearing."

The lag signal chosen. Primary = the JetStream consumer backlog on the materializer consumer: noetl_worker_nats_consumer_pending{stream="noetl_events",consumer="noetl_materializer"} + _ack_pending, reported by the worker's existing lag poller on an independent task. That independence is the point — a stalled or dead materializer loop can't report its own lag, but the separate poller keeps the gauge climbing. It's restart-robust (a gauge of current stream state, not a process counter) and is the earliest "falling behind / down" indication. Backed by a rate-based stall cross-check (published rate>0 AND acked rate==0) that self-scopes to the gate (the published counter only moves under the gate).

Shipped (default-off).

  • worker #116 → v5.35.0 (b910341): NatsSubscriber::consumer_lag_for(stream,consumer) queries an arbitrary consumer over the same JetStream connection; the lag poller now also tracks the materializer consumer when NOETL_MATERIALIZER_ENABLED is set, recording into the existing labelled noetl_worker_nats_consumer_pending{stream,consumer} gauge (no new metric, one extra consumer-info round-trip per tick, system pool only). 201 lib tests + clippy clean.
  • ops #195 + #196 (2fcfa59): ci/manifests/noetl/vmrule-materializer-lag.yaml — backlog warning (>200/10m) + critical (>2000/5m) + growing-unbounded + stall-under-gate (5m, guarded on backlog>0 to suppress a post-burst false positive) + project-errors + absent-under-gate; ci/manifests/noetl-scrape/vmscrape-worker.yaml — worker /metrics VMServiceScrape (the worker pools were previously unscraped); vmstack-values.yaml — VMAlert enabled (was off; rules don't evaluate without it) + documented alertmanager routing; Grafana dashboard; flip runbook runbooks/noetl-cqrs-publish-only-flip.md (pre-flip green-baseline check, per-alert meaning, one-command revert NOETL_EVENT_INGEST_PUBLISH_ONLY=false).
  • worker-wiki 0030f30 — deployment-spec Observability section documents the lag gauge + materializer counters.

Kind validation (the proof, not just deploy). Deployed the VictoriaMetrics stack (operator + vmsingle + vmagent + vmalert) on kind-noetl against a gate-on server (v3.29.x) + the v5.35.0 worker:

  • Green baseline — gate-on, healthy materializer: backlog noetl:materializer_backlog=0; an exec COMPLETED with published==projected==acked==25 (each event published once, materialized once, acked once); all alerts inactive.
  • Induced lagNOETL_MATERIALIZER_FAULT_FAIL_FIRST=1000000 (materializer drains but skips project+ack) while driving executions under the gate: the backlog gauge climbed 0 → 120 → 396 → 684 via the independent poller; published rate>0, acked rate→0. Alerts fired — backlog warning + critical + stall (proven on the same-expression short-window validation copy); the shipped MaterializerProjectErrors and MaterializerStalledUnderGate reached firing on the prod windows.
  • Recovery — fault removed: the healthy materializer drained the whole backlog idempotently (drained=N projected=N acked=N duplicates=0), backlog → 0, alerts cleared. The recovery also surfaced + fixed a post-burst false positive (stall expr re-asserting at backlog=0 due to the lingering published-rate window) — guarded on backlog>0 (#196, verified inactive at backlog=0).
  • Cluster restored to baseline (gate off, system pool back to the baseline image, FAULT removed, stream purged, rust pool restored).

Pointers. ai-meta → worker b910341 (v5.35.0) + ops 2fcfa59 + worker-wiki 0030f30. Tracks #103; refs #107 step 1, #111. Flip-readiness now includes the monitoring gate.


2026-06-19 — 🎯 #103: server CQRS cutover COMPLETE — FLIP-READY (cancel/finalize through the chokepoint, server v3.29.3)

Headline: the third and final PUBLISH_ONLY flip blocker is closed. The two ExecutionService terminal writers — POST /api/executions/{id}/cancel (playbook_cancelled) and POST /api/executions/{id}/finalize (playbook_completed/playbook_failed) — now route their noetl.event writes through the emit_event chokepoint, so they honour NOETL_EVENT_INGEST_PUBLISH_ONLY like the other 13 producer sites instead of INSERTing synchronously under the gate. The server is now a complete non-writer of the event log when PUBLISH_ONLY is on.

What shipped (default-off):

  • server#240 → v3.29.3 (b6e5d31)ExecutionService carries the full AppState (cheap, all-Arc; no cycle — AppState doesn't reference the service). cancel/finalize build an EventRow and call emit_event. New resolve_catalog_id falls back noetl.eventnoetl.command (mirrors #236) because noetl.event is empty under the gate for a fresh exec. require_state guards the pool-less test shim so there's exactly one event write path. 558 lib tests + clippy green.
  • e2e#62 (dee459f) — dual-mode rig kind_validate_cancel_finalize_gate.sh (auto-detects the gate; sibling of the #61 orchestrate-gate rig).

Kind proof — both modes:

  • gate-OFF: cancel → CANCELLED, finalize → FAILED; both INSERT synchronously (published_delta=+0), byte-identical columns (node_id=node_name='playbook', finalize error preserved); rows==distinct, 0 catalog_id=0; natural completion still COMPLETED (25==25) — no regression.
  • gate-ON (PUBLISH_ONLY=true, materializer = system pool sole writer): server noetl_event_ingest_published_total{playbook_cancelled}=1, {playbook_failed}=1PUBLISHED, not inserted; materializer cycles drained→projected→acked duplicates=0; both terminal events materialized with byte-identical columns + correct catalog_id (command fallback held); both executions reach the correct terminal state; rows==distinct, 0 catalog_id=0, no loss/dup.

No remaining synchronous server noetl.event writers under the gate. The only INSERT INTO noetl.event left server-side are the chokepoint's own gate-off INSERT and the two materializer sinks (events_project + project_events, which ARE the sole writer under the gate). claim_command / handle_batch_events in-tx INSERTs are should_publish-gated. EventService / db::queries::event is dead code (never instantiated).

Flip-readiness (for the operator). All three flip blockers are now closed — (1) ack-after-materialize durability, (2) off-server-drive × gate reconciliation (#104), (3) cancel/finalize. Flipping NOETL_EVENT_INGEST_PUBLISH_ONLY on (materializer sole writer) is a staged operator decision, behind a materializer-lag metric/alert, one revert away. No production default changed.

Pointers: ai-meta repos/serverb6e5d31 (v3.29.3), repos/e2edee459f. Cluster restored to baseline (server/system-pool → oc-pool, gate off, materializer off, stream purged). Closes server#239.


2026-06-19 — #104: off-server-drive × gate reconciliation PROVEN (server v3.29.2 cold-cache rebuild + committed gate e2e rig)

Headline: the last real blocker before the PUBLISH_ONLY sole-writer flip is operator-safe is closed. The combination #103 left unproven — gate-ON (NOETL_EVENT_INGEST_PUBLISH_ONLY=true) with the off-server worker-driven drive (NOETL_ORCHESTRATE_PLUGIN_DRIVE=true) and the materializer as sole noetl.event writer — is now green end-to-end on kind.

Why it composes (the reconciliation). The off-server drive does not read noetl.event itself: the server rebuilds WorkflowState from the event log and passes the bounded state into the __orchestrate__ command. Under the gate the orchestrator trigger is relocated to fire from the materializer's write endpoint (events_project) after the row is durably materialized — so when the server rebuilds state it reads committed noetl.event, giving read-your-writes. No worker-side read-cache was needed for the steady state; the existing trigger relocation already reconciles it. What was missing was (a) a committed proof, and (b) crash-recovery for the cold-cache apply window.

Live proof on kind (clean cluster; server v3.29.1→.2 + gate + off-server drive; system pool NOETL_MATERIALIZER_ENABLED=true):

  • Happy path: fresh exec + cursor fan-out → COMPLETED; server wrote 0 noetl.event rows (all 25 PUBLISHED — noetl_event_ingest_published_total=25, every write via events/project); materializer materialized all 25 exactly once (drained=projected=acked, 25 rows == 25 distinct ids, 0 catalog_id=0, 0 duplicate cycles); drive dispatched=applied, decode_error=0, event_suppressed>0 (meta-command never hit noetl.event).
  • Crash-recovery: a server hard-killed mid-drive → the in-flight __orchestrate__ call.done lands on the cold new pod → the new cold_rebuild path fires (metric + log) → that execution COMPLETES with full integrity (25 rows == 25 distinct, correct fan-out, playbook.completed). Graceful-restart soak: 20/20 executions across 2 rolling restarts COMPLETED, 0 loss (reconcile poller + worker emit_event_with_retry recover all).
  • Regression: gate-off + in-process drive (prod default) → COMPLETED, synchronous INSERT, no published/drive metrics; gate-off + off-server drive (kind_validate_orchestrate_offserver.sh) → PASS.

Shipped (default-off; no prod default changed):

  • noetl/server#238v3.29.2 (76d29bb, closes server#237) — apply_worker_orchestration rebuilds WorkflowState from the durable log on a cold-cache apply (server restarted mid-drive) instead of dropping the in-flight result; the #104 "rebuild from the WAL/projection" principle. Confined to the cold branch the warm happy path never enters; idempotent re-apply (deterministic command_id + cursor id-set gating). Adds noetl_orchestrate_drive_total{stage=cold_rebuild|cold_rebuild_failed}.
  • noetl/e2e#61 (61f7a5c, closes e2e#60) — committed kind rig kind_validate_orchestrate_gate.sh asserting the gate × off-server-drive combination (server published / materializer sole writer / rows==distinct / off-server topology under the gate); documented in docs/operations/local-kind.md.

Pointers: ai-meta → server 76d29bb + e2e 61f7a5c.

Remaining before a safe PUBLISH_ONLY flip: only the 2 ExecutionService cancel/finalize sites (still synchronous under the gate; correct — no lost/double writes; need AppState). The off-server-drive×gate item is closed.


2026-06-19 — #103: ack-after-materialize durability RESOLVED + fault-tested (deferred ack + worker materializer loop)

Headline. Closed the one gating correctness item for a safe PUBLISH_ONLY flip: the materializer used to ack noetl_events messages on fetch, before events/project ran, so a transient project failure (server restart mid-drain) lost the acked-but-unmaterialized events. Now there is true ack-after-materialize — acked only after the durable insert; on failure the batch redelivers. Default-off; no prod default changed.

Shipped (PRs open; pointer bumps staged on the crate-publish cascade):

  • noetl/tools#71 — deferred (ack-after-processing) ack in the subscription SourceClient + NATS source. AckMode::Defer surfaces a durable per-message ack handle (the NATS $JS.ACK.* reply subject — connection/process-independent within ack-wait); SourceClient::ack(ack_ids, AckDisposition) = Ack/Nack/Term (NATS + Pub/Sub impls); tool operation: ack|nack|term. Opt-in — existing on_success/manual callers unchanged.
  • noetl/worker#115 — in-process CQRS materializer consume-loop (src/materializer.rs, NOETL_MATERIALIZER_ENABLED, default off, system pool only): drain noetl_events (deferred) → events/project → ack only on 2xx; failure → un-acked → redeliver. Chosen over playbook deferred-ack because the step model can't hold an ack handle across drain→build→project on different pods (result-store _ref stall, concurrent-drain batch splits). Observability triad metrics added.
  • noetl/ops#194 — system-pool wiring (kind on; prod wired-but-false, staging order documented).

Fault-injection proof (kind, gate-ON, PUBLISH_ONLY=true, in-process drive, materializer = sole noetl.event writer; server mat-gate=v3.29.1+gate, worker mat-defer):

  • Happy: execution COMPLETED; materializer cycles drained=N projected=N acked=N duplicates=0; rows==distinct (25==25); stream unprocessed=0 — sole writer, zero loss.
  • Fault before ack: FAULT-INJECT: skipping project + ack; batch will redeliver → after ack-wait the batch redelivered + projected+acked → execution COMPLETED; rows==distinct (907==907); unprocessed=0 — loss=0 across a mid-drain failure, idempotent (no dup rows).
  • Also: crate-level live-NATS integration test (nats_integration_deferred_ack_and_redelivery) proves defer→no-ack→redeliver→ack→empty. Cluster restored to baseline.

Cascade LANDED (same day): tools#71 merged → noetl-tools 3.13.0 published (crates.io); worker#115 rebuilt green against 3.13.0 + merged → noetl-worker 5.34.0; ops#194 merged. ai-meta pointers bumped on main (tools 1ba739f + worker 92ee58f + ops 30e194d, ai-meta 0cc74cd). Squash-merges with conventional titles drove semantic-release.

Remaining before the flip is operator-safe: (1) ack-after-materialize done, (2) off-server-drive×gate reconciliation (#104 read-cache — gate-on validated with the in-process drive), (3) 2 ExecutionService cancel/finalize sites. Default-off.


2026-06-18 — #103: gate-on sole-writer PROVEN end-to-end (catalog_id FK fix)

Headline. A full gate-ON execution now runs all the way to COMPLETED via the materializer-as-sole-writer path, with zero loss — server writes 0 noetl.event rows, the materializer writes all 31, drained=materialized=31. (server#236v3.29.1 994da30, ai-meta db7ceeb).

Root cause (found via a serial clean-cluster soak — the earlier playbook ack/routing/concurrency hypotheses were red herrings). Under PUBLISH_ONLY noetl.event is empty, so get_catalog_id (noetl.event-only lookup) returned catalog_id=0 for worker-emitted events → published rows carried catalog_id=0event_catalog_id_fkey violation → events/project batch 500 → the ack-on-fetch'd events were lost. Fix: get_catalog_id falls back to noetl.command (synchronous under the gate).

Proof. Gate-on: COMPLETED, 0 server writes, materializer projected 6→12→12→1 (no dup/FK), 31 events / 31 distinct ids, strictly ordered, 0 catalog_id=0. Gate-off: off-server orchestrate e2e PASS (no regression); 557 lib tests. Cluster restored to baseline.

Remaining before the PUBLISH_ONLY flip is operator-safe: (1) materializer ack-after-materialize durability hardening (it acks on fetch → a future transient events/project failure could still lose events; needs deferred-ack tooling or a worker-side consume loop), (2) off-server-drive×gate reconciliation (#104 read-cache; gate-on validated with in-process drive), (3) 2 cancel/finalize sites. Default-off; no prod flip.

2026-06-18 — #103: 2d-3 sole-writer cutover implemented (emit_event chokepoint + PUBLISH_ONLY gate, default-off)

Update — MERGED + clean-cluster soak. server#235 merged → server v3.29.0 (e2dc0ce, ai-meta 319985d). Clean-cluster gate-on soak re-proved the server cutover (a fresh gate-on exec writes 0 noetl.event rows; worker claims+runs, events published not inserted), but exposed a materializer-playbook lost-events bug (ops#192, not the server): ack: on_success is on the drain step, so the materializer acks the 6 published events off the stream before materializing them, and when {{ drain_events.count }} mis-routes to empty the events are acked-but-not-materialized = lost. Fix staged on ops#192 (ack-after-materialize + count-path fix + serialize the drainer); ties to #104's WAL-durability. Gate stays default-off until fixed.

Headline. The server-side CQRS write-path cutover is implemented and default-off (server#235). Under the gate the server stops writing noetl.event — events publish to the stream and the system/event_materializer becomes the sole writer. Proven on kind: a gate-on execution writes 0 noetl.event rows; gate-off is byte-identical. No production default flipped (operator decision).

Shipped (server#235, branch kadyapam/cqrs-2d3-publish-only-gate, 2 commits).

  • handlers::event_write::emit_event/emit_events chokepoint over EventRow: gate-off = canonical full-column INSERT (byte-identical — absent columns bind NULL = today's defaults); gate-on = publish to_jsonb(row) to noetl_events.
  • NOETL_EVENT_INGEST_PUBLISH_ONLY config (default false) + lazy publisher in AppState.
  • 13 producer sites routed through it; noetl.command stays synchronous; the two sink writers (events_materialize, events_project) untouched.
  • Trigger relocation: under the gate events_project fires the orchestrator trigger after the row materializes (read-your-writes); ingest no longer triggers.
  • System-pool exemption: system/* playbooks (the drainers) write synchronously even under the gate — else the materializer deadlocks waiting to drain its own events (found live).

Live proof on kind (image built from the branch). Gate-OFF: off-server orchestrate e2e PASS + 25/25 event-row regression (distinct ids, all columns correct, tenant/org defaults, playbook_started node_id≠node_name, command.issued parent+nodetype) → byte-identical. Gate-ON: a gate-on execution wrote 0 noetl.event rows (twice) — events PUBLISHED (noetl_event_ingest_published_total confirms), noetl.command synchronous; the exemption + materialize ({projected:0,duplicates:N}) + relocated trigger all work. Not cleanly shown: a single fresh exec driven fully to COMPLETED under the gate — the shared kind cluster was saturated by accumulated stuck test execs (reconcile re-drives re-flooding the stream); a test-environment artifact, not a defect. 557 lib tests + clippy green; 3 new unit tests. Cluster restored to baseline.

Staged / operator decision. PUBLISH_ONLY stays default-off; the flip is an operator decision behind a materializer-lag metric/alert, one revert away. ai-meta pointer bump waits for server#235 merge. Follow-ups: clean-cluster end-to-end soak; off-server-drive × gate reconciliation (ties to #104 read-cache); ExecutionService cancel/finalize (2 sites still synchronous, correct, staged).


2026-06-18 — #112: fixed the worker /dev/shm SIGBUS (64 MiB tmpfs vs 256 MiB Arrow IPC cache)

Headline. Every NoETL worker pod was a latent crash-loop: the Kubernetes container-runtime default /dev/shm is a 64 MiB tmpfs, but every worker process allocates a 256 MiB Arrow IPC shared-memory cache at init (NOETL_IPC_CACHE_BUDGET_BYTES, default 268435456) backed by POSIX shm on /dev/shm. Under shm-heavy load the cache writes past 64 MiB, the store page-faults against the full tmpfs, and the kernel delivers SIGBUS — the worker dies with exit 135 and crash-loops. Surfaced during the #103 CQRS kind validation on the system pool; a transient live fix was reverted, leaving the committed manifests carrying the bug.

What shipped. ops#193 (merged → ops f4df4c1) fixes all 7 worker deployments that allocate the cache — system pool, shared/rust pool, rust subscription pool, subscription runtime, the legacy Python cpu pool, and the two prod variants (config only; not rolled out by the PR). Each gets a memory-backed /dev/shm (emptyDir medium: Memory, sizeLimit: 320Mi — budget + 64 MiB headroom), the budget pinned explicitly via NOETL_IPC_CACHE_BUDGET_BYTES=268435456 next to the sizeLimit so the two can't silently drift, and the container memory limit raised to 768Mi (the tmpfs is charged to the pod cgroup, so the limit must cover sizeLimit + worker RSS).

Validation (kind, system pool). Reproduced the crash faithfully (the cache's ftruncate+mmap+page-store pattern): on the 64 MiB default tmpfs the write SIGBUSed at the boundary (exit 135, "Bus error"). After applying the fix (strategic patch preserving the running image, to avoid disturbing other sessions on the shared cluster), /dev/shm is a 320 MiB tmpfs, the pod is ready/restarts=0, the cache initialises with budget_bytes=268435456, and a full 256 MiB write completes (exit 0, peak 256M/320M, 80%) with no OOM against the 768Mi limit. The kind system pool was restored to its pre-validation baseline (image, 64 MiB shm, 512Mi limit) afterwards.

Wiki. worker deployment-specification gains a Shared memory (/dev/shm) section (the required memory-backed mount + the three values that move together) and corrects the Resources memory guidance (was 256Mi/"cap at 384Mi" with no /dev/shm mention — the latent bug); the NOETL_IPC_CACHE_BUDGET_BYTES env-var row now points at it.

Pointers. ai-meta → ops f4df4c1 + ai-meta-wiki. Did NOT touch the #49 prod GKE cluster, the in-flight CQRS chokepoint work, or the unmerged CQRS branch.


2026-06-18 — #103: CQRS 2d shadow validated; materializer proven sole-writer-capable; 2d-3 cutover designed + staged

Headline. The CQRS materializer (system/event_materializer) is now a deployed, kind-validated, sole-writer-capable component — and the 2d-3 producer cutover that makes it the sole noetl.event writer is designed and precisely staged. No production default flipped (operator-gated, like the prior default-flips).

What shipped. ops#192 — the system/event_materializer + system/projector system playbooks (+ looping CronJobs). The materializer drains the noetl_events JetStream stream and writes noetl.event via POST /api/internal/events/project (idempotent ON CONFLICT DO NOTHING), talking only to the server's internal API (data-access-boundary). Built on the in-flight feat/cqrs-2d1-materializer-playbook branch, rebased + reconciled against the merged 2a/2b/2d-1/2d-2 server scaffold.

Live proof on kind (shadow; tailer on, synchronous INSERT still on). Materializer reproduces the log byte-identically + idempotently: events/project of 25 real off-server-orchestrate event rows → {projected:0, duplicates:25} (every row already present, zero double-writes, zero errors); a full playbook cycle returned {projected:0, duplicates:20}. Tailer publishes every committed event; both consumers (noetl_projector/noetl_materializer) ensured at startup. Off-server orchestrate e2e (kind_validate_orchestrate_offserver.sh) stayed green with the tailer on.

Findings (fixed/surfaced). (1) Playbook stall: batch: 500→25 — a drain over the worker's 100 KiB inline-context budget stages to a _ref and {{ drain_events.count/messages }} stop resolving (fixed in ops#192). (2) The system-pool /dev/shm (64 MiB k8s default) < the worker's 256 MB Arrow IPC cache budget → SIGBUS under shm-heavy load (surfaced separately; not required with batch=25).

2d-3 cutover scope (designed, NOT shipped). Discovered the sole-writer cutover is a server-wide ~18-site event-write chokepoint refactor (no single chokepoint today) + a default-off NOETL_EVENT_INGEST_PUBLISH_ONLY gate (publish the normalized row instead of INSERT; materializer becomes sole INSERTer) + orchestrator-trigger relocation to the materializer write endpoint (read-your-writes — the trigger reads noetl.event with a COUNT consistency check today, so it hard-depends on synchronous writes). Correctness-critical, multi-round, deliberately not rushed. Tracked on #103; the producer cutover is an operator decision for relay. Cluster restored to baseline.


2026-06-18 — #111: e2e topology coverage for the off-server orchestrate drive + server-API-only gap assessment

Headline. Closed the missing-e2e gap on the Server-Dissolution program's step 2 (#107): the worker-driven orchestrate topology (default-on since server v3.28.0 / #108, shadow retired in #110) had been validated ad-hoc during the shipping sessions but had no committed, repeatable rig. Added one, and used the wrap-up to produce the server-API-only status assessment + surface two operator decisions.

What landed.

  • e2e rig scripts/kind_validate_orchestrate_offserver.sh (e2e#59, squash-merged → e2e 977efc2). Self-contained kind rig over the fanout_reduce_phase6 fixture; hard-asserts the drive runs off-server on the system pool: final COMPLETED; 0 __orchestrate__ rows in noetl.event (no event-log burst); __orchestrate__ rows present in noetl.command (off-server dispatch); noetl_orchestrate_drive_total dispatched + applied both advance with no decode_error (the server applied a worker-computed result); noetl_orchestrate_shadow_total absent (post-#110). Documented under docs/operations/local-kind.md.
  • Live validation on kind-noetl against server v3.28.0 (post-#110 oc-noshadow image, drive ON): PASS — COMPLETED, event rows = 0, 4 off-server drives, dispatched=applied=4, 0 decode errors, shadow series absent. Kind restored to its as-found baseline (oc-pool image, DRIVE=false, SHADOW=false) afterward.
  • Tracking #111 opened (ai-task, repo:e2e; board → In progress) as the durable home for the e2e coverage + the gap assessment + the operator decisions.

Server-API-only assessment. Step 2 (orchestrator → plug-in) is complete — the evaluate loop runs off-server. The server is not yet API-only: it remains the sole writer to noetl.event/noetl.command (the drive's apply path, apply_worker_orchestrationapply_orchestration_result in server/src/handlers/events.rs) and still rebuilds WorkflowState from events to bound the drive input. Both move under the later program steps — CQRS write-path (#103) makes the materializer sole writer; NATS-as-WAL (#104) + Postgres-demotion (#107 step 4) remove the from-events rebuild. Not owned by the orchestrator-dissolution thread.

Surfaced for the operator (not done unilaterally).

  • (A) Retire the in-process drive fallback — gated on prod adopting a post-#108 image first. Prod GKE still runs server-rust:batch-dispatch-v1 (pre-#108), so the worker-driven drive is not live in prod; removing the =false revert now is premature.
  • (B) Reap __orchestrate__ delivery rows — each drive writes one PENDING row to noetl.command (worker_id=null) that is never reconciled to terminal (its lifecycle events are suppressed from noetl.event). Accumulates one row per drive (~694 in a single #108 soak). Wants a TTL / mark-terminal-on-apply / separate-table strategy. Scale-relevant, not a correctness bug.

Pointers. ai-meta repos/e2e bumped 94aa7f1977efc2. Handoff 2026-06-18-orchestrate-plugin-dissolution round-02-result.


2026-06-18 — #110: retired the in-server orchestrate shadow + the wasmtime server dependency

Headline. Server-slimming follow-up to the now-closed #108. The in-server shadow was the slice-4 cutover-confidence harness — it ran the system/orchestrate plug-in inside the server via an embedded wasmtime host and diffed its commands against the in-process drive (529 match / 0 mismatch). With the worker-driven drive default-on and proven, that harness is dead weight: the live drive uses the worker's wasmtime host, never the server's. Retiring it drops the heavy wasmtime server dependency and collapses the build to one config.

What landed. server#234 (squash f3043c9, refactor: → no version bump, stays v3.28.0) removed src/orchestrate_shadow.rs, the orchestrate-shadow cargo feature + the optional wasmtime dep (the cranelift/wasmtime tree — ~1000 Cargo.lock lines — fell out; cargo tree -i wasmtime now matches nothing), the trigger_orchestrator_inner shadow hook (shadow_pre_state + shadow_diff), the main.rs boot loader, the orchestrate_plugin_shadow config field + NOETL_ORCHESTRATE_PLUGIN_SHADOW, the noetl_orchestrate_shadow_total metric, and --features orchestrate-shadow from the Dockerfile. Kept noetl-orchestrate-plugin's run_state (the drive uses it) and NOETL_ORCHESTRATE_PLUGIN_DRIVE (default true).

Validation. cargo build/test/clippy --all-targets clean (single config). Kind smoke on a 4-page self-contained cursor loop (tests/oc_smoke/cursor, alternating fetch/check steps), freshly-built image, both drive modes: drive=false (in-process — the exact edited path) → COMPLETED, 0 __orchestrate__ rows in noetl.event; drive=true (worker-driven default) → COMPLETED, 0 __orchestrate__ rows, 10 drive commands on the system pool (dispatched=applied=10, event_suppressed=30, skipped_in_flight=2). noetl_orchestrate_shadow_total confirmed gone from /metrics. Kind cluster restored to the as-found baseline (image oc-pool, drive=false).

Surprise (unrelated to #110, flagged): a self-referencing cursor arc (fetch_page → fetch_page) stalled the worker-driven drive in RUNNING after one iteration — restructuring to the proven alternating two-step shape fixed it. Arc evaluation lives in noetl-orchestrate-core (untouched by this PR), so it's a possibly-real pre-existing limitation of self-loop arcs, not a regression.

Pointers. ai-meta → server f3043c9; wiki noetl-server-wiki@be76279 (deployment-specification env-var catalogue trimmed). Handoff thread handoffs/active/2026-06-18-orchestrate-plugin-dissolution/. Closes #110.


2026-06-18 — 🎯 #108 (c): worker-driven orchestrator drive is now the DEFAULT; #108 CLOSED

Headline. Flipped NOETL_ORCHESTRATE_PLUGIN_DRIVE to default true (server#233, v3.28.0 → server@80cc0e6) — the orchestrate-on-the-system-pool drive (the dissolution path) is on by default instead of opt-in. This is the last item of #108; the issue is closed and the orchestrator-as-plug-in step of the dissolution program (#107) is complete.

Why it was safe now. The flip was deferred as a production-policy decision until the drive was proven off-server, burst-free, and pool-isolated at scale — all of which shipped (slices 1–3 #229, zero-event-burst #230/#231, system-pool isolation #232 + worker#114 + ops#191).

Validation (kind, staged). Images built from the released tips (server v3.27.0 / worker v5.33.0).

  • Pre-flip scale soak, drive ON via env (code default still false): a single cursor+fan-out test_pft_flow_v2 (3 fac × 40 pat, page_size 1) COMPLETED in 511s with 694 drives (dispatched=applied, decode_error=0); system pool +694 (= the drives), shared pool +671 (real steps only) → full isolation; __orchestrate__ rows in noetl.event = 0 (event_suppressed +2082); 0 errors; 23 distinct workflow steps. 5× concurrent self-contained cursor = 5/5 COMPLETED, all drives system-isolated, 0 burst. (A 3× concurrent PFT run saw 2 fixture-level DDL deadlocks on shared pft_test_* tables — a fixture artifact, not a drive defect; the drive stayed correct through them.)
  • Post-flip, drive via the new code default (no env var): PFT 2×30 COMPLETED, 361 drives, system +361 / shared +349, 0 burst — identical shape to explicit-on. Drive metrics started at 0 on the fresh pod and climbed to 361, confirming the drive fired purely from the default.
  • Regression: 15/15 normal fixtures green (python/http/postgres/duckdb/ loops/fanout/sub-playbooks/cursor+offset pagination/control-flow/vars/args).
  • Revert verified: NOETL_ORCHESTRATE_PLUGIN_DRIVE=false on the flipped image → simple_python COMPLETED with system delta 0, dispatched 0 → clean fallback to the in-process drive (trigger_orchestrator_inner, kept), no rebuild.

Revert. NOETL_ORCHESTRATE_PLUGIN_DRIVE=false on the deployment — per-deployment, immediate, no rebuild. Or revert server#233.

Pointers. ai-meta → server 80cc0e6 (v3.28.0) + worker 437b0be (v5.33.0, release hygiene) + server-wiki 0210012 (deployment-spec default + revert). Board #3: #108 → Done.


2026-06-18 — ops hardening for (b): system-pool consumer name matches the KEDA scaler (#108)

Headline. Closed a committed-config defect complementary to follow-up (b)'s server/worker execution_pool decline: the system-pool deployment bound NATS consumer noetl_worker_system_rust, but its KEDA scaler reads noetl_worker_pool_system and the other pools follow the noetl_worker_pool_<segment> convention with deployment-name == scaler-name. The system pool was the lone exception — the scaler watched a consumer the worker never created (no backlog scaling), and the live kind state had drifted to a hand-applied noetl_worker_system_kindtest present in no committed manifest (the very broad-filter drift the server-side fix guards against). ops#191 → ops@4816af0.

Fix. Aligned the dev system-pool deployment's NATS_CONSUMER to noetl_worker_pool_system. Single-stream consumer-filter affinity model unchanged; the filter_subject (noetl.commands.system.>) isolates the consumer and the shared pool's consumer (noetl.commands.shared.>) cannot see system subjects.

Kind-validated (drive on, cursor+fan-out test_pft_flow_v2, 1 facility × 3 patients): the noetl_worker_pool_system consumer claimed all 44 orchestrate drives (Δ+44 = noetl_orchestrate_drive_total{dispatched=44, applied=44}, skipped_in_flight=3); the shared consumer received only the 42 real postgres/http step commands; the orphaned kindtest consumer received 0. Execution COMPLETED across 23 distinct workflow steps; __orchestrate__ rows in noetl.event = 0. Validated on the :oc-pool worker image (pre-#114), so this proves the consumer-filter affinity independent of the #114 decline layer. Cluster restored to clean stable state (drive off; orphaned kindtest consumer removed). Prod cluster untouched.

Pointer bump. ai-meta repos/ops → ops@4816af0. #108 stays open for (c) the deliberate default-flip (production-policy; not flipped).


2026-06-18 — orchestrate drive isolated on the SYSTEM pool via pool affinity (#108 follow-up b)

Headline. The worker-driven orchestrate drive now runs on the dedicated system pool, not the default pool — pool affinity that survives a JetStream consumer whose filter_subject drifted broad. server#232 (server@846166b) + worker#114 (worker@e2162b7). Kind-validated.

Root cause (corrected). There is NO worker HTTP pending-poll — a worker claims only what its NATS consumer delivers. The system-routed orchestrate command was claimed by the default pool because that pool's durable consumer's filter had drifted broad (durable consumers keep their creation-time filter) and, with no pool-affinity check, it won the claim race.

Fix. Server (publish_command_notification) stamps the resolved pool segment on the notification as execution_pool ("update context for dispatch"); worker gains CommandNotification.execution_pool + segment_from_filter(NATS_FILTER_SUBJECT) and NatsCommandSource::next declines (ACK + skip) a notification whose execution_pool differs from its own segment — the correct pool's independent delivery then claims it. Enforced only when both name a concrete segment (backward compatible).

Validation (kind, server + both worker pools on new images, drive on): test/simple_python → COMPLETED; __orchestrate__ commands pulled + claimed + executed on the system pool (3), zero on the default pool. 553 server + 196 worker tests green.

Remaining for #108: only (c) — the deliberate default-flip (staged production-policy). The worker-driven drive is now functionally complete + scale-hardened (off-server, cursor/fan-out, zero noetl.event burst, system-pool isolated).

Aside (read-only): prod is already fully on the Rust stack (the #49 flip happened ~4 days ago; noetl Service → noetl-server-rust, both secrets present), on an image predating this #108 work; the ops#178 cutover runbook is stale as a "prep" doc.

Pointers. server#232 → server@846166b; worker#114 → worker@e2162b7.


2026-06-18 — the orchestrate meta-command touches noetl.event ZERO times (#108 follow-up a)

Headline. Closing the directive that system pool playbooks keep only their own state, not workflow events: the worker-driven __orchestrate__ meta-command now writes 0 rows to noetl.event (down from 5/drive). server#231 → server@9438f3b. Kind-validated.

What. dispatch_orchestrate_command stops writing command.issued to noetl.event — the command's record lives only in noetl.command (its own state, fatal-on-error as the sole delivery row). claim_command/get_command read noetl.event first (a pri ordering keeps it authoritative, so normal commands' claim path is byte-for-byte unchanged) and fall back to noetl.command only on a miss (the event-free meta-command). Combined with slice 4b's lifecycle suppression, __orchestrate__ writes 0 of its former 5 rows.

Validation (kind, drive on, small cursor+fan-out): __orchestrate__ rows in noetl.event = 0; COMPLETED via the noetl.command fallback; 20 real workflow steps drove normally (shared claim path unaffected); 0 errors. At 10×1000 this removes thousands of infrastructure rows from the burst.

Remaining for #108: (b) NATS stream/consumer affinity so the system pool actually claims the drive (ops); (c) the deliberate default-flip (staged).


2026-06-17 — system playbook events no longer burst Postgres + system-pool routing (#108 slice 4b)

Headline. Per the directive that system pool playbooks are part of the NoETL ecosystem and must not flood the workflow event log: the __orchestrate__ meta-command's lifecycle events are no longer persisted to noetl.event. server#230 → server@6aef3a6. Kind-validated.

Postgres burst fix. The meta-command is infrastructure, not a workflow step — yet each drive wrote 5 rows (issued/claimed/started/call.done/completed). At scale that bursts noetl.event + Postgres for no benefit (the drive state is a pure function of the real step events; the result is applied from the in-memory call.done payload). The server now skips persisting them: handle_event_inner for the worker-emitted ones, claim_command for command.claimed. Validated: __orchestrate__ now writes ONLY the lone command.issued delivery row (1 of 5 — 80% fewer rows); the small cursor+fan-out test_pft_flow_v2 still COMPLETED.

System-pool routing. dispatch_orchestrate_command routes the drive to the system segment (noetl.commands.system.<eid>). Honest scope: the server publishes there, but on kind the default pool still claims it via the HTTP pending-poll — true isolation needs a NATS stream/consumer-affinity fix in ops. The drive stays functional (resilient via the poll).

Follow-ups: (a) eliminate the last command.issued via a noetl.event-free claim path (claim/get reading noetl.command) → zero event rows for the meta-command; (b) NATS affinity so the system pool actually claims the drive.

Pointers. server#230 → server@6aef3a6.


2026-06-17 — worker-driven drive validated on a cursor + fan-out flow (#108 slice 4a)

Headline. The worker-driven drive (#108) handles the complex orchestrator surface, not just the linear case. Ran test_pft_flow_v2 under drive mode (small workload: 1 facility × 3 patients, page_size 1) on kind → COMPLETED, 0 errors. No code change — the slice-3 drive already handles cursors.

What it exercised. 20 distinct workflow steps; cursor fan-out (fetch_* each issued 5 commands across pages); a multi-facility loop (load_next_facility ran twice); 43 __orchestrate__ drive round-trips on the worker (dispatched=43, applied=43), the in-flight guard firing once (skipped_in_flight=1 — concurrent triggers serialized); __orchestrate__ did NOT leak as a workflow step across all 43 round-trips (state guard holds under cursors); playbook.completed emitted.

Remaining (deliberate, not rushed). System-pool routing (the orchestrate command currently co-locates on the execution's segment); and the default-flip (NOETL_ORCHESTRATE_PLUGIN_DRIVE default-true + retire trigger_orchestrator_inner) — a production-policy change recommended via staged rollout (opt-in → one deployment → soak at true scale → then default), not a unilateral flip. The dissolution's core (drive on the pool) is proven for linear AND cursor/fan-out workloads.


2026-06-17 — 🎯 the orchestrator drive runs OFF-SERVER on the worker pool (#108 slice 3, kind-validated)

Headline. The dissolution milestone: with NOETL_ORCHESTRATE_PLUGIN_DRIVE=on the orchestrator drive runs on the worker pool, not in the server. Kind-validated end to end — a real execution drives start→end→completed entirely through the worker round-trip. server#229 → server@465cdbb (v3.23.0).

How. (1) Scheduler (trigger_orchestrator_inner): in drive mode the server issues one system/orchestrate command (entry: run_state, args = the bounded WorkflowState + playbook + trigger) to the worker pool via dispatch_orchestrate_command — no in-process evaluate; an orchestrate_in_flight cache flag serialises drives per execution. (2) Worker runs the plug-in (run_state, worker#113) and returns the OrchestrationResult (base64 output_b64 on the call.done event). (3) Apply-on-callback (handle_event_inner): a call.done for __orchestrate__apply_worker_orchestration → decode → apply_orchestration_result (slice 2) emits events + issues the real commands; their completions re-trigger the drive, the meta-command's own events never do (loop-safe). (4) State guard (apply_event): __orchestrate__ events are ignored so the meta-command never phantom-creates a workflow step.

Validation (server oc-drive2 + worker oc-drive, drive on): test/simple_pythonCOMPLETED; noetl_orchestrate_drive_total{dispatched=2, applied=2}, zero decode_error, zero skipped; 2 __orchestrate__ round-trips, real steps start+end ran on the worker, playbook.completed emitted; __orchestrate__ did NOT leak as a workflow step (steps entered = ['end']). A first-pass bug was caught + fixed: output_b64 rides call.done, not command.completed (lifecycle-only).

Safety. Default off — the in-process drive is the untouched fallback. core 124

  • server 553 tests green; clippy clean both build configs.

Worker-driven arc complete through the drive: slice 1 (worker entry/run_state) → slice 2 (apply_orchestration_result extracted) → slice 3 (drive on the pool, validated). Remaining: slice 4 — shadow→flip at scale (PFT under drive), make drive the default, retire trigger_orchestrator_inner; route the orchestrate command to the dedicated system pool.

Pointers. server#229 → server@465cdbb (v3.23.0). Server wiki: deployment-specification env-var NOETL_ORCHESTRATE_PLUGIN_DRIVE.


2026-06-17 — worker-driven cutover slice 2: apply_orchestration_result extracted + slice 3 designed (#108)

Headline. Slice 2 of the worker-driven cutover (#108): the post-evaluate emission logic (emit pure events → issue commands → terminal event) is extracted verbatim from trigger_orchestrator_inner into a reusable apply_orchestration_result(...), so the worker-driven drive can apply an OrchestrationResult computed on a worker the same way the in-process drive applies its own. server#228 → server@586aeae. Behavior-preserving (553 tests green, clippy clean both configs); internal refactor.

Slice 3 designed (grounded). Checked the load-bearing risk: apply_event (state.rs:408) does steps.entry(name).or_insert_with(StepInfo::new) for every node_name on command.issued/command.completed — so issuing the orchestrate "meta" command as a normal command would create a phantom step and pollute the drive state. The design: (1) flag NOETL_ORCHESTRATE_PLUGIN_DRIVE (default off, in-process fallback); (2) scheduler issues one step: __orchestrate__ wasm command (system/orchestrate, entry: run_state, args: OrchestrateStateInput) to the system pool instead of evaluating in-process; (3) an apply_event/from_events guard ignores command.* events for the reserved __orchestrate__ step so it never pollutes state; (4) on the __orchestrate__ completion the server decodes the OrchestrationResult (data.output_b64) and calls apply_orchestration_result — not a re-dispatch; (5) the loop: real-step completion → dispatch orchestrate → apply → real commands → … (the meta-command's own lifecycle never re-triggers). State coherence holds (drive state is a pure function of the real-step events). Lands behind the flag, kind-validated by driving a real execution through the round-trip before the shadow→flip.

Pointers. server#228 → server@586aeae (internal refactor, stays version 3.22.0). Design: #108 comment.


2026-06-17 — worker-driven cutover slice 1: configurable wasm guest entry (#108)

Headline. First worker-side slice of the worker-driven orchestrator cutover (#108) — moving the drive off the server onto the worker pool. A tool: {kind: wasm, plugin: {path, version, entry}} command can now name the guest export to invoke; the worker-driven orchestrator will dispatch system/orchestrate with entry: "run_state". worker#113 → worker@04420d0.

What. WasmPluginHost::invoke_bytes_with_entry + WasmDispatcher::run_by_ref_entry/run_and_apply_by_ref_entry (the originals delegate with "run", so existing call sites + tests are untouched); wasm_config_to_ref parses optional plugin.entry (default run). Same data-plane ABI, only the export name differs. Test proves run→0xAA vs run_state→0xBB dispatch + a missing-export error. 194 worker tests green; new code clippy-clean. Purely additive — no live behavior change (nothing issues a run_state command yet).

Remaining (server side, the hot path). (2) Scheduler — on a trigger, issue an orchestrate command (entry: run_state, input = OrchestrateStateInput) to the system worker pool instead of driving in-process; (3) Apply — on that command's completion, decode the OrchestrationResult and emit (extracted from trigger_orchestrator_inner) + loop-prevention; (4) shadow-then-flip cutover, then retire trigger_orchestrator_inner. All behind a default-off flag with the in-process drive as the fallback. The per-drive server→worker→server round-trip is the real architectural shift (drive CPU distributes across the pool).

Pointers. worker#113 → worker@04420d0 (stays v5.31.2). Worker wasm-plugin host has no wiki page (pre-existing #105 gap) — flagged for a Rule-1 backfill.


2026-06-17 — orchestrate plug-in drives the real workload identically, live (#108 slice 4)

Headline. Slice 4 of the plug-in round (#108): the orchestrator now runs the system/orchestrate plug-in alongside the in-process drive on every evaluation and diffs the emitted commands — the cutover-confidence gate. The in-process result stays authoritative (observation only) → zero risk to the live platform. server#227 → server@bd652ab. Kind-validated over the live 10×1000 PFT: 529 evaluations, ZERO divergence.

What. The plug-in gains a state-input path (OrchestrateStateInput + run_state export) so the shadow hands it the same WorkflowState the in-process evaluate_state consumes (no event-slice/snapshot reconstruction to confound the diff). New src/orchestrate_shadow.rs (feature orchestrate-shadow, optional wasmtime): a process-global wasmtime host (fresh Store/Instance per call, mirrors the worker invoke ABI) loaded from noetl.plugin_module at boot; trigger_orchestrator_inner clones the pre-evaluate state + diffs after (command- set identity — parsed Value eq, slice-2 key-order finding). Metric noetl_orchestrate_shadow_total{result=match|mismatch|error}; config flag orchestrate_plugin_shadow; Dockerfile builds with the feature (runtime-gated by NOETL_ORCHESTRATE_PLUGIN_SHADOW, default off). Always-present no-op wrappers so the production build (default, no wasmtime) is unaffected.

Validation (kind, image oc-shadow, shadow on). noetl_orchestrate_shadow_total {result="match"} 529 — the only label; zero mismatch, zero error, 0 divergence log lines; server + workers stable (r=0); boot orchestrate shadow host loaded bytes=1603665. Both build configs green; 553 server + 3 plug-in tests; clippy clean.

The arc so far. Slices 1-4 prove orchestrator-as-plug-in end to end: the plug-in exists (0-import wasm, slice 1) → runs identically in wasmtime (slice 2) → is deployed + servable (seed-on-boot, slice 3) → drives the real workload identically, live (shadow, slice 4).

Next. The worker-driven cutover — the kernel scheduler dispatches system/orchestrate on a worker (fetched from the registry), publishes the emitted commands, and retires trigger_orchestrator_inner; the in-server shadow + the wasmtime server dep come back out once the server no longer drives.

Pointers. server#227 → server@bd652ab (Cargo.toml now 3.22.0, no tag cut). Server wiki: deployment-specification env-var NOETL_ORCHESTRATE_PLUGIN_SHADOW (noetl-server-wiki@50f9965).


2026-06-17 — system/orchestrate@1 registered + servable in a deployed server (#108 slice 3)

Headline. Slice 3 of the plug-in round (#108): the server now bakes the orchestrate wasm into its image and seeds built-in system plug-ins into noetl.plugin_module on boot, so the worker pool can fetch system/orchestrate@1 by (path, version) + digest without an out-of-band operator POST. server#226 → server@b21b589. Kind-validated.

What. New src/system_plugins.rs: scan_system_plugins(dir) (pure — glob *.wasm → read → sha256 digest → path system/<stem> → version 1; unit-tested, no DB) + seed_system_plugins(pool) (upserts each via plugin_module::upsert, in-process — not the token-gated /api/internal/plugins surface, which is for external registration). main.rs seeds after plugin_module::ensure_table, non-fatal. A wasmbuilder Dockerfile stage builds plugins/orchestrate to wasm32 and bakes orchestrate.wasm/opt/noetl/plugins/; NOETL_SYSTEM_PLUGIN_DIR defaults there. .dockerignore now excludes nested **/target (the excluded plug-in crate's local target is 1.8 GB and broke the context COPY). Digest-keyed hot-reload: every boot re-seeds @1 with current bytes, so a new image hot-reloads the pool with no version bump.

Validation (kind, image oc-seed). Boot log seeded system plug-in path=system/orchestrate version=1 digest=823dec… bytes=1559093count=1. GET /api/internal/plugins/system/orchestrate?version=1 → 200, application/wasm, 1559093 bytes, magic \0asm, ETag=digest; correct digest → 200, stale → 409. Baked-file sha256 == served digest (823dec…) — byte-identical image → registry → API. 553 lib tests green (2 new); clippy clean.

Next. (4) kernel scheduler (NOETL_ORCHESTRATE_PLUGIN, default off) — dispatch the now-registered plug-in alongside the in-process orchestrator, live-shadow over the PFT, flip after green.

Pointers. server#226 → server@b21b589 (internal — stays v3.20.0). Server wiki: deployment-specification env-var NOETL_SYSTEM_PLUGIN_DIR (noetl-server-wiki@5cb58cd).


2026-06-17 — system/orchestrate plug-in runs identically to native in wasmtime (#108 slice 2)

Headline. Slice 2 of the plug-in round (#108): a wasmtime shadow-diff proves the system/orchestrate .wasm doesn't just compile — it executes identically inside the same wasmtime contract the worker uses. server#225 → server@ccec104.

What. tests/shadow_diff.rs loads the built plug-in through a harness mirroring the worker host's invoke_bytes ABI byte-for-byte (alloc → write → run(ptr,len) → unpack packed i64 → read; fresh Store/Instance per call; 0 imports → bare instance) and asserts the wasm output equals the native drive over two fixtures: auth0 multi-arc when: routing (exercises the minijinja template engine in wasm) and a cold-start linear flow. wasmtime dev-dep pinned to the worker's major (27) so "runs in the harness" ⟺ "runs in the worker host".

Determinism finding (relevant for the slice-4 live shadow). The diff is command-set identity (parsed Value equality), not raw-byte identity. The drive builds a step context as a serde_json::Value map; with serde_json's preserve_order in the tree, object key order is insertion order, and that order traces to upstream HashMap iteration — which hashes differently on wasm32 vs the host arch. So wire bytes differ in key order while the value is identical. The correct bar: the kernel scheduler deserializes the plug-in output to Vec<Command> and persists it through the server's own encoder, so the plug-in's wire bytes are transient — what must match is the command set. A canonical sorted-key command encoding (if byte-identical persistence is ever wanted) is a separable follow-up, not required for correctness.

Validation. 2 unit + shadow-diff (2 fixtures) green; clippy clean; test self-skips when the .wasm isn't pre-built (gate: cargo build --release --target wasm32-unknown-unknown && cargo test). Plug-in stays excluded — native server build/test unaffected. Test-only; no kind validation needed.

Next. (3) catalog register/serve; (4) kernel scheduler (NOETL_ORCHESTRATE_PLUGIN, default off) replacing trigger_orchestrator_inner, live-shadowed over the PFT before the flip.

Pointers. server#225 → server@ccec104 (internal — stays v3.20.0).


2026-06-17 — system/orchestrate WASM plug-in exists: drive core runs as a 0-import module (#108)

Headline. First slice of the plug-in round (#108, step 2 of the OS program #107): a new standalone plugins/orchestrate/ crate (in the server repo, next to the core, depending on noetl-orchestrate-core by path) wraps the drive behind the worker plug-in ABI and compiles to wasm32-unknown-unknownthe first non-trivial compiled system playbook. server#224 → server@10a629b.

ABI. Input bytes = JSON OrchestrateInput { events, playbook, trigger_event_type } (the bounded event slice + the catalog playbook — the same read-set trigger_orchestrator loads); output bytes = JSON OrchestrationResult (commands + completion + events_to_emit), or {"error": "..."} on a drive failure. Data-plane = the host's memory + alloc(size)->ptr + run(ptr,len)->packed contract (same as the reference-materializer).

The feasibility risk #108 flagged is retired. #108's "Hard parts" asked whether the template/evaluator compiles to wasm32 or needs a host render callback. The compiled .wasm answers it: zero imports — no WASI, no render — so the whole drive (condition evaluator and minijinja template engine included) runs in-guest. That's the payoff of #109 keeping the entire core wasm-resident (the "keep core compiled to wasm32" call). Exports: exactly memory / alloc / run. Artifact 1.54 MB.

Validation. Native parity test — the shadow-diff in miniature — orchestrate(json_bytes) reproduces native evaluate byte-for-byte (commands + events_to_emit) on the auth0 multi-arc when: fixture; malformed input → error envelope, no panic. wasm32 build green; import/export sections verified programmatically. Server workspace unaffected: the crate is excluded from the workspace so the native build/test/clippy never pull it in — 551 lib tests green, clippy clean both the workspace and the plug-in crate. No kind validation needed: nothing loads the plug-in yet, so the server binary's runtime is byte-identical.

Next. (2) worker host shadow-diff — load the .wasm, feed it the same (events, playbook) the in-process orchestrator gets on the PFT, assert byte-identical commands; (3) catalog register/serve; (4) kernel scheduler (NOETL_ORCHESTRATE_PLUGIN, default off) replacing trigger_orchestrator_inner. Design note: commands return over the byte path, so the command_emit capability the original scope called for is optional (evaluate is referentially transparent — it returns the full Vec<Command>).

Pointers. server#224 → server@10a629b (internal — server stays v3.20.0, no release tag).


2026-06-17 — Orchestrator drive core fully wasm-resident: Event-ABI round closed (#109)

Headline. The last slice of the orchestrator-as-plug-in extraction (#108) landed: orchestrator.rsWorkflowOrchestrator, evaluate, evaluate_state — moved from src/engine/ into the pure noetl-orchestrate-core crate (server#223). With it, every drive module — renderer, playbook model, commands, evaluator, state, and now the orchestrator Event-type switch — compiles to both native (linked into noetl-server, re-exported through src/engine/ so call sites are unchanged) and wasm32-unknown-unknown (the seed for the future system/orchestrate WASM plug-in). Event-ABI round #109 CLOSED.

What moved. evaluate(events: &[core::event::Event], …) now reads the pure core::event::Event read-set defined in slice 1; the server converts its db::Event (sqlx::FromRow, native-only) at the trigger_orchestrator boundary via the existing From impl (4 conversion sites in handlers/events.rs) — no production drive change, since the drive already rebuilds state from converted events. Test fixtures ported to the core shape: Utc::now() → fixed DateTime::from_timestamp(0, 0) (the core has no clock under wasm), the db-serial .id ordering field → .event_id (the drive's real ordering key), playbook::types::playbook::. From<CoreError> for AppError moved ahead of the test module (clippy items-after-test-module); dropped a useless vec!.

Validation. 122 core tests green (native), 0 WASI imports on wasm32-unknown-unknown; 565 server tests green; clippy clean on both targets. cargo-chef Docker image built (noetl-server v3.20.0, 44.8 MB), loaded into kind, server rolled out (container noetl-server, only the new image running). kind e2e — PFT 10×1000 (test_pft_flow_v2, 10 facilities × 1000 patients, page_size 1): full command lifecycle flowing (issued→claimed→started→call.done →completed, 400 of each in the first event page), 0 errors, 0 restarts across the server + all 8 worker pods, live forward progress confirmed (event timestamps advancing in real time). The drive behaves identically on the relocated core.

Pointers. server#223 → server@bfd3f77 (internal refactor — stays v3.20.0, no release tag). Slice 4 (data-plane ABI + command_emit capability + kernel scheduler + shadow-diff → live system/orchestrate plug-in) is the follow-on round, tracked under #108. Standing direction honored: Claude wrote the Rust directly, no Codex.


2026-06-17 — North-star named: NoETL is a distributed multitenant OS (#107)

Headline. Captured the architecture thesis the in-flight umbrellas converge on. New blueprint noetl_server_dissolution_and_global_grid.md (docs#183 + docs#184): the server dissolves into a stateless edge; the NATS JetStream WAL + object store are the only durable state (no Postgres source of truth); all processing is event-driven system/data playbooks on a sharded global grid. Named plainly, NoETL is a distributed multitenant operating system — process = ephemeral atomic block, scheduler = JetStream-lag pump, syscall = the WASM capability ring, drivers = the tool registry, VFS = the locator namespace, journaling = the WAL, isolation = shard-key + sandbox + capability + keychain — and the foundation for a quantum-cloud-hybrid platform (QPU as a tool driver, a circuit as an atomic block per no-cloning, hybrid as a cursor/loop, queue latency as the callback rule; positioning, not a roadmap).

Program opened. #107 is the strategic roof over #101–#105 with the 5-step path (CQRS cutover → orchestrator-as-plug-in → per-shard WAL → drop Postgres → cross-shard federation). Step 2 scoped in #108 — extract the already-pure evaluate/evaluate_state drive core into a system/orchestrate WASM plug-in + a kernel scheduler replacing trigger_orchestrator, shadow-diffed before any flip. Step 1 (CQRS cutover) is unblocked — its shadow gate went green this session via the #106 fix.

#108 started + first slice landed. Directive: the orchestrator core stays compiled to wasm32 (no host-side template carve-out). Spike proved minijinja + serde_json compile to wasm32-unknown-unknown with no WASI imports. Then stood up noetl-orchestrate-core (server#218) — the runtime-free drive core compiling from one source to both the server (native) and a wasm32 plug-in. First slice = the template renderer (the foundation the evaluator/state/commands/orchestrator build on); the only server coupling (AppError) became the crate's CoreError, mapped at the boundary. minijinja-for-wasm recipe: default features minus loader + serde. 21 core + 651 server tests green; cargo-chef Docker build works with the new workspace; kind-validated (wasm_smoke playbook.completed, 0 template errors). Pure refactor.

Continued through the slices: playbook model (server#219, with the Residency enum untangled), then commands + evaluator (server#220). All four of evaluate's dependencies — renderer, type model, commands, evaluator — now compile to native + wasm32 from one source. E2E checkpoint PASSED on kind: the integrated server (4 of 6 modules extracted) drives test_pft_flow_v2 correctly through the extracted core — 356 commands issued, cursor fan-out, ctx.updated firing, 0 errors, 0 restarts; behavior-preserving. Each slice: 57 core + 615 server tests, wasm clean, cargo-chef Docker build green. Pointer f051779.

Event-ABI boundary designed + opened, slice 1 landed. Weighed minimal-vs-canonical event (the db-row→core-event conversion is unavoidable either way since db::Event is sqlx::FromRow, so unifying buys nothing on the boundary + pre-empts the #104 WAL design) → minimal core::event::Event, named to converge with EventEnvelope. Design note orchestrate_core_event_abi.md (docs#185); round issue #109. Slice 1 (server#221): the pure event type + From<&db::Event> + chrono (no clock) → compiles to wasm32; determinism audit favorable (the drive's only real-output now() is a generated-event timestamp the shadow-diff normalizes; 6 of 8 sites are tests). Pointer 516ef17. Slice 2 landed + e2e-validated (server#222): WorkflowState (state.rs) moved into the core on core::event::Event; the server converts db::Event at the drive boundary (orchestrator's evaluate + 4 events.rs sites). wasm portability surfaced + handled — Instant gated #[cfg(not(wasm32))], tracing + catalog_id (serde-default) added. E2E: PFT drives through the extracted state — 98 commands, ctx.updated firing, 0 errors. 587 server + 86 core tests; pointer 8be15be. 5 of 6 modules now in the core (template, model, commands, evaluator, state). Last slice: move orchestrator/evaluate (the orchestrator Event-type switch + its fixture churn the boundary deferred) → then the whole drive core is wasm-resident and the plug-in round begins.

Pointers. docs dd391f3; program #107 + step-2 #108 on the roadmap boards.


2026-06-17 — Orchestrator-scaling merge train landed + final-e2e validated (#101/#102/#103)

Headline. Worked the queued PR backlog across the orchestrator-scaling + event-WAL line, merging the clean/validated work and validating the integrated result on kind. #105 closed (runtime complete).

Landed on main (each rebased onto a moved main, built + tested green before merge):

  • CQRS #103 phase 2b/2d — projector owns projection_snapshot + noetl_materializer consumer + shared normalize_event_to_row / POST /api/internal/events/materialize (server#215, rebased; the stacked #204/#205/#206 diverged after #202's 2a squash + main's wasm advance — superseded). All default-off. Build hotfix server#216 (the rebase left a rebuild_state call at the old 3-arg signature; v3.15.0 shipped non-compiling — fixed within minutes).
  • Batch event-log INSERT #102 — N inserts → one multi-row QueryBuilder (server#199, v3.15.1).
  • ctx/workload shim-dedup #101 — server stops persisting the shims (the ~5MB command.issued), worker rebuilds them at render (server#207 v3.15.3 + worker#90 v5.31.2). Unflagged → deployed + validated together.
  • Closed worker#89 (superseded — main already has inline_budget_bytes()).

Final e2e (kind). Built server v3.15.3 + worker v5.31.2 from main, deployed to the 8-pod Rust pool, ran test_pft_flow_v2 (distributed): 7300+ context-dependent commands dispatched + completed, ZERO errors, 0 worker/server restarts, no OOM; ctx.updated shows correctly-folded context (the shim-dedup'd worker rebuilds the shims faithfully). The growing cursor backlog is the known 10×1000 scale the umbrellas target, not a regression. Pointers bumped: ai-meta@d469e70.

refs-in-state phase 2 — already on main, now VALIDATED. The open phase-2 PRs (server#208 + worker#91) turned out to be stale duplicates of an old branch — the whole feature already landed on main (server hydrate_result_references(keep_refs) + cursor claim_ref resolution; worker build_extracted emit + resolve_context_references consume + resolve_ref). Both closed as superseded. What was never done — the flag flip — validated on kind (NOETL_REFS_IN_STATE=true, budget 2KB, test_pft_flow_v2): references kept in state (_ref noetl:// URI + arrow_ipc 50-row reference block), command context stays ~10KB (not the inline 50-row payload), cursor fans out (claim-ref resolved, no wrong-drain), 0 errors across 645 context-dependent completions. Flag stays default-off; enabling is a rollout decision, not a code gap. #101's two pillars (shim-dedup + references-in-state) are both landed + validated.

CQRS write-path — 2a producer validated live + 2d-3 cutover scoped. Set NOETL_EVENT_STREAM_ENABLED=true on kind: the tailer started, published a run's 7 events noetl.event → noetl_events stream, advanced its stream_cursor; noetl_projector + noetl_materializer consumers ensured. Reverted to default-off. The 2d-3 cutover (materializer as sole noetl.event writer) is an architecture change, not a flag-flip — the tailer reads committed rows, so dropping the synchronous INSERT requires producer→stream-direct publish (new worker code), a skip-synchronous gate (doesn't exist), and the system/event_materializer playbook deployed on the system pool. Scoped on #103; the durability-contract design note landed at docs/architecture/cqrs_write_path_cutover.md (docs#182) — pins the boundary move (synchronous INSERT → JetStream publish-ack), the producer fork (server-mediated Option A vs worker-direct Option B), and the staged shadow→producer-move→drop-tailer rollout. Shadow phase then stood up on kind (the rollout's first step): producer validated (tailer publishes, cursor advances), materializer deployed on the system pool + draining the consumer (20 msgs acked) — and it did its job, surfacing a real reproduction-blocker #106: the worker renders a large templated http body ({{ build_envelopes.events }}, 47 KB) truncated to ~1182 chars (orchestrator context holds the full 348 KB), so the materializer's /events/project POST fails to deserialize. Cutover producer-move stays gated behind #106. Also a sizing finding — the materializer drove the 512Mi server to OOMKilled under a large backlog.

refs-in-state phase 2 validated (see #101): the open PRs #208/#91 were stale dupes (already on main, closed superseded); the NOETL_REFS_IN_STATE flip proven on kind — refs kept in state, command ctx ~10KB not inline, cursor fans out, 0 errors over 645 completions.

Pointers. server ed691c7 (v3.15.3) + worker faa2b16 (v5.31.2); earlier server bea09db (CQRS).


2026-06-17 — WASM plug-in routing complete + validated LIVE end to end (#105)

Headline. The WASM system-plug-in capability (#105) is functionally complete and proven end to end on the kind cluster — a tool: {kind: wasm, plugin: {...}} playbook runs a compiled Rust→wasm plug-in on the worker's wasmtime host and its object_put lands in the object store.

Routing shipped this session: digest resolution at dispatch (worker#105) · the dispatch branch tool_kind: "wasm"WasmDispatcher::run_and_apply_by_ref (worker#107) · wasm-plugin flipped on by default (worker#108, v5.31.0) · ToolKind::Wasm in the playbook schema (server#214, v3.13.0).

Live kind validation (the capstone): built + deployed the feature-on worker (v5.31.0, wasmtime carried) + server v3.13.0 (plug-in registry + object store). (1) Flip-safety — a full PFT run playbook.completed (217 events) on the host-carrying workers, 0 restarts. (2) Registered system/reference-materializer@1 via POST /api/internal/plugins (digest c6bd7d05…); GET returns it with the ETag. (3) A wasm playbook playbook.completed; the step routed (command.issued→claimed→started→call.done); the plug-in's object_put landed in noetl.object_store at noetl/results/reference/0/0/1.feather. The whole path — server accepts wasm tool → tool_kind: "wasm" → worker routes to host → digest from registry → load + run → flush to the object store — works against running pods.

Real-data flow closed (inputargs): the object payload had been empty because the worker's wasm_config_to_ref read config.input while the server canonicalizes a step's input: to args. Fixed (worker#110, v5.31.1): read args first, input fallback. Re-validated on kind — a wasm playbook with input: {hello: world} now lands {"hello":"world"} (17 bytes) in noetl.object_store (was 0). The full WASM dispatch path now carries real data end to end. #105's runtime is complete; the only remaining scope is the optional playbook→WASM lowering + porting system/materialiser to the compiled path.

Hot-replace pillar proven LIVE (worker#112, test + kind validation): the host resolves the plug-in digest from the registry per dispatch, so republishing the same path@version with new bytes hot-reloads the pool with no restart. Proof on kind — registered @1=variant A → ran → object at .../1.feather; republished @1=variant B (new digest, different object key) → re-ran → object at .../HOTRELOAD-B.feather, all 8 worker restartCount still 0. The full #105 runtime — load + run + capability-flush + hot-replace — is now live-proven. The two remaining items (executor: step-level author sugar; porting the Python system/event_materializer to a compiled plug-in) are deferred: the materialiser port needs nats_drain/events_project capabilities + the Arrow-in-wasm decision, which belong to #104/#103, not #105.

Pointers. worker df16b83/bump (v5.31.1, worker#110) · worker d6cd215 (v5.31.0) · server 2b21f28 (v3.13.0). (worker#112 hot-reload test — pointer bump pends merge.)


2026-06-16 — WASM plug-in capability shipped + pointers bumped (#105, #104)

Headline. The WASM plug-in host capability landed end to end across worker + server, plus the Resource Locator naming foundation; ai-meta pointers bumped to the merged releases.

Pointer bumps (this change set):

  • servera162514 (v3.11.0, server#210) — plug-in module registry (noetl.plugin_module + POST/GET /api/internal/plugins/{*path}), the PluginSource backend.
  • worker2433492 (v5.23.0 → v5.24.0, worker#93 + worker#95) — wasmtime host (capability ring, hot-reload, Arrow byte data-plane ABI, materialiser capability ring) + HTTP PluginSource. Catalog-loading loop closed: server registry → host → wasmtime. All behind off-by-default wasm-plugin.
  • tools2ca2f2a (v3.11.0, tools#68) — noetl_tools::locator (ResourceLocator logical URI + ResultCoordinates §7 physical key + stable FNV-1a shard_key).

Still in flight: #105 Round 5 (dispatcher mode + lowering + materialiser port); #104 R02 — locator fan-out (frame,row) refinement open as tools#70 (v3.12.0); R02a (orchestrator stamps cursor.{frame,row} on body commands) already in place; R02b (worker stamp) gated on the tools 3.12.0 release.

Pointers. ai-meta chore(sync) commits e96b6d6 (server) + e34dbcb (worker) + 2b71d4d (tools).


2026-06-16 — Event-WAL Round 01 (locator) + system-pool plug-in direction (#104, #105)

Headline. Implementation of the Event-WAL umbrella started with the naming foundation; a standing directive routed all async system services to the system-pool plug-in ring and opened the WASM-compilation capability as its own umbrella (#105).

Round 01 — naming foundation (noetl/tools#68, merged; v3.11.0). Shared noetl_tools::locator: ResourceLocator (the stable §8 logical URI), ResultCoordinates (the §7 physical key, collision-free across frame/attempt), shard_key (fixed FNV-1a — reproducible across binaries/arch/time, locked test), CellPlacement, legacy parse. Pure module, 12 unit tests, not yet wired into runtime. Sub-issue tools#67. Single source of truth replacing the divergent noetl://execution/... formatting.

Directive — system services as plug-in-ring playbooks. The Event-WAL materialiser and every async system service run on the system worker pool as playbooks (like the live system/projector + system/outbox_publisher), not bespoke Rust services; the worker's parallel publish stays compiled-core-thin. "Compile playbook logic to compiled, hot-replaceable, managed library" is the Phase 4 (WASM compilation) of the System Pool + WASM Plug-in ADR: wasmtime host in the worker binary, server-side compile-at-register, hot-reload via catalog version bump, the catalog as the managed plug-in library. Designed, not built; its old umbrella #46 is closed, so opened #105 to track it (gating decision: the lowering model — transpile vs hand-written Rust plug-ins). #104 blueprint reshaped to route services to the plug-in ring (docs#180 836725c).

Pointers. tools#68 (v3.11.0, merged); #105 opened + on board 3; docs#180 updated.


2026-06-16 — References-in-state stall fixed + Event-WAL umbrella opened (#101, #104)

Headline. The references-in-state flag-on stall is fixed and re-validated on kind; the architecture for making the event/result path fast and crash-resilient is captured as a new umbrella (#104).

Stall fix (worker feat/refs-in-state-extracted @ 56c253c, off main). The extracted predicate block was collapsing over-budget results to a flat {_count, _keys} shape, so the orchestrator's {{ output.data.rows[0].facility_mapping_id }} resolved to null and the PFT stalled at 13 events. Replaced the flat collapse with a bounded structural summary (build_extracted/summarise_value recurse: objects keep every key, arrays keep their first element as a real 1-element array so arr[0].<field> resolves; large strings → {_len}; threaded byte budget caps at 4KB). Worker-only — the server command carries no output_select. Re-validated flag-on (2 fac × 5 pat): 253 events (vs 13), full per-facility pipeline ran, 31 reference URIs active, command.issued bounded (max 18KB / avg 11KB). The terminal playbook.failed was the hardcoded 1000/1000 go/no-go assertion vs the 5-patient override (fixture artifact). 2 unit tests added; kind reverted to flag-off. Recorded on #101.

New umbrella #104 — Event WAL + derivable result storage. Design blueprint landed: docs/architecture/event_wal_and_derivable_storage.md (docs branch feat/event-wal-derivable-storage @ fdbc388). The model: NATS JetStream is the write-ahead log (publish-ack = durability, the synchronous noetl.event INSERT leaves the hot path); local memory/temp is a read cache ahead of the last acked offset; result locations are derived from a URN naming convention instead of carried as references; two pools drain the log (projector → projection_snapshot, materialiser → Arrow Feather in object store). The load-bearing decision: the durability barrier sits at side-effecting tool boundaries only. Folds CQRS (#103)

  • references-in-state (#101) into one model; ~70% already scaffolded by #103. Added to roadmap board 3.

Pointers. worker feat/refs-in-state-extracted 56c253c; docs feat/event-wal-derivable-storage fdbc388; umbrella Event WAL Storage.


2026-06-15 — Orchestrator block b landed + GKE small-tier stall proof (#101, server v3.9.0)

Headline. Block b of the orchestrator-scaling umbrella landed in noetl/server v3.9.0 and was proven stall-proof on the GKE db-g1-small + PgBouncer small tier.

What landed (noetl/server#197 → v3.9.0 1760c19): projection-snapshot bounded rebuild (flat memory — 167KB snapshot at 200k events, was OOM at ~19k); throttled consistency COUNT (O(events) per-trigger COUNT off the hot path); a background reconcile poller force- advancing every active execution every 8s; results-by-reference resolution; and the GET /api/executions/{id} memory-bomb fix (was loading all 200k events → OOM on poll).

The journey. kind 10×1000 held flat memory to 200k events but a status-endpoint poll OOM'd it (self-inflicted — found + fixed the executions endpoint). On the GKE small tier the slow DB backpressure surfaced a deadlock: a non-triggering straggler (a cursor claim's call.done carrying the row batch) missed in a COUNT throttle gap left the cursor unable to fan out, and with no further events there was no trigger to retry → permanent stall. Fixed with the reconcile poller. Re-validated on GKE db-g1-small + PgBouncer (10×200): cleared the prior deadlock point, poller observed advancing a stuck execution, 0 fails / 0 restarts, Cloud SQL bounded ~15 backends. Processing is slow under the small tier (~1.5 items/s — the bottleneck is the Cloud SQL tier + PgBouncer pool, not the orchestrator) but does not stop.

Pointers. Issue #101 · Umbrella: Orchestrator Scaling · server v3.9.0. Full GKE 10×1000 running for the small-tier completion number. Block-b stage 2 (references-in-state + completed-frame pruning) still ahead.


2026-06-15 — Orchestrator scaling: incremental state + results-by-reference (#101)

Headline. Found + fixed two coupled orchestrator-scaling bugs that surfaced validating test_pft_flow_v2 at scale, then committed both as PRs (kind-validated, awaiting merge).

What landed (PRs open).

  • server#197 — incremental orchestrator state (per-execution OrchStateCache: apply-new-events behind a per-exec lock, full-rebuild fallback on count mismatch, evict on terminal; cursor frames moved into StepInfo) + hydrate_result_references (resolve over-budget {data:{_ref}} references from noetl.result_store, nested + top-level envelope shapes, before the orchestrator reads events).
  • worker#89 — env-configurable inline budget NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES (default 100 KB).

Root cause. The orchestrator never resolved references when reading events, so a cursor claim returning its rows by reference looked like zero rows → 0-row frames → infinite re-claim (20,844 events, work-queue stuck pending=5). And the per-trigger full event-log replay OOM'd the server under cursor concurrency.

Validation. kind, worker budget=256 (forces every result by reference): PFT test_pft_flow_v2 (1 facility × 5 patients) COMPLETED — 34 references / 9 steps stored+resolved in noetl.result_store, 292 events (vs 20,844 runaway pre-fix), all 25 work-queue rows done. server 666 tests pass, worker 19 pass, clippy clean.

Pointers. Issue #101 (board: In progress) · Umbrella: Orchestrator Scaling · worker wiki deployment-spec env var (worker.wiki@a1d3b09). Pointer bumps deferred to merge. Follow-up (user-steered): extend the contract so event.result + command.context carry references only (never inline data), with an extracted predicate-fields block — projection + reference + step-container / heterogeneous-runtime model.


2026-06-15 — Cursor/claim loop mode + PFT green on Rust (#100)

Headline. The Rust orchestrator now supports loop.spec.mode: cursor — a claim-based work-distribution loop that repeatedly leases a frame of rows from a database table and processes the step body per row until the claim returns nothing. The full test_pft_flow_v2 patient-fetch flow ran end-to-end on the Rust stack with all_passed: true (5/5 per data type: assessments, conditions, medications, vital_signs, demographics) against the throttling/error-injecting paginated-api test server on kind. This proves the playbook handles retries + multi-frame cursor loops.

What landed.

  • noetl/server#196 → v3.8.0 — cursor loop engine: LoopMode::Cursor, CursorClaim, FrameSpec; orchestrator entry hook issues the first claim command; advance_cursor drives subsequent frames; reconstruct_cursor_frames matches completions to frames; StepInfo.is_cursor prevents premature step completion; drain fires when the claim returns 0 rows (__cursor_drained, event.name = loop.done). Also lands: the output namespace (arc when: / step set: may now use {{ output.<field> }} as an alias for the just-completed step's result — unblocked the PFT's output-gated arcs) + cursor loop-back re-entry (re-running a cursor step from a loop-back arc resets frame tracking so prior drained run's frames don't merge).

  • noetl/tools#66 → v3.10.1 — postgres multi-statement splitter fix: the splitter now skips -- line comments before scanning for ;. An apostrophe in a -- comment was swallowing the trailing semicolon and merging subsequent statements, producing "cannot insert multiple commands into a prepared statement" at setup_facility_work.

  • noetl/worker#88 — bumps noetl-tools dep to v3.10.1 to pick up the splitter fix.

Orchestrator-driven design. No worker holds a slot between claim frames. Each claim runs as a normal postgres tool command on a worker; the orchestrator issues the next claim when the prior frame's body work completes. Stale-row reclaiming is the playbook's own responsibility via a reclaim-stale CTE in the claim SQL (FOR UPDATE SKIP LOCKED … RETURNING). {{ __frame_max_rows }} is injected from loop.spec.frame.max_rows.

Validation. test_pft_flow_v2 on the local kind cluster:

  • Stack: server v3.8.0 + noetl-tools v3.10.1 + worker pulling v3.10.1.
  • Test server: throttling/error-injecting paginated-api (proves error handling + multi-frame loops).
  • Result: all_passed: true, 5/5 per data type.

Pointers. server → 1418a93 (v3.8.0) · tools → 454ab6e (v3.10.1) · worker → e4b4a64.

Umbrella. noetl/ai-meta#100 — closed. Umbrella-Cursor-Loop-Mode page created. Server wiki cursor-loop-mode deep-dive page added.


2026-06-14 — Transfer tool: Snowflake↔Postgres both directions (#99 closed)

Headline. The transfer tool now moves rows between Snowflake and Postgres in both directions with full credential-alias resolution. Validated end-to-end on kind against the live sf_test Snowflake account (NDCFGPC-MI21697) + kind Postgres: the full bidirectional data_transfer/snowflake_postgres fixture turned green — every step COMPLETED, including transfer_sf_to_pg and transfer_pg_to_sf. Real data moved with correct types: SF→PG landed id (int) / name (text) / value (numeric 100.50) / created_at (timestamptz 2026-06-15 05:55:01.262+00) / metadata (jsonb) correctly; PG→SF moved all 5 rows back.

What landed.

  • noetl/tools#65 → v3.10.0 — implemented the (Snowflake,Postgres) and (Postgres,Snowflake) transfer arms (previously validated as supported but not implemented). Reuses SnowflakeTool (new query_rows method) + PostgresTool internally. Two key coercion problems solved: (1) Snowflake returns every cell as a string, so SF→PG looks up the target column types from information_schema.columns and coerces with $n::text::<udt> casts; (2) Snowflake's internal TIMESTAMP_TZ format (<epoch>.<nanos> <tzmin>) is reformatted to RFC3339 before the PG cast. PG→SF writes generated SQL-escaped INSERT statements. SourceConfig/TargetConfig capture worker-injected credential fields via #[serde(flatten)] extra.

  • noetl/worker#87 → v5.22.0 — the worker pre-resolves the keychain alias on each transfer endpoint (source.auth/target.auth), mirroring the task_sequence pre-resolution pattern, and bumps noetl-tools to 3.10. (Earlier worker bumps this session: #83 / #84 / #85 / #86 for the key-pair JWT work under #98.)

  • noetl/e2e#58 — migrated the fixture's transfer steps off the Python-era nested-auth map to string-alias auth + table-based auto-INSERT.

Validation (kind, live account). Full bidirectional data_transfer/snowflake_postgres fixture green: create_sf_database, setup_sf_table, create_pg_table, transfer_sf_to_pg, and transfer_pg_to_sf all COMPLETED. Type coercion confirmed live: numeric, text, timestamptz, jsonb all round-tripped correctly in the SF→PG direction; all 5 rows moved cleanly in the PG→SF direction.

Pointers bumped. tools → 4127b4b (v3.10.0) · worker → 6d97e7c (v5.22.0) · e2e → 94aa7f1.


2026-06-14 — Snowflake key-pair JWT validated end-to-end

Headline. Snowflake key-pair (JWT) authentication implemented and validated end-to-end on kind against the live sf_test account (Snowflake account NDCFGPC-MI21697, user NOETL, warehouse SNOWFLAKE_LEARNING_WH). This was the last external-fixture gap for the regression-baseline migration (#98): the Snowflake tool only supported password auth, and the real account requires MFA with TOTP — rejected every attempt. JWT bypasses the password/MFA path entirely.

What landed.

  • noetl/tools#62 → v3.9.0 — key-pair JWT auth for the Snowflake tool. RS256 JWT with iss = <ACCOUNT>.<USER>.SHA256:<base64(SHA256(public-key DER))>, sub = <ACCOUNT>.<USER> (both uppercased, region segment dropped from account); sent as a Bearer token with X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT, bypassing the password/MFA login flow. New SnowflakeConfig.public_key (PEM) field. Deps: jsonwebtoken 9 + pem 3.
  • noetl/tools#63 → v3.9.1 — set a User-Agent header on the Snowflake HTTP client. Snowflake's SQL REST API rejects requests with no User-Agent (400 / code 391903); reqwest sends none by default.
  • noetl/tools#64 → v3.9.2 — set Snowflake session context (warehouse/role/database/schema) in the request body instead of via USE statements (the SQL API rejects USE with code 391911); split multi-statement command: blocks on ; (the SQL API runs one statement per request; a whole multi-statement block fails with code 000008); database/schema omitted for CREATE/DROP DATABASE.
  • noetl/worker#83 — added sf_public_key -> public_key to auth_alias.rs SNOWFLAKE_FIELD_MAP.
  • noetl/worker#84/#85/#86 — bumped noetl-tools dep 3.7 → 3.9 → 3.9.1 → 3.9.2 (binary redeployed on kind; no crates.io worker publish).
  • noetl/e2e#57 — dropped unsupported USE DATABASE; USE SCHEMA statements from the data_transfer/snowflake_postgres fixture.

Validation (kind, live account). The two tool: snowflake steps in the fixture — create_sf_database (CREATE DATABASE) and setup_sf_table (CREATE TABLE + INSERT) — both reached COMPLETED via key-pair JWT. The transfer_sf_to_pg step still fails: the transfer tool reads inlined credentials and cannot resolve credential aliases, and its Snowflake source has no key-pair fields. Deferred to #99.

Pointers bumped. tools → a216ab2 (v3.9.2) · worker → 9d6b127 · e2e → e191231.


2026-06-14 (#96 Rust system worker pool live: scheduled cleanup + Python-era legacy removed)

Headline. Stood up the Rust-native system worker pool. The Rust server publishes commands directly to NATS and writes events inline, so the Python-era outbox-publisher/projector system playbooks are obsolete — the system pool's real job is scheduled retention/cleanup of the transient noetl.* tables (nothing did this after the cutover). Deployed + verified in prod; Python-era legacy removed.

What landed.

  • noetl/server#193 (v3.6.0) — POST /api/internal/cleanup/purge, service-account-gated. Purges terminal noetl.command rows + dead noetl.runtime worker registrations; noetl.event retention is opt-in (default off — append-only source of truth). Ships span + noetl_cleanup_rows_purged_total{table} metric + structured log. Prod image server-rust:v3.5.4 (sha256:2bb06d5a…).
  • noetl/ops#185system/scheduled_cleanup playbook (calls the endpoint via the system pool, per data-access-boundary.md), prod system-pool deployment (NOETL_COMMANDS_RUST stream, noetl_worker_system_rust consumer), hourly CronJob trigger. Deleted the obsolete Python outbox_publisher playbook + outbox-publisher-deployment + configmap-outbox-publisher + projector-statefulset.

Python-era legacy removed. Deleted the dead prod noetl-server + noetl-worker Python deployments (orphaned — the noetl Service already selects app: noetl-server-rust; gateway health stayed 200). Prod now runs Rust only (server-rust + worker-rust + system-pool). The remaining entangled refactor (Python server-deployment/worker-deployment/subscription-runtime manifests, the kind redeploy automation that hardcodes Python deployment names, the stale helm release noetl rev 185 from before the cutover) is tracked in #97.

Validation. Kind end-to-end (/api/execute → system pool → cleanup/purge 200, purged 30 dead runtime rows), then prod: server rolled to v3.5.4, system pool 1/1, CronJob hourly; a trigger run reached playbook.completed with a 200 from the endpoint.

Pointers bumped. server@9f399f7 + ops@7b02727.

Also this session — Rust regression baseline + e2e migration (#98). Built a Rust-stack regression runner + PF-resilient batched runner; grew the green core 26 → 40 → 64 fixtures, all verified on kind. Migrated 2 fixtures to the start-entry convention. Root-caused + fixed a kind credential blocker (noetl/ops#186): the kind noetl-secret lacked NOETL_ENCRYPTION_KEY, so the server used a random default key regenerated per restart → flaky Decryption failed on postgres fixtures; a stable dev key unlocked the whole postgres batch (40→65 green). Remaining red is external-cloud (OpenAI/GCS/IB/Snowflake — can't run in kind) + a few engine cases. Browser e2e on the Rust-only prod stack also green. Tracked in #98.

(superseded by the fuller note below) Also this session — Rust regression baseline. Added a Rust-stack regression runner (noetl/e2e#52, scripts/rust_regression_run.sh) that drives the server's /api/execute directly and ships a green 10-fixture core baseline verified on kind (basic python, loops, control-flow routing, fanout/parallelism, sub-playbook composition, large-result extraction, output selection). Browser e2e on the Rust-only prod stack also green (login → run tests/e2e_probeplaybook.completed). Growing the baseline / migrating the Python-era suite tracked in #98.


2026-06-14 (#49 Phase F R5 — production login RESTORED through the full Rust stack)

Headline. Browser-driven Auth0 login at https://mestumre.dev/login now authenticates and redirects to the dashboard; the auth0_login playbook runs to playbook.completed. Login had been down since the cutover; root-caused and fixed forward (no Python back-compat) per standing guidance.

What landed.

  • noetl/worker#81 (worker@9ce4d6d, released v5.20.1) — SOURCE_FIELD_MAP maps nats_url/nats_user/nats_password credential fields to the flat url/user/password the NATS tool deserializes. The shipped nats_credential used the prefixed names, which apply_source_credential injected verbatim → dropped as serde-unknown → no urlcache_and_callback's NATS kv_put failed (503 auth backend is busy at the gateway). Mirrors the existing POSTGRES_FIELD_MAP pattern. Prod image rolled to sha256:61278cb6…f524658 (ops#184).
  • noetl/e2e#51 (e2e@1c2a0b5) — migrated auth0_login.yaml to Rust execution conventions (prod catalog v102): python libs:→explicit import; context.get()input:+args.get(); http callback body data:json:; dropped Python-era .context step-result wrapper; carried the session token as the non-sensitive sess_ref (the server + worker [REDACTED]-redact any *token* field when persisting step results, which destroyed the real token before downstream steps read it); expires_at::text cast (the Rust postgres tool returned null for the raw timestamptz column).

Validation. Full browser Auth0 login through the prod Rust stack → authenticated /catalog dashboard; execution reached playbook.completed across start → create_user_session → prepare_session_cache → cache_and_callback → send_success_callback.

Pointers bumped. worker@690ef1d + e2e@1c2a0b5 + ops@19d7f84.

Follow-ups filed. #95 — postgres pg_value_to_json returns null for timestamptz/NaiveDateTime. Noted in #49: event-log *token* redaction corrupts inter-step propagation (favor response-boundary redaction per the data-access-boundary rule).


2026-06-13 (#49 — post-cutover edge refresh: gateway v3.4.0 + SPAs on Cloudflare Pages + Rust worker KEDA)

Headline. On top of the full Rust-stack cutover, refreshed the edge and added worker autoscaling. All public endpoints live + verified through Cloudflare.

  • Rust worker KEDA autoscaler (ops#182) — 2→20 on NATS JetStream lag of the dedicated NOETL_COMMANDS_RUST stream / noetl_worker_rust_shared consumer.
  • Gateway → v3.4.0 (noetl-gateway@sha256:97e72c97…49c48, f175a87) — rolled in-cluster by digest; forwards to the Rust control plane; /health ok. Config-compatible (new env all optional). Helm-managed → set image in prod helm values before any helm upgrade.
  • NOETL dashboard → Cloudflare Pages (noetl-gui, mestumre.dev) via automation/cloudflare/gke_gateway_edge.yaml action=pages.
  • Travel SPA → Cloudflare Pages (travel, travel.mestumre.dev) via its npm run build:cf && deploy:cf (Maps key from SM google-maps-widget-key).
  • Frontend pattern = Cloudflare Pages for SPAs + tunnel for the gateway API only (not in-cluster). Verified: gateway.mestumre.dev/health 200, mestumre.dev 200, travel.mestumre.dev 200.

ops pointer 494497f. #49 detail.


2026-06-13 (#49 Phase F R5 — 🎉 PRODUCTION CUTOVER COMPLETE: full Rust stack live)

Headline. Production GKE (noetl-demo-19700101) now runs the full Rust stack (Rust server + Rust worker); Python is scaled to 0. The Rust noetl/server crate is the production control plane.

The decisive lesson. A server-only flip (attempt 1) failed: the Rust server publishes commands to the hierarchical subject noetl.commands.{pool}.{execution_id}, which the Python workers + prod NATS stream (flat noetl.commands) don't consume → /api/execute 500'd. Rolled back cleanly. The Rust worker is what consumes the hierarchical subjects, so the cutover had to be the full Rust stack (the kind-validated config).

Attempt 2 (validated-then-flipped).

  1. Built the prod amd64 Rust worker image (noetl-worker-rust:v5.20.0, digest sha256:b808bc60…c8dea9).
  2. Created a dedicated NATS stream NOETL_COMMANDS_RUST (noetl.commands.>), disjoint from Python's flat stream — Python untouched.
  3. Deployed the Rust worker as a canary + proved a real hello_world execution COMPLETED end-to-end off the traffic path (the gate).
  4. Cut over: scaled Rust worker → 3, flipped the noetl selector → Rust, re-encrypted the 19 credentials under Rust (19×200), scaled Python server
    • workers → 0 (KEDA ScaledObject paused at 0).
  5. Verified through the production noetl Service / gateway: executions COMPLETE, credentials decrypt, health green, logs clean.

Fixes landed this cutover: pgbouncer transaction-mode sqlx (NOETL_PG_STATEMENT_CACHE_CAPACITY=0, server#191), time =0.3.47 build pin (server#190), DB password from NOETL_PASSWORD key (ops#180), prod manifests + runbook (ops#178/#179/#181).

State: noetl-server-rust 1/1, noetl-worker-rust 3/3; Python noetl-server/noetl-worker retained at 0 for rollback. #49 open for a short soak. Full detail: #49 cutover comment.


2026-06-13 (#49 Phase F R5 — live pre-flight: pgbouncer transaction-mode fix)

Headline. Ran the read-only pre-flight against live prod with the operator. Surfaced one real blocker and corrected two assumptions; fixed the blocker. Production still 100% Python — no traffic moved.

Findings.

  • 🔧 DB is Cloud SQL behind transaction-mode pgbouncer. Prod has no direct postgres Service — pgbouncer.postgres.svc runs POOL_MODE=transaction (cloud-sql-proxy → noetl-shared-pg). sqlx's named prepared-statement cache fails intermittently under transaction pooling. Fixed: added NOETL_PG_STATEMENT_CACHE_CAPACITY to the server (default 100 unchanged; set 0 behind a transaction pooler → one-shot unnamed statements) — noetl/server#191 MERGED (0577cc6, v3.5.1). Prod manifest sets it to 0. Rust stays behind pgbouncer like Python.
  • Decision C N/A. Prod noetl Service exposes 8082/TCP only — no 8083/Flight port (that's kind-only). Nothing to break on flip.
  • Image repinned to the rebuilt digest sha256:c3783281…964984 (server-rust:e7df366 / :v3.5.0, Cloud Build 94cc199e) carrying both the time pin and the statement-cache fix.
  • noetl/ops#179 MERGED (1164270): manifest env + repin + runbook corrections (Decision B RESOLVED, C N/A). Server-wiki deployment-spec env catalogue updated (a17cf50).

ai-meta pointers: server 55d2dfc0577cc6, ops dd5ede71164270, server-wiki a17cf50. Operator-pending: provision the two secrets → apply → canary (query-heavy playbook proves the pgbouncer fix) → flip → scale Python to 0. #49 open; board In progress.


2026-06-12 (#49 Phase F R5 — cutover prep: image + manifest + operator runbook)

Headline. Following the NO-GO readiness review (same day, below), all safe, non-traffic-affecting cutover prep is built and merged. Production is still 100% Python — the cutover is operator-gated and fully scripted. #49 stays open; board In progress.

What landed.

  • Prod amd64 image pushed to the prod Artifact Registry: server-rust:4644c49 / :v3.5.0, digest sha256:78cce8f3…b929fa (Cloud Build 00a26c26, linux/amd64).
  • noetl/server#190 MERGED (55d2dfc) — time =0.3.47 build-fix pin. The v3.5.0 release commit 7b217d8 doesn't compile (time 0.3.48 × async-nats 0.38 E0119); server was the last Rust repo missing the pin that tools/worker/gateway already had.
  • noetl/ops#178 MERGED (dd5ede7) — prod-shaped server-rust-deployment-prod.yaml (image pinned by digest, encryption key + internal-api-token REQUIRED/fail-closed, pgbouncer DB, NATS prod auth; does NOT touch the noetl Service) + the operator runbook runbooks/noetl-server-rust-cutover.md
    • the amd64 Cloud Build asset.

Operator-pending (in the runbook): provision NOETL_ENCRYPTION_KEY (+ re-enter the plaintext-stored prod credentials under the Rust AES-GCM scheme) and noetl-internal-api-token; verify pgbouncer session mode + the port-8083 Flight caveat; apply → canary → selector flip → scale Python to 0; one-command rollback on standby.

ai-meta pointers bumped: server 7b217d855d2dfc, ops 85bfc1fdd5ede7.


2026-06-12 (#49 Phase F R5 — production cutover readiness: NO-GO)

Headline. Stage-1 readiness review for flipping production GKE (gke_noetl-demo-19700101_us-central1_noetl-cluster) from the Python noetl-server (FastAPI) to the Rust noetl/server crate. Verdict: NO-GO — no production traffic was flipped; prod is untouched and still serving on Python.

What was found. Prod read-access confirmed. Routing is the gateway LoadBalancer → noetl ClusterIP Service (selector app=noetl-server), not a K8s Ingress. The prod baseline is Python only: noetl-server 1/1 on image noetl:coalesce-20260529230422 (cmd ["python"]), noetl-worker 3/3. No noetl-server-rust Deployment, Service, pods, or image exist in prod — the Rust server has never been deployed to GKE; it runs only on the kind stack (healthy at runtime v3.4.2). Submodule pointer repos/server = 7b217d8 (v3.2.0-10-g7b217d8).

Hard blockers (beyond "not deployed").

  • noetl-secret in prod holds only NOETL_PASSWORD + POSTGRES_PASSWORDno NOETL_ENCRYPTION_KEY; the Rust manifest references it optional: true, so a Rust pod would boot on the insecure default key (prereq 3b FAIL).
  • noetl-internal-api-token secret absent — Rust manifest references it as a non-optional secretKeyRef → pod fails to start (3c FAIL).
  • Named gate validate-shard-routing-n2.sh is kind-scoped; fresh re-run blocked on a kind Postgres CREATEDB harness-privilege gap (shard routing previously passed at Phase F R4). Not unblocked (out of authorized scope).

Outcome. Full GO/NO-GO with per-prerequisite pass/fail, the exact selector-flip cutover + one-command rollback, blast radius, and the operator action list (build/push amd64 image → provision the two secrets → deploy + canary → flip) recorded on #49. No PR opened (applying a Rust prod Deployment before the secrets exist would crashloop or silently use the insecure default key). #49 stays open; board stays In progress. Unrelated uncommitted items (repos/.dockerignore, scripts/start_noetl_ui.command) left untouched.


2026-06-12 (#91 — live Google OIDC signature validation for push ingress)

Headline. The gateway push-ingress pubsub_oidc verifier is now proven against the real Google JWKS — closing the one positive-path gap #90 Phase 3 deferred for lack of a real Google-signed token.

What landed.

  • A genuinely Google-signed OIDC token was minted by impersonating the #90 Phase 5 least-privilege runtime SA (noetl-subscription-runtime@noetl-demo-19700101.iam.gserviceaccount.com) with --audiences + --include-email, so it carries the exact claims a Pub/Sub push subscription with OIDC auth sends (iss=accounts.google.com, custom aud, email=<SA>, email_verified=true, RS256 + real Google kid).
  • #[ignore]d live test oidc_live_google_token_against_real_jwks in repos/gateway/src/ingress/verify.rs fetches the live JWKS via the gateway's own fetch_google_jwks and validates the token: valid → verified; wrong-aud → oidc_wrong_audience; wrong-SA → oidc_wrong_sa; tampered → oidc_bad_signature. (gateway#30, test-only)
  • Reproducible runner scripts/live_validate_oidc_verify.sh in noetl/e2e mints the token + runs the test; no secret printed or committed. (e2e#50)

Live HTTP gold-standard (kind-noetl). Ran the gateway binary against the in-cluster server (NATS + server port-forwarded) with a registered pubsub_oidc subscription, then POSTed the real token in a Pub/Sub-push body to /ingress/oidcbilling: 4 received → 1 dispatched (valid token → HTTP 202 + one COMPLETED child execution 323957201221718016 on the subscription pool) → 3 rejected (tampered 401, wrong-aud 403, missing 401), zero executions from the rejected deliveries. Metrics confirm: noetl_ingress_dispatched_total{oidcbilling}=1.

GCP hygiene. No cost-bearing resources created (no Pub/Sub topics/subscriptions, no Cloud Run, no buckets) — only an ephemeral token. The scoped roles/iam.serviceAccountTokenCreator binding added to mint the token was removed at the end.

Status. #91 CLOSED. gateway#30 + e2e#50 merged (test-only; gateway stays v3.4.0 — no release cut); ai-meta pointers bumped (gateway f175a87 + e2e f7a24de). Board → Done. Full HTTP run this round also added the wrong-SA negative at the HTTP layer (→ 403 oidc_wrong_sa).


2026-06-12 (#92 — shared noetl-directives crate extracted, gateway de-vendored)

Headline. The header-directive engine is extracted into a standalone, lean noetl-directives crate (serde+thiserror only) so the security-sensitive allowlist (RFC §7.5) has one implementation — the internet-facing noetl-gateway de-vendors its former serde-only copy (src/ingress/directives.rs) and depends on the shared crate instead, eliminating the drift risk.

  • noetl-directives 0.1.0 published to crates.io (new crate). tools v3.8.0 (tools#61) makes repos/tools a workspace (root noetl-tools + directives/ member), re-exports every symbol from tools::source so the worker's call sites are unchanged, and publishes the member before the root in release.yml.
  • gateway v3.4.0 (gateway#29) drops the vendored copy for noetl-directives = "0.1".
  • Scope: extracted noetl-directives (the drift-risk fix); deferred noetl-spool (single consumer → no drift; couples to PolledMessage) — #92 notes it.

Validation. 13 noetl-directives tests + 376 noetl-tools tests + 69 gateway tests green; clippy clean. Gateway stays leancargo tree shows no duckdb/kube/tokio-postgres/noetl-tools creep (the whole point). Directive behavior is byte-identical (the engine + its full suite moved verbatim) and was already exercised live by the #93/#94 e2e earlier today. Also pinned time =0.3.47 across tools/worker/gateway (0.3.48 broke async-nats 0.38 with E0119 under rustc 1.92).

Pointers. ai-meta → tools d8bef36 (v3.8.0) + gateway 2c48c26 (v3.4.0). Closes #92.


2026-06-12 (Spool refinements — #94 s3 backend + #93 cross-restart drain SHIPPED, live proof green)

Headline. Two of the three spool/directives refinements spun out of the closed subscription RFC (#90) land, both live-proven on kind:

  • #94 — s3 spool backend. New noetl_tools::spool::S3Backend (hand-rolled AWS SigV4 over reqwest + hmac/sha2 — no AWS SDK; S3/MinIO/R2/B2) + worker wiring (SpoolBackendKind::S3, keychain-auth credential). tools v3.7.1 (tools#58), worker v5.20.0 (worker#80).
  • #93 — cross-restart drain. recv_seq high-water recovery + SpoolRuntime::recover_on_startup: on boot, list the durable spool and auto-drain (closes the gcs/s3 in-memory-circuit gap where a restart mid-outage otherwise forgot the backlog). tools recovery helpers (tools#59) + the same worker PR.

Build unblock. time 0.3.48 (published 2026-06-12) breaks async-nats 0.38 with E0119 under rustc 1.92 — it failed the 3.6.0/3.7.0 crate publishes. Pinned time =0.3.47 in tools (tools#60)

  • worker → noetl-tools 3.7.1 published. Revisit when async-nats ships a fix.

Live proof (kind, MinIO). kind_validate_subscription_spool_s3.sh (e2e#49, MinIO via ops#177): outage → 6 buffered to MinIO (s3 SigV4) → kill+restart runtime → startup auto-drain (subscription.spool.recovered fired) → 6 replayed in order, 6 COMPLETED, idempotent (no dups), no loss. Also: the s3 backend's put/list/get/delete proven directly against MinIO (s3_live test).

Pointers. ai-meta → tools f362aa1 (v3.7.1) + worker 7b8a09a (v5.20.0)

  • ops 85bfc1f + e2e 1ea7bd0. Closes #94
  • #93. Remaining refinement: #92 (shared noetl-directives/noetl-spool crate extraction).

2026-06-12 (RFC #90 Phase 7 SHIPPED — scale hardening; #90 CLOSED — all 7 phases complete, live proof green)

Headline. The subscription/listener RFC's final phase lands and #90 is closed: all seven phases (bounded-drain tool → kind: Subscription continuous runtime + header directives → gateway push-ingress + auth-gated trust → store-and-forward spool + circuit breaker → out-of-cluster Cloud Run + gcs spool → CLI local noetl subscribescale hardening) are shipped and live-proven on kind / Cloud Run.

What shipped (Phase 7).

  • server v3.5.0 (server#189, closes server#188) — POST /api/execute/batch (N→N executions in one round-trip, partial-failure contained, reuses the single-execute execute_one path so per-message routing/trace/dedup are intact; server still owns every DB write) + the opt-in exactly-once dedup window (RFC §10 OQ1): noetl.subscription_dedup (idempotent startup DDL, bounded by age, cluster-pool authority), execute takes dedup: { key, window_secs } scoped by parent_execution_id (the subscription), a duplicate within the window collapses to the existing execution + a subscription.message.deduplicated audit event, race-safe via INSERT … ON CONFLICT, default off; validation of the new dispatch.batch_dispatch/batch_max/dedup/limits blocks; noetl_execute_outcomes_total + noetl_execute_batch_size.
  • worker v5.19.0 (worker#79, closes worker#78) — batch dispatch (dispatch.batch_dispatchexecute_batch in chunks of batch_max, each item its own playbook/pool/trace/dedup); opt-in dedup stamps the block (idempotency_key→message_id, OQ8); per-subscription rate limits (RFC §9) via a new deterministic token-bucket RateGovernor (src/ratelimit.rs) enforced on the fetch side — over the cap the runtime stops fetching (source keeps the backlog, redelivers — no loss) + a subscription.rate_limited event; new batch/rate-limit counters.
  • ops (ops#176) — subscription_scale_hardened.yaml example. e2e (e2e#48) — kind_validate_subscription_scale.sh + 3 fixtures.
  • No tools change → no crate cascade.

Live proof (kind, server v3.5.0 + worker v5.19.0).

  • batch — 12 pre-loaded → children=12 completed=12 pooled=12 traced=12 (12→12, all COMPLETED on the subscription pool, per-message traceparent preserved), runtime used execute_batch (server handled 5 batch calls).
  • dedup — a duplicate (same x-idempotency-key) + a distinct key → children=2 deduplicated_events=1 (the dup collapsed to one execution); direct-curl proved within-window→duplicate, outside-window→allowed, dedup-off→no-collapse, batch partial-failure containment.
  • rate-limit — burst of 10 at max_dispatch_per_sec=2rate_limited_events=1 children=10 completed=10 (limit engaged, every message became an execution — no loss).

Unit. server batch/dedup/validation tests; worker RateGovernor (throttle→recover no-loss, clamp, combined caps), Phase-7 spec parse, dedup-key resolution, batch client shapes. Full server + worker suites green; clippy clean.

Pointers. ai-meta → server 7b217d8 (v3.5.0) + worker 7531f4a (v5.19.0) + ops 6db69b9 + e2e 203593b.

#90 closed; refinement follow-ups spun out as separate ai-task issues (none a planned phase or a load-bearing gap): #91 (live OIDC signature), #92 (shared noetl-directives/noetl-spool crates), #93 (cross-restart spool drain auto-trigger), #94 (s3 spool backend wiring), tools#57 (real-Pub/Sub pull default).


2026-06-12 (RFC #90 Phase 6 SHIPPED — CLI local noetl subscribe + FileEventSink + local_disk spool, live local proof green)

Headline. noetl subscribe <spec.yaml> runs a kind: Subscription listener standalone in local mode — no Kubernetes, no NATS-dispatch server is required for the listening itself. It reuses the same noetl_tools::tools::source clients + header-directive engine + noetl_tools::spool engine the in-cluster worker runtime uses, and emits the same ExecutorEvent envelope — to a local FileEventSink (one event per line, JSONL) — so a local run produces a replayable event-sourced log identical in shape to the in-cluster / Cloud Run trail. This completes RFC #90 Phases 1–6.

What shipped. cli v4.11.0 (cli#60, closes cli#59) — new src/subscribe/{mod,spec,sink,dispatch,runtime,spool}.rs + examples/subscribe/:

  • sink.rsFileEventSink (JSONL, flushed per emit → crash-safe trail) implementing the shared noetl_events::EventSink; app-side snowflake id generator (observability.md Principle 3, hostname+pid machine id).
  • dispatch.rs — the local dispatch model (RFC §5.3): LocalDispatcher runs the target playbook in-process via PlaybookRunner (the pure-local default); ServerDispatcher POSTs /api/execute (--dispatch server). Plus the message→workload envelope (mirrors the worker's build_payload).
  • spec.rs — parse a kind: Subscription for local mode; forces the spool backend to local_disk (RFC §8.6) so a spec authored for the in-cluster nats_object/gcs backend runs locally unchanged.
  • spool.rs — local spool runtime over noetl_tools::spool::LocalDiskBackend: circuit breaker + buffer + ordered replay + idempotency + dead-letter; circuit state in a local control/ file; the six spool/circuit events to the FileEventSink; in-process replay.
  • runtime.rs — the continuous drain loop: lifecycle (registered/activated/drained/deactivated) + per-message receivedplaybook.startedcompleted/failed + directives_applied; --once / --max-messages stop conditions; Ctrl-C/SIGTERM drain.

cli-only. The source clients + spool engine already ship in noetl-tools v3.5.0 (Phases 1–5), so no tools change / crate cascade was needed — the PR bumps the noetl-tools lock 3.0.0 → 3.5.0 (the executor's "3" constraint).

Tests. 12 subscribe unit/integration tests; full bin suite 53 passed; clippy-clean. Includes a deterministic local outage → local_disk spool → ordered replay → idempotency proof exercising the real noetl_tools::spool engine (a TCP downstream probe toggled in-test).

Live proof (local mode, against the in-cluster NATS broker on kind). (1) Drain + in-process dispatch + event-sourced JSONL: created a JetStream stream + durable consumer, published 5 messages → received=5 dispatched=5 failed=0; the JSONL trail held 19 events (lifecycle×4 + received×5 + playbook.started×5 + playbook.completed×5), every line round-tripping as ExecutorEvent. (2) local_disk spool outage → recovery: a tcp downstream probe pointed at a closed port → subscription.circuit.opened6 subscription.message.spooled to the local_disk dir (recv_seq-ordered object keys on disk), 0 dispatched (no loss); bind the port (downstream up) → subscription.circuit.closedsubscription.spool.draining6 subscription.message.replayed in receive order → spool drained to 0 (pending_spooled=0). The whole outage is reconstructable from the JSONL trail.

Finding. The NATS source connects via async-nats ConnectOptions, which does not honor user:pass embedded in the URL — the spec uses explicit user/password fields (or auth: <alias> + --credential). Documented in the example specs.

Pointers + wiki. ai-meta → cli 2fb3fb0 (v4.11.0). Wiki: new cli page subscribe; umbrella Subscription / Listener Phase 6 row + recent activity. #90 stays open for Phase 7 (scale hardening — batch execute, opt-in dedup window, rate limits — volume-gated).

2026-06-12 (RFC #90 Phase 5 SHIPPED — out-of-cluster Cloud Run target + gcs spool backend, live out-of-cluster proof green)

Headline. The subscription/listener runtime now runs out-of-cluster on Google Cloud Run (RFC §5.2): a Pub/Sub firehose is consumed off-cluster and dispatched to the NoETL server over HTTPS, never entering the cluster network until it is a well-formed execution. The gcs store-and-forward spool backend landed alongside.

What landed.

  • tools v3.5.0 (tools#56, closes tools#55) — noetl_tools::spool::GcsBackend, the GCS impl of the Phase-4 SpoolBackend trait over the JSON API, reusing the existing GcpAuth (ADC) + reqwest (no new dependency); prefix-shared bucket, live+dlq split, recv_seq-ordered keys, idempotent put/delete; gcs feature (default-on). Live GCS round-trip proven.
  • worker v5.18.0 (worker#77, closes worker#76) — spool.backend: gcs wired into the WORKER_MODE=subscription run-loop (ADC/Workload Identity; in-memory circuit out-of-cluster); optional NOETL_INTERNAL_API_TOKEN bearer auth to the control plane; $PORT-aware metrics/health bind (Cloud Run startup probe, no new HTTP code).
  • server v3.4.2 (server#187, closes server#186) — gcs/s3 spool credential made optional (absent → ADC/Workload Identity for the Cloud Run platform bucket; present → tenant-bucket keychain alias); bucket stays required. A real Phase-5 finding: the Phase-4 validation wrongly rejected the Cloud Run bucket.
  • ops (ops#175, closes ops#174) — automation/cloud-run/: least-priv SA + spool bucket + Pub/Sub setup-gcp.sh, Cloud Build + gcloud run deploy deploy.sh (min=1 --no-cpu-throttling singleton), teardown.sh, declarative service.yaml, README.
  • docs (docs#179) — the Cloud Run runtime architecture page.
  • e2e (e2e#47, closes e2e#46) — Pub/Sub-source + gcs-spool fixture + hybrid Cloud Run validation driver.

Live proof (noetl-demo-19700101, server reached via a cloudflared tunnel to the kind cluster). Live: the GCS backend round-trip; the Cloud Run service deploying + running ($PORT health bound, startup probe green); the out-of-cluster runtime activating against the server over HTTPS (register+activate in the event log); spool runtime active backend=gcs; 6/6 Pub/Sub messages → one POST /api/execute each over HTTPS → COMPLETED on the subscription pool; GCS spool under a live outage — killing the tunnel opened the circuit and a message buffered durably to the real GCS bucket (…/spool/000…001-<id>, sha256 + reason=circuit_open). Not auto-triggered: cross-restart GCS drain (the documented in-memory-circuit limitation — drain+replay+idempotency was proven live in Phase 4 with the same engine). Finding: the pubsub source's synchronous pull (emulator-validated in Phase 1) needs timeout_ms ≥ 10s against real Pub/Sub — the 2s NATS default stalls; filed tools#57.

GCP setup. Dedicated least-privilege runtime SA noetl-subscription-runtime (objectAdmin on the one spool bucket + subscriber on the one Pub/Sub subscription; no project-wide roles, no exported key — Workload Identity / ADC). All test resources torn down at the end (Cloud Run service deleted, tunnels killed, spool bucket + Pub/Sub topic/sub + throwaway AR image deleted) — no cost-bearing resources left; the runtime SA kept (free).

Pointer bumps. ai-meta → tools 0f29c57 (v3.5.0) + server 67669ba (v3.4.2) + worker e1a74ce (v5.18.0) + ops 14c5bf1 + docs cb48772 + e2e 99cda2b. #90 stays open (Phases 6–7 remain).


2026-06-12 (RFC #90 Phase 4 SHIPPED — store-and-forward spool + per-downstream circuit breaker, live outage proof green)

Headline. Phase 4 of the subscription/listener RFC (#90) shipped + was live-validated on kind under a simulated outage: when a downstream a subscription depends on goes offline, incoming messages are durably buffered (buffer_and_ack) and replayed in order on recovery — proven no data loss.

What landed.

  • tools v3.4.0 (tools#54, closes tools#53) — noetl_tools::spool: a pure per-downstream circuit breaker (trip-after-N / half-open probe / close; NATS-KV-serializable; one breaker per declared downstream → resolves OQ2), the SpoolItem envelope (SHA-256 + noetl://spool/<sub>/<recv_seq>/<id> ref + recv_seq-ordered object keys so a lexical list == receive order), a SpoolBackend trait + nats_object + local_disk backends, and the engine (ordering global/per_key/none + idempotency + poison→dead-letter + retention max_age/max_bytes/on_full + GC) + http/tcp/nats probes. 44 unit tests under simulated outage + a real-NATS nats_object integration test.
  • worker v5.17.0 (worker#75, closes worker#74) — wires the spool into the WORKER_MODE=subscription run-loop: probe→circuit→spool-or-dispatch→ack, NATS-KV circuit persistence (survives a restart mid-outage), drain-on-recovery, 6 spool/circuit events + noetl_subscription_spool_bytes gauge.
  • server v3.4.1 (server#184 + server#185, closes server#183) — spool: block validation + the lifecycle-status fix (spool/circuit events share the subscription's execution_id but must not corrupt its lifecycle status; surfaced + fixed during live validation when an open circuit 500'd activate).
  • ops (ops#173) — toggleable spool-downstream-echo + runtime NATS env. e2e (e2e#44 + e2e#45) — kind_validate_subscription_spool.sh.

Live proof (kind, 6 messages). Scale the downstream to 0 → subscription.circuit.opened → publish 6 → 6 subscription.message.spooled (recv_seq 1-6, each with the noetl://spool ref + sha256), 0 dispatched while open (no loss); scale back to 1 → subscription.circuit.closed + subscription.spool.draining6 subscription.message.replayed6 child executions COMPLETED on the subscription pool → spool drained to 0 → exactly 6 distinct children (idempotency held). The entire outage is reconstructable from noetl.event.

Decisions / deferred. OQ2 per-downstream scope; OQ3 immediate-GC + max_bytes ceiling + gauge; OQ8 idempotency_key wins over message_id; OQ14 ack-after-dispatch stop-ack tracked (buffer_and_ack/hybrid loss-safe). Deferred/tracked: gcs/s3 backends (same trait), gateway edge spool + shared noetl-directives/noetl-spool crate, hybrid stop-ack-blip optimisation.

Pointers. ai-meta → tools 02110a5 (v3.4.0) + server 51dc0d1 (v3.4.1) + worker 65fb27d (v5.17.0) + ops ab9af34 + e2e 2d0ad0a. #90 stays open (Phases 5–7 remain).


2026-06-11 (RFC #90 Phase 3 SHIPPED — gateway push-ingress (Mode C) + auth-gated directive trust, live E2E green)

Headline. Phase 3 of the subscription/listener RFC (#90) shipped: the gateway gains POST /ingress/{listener} — it terminates untrusted webhook / Pub-Sub-push traffic as a verify-and-forward gatekeeper (no DB on the ingress path), and the auth-gated directive trust the Phase-2 directive engine was designed for lands. The gateway verifies a delivery (HMAC / bearer / Pub-Sub OIDC, secret resolved from the Secrets Wallet by alias) and only then applies the header directives and forwards one POST /api/execute per delivery on the dedicated pool. The auth gate is a structural invariant (verify_then_plan): a failed verification yields no dispatch plan, so an unauthenticated caller can never drive routing (RFC §7.5).

What shipped.

  • noetl-gateway v3.3.0 (gateway#28, closes gateway#27) — src/ingress/: verify.rs (HMAC-SHA256 over the raw body, constant-time; bearer, constant-time; Google Pub/Sub OIDC — RS256 vs Google JWKS, aud+email/service_account+email_verified+exp); directives.rs (serde-only vendored port of the tools v3.3.0 engine — the internet-facing edge must not pull duckdb/kube); mod.rs (verify_then_plan fuses verify + directive resolution; Pub/Sub-push envelope unwrap → attributes channel; auth headers stripped from the forwarded workload; first /metrics surface). 25 ingress + verify unit tests incl. every negative + directives_applied_only_after_verification_passes.
  • noetl-server v3.3.0 (server#182, closes server#181) — push catalog validation (ingress.verify required, none rejected) + GET /api/internal/ingress/{listener} (service-account-gated) resolving the verify-secret alias via the Wallet + idempotent subscription registration. subscription::ensure_registered extracted + reused. 9 push-validation unit tests.
  • ops (ops#172) — gateway NOETL_INTERNAL_API_TOKEN env. e2e (e2e#43) — kind_validate_subscription_push.sh + HMAC/bearer push fixtures.

Validation (live on kind). Built + loaded + rolled server v3.3.0 (+ internal token) and gateway v3.3.0 (ns gateway, matching token); deployed the dedicated subscription pool. kind_validate_subscription_push.sh: HMAC 12/12 + bearer 12/12 green — N signed deliveries → one execution per delivery on the subscription pool → COMPLETED; allowlisted x-noetl-route redirect honored only after verification; the auth gate — a tampered (bad-signature) and an unsigned/unauth delivery (both carrying the redirect header) → 401, no execution, no directive applied. Pub/Sub-push path proven live (bearer-verified pubsub-source subscription, base64 message.data decoded → order_id, redirect via the attribute x-noetl-route). OIDC signature path unit-proven (every negative: bad-sig / expired / wrong-aud / wrong-SA / unknown-kid).

Pointers. ai-meta → server fa1ff3f (v3.3.0) + gateway 38f024b (v3.3.0) + ops 54f2d65 + e2e 1421267 + gateway-wiki (push-ingress page). #90 stays open — Phases 4–7 (spool, Cloud Run, CLI local, scale-hardening) remain. Fast-follow tracked: extract a lean shared noetl-directives crate so the gateway + tools consume one engine instead of the vendored copy.


2026-06-11 (RFC #90 Phase 2 SHIPPED — kind:Subscription + continuous runtime + header-directive engine, live E2E green)

Headline. Phase 2 of the subscription/listener RFC (#90) shipped across five repos and validated live end-to-end on kind (13/13 assertions). kind: Subscription is now a first-class catalog type; the continuous listener runtime (Mode B) turns each received message into one execution on a dedicated pool segment; and the header-directive engine (redirect / pool / idempotency / content + W3C trace, untrusted by default) lands.

What shipped.

  • tools v3.3.0 (tools#52, closes tools#51) — source/directives.rs header-directive engine (DirectiveSpecDispatchPlan: allowlisted redirect/pool/priority/idempotency/content + W3C trace extraction; value-allowlists enforced at parse; multi-value last-wins; applied[] audit) + public build_source(cfg, ctx) factory. 12 new tests.
  • server v3.2.0 (server#180, closes server#179) — kind: Subscription catalog validation (source/mode/dispatch, no step-DAG); event-sourced lifecycle endpoints /api/subscriptions (register→activate→pause/resume→drain→deactivate, idempotent register, GET list/get); execution_pool override on /api/executenoetl.commands.<pool>.<eid> across the whole execution (persisted in playbook_started meta, orchestrator reads back); W3C trace into meta.trace + command notification + child inheritance.
  • worker v5.16.0 (worker#73, closes worker#72) — WORKER_MODE=subscription continuous runtime: build SourceClient via the tools factory, register+activate, loop poll()→one POST /api/execute per message on the dedicated pool, apply directives + emit subscription.message.directives_applied, drain+deactivate on SIGTERM. Observability triad (noetl_subscription_* counters).
  • ops (ops#171) — dedicated noetl-worker-rust-subscription-pool (filter noetl.commands.subscription.>) + noetl-subscription-runtime (Recreate strategy) + KEDA scaler.
  • e2e (e2e#42) — kind_validate_subscription_runtime.sh + a kind: Subscription fixture + two target playbooks + the NATS credential.

Live E2E (kind). 6 NATS messages → 6 child executions, all COMPLETED on the dedicated subscription pool (playbook_started.meta.execution_pool=subscription); 2 header-redirected (x-noetl-route → a different allowlisted playbook), 4 default; W3C traceparent propagated into all 6 children's meta.trace; 6 directives_applied audit events; full lifecycle registered→activated→paused→resumed→draining→deactivated event-logged; invalid transitions rejected (422). 13/13 assertions PASS.

Three integration gaps the E2E surfaced + fixed in-PR (same pattern Phase 1 hit): (1) noetl.catalog.kind → noetl.resource(name) FK lacked a subscription row → server startup seed (ensure_builtin_kinds); (2) noetl.event.created_at is TIMESTAMP, decoded as TIMESTAMPTZNaiveDateTime; (3) the runtime drained only on SIGINT but K8s sends SIGTERM → select on both. Plus idempotent register (reuse per path) + Recreate strategy (singleton drain handoff).

Decisions. OQ1 resolved — new WORKER_MODE=subscription run-mode of the worker binary (not a new artifact). OQ7 resolved — explicit dispatch.execution_pool wins over a priority map; multi-value headers last-wins. CLI parse deferred to Phase 6 (the noetl subscribe local mode owns it; Phase-2 registration is a thin server POST). Ack-after-dispatch + durable spool is Phase 4; gateway push (Mode C) is Phase 3.

Pointers. ai-meta → server ebd2944 (v3.2.0) + worker 1f74992 (v5.16.0) + tools 4995692 (v3.3.0) + ops 242e420 + e2e 32df918. Cluster left on the clean :dev stack (brokers kept). #90 stays open (Phases 3–7).


2026-06-11 (RFC #90 — Pub/Sub + Kafka brought to live-E2E parity with NATS)

Headline. Closed the Phase-1 validation gap: the subscription tool's Pub/Sub-pull and Kafka-poll backends were proven live end-to-end on kind, the same bar NATS met. (Phase 1 had proven NATS live; Pub/Sub was emulator-gated unit-only and Kafka was adapter-level.)

What landed.

  • Brokers in kind (ops#170) — ci/manifests/pubsub-emulator/ (gcloud SDK :emulators image, gcloud beta emulators pubsub start on 8085) and ci/manifests/kafka/ (single-broker KRaft apache/kafka:3.9.1, advertised on the in-cluster Service DNS; retired bitnami images avoided). Same PR fixes the subscription_drain.yaml example's .output. accessor.
  • E2E fixtures + runners (e2e#41) — subscription_{pubsub,kafka}_drain.yaml + {pubsub,kafka}_e2e.json.example credential aliases (cluster DNS only, no secret) + scripts/kind_validate_subscription_{pubsub,kafka}.sh. The runners provision the broker (topic/sub + publish/produce N), register + execute the playbook (--set unique names per run), then assert source/count=N/acked=true on the drain result, execution COMPLETED, and the event trail.

Live results (server v3.1.0 + worker v5.15.2 + tools v3.2.0 on kind):

Backend Result
Pub/Sub publish 5 → drain count=5 acked=trueCOMPLETEDcall.done/command.completed/playbook.completed
Kafka produce 5 → drain count=5 acked=trueCOMPLETED → same trail ✓

Re-ran at N=4 against the final committed form — both green.

Adapter fixes. None needed — both backends worked as-is against real brokers (the pure-Rust kafka crate speaks to Kafka 3.9 KRaft; the Pub/Sub REST backend works against the emulator; the worker's v5.15.2 nats|pubsub|kafka credential arm merged endpoint/brokers correctly). The only bug was a playbook accessor: {{ <step>.output.<field> }} never resolved (both when: arcs evaluated false → the drain stalled after the subscription step), corrected to {{ <step>.<field> }} in both fixtures and the latent ops example.

Cluster note. Rebuilt server :dev from the v3.1.0 tree (the running image had been an earlier :dev reporting 3.0.6, pre the version-bump commit) + worker :dev from v5.15.2; rolled both. Cluster left on this clean released stack; Pub/Sub-emulator + Kafka brokers left deployed as first-class e2e infra.

Pointers. ai-meta → ops 568a4ac + e2e 8d21e7a. Wikis: tools SubscriptionTool (new Live validation section), this log, Home, Releases, Umbrella: Subscription / Listener. Board 3: #90 stays In progress (Phases 2–7 remain). Standing direction honored — Claude authored the ops/e2e changes directly.

2026-06-11 (RFC #90 Phase 1 shipped — bounded-drain subscription tool + source-client abstraction)

Headline. Shipped Phase 1 (Mode A — bounded drain) of the subscription/listener RFC: a new atomic subscription registry tool plus the reusable source-client abstraction the later phases build on. First feature code under #90.

What landed.

  • noetl-tools v3.2.0 (tools#50, closes tools#49). New tool kind subscription (operation: poll) — a bounded drain that fetches up to batch / until empty / until timeout_ms (both hard-capped, so the worker slot is never held), acks per policy, and returns the normalized batch.
  • Source-client abstraction (src/tools/source/): the SourceClient trait (poll(&PollOptions) -> PollOutcome), PolledMessage / AckMode types, and shared decode_payload / normalize_headers (RFC §7.1). This is the deliverable later phases reuse — a continuous runtime calls poll in a loop; a gateway push reuses the normalizers.
  • Three backends. NATS (refactors js_consume into the shared drain_pull_consumer; the nats tool now delegates to it), Pub/Sub pull (REST pull+acknowledge via gcp_auth, emulator support, feature pubsub), Kafka poll (pure-Rust kafka crate, feature kafka; Phase-1 limits documented). Worker dispatches via the generic registry — no dispatch-match change.
  • ops example (ops#169): playbooks/examples/subscription_drain.yaml — a scheduled bounded NATS drain.
  • Wiki: SubscriptionTool page on the tools wiki; RFC umbrella marked Phase 1 shipped.

Validation. 323 lib tests + clippy -D warnings clean across --no-default-features / --features pubsub / --all-features. NATS poll path validated live against the in-cluster NATS JetStream broker (port-forward; create stream → publish 3 → drain → ack → second drain returns 0). Pub/Sub emulator path is emulator-gated; Kafka is adapter/unit level (no broker stood up this session).

Full in-cluster playbook-dispatch E2E — green. Built worker + server images carrying the tool, loaded into kind, rolled, ran examples/subscription_e2e: subscription poll drained count=5, acked=true, the NATS consumer went 5→0 pending, the execution reached COMPLETED with call.done/command.completed/playbook.completed in the event log. The E2E surfaced two integration gaps the unit + live-NATS tests could not — both fixed and re-validated end-to-end:

  1. server v3.1.0 (server#178) — the orchestrator validates every step's tool.kind against a typed ToolKind enum, so kind: subscription was rejected at POST /api/execute with HTTP 400 until the Subscription variant was added.
  2. worker v5.15.2 (worker#71) — apply_credential only knew postgres/bearer/api_key/basic; a type-nats credential errored ("unsupported type 'nats'"), so the no-default-connection auth: alias pattern failed for the nats + subscription tools. Fixed by merging the credential's connection fields into the tool config.

Also re-confirmed the local dev wallet's ephemeral KEK behaviour (a server roll re-keys, so playbook credentials must be re-registered after — per the roll-staleness note). Cluster restored to a clean :dev build (validated images re-tagged; re-smoke green: drain count=2 acked=true → COMPLETED).

Pointers. ai-meta pointer bumps: tools → v3.2.0, worker → v3.2.0 (#70) then v5.15.2 (#71), server → v3.1.0, ops → subscription example, ai-meta-wiki

  • noetl-tools-wiki. #90 stays open (umbrella; Phases 2–7 remain). Board: #90 → In progress.

2026-06-11 (RFC #90 → v3 — header / attribute directive layer: redirect, pool routing, W3C trace)

Headline. Revised the subscription RFC to v3 (new §7). Still design-only.

Header-as-instruction, configurable + allowlisted. Each source's metadata channel — Pub/Sub attributes, Kafka record headers, NATS headers, HTTP headers (webhook/push) — is normalized into one uniform message.headers map so playbooks + the runtime see the same shape regardless of source. An opt-in headers.directives allowlist in the kind: Subscription spec declares which keys act as instructions:

  • redirect (dispatch.playbook) — run a different target playbook than the subscription default;
  • pool routing (dispatch.execution_pool) — land the run on a different worker pool / command segment (noetl.commands.<override>, the pool_segment seam at repos/server/src/handlers/execute.rs:705-709);
  • priority (map→pool), idempotency_key (feeds dedup + spool key), content_type/schema_hint;
  • W3C trace / mesh propagation — there is no trace propagation today (only request_id at gateway/src/sse.rs:266 + execution_id); the layer extracts traceparent/tracestate/allowlisted-baggage on ingest, stamps it into event meta.trace (execute.rs:630 threads meta) + span, and propagates it on the command.issued NATS message and into child executions — so a message is traceable upstream-mesh → gateway/runtime → execution → child runs → event log.

Security — untrusted by default. Only allowlisted keys are honored as instructions (rest are data); allowed:/map: constrain even allowlisted headers so they can't pick an arbitrary target; push/webhook directive trust is gated on auth — directives are parsed only after HMAC/bearer/OIDC verification (§6 flow step 4 after step 3), so an unauthenticated caller can't drive routing; a header may name a credential alias only if the alias is itself allowlisted (Secrets Wallet boundary). Applied directives event-logged (subscription.message.directives_applied), honored across in-cluster / Cloud Run (post-auth) / CLI-local.

Plan. No new phase — the directive engine (normalization + allowlist + redirect/pool/idempotency/content + W3C trace into execution) lands in Phase 2 (dispatch-layer); auth-gated push directive trust in Phase 3 (gateway ingress). New open questions OQ7–OQ10 (directive precedence/conflict, idempotency-key vs message_id, header-chosen credentials [default off, needs security review], trace depth). Issue #90 body + a v3 delta comment updated. No pointer bump beyond the wiki.


2026-06-11 (RFC #90 → v2 — first-class kind: Subscription type + store-and-forward spool)

Headline. Revised the subscription RFC to v2 with two review refinements. Still design-only.

1 — kind: Subscription first-class catalog type. Resolves the old open question (distinct kind vs trigger: block) in favour of a dedicated catalog type registered alongside kind: Playbook. It completely isolates the class: own type + validation, own dedicated runtime + worker pool + command segment (noetl.commands.subscription/ iot.*), own lifecycle (register → activate → pause/resume → drain → deactivate, each event-logged), own KEDA scaling on source backlog. The three prongs (bounded-drain A / continuous runtime B / gateway push C) become activation modes of this one type. A trigger: flag would entangle subscription lifecycle into the step orchestrator and default the firehose onto the shared stream; a distinct type keeps isolation both schematic and operational.

2 — configurable store-and-forward spool (RFC §7). When a downstream (target storage / DB / produced-to Pub/Sub-Kafka topic) is unavailable, a circuit breaker trips and incoming messages accumulate in a fallback buffer, replayed in order on recovery. The durability tradeoff is the explicit spool.mode knob — off (stop-acking, the source is the buffer; cheapest but bounded by source retention + useless for non-redelivering webhooks), buffer_and_ack (write-to-durable-store then ack; survives arbitrary outages + push sources, costs a write/msg), hybrid (escalate). Backends reuse existing object-store primitives — gcs/s3 via the Result-Store noetl:// payload-ref pattern (repos/server/src/services/result_store.rs, repos/tools/src/tools/artifact.rs), nats_object via the NATS Object Store ops already in the nats tool, local_disk for CLI. Ordering (per_key lanes) + idempotency on message_id + poison→dead-letter + retention/drain policy. Fully event-logged (subscription.message.spooled / circuit.opened / circuit.closed / spool.draining / message.replayed / message.dead_lettered) so an entire outage is replayable from the log; payload bytes → tenant object store (external, keychain-auth per data-access-boundary), metadata + ref + SHA-256 → the event log. Works across in-cluster (NATS-KV circuit state) / Cloud Run (HTTPS flow-back) / CLI-local (file).

Plan re-cut. 7 phases — Phase 1 unchanged (bounded-drain tool); kind: Subscription + runtime = Phase 2; gateway push = Phase 3; spool = Phase 4 (right after push, where buffer_and_ack is non-optional); Cloud Run 5; CLI local 6; scale-hardening 7. Issue #90 body + a v2 delta comment updated. No pointer bump beyond the wiki.


2026-06-11 (RFC filed — subscription / listener tool for Pub/Sub, NATS, Kafka, webhooks — #90, design only)

Headline. Filed #90 (ai-task, repo:tools, board 3 Todo) with a full RFC on the wiki: Umbrella-Subscription-Listener. Design deliverable — no feature code.

The tension. A listener is inherently long-lived; a worker tool is inherently atomic. The nats tool already encodes this — js_consume is a bounded pull, "not subscriptions," because a long-lived subscription would hold a worker slot and violate the execution model (repos/tools/src/tools/nats.rs:8-20). The RFC resolves it by splitting listening from processing across three prongs:

  • A — bounded-drain tool (tool: subscription, op poll): atomic, a registry tool like the other 18, a generalized js_consume across Pub/Sub-pull / NATS / Kafka. Reuses the worker model wholesale, no new runtime — ships first (Phase 1).
  • B — listener runtime: a long-lived ingress host that turns each message into a normal POST /api/execute. Runs on an in-cluster KEDA-scaled dedicated pool or out-of-cluster on Cloud Run. IoT firehose isolated on a dedicated iot command segment + worker pool so the shared stream never degrades (req. 3).
  • C — gateway push-ingress: /ingress/{listener} for webhooks + Pub/Sub-push; verifies HMAC / bearer / Pub/Sub-OIDC; secrets from the Secrets Wallet by alias, never gateway env. Verify-and-forward; never touches domain data.

Traceability. Same event envelope in local (CLI noetl listen + a FileEventSink reusing the executor's pluggable EventSink), in-cluster, and out-of-cluster (Cloud Run emits via HTTPS, holds no DB connection per data-access-boundary). The event log is the single traceability story regardless of where the listener ran.

Grounding. Built on a parallel code study citing real files: tool trait/registry (repos/tools/src/registry.rs:132, tools/mod.rs:56-79), worker pull loop + pending_callback async path (worker/src/worker.rs:234, executor/command.rs:205-457), container callback precedent (server/.../container_callback.rs), command publish

  • pool routing (server/.../execute.rs:684-742), KEDA scalers (ops/ci/manifests/keda/scaledobject-worker-*.yaml), gateway auth + callback ingress (gateway/src/main.rs:230-288, auth/middleware.rs), Secrets Wallet envelope (server/src/crypto/envelope.rs), CLI local mode + EventSink (cli/src/playbook_runner.rs, cli/executor/src/events.rs).

Plan. 6 phases (tools → worker/server → gateway → cloud-run → cli → scale-hardening). Reuses #46 system-pool primitives, #61 Wallet, #49 server API surface. Next: review the three-prong model + confirm Phase 1 scope. No pointer bump (wiki-only + ai-task issue).


2026-06-11 (#89 shipped — JSON null round-trips through {{ step }}; root cause was the server renderer, not the worker — server v3.0.6)

Headline. Closed #89, the null-serialization bug #88 surfaced. The cursor pagination fixture walked all 4 fetch pages but its 4th check_pagination Python step crashed on the terminal page: the API returns next_cursor: null, and re-injecting the whole {{ fetch_page }} envelope into the next step's input rendered that field as the JS token undefined — invalid JSON — so the consuming step received response as a raw str and died with AttributeError: 'str' object has no attribute 'get'.

Root cause — server, not worker. The issue hypothesized the worker. Tracing the corrupt command.issued args.response showed it's emitted by the server orchestrator, which renders next-step inputs via repos/server/src/template/jinja.rs::render_to_value. Two facts combined: json_value_to_minijinja maps a JSON null to Value::UNDEFINED, and minijinja renders a map with Python-style repr, so the field surfaces as the bare token undefined. render_to_value then failed serde_json::from_str and fell through to returning the entire envelope as a string. The noetl-tools TemplateEngine::render_value already had a | tojson retry for exactly this case; the server's renderer was a divergent copy that lacked it.

Fix. server#177 (v3.0.6) adds the | tojson retry to render_to_value: a lone {{ expr }} whose plain render is container-shaped-but-invalid JSON re-renders with | tojson, and minijinja_to_json maps undefined/none → JSON null, so a null field round-trips as null and the downstream step receives a parsed object. 5 new regression tests (null in nested + top-level objects, null array element, explicit | tojson no-double-pipe, scalars unchanged); 619 lib + 8 parity tests green; clippy clean.

Kind validation. Built noetl-server-rust:dev, loaded into the local kind cluster (podman savekind load image-archive), rolled noetl-server-rust, re-ran tests/pagination/cursor/cursor against the live paginated-api test-server. Baseline (unfixed) reproduced the bug: 4th check_paginationcommand.completed error, args.response carried next_cursor": undefined. Fixed: all 4 cycles → success, terminal args.response = next_cursor": null, execution completed through end, validate_resultsstatus=success, total_events=35, first_id=1, last_id=35 — matching the offset fixture's clean full-collection result. No error events.

Pointers. ai-meta → server 8e17fbe (v3.0.6). Standing direction honored — Claude wrote the Rust directly, no Codex. Left the unrelated uncommitted items (repos/.dockerignore, scripts/start_noetl_ui.command) untouched.


2026-06-10 (#88 shipped — pagination fixtures read response.body.*; #89 filed — worker null→undefined serialization)

Headline. Closed the pagination fixture-path follow-up #88 that #85 anticipated. The offset/cursor e2e fixtures read the HTTP response at response.get('data', {}), but the Rust http tool nests the parsed JSON payload under body{{ fetch_page }} resolves to {body, headers, status_code}. So response.data had no users/events/has_more/next_cursor, the loop saw has_more=False and exited after page 1 even though the post-#85 loop machinery is correct.

Root cause confirmed against a live http-tool response (not guessed): call.done fetch_page → result.context.data = { body: { users:[…], has_more, offset, limit, total }, headers, status_code }. Switched both check_pagination steps to response.get('body', {}).

Fixtures changed. repos/e2e/fixtures/playbooks/pagination/offset/test_pagination_offset.yaml and .../cursor/test_pagination_cursor.yaml ('data''body' + clarifying comments). Other pagination fixtures (retry, max_iterations, pipeline*, loop_with_pagination) share the same envelope-key assumption (http_response.get('data') over /api/v1/assessments|flaky, which return {data, paging}) — flagged on the umbrella, left out of scope.

Kind validation (live paginated-api test-server, Rust server/worker :dev carrying the #85 fix):

  • offset — exec 323345262938427392: offsets 0→10→20→30, has_more T/T/T/F, users 10/10/10/5, validate_results success 35 (first_id=1, last_id=35), playbook.completed COMPLETED. Fully green.
  • cursor — exec 323345263580155904 / 323346678553776128 (deterministic): cursors Mg==→Mw==→NA==→null, 35 events fetched, collected 10→20→30, then the 4th check_pagination crashed ('str' object has no attribute 'get') — the worker re-injected the terminal next_cursor: null as JS undefined, invalid JSON, so the Python step received an unparseable str. Filed #89 (repo:worker) for the serialization bug; not a response-path issue, so out of scope for #88.

Shipped. e2e#40 (squash-merged, author kadyapam) → e2e 72a7525; ai-meta pointer bumped; #88 closed + moved to Done on roadmap board 3; #89 opened (Todo) + added to board 3.


2026-06-10 (#85 deep fix shipped — durable loop-ctx propagation + loop-exit hang, server v3.0.5)

Picked up the deferred deep layer of #85 on the existing draft server#176 (dispatch-guard re-entry already on the branch). Claude wrote all Rust directly per handoff-routing.md.

Root cause (two layers). The dispatch-guard layer made the loop re-enter, but kind validation had shown loop variables thrashing (0,0,1,0,1,2,…). Diagnosed two distinct bugs:

  1. Loop-ctx not durable. Step-level set: ctx.* was recomputed every orchestrator pass from the workload default + step results. The per-pass “apply every completed step’s set” loop re-fired start’s initializer set: ctx.offset: {{ workload.offset }} (= 0) on every pass, competing non-deterministically (random HashMap order) with check_pagination’s advancing set: ctx.offset.
  2. Loop-exit hang. Once the variable advanced, the exit branch (validate_results) was marked step.skipped on a pass triggered by the loop body completing — the recency-based branch-point detector saw the body as newer than the branch point — turning the exit branch terminal so the is_step_done guard later suppressed the exit dispatch.

Fix (server#176, v3.0.5). (1) Persist each completing step’s rendered set: values to the event log as a ctx.updated event; WorkflowState folds them latest-wins into a durable ctx map that build_context overlays. Emission is once per completion, keyed by the completion event’s event_id (StepInfo.completed_event_id) — not completed_at, which the event loader fills with Utc::now() when the row’s created_at is unreadable (observed live: start re-emitting counter=0 every pass, oscillating the fold). (2) Protect the exit branch with a structural loop-branch-point test (a step with any back-edge arc), independent of completion timing.

Tests. 614 lib tests (+6 new for this change; 2 verified to fail without their guard): end-to-end counter loop advances 0→1→2 and terminates through a real exit step; ctx.updated emission + once-per-completion idempotency; durable-ctx fold + build_context overlay; exit branch not skipped on a body-completion pass. Clippy-clean.

Kind validation (rebuilt + rolled noetl-server-rust, context kind-noetl):

  • Counter-loop repro: ctx.updated emits start=0, gate=1,2,3 (once each, no thrash), work dispatched 0,1,2, zero step.skipped, validate.final_counter=3 success=True, COMPLETED.
  • Real-http offset pagination (test-server, 35 users / 10 per page): fetch_page completes 4×, offset 0→10→20→30→40, collected 0→10→20→30→35, has_more flips false only at the end, zero skips, validate_results: total_users=35 success=True, COMPLETED.

Separate finding (not #85, follow-up filed): the e2e offset/cursor fixtures’ check_pagination reads the HTTP response as response.data.users; the Rust http tool nests it at response.body.users, so the fixtures exit after page 1 (loop ctx still propagates correctly — confirmed offset 0→10).

Merged + released: server #176v3.0.5 (e519fdc). ai-meta pointer → server e519fdc. #85 closed; roadmap board 3 → Done.


2026-06-10 (e2e follow-ups: #87 multi-tool sibling refs shipped; #85 loop re-entry deferred)

Picked up the two open regression-sweep follow-up bugs from the 2026-06-10 sweep. Standing direction honored — Claude wrote all Rust directly (no Codex handoffs).

#87 — multi-tool sibling references — CLOSED. Root cause in noetl-tools TaskSequenceTool::execute (the runtime for tool: [list] multi-tool steps): each sub-tool's result was stored in labeled_results for the aggregated step output but never injected into the running context, so a later sub-tool referencing an earlier sibling via {{ <label>.<field> }} rendered empty. Masked wherever the reference sat in a quoted position (empty render = valid ''); surfaced as a syntax error at or near "," in an unquoted numeric SQL position (save_edge_cases test_large_payload: VALUES ('large_payload_test', {{ generate_large.metadata.record_count }}, ...)). Fix: inject each sub-tool's result under its label after it completes, with a synthetic .data self-reference (mirrors the server's build_context shape) so both {{ label.field }} and {{ label.data.field }} resolve. Also visible to a later python sub-tool's stdin variables. 2 new unit tests; 300/0 tools lib.

  • Ship: tools#48noetl-tools v3.1.1 (published to crates.io after a transient HTTP2-flake re-run); worker adopts via worker#69 (Cargo.lock → 3.1.1, ^3 covered it).
  • Validation: built a worker image against the 3.1.1 code (cargo [patch.crates-io] to the local tools tree), loaded into the kind cluster, re-registered credentials. save_edge_cases test_large_payloadrecord_count = 100 with no SQL syntax error; save_delegation_test completed clean (its unquoted {{ generate_data.value }} postgres insert would have failed if the sibling ref rendered empty).
  • Pointers: ai-meta → tools 76f942a (v3.1.1) + tools-wiki 4962f8b + worker b97f642. Tools wiki: added a Multi-tool steps (task_sequence) section.

#85 — workflow-arc loop can't re-enter a completed step — DEFERRED (kept open). The filed root cause (the pass-2 is_step_done guard suppressing the loop back-edge dispatch) is real and fixed by a back-edge detector — a matched arc src → target where target is terminal, target can forward-reach src (cycle), and src completed strictly after target (recency, which disambiguates the two arcs of a 2-cycle). The guard is bypassed for recognized back-edges and the target re-dispatched; no state reset (the re-dispatch's own step.enter/command.issued re-activate it). 608 server lib tests + 5 new pass.

But kind validation found a second, deeper layer: the loop re-enters and no longer hangs, yet the loop variable does not propagate across iterations. set: ctx.X mutations are recomputed per orchestrator pass from step results and revert to the workload default whenever the producing step is re-dispatched. A minimal counter-loop driven by set: ctx.counter: {{ work.next_counter }} thrashes (work dispatched with counter 0,0,1,0,1,2,… instead of 0,1,2,3), so offset/cursor pagination still stall. This is the "event-sourced step re-enter / loop iteration signal" the issue body anticipated — the Rust orchestrator has no durable ctx state across passes. Held as draft server#176 (not merged — a clean hang is more debuggable than a non-deterministic thrash); detailed next-step proposal (durable context.set events reconstructed in from_events) recorded on #85. Board #85 → In progress.


2026-06-10 (full e2e regression re-sweep — 19→27/36; two orchestrator loop bugs fixed)

Agent: Claude (direct — Rust + e2e fixtures) · Repos touched: noetl/server (PR #175), noetl/e2e (PR #39), noetl/cli (PR #58, doc fix). Merged + pointer-bumped: ai-meta → server 480ba72 (v3.0.4), e2e b0a5c85, cli a3e22ef; issues #83/#84/#86 closed.

Headline. Fresh full-platform regression sweep against the local kind cluster after rebuilding noetl-server to v3.0.3 (the deployed image reported 3.0.2 — missing the #173 container-callback fix; worker already at v5.15.1 99e2c66). Config-driven 36-playbook sweep went 19→27/36 PASS after fixes. Found three distinct orchestrator/platform bugs; fixed two, filed the third.

Bugs found.

  • #83 (FIXED, server#175) — the fan-in/reduce barrier deadlocked every workflow loop. build_incoming_arcs counted a loop back-edge (check_pagination → fetch_page) as an upstream, so the barrier deferred the loop head forever (Reduce step 'fetch_page' deferring dispatch — 1 of 2 upstream(s) still pending). Fix: exclude back-edges via a new forward_reachable helper. fanout_reduce barrier unaffected.
  • #84 (FIXED, server#175)event.name was never populated for arc evaluation, so the canonical when: {{ event.name == "loop.done" }} gate (10+ fixtures) never matched → in-step loop: steps hung after completion. Fix: inject event.name = "loop.done" into a completed loop step's next-arc context. Validated: test_pagination_basic completes.
  • #85 (FILED, deep) — workflow-arc loops can't re-enter an already-Completed step: the pass-2 is_step_done dispatch guard suppresses the loop-back, so offset/cursor pagination stall on the 2nd iteration (flaky: pass when 1 page suffices). Needs an event-sourced step-reset — left for follow-up. #83 is what lets such loops run their first iteration.
  • #86 (FIXED, e2e#39) — duckdb fixtures used commands: (plural); the tool field is command/query. Renamed across 4 storage/gcs fixtures; save_all_storage_types green.
  • #87 (FILED) — multi-tool (workbook-list) step: a later sub-tool can't reference an earlier sibling's output ({{ generate_large.metadata.record_count }} renders empty); surfaces in save_edge_cases (unquoted SQL position), masked in save_all_storage_types (quoted).

Other setup. Re-registered all 5 test credentials (the dev cluster uses NOETL_ALLOW_INSECURE_DEFAULT_KEY → ephemeral per-pod key, so creds need re-registration after each server restart — by design; the stable fix is a fixed NOETL_ENCRYPTION_KEY in noetl-secret). Built + deployed the missing paginated-api test-server (ops/ci/manifests/test-server/) into kind — pagination/retry fixtures depend on it.

After-matrix (27/36). Remaining 9 non-passes: 4 external (GCS×2, OpenAI, github-URL), 2 distributed-mode local-file sources (python_file/postgres_file — script not shipped to worker), 1 negative-test harness artifact (should_error — server's HTTP 400 reject is the expected failure), 2 multi-tool templating (#87). Targeted validators kind_validate_fanout_reduce.sh + kind_validate_container_callback.sh both PASS.

Pointers. server PR #175 (fix/orchestrator-loop-completion, 26 orchestrator unit tests + 2 new). e2e PR #39. Validated against noetl-server v3.0.3 / noetl-worker v5.15.1.

2026-06-10 (#78 closed — worker pre-dispatch failures emit terminal call.error)

Agent: Claude (direct — Rust, per handoff-routing.md) · Repos touched: noetl/worker (PR #68, v5.15.1, sub-issue worker#67); worker wiki; ai-meta pointer bumps + wiki.

Headline. Fixed noetl/ai-meta#78: a command that failed before tool dispatch (credential-alias resolution, tool-config deserialization) ?-propagated out of CommandExecutor::execute_with_server_url to the dispatch loop, which only logged Command execution failed — no call.error reached the server, so the execution hung at command.started forever (had to be noetl cancel'd by hand).

What landed.

  • Typed CredentialResolutionError — terminal (AliasNotFound / Invalid) vs retryable (Transient), classified by type / HTTP status, not by string-matching anyhow messages.
  • CredentialHttpError surfaces the credential-fetch HTTP status so classify_fetch_error decides retryability by code: terminal for 404/400/401/403/500, retryable for 408/429/502/503/504 + transport errors.
  • CommandExecutor::handle_predispatch_failure emits call.error + command.failed for terminal failures (and retry-exhausted transients via MAX_PREDISPATCH_ATTEMPTS=3); fresh transients emit nothing so the command path's retry runs.
  • Folded in the gated noetl-tools dep revert (path = "../tools""3", Cargo.lock → 3.1.0) — this also unblocks the Docker image build (the path dep can't resolve in the build context).

Diagnosis correction. The umbrella framed this as a "clean 404" from /api/keychain/..., but the worker has no such call — it resolves aliases via /api/credentials/{alias}. The live pg_noetl_k8s failure is actually an HTTP 500 "Decryption failed: aead::Error". The status-aware classification handles both 404 and the real 500 as terminal.

Validation. cargo build/test (133 lib + 9 integration, +7 new) / clippy green. Kind-val on local Rust stack (rebuilt worker image loaded into kind noetl): test/postgres (pg_noetl_k8s) → call.errorcommand.failedplaybook.failed (no hang; worker log: "Pre-dispatch failure is terminal; emitted call.error + command.failed ..."); hello_worldplaybook.completed.

Pointers. ai-meta → worker 99e2c66; worker wiki 987abba (worker-credentials Pre-dispatch failure handling). PR worker#68; sub-issue worker#67 (auto-closed). Handoff thread 2026-06-09-worker-predispatch-call-error completed directly by Claude (not Codex) and archived.

2026-06-10 (#80 closed — container_callback chain green end to end)

Agent: Claude (direct — Rust + ops + e2e) · Repos touched: noetl/ops (PR #168), noetl/server (PR #173, v3.0.3), noetl/e2e (PR #38); three pointer bumps in ai-meta; ai-meta wiki.

Headline. The watcher's missing curl (#80) was the named blocker, but fixing it uncovered two more layered bugs beneath it. With all three fixed, kind_validate_container_callback.sh is green both probes — happy_path → succeeded, oom → failed_oom — closing the last blocker on the #43 container-callback chain.

Root cause (three layers).

  1. Watcher image / curlrepos/ops/ci/manifests/k8s-watcher/deployment.yaml used the retired bitnami/kubectl:1.30.3 (removed from Docker Hub; the live cluster was patched to the bitnamilegacy archive image as a stopgap) with a runtime apt/apk install jq curl step that never put curl on PATH. Every callback POST hit curl: not found → HTTP 000, so noetl_container_callback_total never bumped.
  2. Server insert schema — once curl worked, the POST reached the server and 500'd: handlers::container_callback emitted its resume call.done via db::queries::event::insert_event, whose SQL targets attempt + id columns and RETURNING id — none of which exist on the deployed noetl.event (PK (execution_id, event_id)). column "attempt" of relation "event" does not exist. The normal ingestion path (handlers::events) uses the correct column set; only this handler wired into the stale module.
  3. OOM path (never functional) — the watcher's classify_state() only read Job-level conditions, so it could emit succeeded/failed/failed_timeout but never failed_oom; and the e2e fixture's bytes(40 * 1024 * 1024) is calloc-backed (mapped to the zero page, never faulted in), so the container exited 0 instead of OOMing. A third bug hid behind those: build_body's completed_at fallback for failed Jobs used bare jq now (a numeric Unix epoch the server rejects as a DateTime<Utc> → HTTP 422).

Fixes.

  • ops #168 (→ cacc513): image → alpine/k8s:1.30.3 (kubectl + jq + curl baked in), drop the install hack; add classify_pod_failure() reading the backing Pod's status (RBAC already grants pod reads) → failed_oom (OOMKilled) / failed_image_pull (ImagePullBackOff); completed_at fallback → RFC3339 now | todate.
  • server #173 (v3.0.3 → 5d2cf58): replace insert_event with an inline INSERT matching handlers::events; terminal outcome rides in a chk_event_result_shape-conforming result envelope ({status, context}), node_type in meta. cargo build, clippy (no new warnings), 7 container_callback unit tests pass.
  • e2e #38 (→ 6aaf06e): fixture uses a written-into bytearray(64 MiB) (one byte per 4 KiB page) so the kernel backs every page and the kubelet OOM-kills the pod.

Validation. Confirmed the kind cluster enforces memory limits (120 MiB dirty alloc in a 32Mi pod → OOMKilled exit 137). Rebuilt the server image + reloaded into kind; restarted the watcher with the new script. Final run:

kind-val: PASS — happy_path   (state=succeeded  counter delta = 1)
kind-val: PASS — oom          (state=failed_oom counter delta = 1)
kind-val: ALL PROBES PASS — Container Tool Callback chain green

Pointer bumps. ai-meta@811b3da (ops cacc513) + ai-meta@12bb6d6 (server 5d2cf58, v3.0.3) + ai-meta@649c646 (e2e 6aaf06e). #80 closed by the ops pointer-bump commit; #43 (closed design issue) got a closing-the-loop note that acceptance #5 (kind validation) is green.


2026-06-10 (#79 closed — e2e kind-val runner scripts updated to current noetl CLI surface)

Agent: Claude (direct — repos/e2e is non-Rust) · Repos touched: noetl/e2e (PR #37); pointer bump in ai-meta; e2e wiki (new Kind-Val Runners page).

Headline. Both scripts/kind_validate_fanout_reduce.sh and scripts/kind_validate_container_callback.sh aborted immediately on error: unrecognized subcommand 'playbook'. They targeted the retired noetl playbook register/execute + noetl execution status/events verbs. The validation logic and the asserted event taxonomy (step.enter / command.completed / node_name / fan-in barrier) were intact — only the CLI invocation layer had drifted.

Root cause. The maintained CLI surface (stable from v2.17.0 through the v4.x repos/cli line) is register playbook / exec / status / query; the runners were never updated when the old playbook / execution command groups were removed.

Fix (e2e#37, squashed to a3594b3).

  • noetl playbook register --file Fnoetl register playbook --file F.
  • noetl playbook execute --path P --output jsonnoetl exec <catalog-path> --runtime distributed --json (exec by metadata.path, not the bare name).
  • noetl execution status --id ID --output jsonnoetl status ID --json.
  • noetl execution events --id ID --output jsonnoetl query "SELECT … FROM noetl.event WHERE execution_id = ID ORDER BY event_id" --format json (no events verb today; rows wrap under .result; noetl.event has no timestamp column so the barrier-ordering assertion compares event_id, replacing the removed created_at).
  • Fail-fast CLI-surface guard added to each runner (names the missing verb + installed version).
  • container_callback: default NOETL_SERVER_DEPLOYnoetl-server-rust; run_probe now takes the catalog path explicitly.

Validation (local kind, server-rust v3.0.1 + worker-rust, :8082).

  • fanout_reduce — PASS start-to-finish, no manual workaround (final COMPLETED; exactly one step.enter for reduce_customer; reduce command.completed after both upstreams).
  • container_callback — now drives register → exec → terminal COMPLETED cleanly; stops at the metric-delta assertion because the deployed noetl-k8s-watcher image lacks curl (watcher.sh: curl: not found → callback POST HTTP 000). Cluster-side watcher gap, tracked on #80; not the CLI surface.

Version-skew decision. PATH binary is noetl 2.17.0; repos/cli submodule is v4.10.0. The targeted surface is identical across both, so the runners work on either. The installed binary lags the submodule by a major-version line — worth refreshing for general parity, but not required for these runners.

Wiki. New Kind-Val Runners page on the e2e wiki documents both runners + the CLI-surface contract; linked from Home.

Pointer: e2e → a3594b3; e2e wiki → 59413b1.


2026-06-10 (#82 closed — GUI credential View/Edit recovered for pre-wallet records; e2e fixture dedup + dev:kind script landed)

Agent: Claude (direct — repos/gui/repos/e2e are non-Rust) · Repos touched: noetl/gui (PRs #36, #35; v1.11.0 + v1.11.1), noetl/e2e (PR #36); pointer bumps in ai-meta; wiki

Headline. After the Secrets Wallet migration (#61) the credential section couldn't display some credentials on View or Edit — they failed with a generic toast. Root cause is not a GUI API-shape change: the wallet moved credential storage to forward-only envelope encryption (repos/server/src/services/credential.rs"Forward-only — there is no legacy single-master-key path; a pre-wallet record must be re-registered"). GET /api/credentials/{id}?include_data=true calls cipher.open_storage_json(&entry.data); records sealed under the old static/all-zeros key can't be unwrapped by the new KEK, so the server returns 500 Decryption failed: aead::Error. Proven live: all 21 stored (pre-wallet) credentials 500 on include_data=true, while a freshly-POSTed credential reads back its data exactly as before. The GUI met that 500 with a dead-end toast — View silently failed, Edit never opened.

What landed

  • GUI credential recovery (repos/gui/src/components/Credentials.tsx + styles/Credentials.css, gui#36, v1.11.0; closes #82): View surfaces the real reason and points to Edit as the recovery path; Edit still opens the modal using the metadata already in the list row (name/type/description/tags) + a warning banner, with an empty-but-required data field — re-entering the secret and saving POSTs it back, re-sealing the record under the current wallet. Happy path (wallet-era credentials) unchanged.
  • dev:kind convenience script (repos/gui, gui#35, v1.11.1): VITE_API_MODE=direct + skip-auth Vite target that talks straight to the kind server on :8082, plus README.
  • e2e fixture dedup (repos/e2e/.../tooling_non_blocking.yaml, e2e#36): removed a duplicate enable_snowflake_probe/enable_nats_kv_probe pair in the same workload: mapping (a serde_yaml dup-key defect; last-wins under PyYAML). Canonical declarations + {{ workload.* }} references intact; no behavior change.

Validation

npm run type-check + npm run build + npm test (4/4) green. Live against the kind cluster (server :8082) + npm run dev:kind UI on :3001 via Playwright: View on a legacy credential → 500 caught → decrypt-specific toast (no silent failure); Edit on a legacy credential → modal opens with the warning banner, name (pg_local) + type (PostgreSQL) preserved, data field empty + required. The re-save → re-seal loop confirmed at the API layer (a freshly-POSTed credential reads back its data). e2e fixture: yaml.safe_load parses clean, both flags resolve to False, references + auth aliases intact.

Pointers

  • ai-meta → gui 8cacc9e (v1.11.1); ai-meta → e2e 4a9ffbc.
  • The worker Cargo.toml path-dep override (noetl-tools = { path = "../tools" }) remains in the working tree, untouched — gated on #78.

2026-06-10 (#81 closed — noetl-server v3.0.2 fixes the container-tool command type contradiction; kind-val GREEN)

Agent: Claude (direct, Rust per agents/rules/handoff-routing.md) · Repos touched: noetl/server (PR #172, v3.0.2); pointer bump in ai-meta; wiki

Headline. The container tool kind couldn't execute with any command value — the server and worker disagreed on the type. Server ToolSpec.command was Option<String> (scalar): an array command: ["/bin/sh", "-c"] failed the ToolDefinition untagged-enum match at deserialise time (400 Bad Request: data did not match any variant of untagged enum ToolDefinition). Worker ContainerConfig.command is Option<Vec<String>> (a sequence): a scalar cleared the server but was then rejected by the worker (Invalid container config: invalid type: string, expected a sequence). No value satisfied both schemas, so the container-callback chain couldn't be exercised — no K8s Job was ever created.

What landed

  • Fix (repos/server/src/playbook/types.rs, server#172, v3.0.2): typed ToolSpec.command as Option<serde_json::Value> — the same treatment args already gets and which already decoded its array fine. A scalar stays a JSON string for the shell/db consumers (db tools read it via query's #[serde(alias = "command")]); an array passes through unchanged to the worker's ContainerConfig.command: Option<Vec<String>>. ToolCall::from_spec now forwards command verbatim instead of wrapping it in serde_json::Value::String(...). Worker side (noetl-tools) needed no change.
  • Tests: 2 new regression tests (test_container_array_command_decodes_and_passes_through, test_scalar_command_stays_a_string); cargo test --lib playbook::types 18/18; clippy clean on the change.

Validation (kind, manual approach)

Built localhost/noetl-server-rust:dev with the fix, loaded into kind via image-archive, rolled noetl-server-rust (server only — no full-stack redeploy). Reproduced the baseline 400 against the pre-fix server, then re-ran container_callback_happy_path (command: ["/bin/sh","-c"]) against the fixed server:

  • Server accepted it (no 400); execution 323132934439571456 dispatched.
  • Worker logs: container.dispatchcontainer.dispatched (no "expected a sequence").
  • K8s Job noetl-container-dispatchcontainer-323132934439571456-kdrq7 reached Complete 1/1. Pre-fix kubectl get jobs stayed empty.

The chain's terminal-state counter-bump validation (succeeded / failed_oom) still depends on the resume path + #79 (runner-CLI refresh) / umbrella #43.

Pointers

  • Server pointer bumped: repos/serverbd36672 (v3.0.2).
  • Closed noetl/ai-meta#81; board 3 → Done.
  • Dashboard reconciled: added open #79 + #80 to Home's Active-umbrellas table (were missing).

2026-06-09 (Full e2e regression sweep — v3.0.1 server + v3.1.0 tools: regression-clean; filed worker bug #78)

Agent: Claude (direct) · Repos touched: none (validation + issue filing); credentials registered in the running kind cluster (pg_k8s, pg_local)

Headline. Ran the config-driven regression set (repos/e2e/fixtures/playbook_test_config.yaml, 36 enabled playbooks) against the Rust-only kind stack (noetl-server-rust :dev v3.0.1 code, noetl-worker-rust :dev local tools v3.1.0, Python scaled to 0). Harness: catalog registerPOST /api/execute → poll status → compare to the config's expected execution_status. Conclusion: no platform regression from the v3.0.1/v3.1.0 cleanup.

What landed

  • 18 confirmed PASS covering every code path the cleanup touched: core (hello_world, start/end-with-action), control flow / when: / tojson (control_flow_workbook, weather_control_flow), variables (5× test_vars_*), retries (http_retry_*, python_retry_exception, postgres_retry_connection, duckdb_retry_query), nested-playbook composition (playbook_composition), postgres+http (http_to_postgres_simple).
  • Every non-pass root-caused to environment / fixture / harness, not the platform:
    • CLI v2.17.0 path-derivation quirk (bare-basename → 404 when a <name>.yaml sibling exists); the server has them registered and POST /api/execute runs them.
    • External mock-HTTP (api_url flaky server) / GCS not present in kind (pagination, retry_simple_config, python_http_example, python_gcs_example, duckdb_gcs_workload_identity).
    • Missing worker-FS script file (python_file_example).
    • Missing pg_local credential (registered mid-run → postgres fixtures then reach a clean terminal).
    • Fixture SQL drift (save_*: truncate_tables expects simple_test_flat which create_tables doesn't create — surfaced cleanly as SQLSTATE 42P01, i.e. the postgres-observability fix works).

Genuine bug found → filed #78

The missing-pg_local case exposed a worker error-propagation gap: a pre-dispatch failure (credential-alias resolution) ?-propagates out of CommandExecutor::execute_with_server_url; the dispatch loop (repos/worker/src/worker.rs:306) only logs "Command execution failed" and emits no call.error → the execution hangs at command.started forever instead of reaching FAILED. Violates agents/rules/no-default-connection.md. Confirmed by before/after credential registration (missing → hang; present → clean terminal). Pre-existing; independent of the v3.0.1/v3.1.0 work. Labelled ai-task + repo:worker + bug; on roadmap board 3 (Todo).

Pointers

  • Regression result posted on #49.
  • ai-meta main pushed earlier this session (b2e8e83): tools v3.1.0 + server v3.0.1 + ai-meta-wiki pointer bumps.

2026-06-09 (E2E sweep cleanup — noetl-tools v3.1.0 + noetl-server v3.0.1; tracks #49)

Agent: Claude (direct) · Repos touched: noetl/tools (v3.1.0, PR #47), noetl/server (v3.0.1, PR #171)

Headline. Stripped the diagnostic tracing::debug! scaffolding added during the prior-session e2e triage, kept the production fixes, opened + merged the two PRs, bumped pointers. Tracks the #49 Rust-server parity umbrella.

What landed

  • noetl-tools v3.1.0 (noetl/tools#47 MERGED):

    • YAML boolean when: true in policy rules now checks as_bool() before the string-template fallthrough — Value::Bool(true).as_str() returns None, so the pre-fix code never matched.
    • |tojson fallback: when a single {{ expr }} template renders a complex object, minijinja emits Python-style repr (True/None); the engine retries with |tojson appended to produce valid JSON.
    • UndefinedBehavior::Chainable on the tools template engine — matches the server's permissive undefined-variable behaviour so {{ iter.item }} returns undefined rather than throwing.
    • Test hygiene: two tests (test_render_value_roundtrip_complex_object, test_policy_rules_extraction_from_wire_format) were outside the mod tests {} block — moved inside, removed eprintln! debug prints.
  • noetl-server v3.0.1 (noetl/server#171 MERGED):

    • Result-store PUT body limit raised to 64 MB (DefaultBodyLimit::max(64 * 1024 * 1024)) — Axum's default 2 MB limit was rejecting 15 MB+ result payloads with HTTP 413.
    • render_pipeline_config stashes set/args/spec/command blocks before Tera rendering and restores them after.
    • iter namespace map in build_iteration_command so {{ iter.item }} resolves during fan-out.
    • cmd_render_ctx uses command.context.clone().unwrap_or_default() instead of the raw orchestrator context, so per-command overrides propagate.
    • Stripped diagnostic tracing::debug! blocks from commands.rs + execute.rs.

Validation

All 7 e2e sweep playbooks PASS on the Rust-only kind stack (Rust server + Rust worker, Python scaled to 0) — confirmed in the prior session before the cleanup; these PRs are logging-only deltas on top of the validated source trees.

Pointers + issues

  • ai-meta@316048c — chore(sync): bump tools to d294a6c (v3.1.0)
  • ai-meta@6590bd6 — chore(sync): bump server to 33789b0 (v3.0.1)
  • Commented landing on #49; corrected the stale Refs #69 citation the PRs carried (#69 is the closed worker output_select bug, not these fixes).

Deferred

repos/worker/Cargo.toml still pins noetl-tools = { path = "../tools" } for local testing. Reverting to noetl-tools = "3" (crates.io) is blocked on noetl-tools v3.1.0 publishing to crates.io — currently crates.io is at v3.0.0; the v3.1.0 release commit carries [skip ci], so the publish CI didn't fire. Reverting now would regress the worker to the pre-fix v3.0.0.


2026-06-09 (#77 CLOSED — all 5 PRs merged: noetl-tools v3.0.0 + noetl-server v3.0.0 + e2e#35 + cli v4.10.0 + worker#66)

Agent: Claude (direct) · Repos touched: noetl/tools (v3.0.0), noetl/server (v3.0.0), noetl/e2e (PR #35), noetl/cli (v4.10.0, PR #57), noetl/worker (PR #66)

Headline. Post-merge propagation for #77 — explicit input:/set: forward-only data binding. All five PRs merged; all pointer bumps landed. Umbrella CLOSED, board → Done.

What landed

  • noetl-tools v3.0.0 (noetl/tools#45 MERGED): breaking — task_sequence pipeline reads each sub-tool's input: map and injects key-value pairs as template variables. _prev/_results positional workaround removed.

  • noetl-server v3.0.0 (noetl/server#169 MERGED): breaking — render_pipeline_config replaces render_value_deferring. Server no longer defers _prev/_results templates.

  • e2e fixture migration (noetl/e2e#35 MERGED): all 13 YAML fixtures rewritten from _prev references to set:/input: pattern.

  • CLI dep bump (noetl/cli#57 MERGED): noetl-tools 2.x -> 3, noetl-executor 0.4.x -> 0.5.0.

Kind validation

PASSES for #77 scope. Server orchestrator correctly renders pipeline config and propagates set: values via the new path. Template resolution failure on {{ ctx.test_var }} is a pre-existing Rust worker ctx-namespace gap, not a #77 regression.

Pointers

  • Tools: 10cc751 -> fdbc407 (v3.0.0)
  • Server: 54fecd3 -> 0f8dc63 (v3.0.0)
  • E2E: 6b3a52a -> f6a9a93 (PR #35)
  • CLI: 77be8be -> c73f99d (PR #57)
  • Worker: 02f18d5 -> 8dd653b (PR #66, dep bump noetl-tools 3.0.0 + noetl-executor 0.5.0)
  • Umbrella page: Umbrella: Explicit Input Binding
  • Issue #77 CLOSED. Board → Done.

2026-06-08 (noetl-server v2.63.0 — step-level set: + ctx shims MERGED; e2e fixture _prev template fixes MERGED; #77 opened)

Agent: Claude (direct) · Repos touched: noetl/server (PR #168 MERGED — v2.63.0), noetl/e2e (PR #34 MERGED)

Headline. Server PR #168 merged (v2.63.0) — ctx/workload namespace shims + step-level set: mutation support. Fixed the last 2 e2e fixture failures (http_to_postgres_simple + save_simple_test) by applying the _prev deferred template pattern for multi-tool steps. E2E PR #34 merged; pointer bumped. Opened #77 — explicit input: binding for multi-tool step arguments (replaces the positional _prev workaround with proper data-binding).

What landed

  • Server v2.63.0 (noetl/server#168 MERGED):

    • with_ctx_shims() helper at 7 orchestrator call sites.
    • set_vars field on Step struct — step-level set: mutations applied after each completed step.
    • Kind-validated in prior session; deployed to kind cluster.
  • E2E fixture _prev fixes (noetl/e2e#34 MERGED):

    • http_to_postgres_simple: merged separate transform + execute steps into single multi-tool step using {{ _prev.id_0 }} etc. (execution 322618534888738816 → COMPLETED).
    • save_simple_test: fixed both postgres_flat_test ({{ _prev.data.record_id }}) and postgres_nested_test ({{ _prev.nested_id }}) — both previously used tool-name references that resolved to empty server-side (execution 322619363913895936 → COMPLETED, affected_rows: 1).
  • #77 opened — explicit input: binding for multi-tool step arguments. The _prev pattern in PR #34 is a positional workaround; the proper fix is task_sequence in noetl-tools extracting each sub-tool's input: map and injecting those key-value pairs as template variables for the tool's own command:/query:.

Key discovery — _prev deferred template pattern

The Rust server's template engine (render_value_deferring() in jinja.rs) pre-renders ALL Jinja references except _prev and _results before dispatching task_sequence to the worker. Cross-tool references like {{ generate_data.data.record_id }} resolve to empty/null server-side because tool names aren't in the server's context. The correct pattern for multi-tool steps is {{ _prev.property }} — deferred by the server, resolved by the worker's task_sequence at runtime. The long-term fix is #77 — explicit input: blocks on each tool.

Pointers


2026-06-08 (noetl-server v2.64.0 — step-level set: + ctx shims; e2e validation 25/27 PASS)

Agent: Claude (direct) · Repos touched: noetl/server (PR #168 OPEN — v2.64.0 on branch fix/ctx-shim-orchestrator-eval)

Headline. Two orchestrator bugs fixed + full 38-playbook e2e validation sweep on the Rust-only kind stack (Python scaled to 0). 25 of 27 testable playbooks PASS (92.6% excluding external-dep fixtures). PR #168 carries both fixes; awaiting merge.

What landed (on PR #168 branch, not yet merged)

  • ctx/workload namespace shims (commit b05f978): with_ctx_shims() helper adds ctx.* and workload.* namespace entries to the Jinja evaluation context at all 7 orchestrator call sites. Fixes {{ ctx.probe_delays }} and similar expressions that resolved to undefined.
  • Step-level set: mutation support (commit 48cb008): Added set_vars field to Step struct (#[serde(default, rename = "set")]). Orchestrator applies step-level set: mutations after each completed step. Previously step-level set: YAML was silently dropped during deserialization — root cause of pagination_basic failure (19 events → now 45+ events, progresses past the loop).

E2E validation summary

Tier PASS FAIL (ext) TIMEOUT (ext) Bug
1 — Core 9/10 1 (test_storage_tiers artifact _ref)
2 — Control flow 9/10 1 (OpenAI key)
3 — HTTP/postgres 4/10 3 (fixture) 3 (ext HTTP server)
4 — Composition 6/9 2 (fixture) 1 (YAML parse)
Total 25/38 5 4 1 server + 1 YAML + 2 fixture

Performance highlights: heavy_loop_aggregation 526 events in 31.89s (~16.5 events/s sustained); median simple-playbook time ~1.45s.

Pointers


2026-06-08 (noetl-tools v2.24.2 + noetl-server v2.62.1 clippy cleanup + noetl/server#22 closed)

Agent: Claude (direct) · Repos touched: noetl/tools (PR #44 MERGED, v2.24.2 released), noetl/server (PR #167 MERGED, v2.62.1 released — clippy cleanup; #22 closed)

Headline. Housekeeping session — cleared the clippy -D warnings CI gate on both noetl-tools (15 warnings across 7 files) and noetl-server (14 clippy categories across 17 files). All mechanical lint fixes, zero behavior changes. Closed stale noetl/server#22 (Phase D orchestrator engine port sub-issue — all rounds shipped).

What landed

  • noetl-tools v2.24.2 (noetl/tools#44; closes tools#42):
    • 15 clippy fixes: unused bindings, dead code suppression, identical if/else simplification, doc indent, .clamp() over .min().max(), const assertion blocks.
    • 7 files touched: script.rs, snowflake.rs, task_sequence.rs, transfer.rs, artifact.rs, mcp.rs, nats.rs.
    • 288 lib tests pass; no behavioral changes.
  • noetl-server v2.62.1 (noetl/server#167; closes server#161):
    • 14 clippy categories resolved across 17 files: Box::new wrapping for large enum variants, u32 type mismatches, too_many_arguments allow, and other mechanical lint fixes.
    • No behavioral changes; PATCH bump only.
  • noetl/server#22 closed — Phase D orchestrator engine port complete (all 6 rounds shipped: R1–R3c + R4-5 sharding).

Pointer bumps

  • repos/tools 10cc751 (v2.24.1) → 56225ca (v2.24.2)
  • repos/server 2430bc2 (v2.62.0) → da867c4 (v2.62.1)

2026-06-08 (#76 closes — noetl-server v2.62.0 ships sequential-mode iterator dispatch; first Claude-direct Rust PR under the new rule)

Agent: Claude (direct) · Repos touched: noetl/server (PR #166 MERGED, v2.62.0 released)

Headline. #76 closes — first Rust-submodule change Claude authored directly per agents/rules/handoff-routing.md (codified earlier this session to stop the Codex handoff pattern). Sequential-mode iterator dispatch: when a step declares loop.spec.mode: sequential (or omits mode:, since Sequential is the #[default]), the orchestrator serializes per-iteration commands instead of fanning out all iterations at once. This fixes the remaining 2-of-4 process_users iterations that stalled during the #75 kind-val on playbook_composition.yaml — child playbooks were contending for worker slots because all 4 iterations dispatched simultaneously.

What landed

  • noetl-server v2.62.0 (noetl/server#166; closes noetl/ai-meta#76):
    • LoopMode enum (Sequential default / Parallel) in playbook/types.rs; LoopSpec.mode parsed from playbook YAML loop.spec.mode.
    • StepInfo.iterations_dispatched field tracks command.issued count for the sequential dispatch guard.
    • Sequential dispatch pattern: dispatch iteration 0 at fan-out; on each command.completed, check iterations_dispatched == iterations_completed() and dispatch next.
    • Existing parallel-mode tests updated to explicitly set mode: Parallel.
    • 3 new tests pinning sequential behavior; lib tests pass; release build + clippy clean.

Kind validation

  • test/loop → COMPLETED (5/5 steps, 0 failures) — parallel iterator regression confirmed working.
  • tests/iterator_save_test → COMPLETED (4 steps, 0 failures) — iterator save regression confirmed.
  • Server submodule pointer bumped: 12eb2b82430bc2.

Pointers

  • repos/server 12eb2b82430bc2 (v2.62.0)

2026-06-08 (#75 closes — noetl-tools v2.24.1 ships PlaybookTool polling fix; 7th + final Codex Rust handoff before rule change)

Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/tools (PR #43 MERGED, v2.24.1 released), noetl/worker (PR #65 MERGED for lockfile bump), noetl/ai-meta (agents/rules/handoff-routing.md codified)

Headline. #75 closes — Codex handoff #7 (the day's last Rust-via-Codex round) fixed the PlaybookTool::execute polling loop's terminal-status check. Pre-fix it read payload.completed/payload.failed as booleans, but the status endpoint returns payload.status as a string — both lookups always returned false, so child playbooks dispatched with return_step: end timed out at 300s instead of returning their result. Post-fix, the kind-val on playbook_composition.yaml shows 2 of 4 iterations completing end-to-end with save_profile.status: COMPLETED from the child playbooks (was: zero completions pre-fix; child-side {status: "timeout"} shape).

Rule change codified mid-session: after this PR opened, the user said "stop using codex for rust code - do it yourself". Codified as agents/rules/handoff-routing.md. All future Rust-submodule changes (repos/cli, repos/server, repos/worker, repos/tools, repos/doctor, repos/gateway) Claude does directly via Read/Edit/Bash — no Codex handoff prompt, no Agent dispatch. Codex remains in play for non-Rust work (Python, YAML, docs in non-Rust submodules) but those are rare.

What landed

  • noetl-tools v2.24.1 (noetl/tools#43; closes noetl/ai-meta#75):
    • Extract PlaybookTool::is_terminal_status(payload) helper so the check is unit-testable without HTTP mocks.
    • Replace the two boolean lookups with a string-status check: payload.status == "COMPLETED" | "FAILED" | "CANCELLED", plus the is_cancelled: true fallback.
    • Single file touched (src/tools/playbook.rs); 132 lines added / 9 removed.
    • 7 new unit tests; lib 288 passed / 0 failed (was 281/0); release build + clippy clean.
    • PATCH bump (commit prefix fix:).
    • Authored on a Codex handoff at handoffs/active/2026-06-08-tools-playbook-polling-fix/.
  • noetl-worker PR #65 (merged): Cargo.lock bump to 2.24.1. Cargo.toml's ^2.24 caret already covered it; only the lockfile changed. Claude-authored directly (post-rule-change).
  • agents/rules/handoff-routing.md (commit e5f53fe): codifies "Claude writes Rust directly; no Codex dispatch for Rust" so it survives compaction.

Kind re-val PARTIAL — fix is verifiably working

Built localhost/noetl-worker-rust:v5.15.0-tools2241 via podman, loaded into kind, rolled deploy. Re-ran tests/playbook_composition/playbook_composition.yaml (kind exec 322425980322844672):

  • 2 of 4 process_users iterations completed end-to-end. Their call.done payloads carry save_profile.status: COMPLETED from the child playbooks (was: zero completions; {status: "timeout"} shape pre-fix).
  • 4 child executions visible in noetl.event with parent_execution_id = 322425980322844672 (but NOT in noetl.execution — separate parity gap from the Rust orchestrator not populating that table).
  • Remaining 2 iterations stuck on a separate concern (likely worker concurrency under sequential-mode iterator dispatch, or per-iteration child stall). #76 filed as the follow-up.

Pointers bumped

  • repos/tools: b6b80ce (v2.24.0) → 10cc751 (v2.24.1).
  • repos/worker: 802954bbe431a5 (lockfile bump; no version tag).

Session totals — final

7 issues closed today: #70, #71, #72, #73, #74, #75 (umbrellas) + #75-the-tools-clippy-followup #42 + #161 still open. 7 Codex Rust handoffs + the rule change that stops it from being 8. 7 server releases (v2.58.0 → v2.59.0 → v2.60.0 → v2.61.0 → v2.61.1), 2 tools releases (v2.24.0 → v2.24.1), worker lockfile bumps to track.


2026-06-08 (#72 closes — noetl-server v2.61.1 ships honest in-flight check in get_status via Codex handoff)

Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #165 MERGED, v2.61.1 released), noetl/ai-meta

Headline. #72 closes — sixth Codex handoff of the day fixes three compounding bugs in ExecutionService::get_status that caused /api/executions/{id}/status to return COMPLETED while iterator commands were still in-flight. Kind re-val on playbook_composition.yaml confirms the endpoint now correctly reports RUNNING with running_steps: 3 while the 4 process_users iterations are pending (was: COMPLETED, running_steps: 0 pre-fix).

What landed

  • noetl-server v2.61.1 (noetl/server#165; closes noetl/ai-meta#72):
    • Fix 1: running_steps SQL filter changed from status='RUNNING' to event_type IN ('command.claimed', 'command.started') AND status IN ('RUNNING', 'STARTED'). Workers emit command.started with status='STARTED', so the old filter returned 0 even when N commands were mid-execution.
    • Fix 2: New in_flight_commands SQL query counts non-terminal rows in noetl.command using pool_for(execution_id) for sharding consistency.
    • Fix 3: COMPLETED branch now requires both stats.1 == stats.0 && stats.0 > 0 && in_flight_commands.0 == 0. Either signal alone trips prematurely (event-log signal when iterator steps fire one step.enter for N commands; command-table when projection lags). Requiring both is honest.
    • 1 file touched (src/services/execution.rs); 160 lines added / 8 removed.
    • 4 new unit tests; lib 598 passed / 0 failed (was 594/0); release build + clippy clean.
    • Authored on a Codex handoff at handoffs/active/2026-06-08-server-status-endpoint-fix/.
    • PATCH bump (commit prefix fix:).

Kind-val GREEN at the status-endpoint layer

Built localhost/noetl-server-rust:v2.61.1 via podman, loaded into kind, rolled deploy. Re-ran tests/playbook_composition/playbook_composition.yaml (kind exec 322392447093051392):

Snapshot status running_steps total_steps completed_steps
t+2s RUNNING 3 2 2
t+8s RUNNING 3 2 2
t+30s RUNNING 3 2 2

Pre-fix would have returned COMPLETED, running_steps: 0 at t+2s already. Cancelled the stuck execution after verifying the fix.

Underlying worker stall — separate concern, out-of-scope here

The 4 process_users commands still never reach command.completed — the task_sequencetool: kind: playbook dispatch path stalls. That's the original cause of the kind exec hanging; this PR only fixed the status endpoint's lying. Filing as a separate follow-up issue.

Pointers bumped

  • repos/server: c01d3ce (v2.61.0) → 12eb2b8 (v2.61.1).

2026-06-08 (#74 closes — noetl-server v2.61.0 exposes ctx/workload namespace shim via Codex handoff; test_args_passing fully green; #73→#74 chain complete)

Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #164 MERGED, v2.61.0 released), noetl/ai-meta

Headline. #74 closes and the #73→#74 orchestrator-context chain completes. Codex handoff #5 of the day wraps CommandBuilder::build_command + build_iteration_command render contexts with ctx + workload namespace shims mirroring Python's commands.py:915-916. Kind re-val on test_args_passing.yaml is now fully green — every step command.completed | success, {{ ctx.test_var }} resolves to 100, all assertions pass.

What landed

  • noetl-server v2.61.0 (noetl/server#164; closes noetl/ai-meta#74):
    • In build_command + build_iteration_command: clone the incoming context into render_ctx, use entry().or_insert_with() to add ctx + workload keys pointing at a serde_json::Value::Object view of the original context.
    • entry().or_insert_with() preserves any pre-existing workload binding (which generate_initial_commands populates with the structured YAML workload block at execute.rs:453) so {{ workload.session_token }} keeps working.
    • Command.context persists the original flat context (not the shimmed render_ctx) — keeps event-log payloads compact.
    • Iteration shim runs AFTER iterator-var insertions so {{ ctx.<item_var> }} also resolves.
    • 1 file touched; 210 lines added / 5 removed.
    • 5 new unit tests; lib 594 passed / 0 failed (was 589/0).
    • Authored on a Codex handoff at handoffs/active/2026-06-08-server-ctx-namespace-shim/.

Kind-val GREEN end-to-end

Built localhost/noetl-server-rust:v2.61.0 via podman, loaded into kind, rolled deploy. Re-ran tests/test_args_passing (kind exec 322284179205132288):

event_type status node_name
command.completed success start
command.completed success use_vars
command.completed success end
playbook.completed COMPLETED playbook

Dispatch context for use_vars now carries tool_config.args: {test_var: 100, computed: 200} (was {test_var: null, computed: null} on v2.60.0). Worker stdout: test_var=100, computed=200, result=300\n. All Python assertions pass (assert test_var == 100; assert computed == 200; assert result == 300).

#73 → #74 chain summary

Four Codex handoffs delivered the orchestrator-context plumbing across the day:

  1. #73 gap 1 (PR #162, v2.59.0): iterator fan-out at the initial-dispatch path. loop_test.yaml fully green.
  2. #73 gap 2 (PR #163, v2.60.0): apply_set_mutations at orchestrator dispatch with ctx./iter./step. scope-strip. actions_test.yaml partially recovers; test_args_passing.yaml still fails because of missing namespace shim.
  3. #74 (PR #164, v2.61.0): ctx + workload namespace shim in render context, mirroring Python commands.py:915-916. test_args_passing.yaml fully green.

Pointers bumped

  • repos/server: 084cad4 (v2.60.0) → c01d3ce (v2.61.0).

2026-06-08 (#73 gap 2 ships — noetl-server v2.60.0 applies arc-level set: via Codex handoff; partial kind-val, #74 filed for namespace shim)

Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #163 MERGED, v2.60.0 released), noetl/ai-meta (handoff + pointer bump + #74 filed)

Headline. #73 gap 2 ships — Codex handoff #4 of the day delivered arc-level set: propagation (apply_set_mutations + render at dispatch). Kind re-val on v2.60.0 shows the apply runs correctly (bare keys land in dispatch context) but the downstream Jinja templates {{ ctx.test_var }} still resolve to null because the orchestrator doesn't expose a ctx namespace in the render context. Filed as noetl/ai-meta#74 — small follow-up to wrap the dispatch context with {ctx, iter, step} namespace shims mirroring Python's commands.py:915 (context["ctx"] = state.variables).

What landed

  • noetl-server v2.60.0 (noetl/server#163; closes noetl/ai-meta#73):
    • NextArc.args renamed to set_vars with #[serde(rename = "set")] — YAML key now matches Python canonical (set:, not legacy args:).
    • apply_set_mutations helper in state.rs mirrors Python's _apply_set_mutations verbatim: ctx./iter./step. scopes strip to bare keys; others kept as-is.
    • Orchestrator dispatch sites (main path + skip-chain loop) render set_vars templates against producing-step completion context, apply to dispatch context before issuing downstream command.
    • 4 files touched; 457 lines added / 67 removed; 8 new unit tests; lib 589 passed / 0 failed (was 581/0).
    • Authored on a Codex handoff at handoffs/active/2026-06-08-server-next-set-propagation/.

Kind re-val PARTIAL

Built localhost/noetl-server-rust:v2.60.0 via podman, loaded into kind, rolled deploy.

  • actions_test.yaml (kind exec 322277934972801024): 4 downstream steps recover (aggregate, helper, verify, end all command.completed | success — were error pre-fix). process_loop ×3 still error because it uses spec.policy.rules.then.set (per-task pipeline directive), a DIFFERENT code path from arc-level set:.
  • test_args_passing.yaml (kind exec 322277934767280128): use_vars STILL errors with TypeError: 'NoneType' + 'NoneType'. Command-table evidence: tool_config.args: {test_var: null, computed: null} despite state.variables containing the bare keys (post-strip). Root cause: the downstream Jinja template {{ ctx.test_var }} looks up a ctx namespace value, but the dispatch render context doesn't expose ctx as a key. Python handles this in commands.py:915 with context["ctx"] = state.variables.

Follow-up filed

  • noetl/ai-meta#74 (repo:server) — bind ctx/iter/step namespaces in the dispatch render context. Pure addition to the render-context construction in CommandBuilder::build_command; should be a small Codex handoff.

Pointers bumped

  • repos/server: 59b743c (v2.59.0) → 084cad4 (v2.60.0).

2026-06-08 (#73 gap 1 closes — noetl-server v2.59.0 fans out start-step iterators via Codex handoff; loop_test fully green)

Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #162 MERGED, v2.59.0 released), noetl/ai-meta (handoff + pointer bump)

Headline. #73 gap 1 closes — Codex handoff #3 of the day delivered the iterator-binding fix at the initial-dispatch path. Kind re-val of loop_test.yaml on v2.59.0 produces 5 iterator-bound start commands (was 1 empty-args pre-fix) + all 11 per-step command.completed | success. Gap 2 (next.set: value propagation across step transitions) remains the next round on #73.

What landed

  • noetl-server v2.59.0 (noetl/server#162; Refs noetl/ai-meta#73):
    • generate_initial_commands learns about start_step.loop: if loop: is set, render loop.in_expr to a JSON array, require array (else AppError::Validation), iterate to build per-iteration IteratorMetadata + dispatch via build_iteration_command + persist_engine_command. Mirrors the existing Phase D R3b orchestrator fan-out into the /api/execute initial-command path.
    • 282 lines added / 8 removed in src/handlers/execute.rs.
    • 3 new unit tests (fan-out, back-compat no-loop, non-array rejection); lib 581 passed / 0 failed (was 578/0); release build + clippy clean.
    • Authored on a Codex handoff at handoffs/active/2026-06-08-server-initial-iterator-fanout/.

Kind-val GREEN end-to-end

Built localhost/noetl-server-rust:v2.59.0 via podman, loaded into kind, rolled deploy. Re-ran tests/test/loop (kind exec 322268172092706816):

step_name | dispatched
----------+------------
start     |          5  ← was 1 with args:{} pre-fix
process   |          1
user_loop |          3  ← already worked via R3b
verify    |          1
end       |          1

All 11 command.completed events carry status: success. playbook.completed | COMPLETED reflects real state.

process step's call.done shows the upstream loop's aggregated path (original_sum: 15, doubled_sum: 30, count: 5); verify step received it via {{ process }} template and passed all assertions; end step's verify_result.test_passed: true.

What's still open on #73

  • Gap 2: next.set: value propagation. Affects test_args_passing.yaml's use_vars step — tool_config.args: {test_var: null, computed: null} because the upstream arc's set: { ctx.test_var: '{{ initial_value }}' } doesn't propagate into the downstream step's context at dispatch. Distinct code path from the iterator fan-out; will land in a follow-up round.

Pointers bumped

  • repos/server: 7dab231 (v2.58.0) → 59b743c (v2.59.0).

2026-06-08 (#71 CLOSED — noetl-tools v2.24.0 ships python wrapper fix via Codex handoff; kind re-val surfaces #73)

Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/tools (PR #41 MERGED, v2.24.0 released), noetl/worker (PR #64 OPEN for the dep bump), noetl/ai-meta (handoff + new follow-ups + pointer bump)

Headline. #71 closes at the noetl-tools layer. Second cross-agent handoff of the day used the same Claude-dispatch + Codex-execute pattern that delivered #70. Kind re-val with the new worker image (localhost/noetl-worker-rust:v5.15.0-tools224) confirms the python wrapper fixes shipped — Style C top-level-return wrap is visible in fixture tracebacks — but surfaces a distinct orchestrator-side binding gap filed as noetl/ai-meta#73.

What landed

  • noetl-tools v2.24.0 (noetl/tools#41; closes noetl/ai-meta#71):
    • wrap_top_level_return() helper detects unindented return before any def/async def/class and wraps user code in def __noetl_step__(args, input_data, **kw): + a call that captures the return value as result.
    • input_data = dict(args) global added to the wrapper template (right after globals().update(args)).
    • Style B (def main() convention from #65) preserved untouched.
    • 7 new unit tests; lib 281 passed / 0 failed (was 274/0); release build clean.
    • Authored on a Codex handoff at handoffs/active/2026-06-08-tools-python-wrapper-contract/ (prompt commit 74ffe97, result commit pushed by Codex itself).
  • noetl-worker PR #64 (noetl/worker#64) — one-commit dep bump (noetl-tools = "2.23" → "2.24" in Cargo.toml + Cargo.lock). Awaiting review/merge.

Kind re-val GREEN at the wrapper layer, mixed at the orchestrator layer

Built localhost/noetl-worker-rust:v5.15.0-tools224 via podman, loaded into kind, rolled the deployment (pod noetl-worker-rust-68ff5bd99f-dgf65 registered + serving). Re-ran the three #71 fixtures:

Fixture Kind exec Wrapper layer Orchestrator layer Net
loop_test.yaml 322253869239242752 ✅ Style C wrap fires; result = __noetl_step__(args, input_data) in traceback ❌ start step's command context shows args: {} — iterator's num not bound partial
actions_test.yaml 322253869834833920 input_data global injected ❌ downstream null-arg dispatch same as test_args_passing partial
test_args_passing.yaml 322253869994217472 ✅ wrapper runs cleanly tool_config.args: {test_var: null, computed: null}next.set not propagated partial

Evidence the wrapper IS fixed. The original #71 fingerprints (SyntaxError: 'return' outside function + NameError: name 'input_data' is not defined) are GONE. Replaced by TypeError: unsupported operand type(s) for *: 'NoneType' and 'int' — Python successfully ran the wrapped function, but downstream values arrived as None.

The new finding. Orchestrator dispatches with empty / null args:

  • Iterator binding gap: loop_test's start step gets args: {} despite input: { num: '{{ num }}', loop_index: '{{ loop_index }}' }.
  • next.set: propagation gap: test_args_passing's use_vars step gets args.test_var: null despite the upstream arc's set: { ctx.test_var: '{{ initial_value }}' }.

Both are orchestrator-side template-resolution gaps, distinct from #60 (template context for when:). Filed as noetl/ai-meta#73.

Follow-ups filed

  • noetl/ai-meta#73 (repo:server) — orchestrator doesn't bind iterator values + next.set value templates into downstream step args at dispatch.
  • noetl/tools#42 — 12 pre-existing clippy errors in mcp.rs / nats.rs / snowflake.rs / result_fetch.rs, parallel to noetl/server#161. Mechanical cleanup; blocks PR CI on -D warnings.

Pointers bumped

  • repos/tools: 7d3fcfd (v2.23.1) → b6b80ce (v2.24.0).
  • repos/worker: 401bafc (v5.15.0) → 802954b (worker#64 merged; chore(deps): so no release-please tag).

2026-06-08 (post-#70 sweep — #71 + #72 filed; orchestrator-status-drift surfaced)

Agent: Claude · Repos touched: noetl/ai-meta (issues #71 + #72 opened), noetl/ai-meta.wiki

Headline. After #70 closed, swept four more e2e fixtures against the new v2.58.0 stack. test_storage_tiers clean. loop_test + actions_test surfaced two distinct python-tool wrapper gaps + an orchestrator status-drift concern. Re-audit of the earlier test_args_passing PASS verdict revealed it was a false-passcommand.completed | error masked by playbook.completed | COMPLETED.

Sweep results (all on v2.58.0)

Fixture Kind exec Verdict
test_storage_tiers.yaml 322239636480987136 PASS (5/5 steps, 0 errors) — confirms #70 cascade unblocked for this fixture too.
loop_test.yaml 322239728994750464 FAIL — all 5 python steps fail with SyntaxError: 'return' outside function.
actions_test.yaml 322240336095088640 FAIL — start step OK; downstream python steps fail with NameError: name 'input_data' is not defined.
test_args_passing.yaml (re-audit) 322228896269340672 False-pass. use_vars step had `command.completed

Findings filed

  • noetl/ai-meta#71 (repo:tools) — Rust python tool's wrapper doesn't replicate the legacy noetl/noetl/tools/python/executor.py contract. Two gaps:
    1. Top-level return X triggers SyntaxError: 'return' outside function (need to wrap user code in an implicit function).
    2. input_data global isn't bound in the exec namespace (Python tool sets input_data + args + result as globals before exec). Mechanical fix — wrapper-level only, no fixture rewrites.
  • noetl/ai-meta#72 (repo:server) — orchestrator emits playbook.completed | COMPLETED even when EVERY step's command.completed is error. CI gates / dashboards that watch playbook.completed trust the green status on silently-broken playbooks. Needs a design call: cascade to playbook.failed, add playbook.completed_with_errors, or add an error-count field — pick one and document in event-envelope.md.

Sweep methodology note

This session reaffirmed the pattern that the "did the orchestrator emit a terminal event" check is insufficient for kind-val PASS verdicts — every sweep must also audit the per-step call.done / command.completed events. The Sessions-Log entry from earlier in this session reporting test_args_passing as PASS was a false-positive directly caused by #72.

Pointers

  • 7 e2e fixtures hit the #71 pattern (grep input_data\.get|^\s*return over repos/e2e/fixtures/playbooks/*.yaml): actions_test, broken_sql, duckdb_test, loop_test, postgres_test, test_args_passing, test_end_with_action. Many more under subdirectories.
  • The orchestrator status-drift evidence query (event-log slice from #72 body) reproduces on all three failed fixtures above.

2026-06-08 (#70 CLOSED — noetl-server v2.58.0 ships durable result-store endpoints via Codex handoff; kind-val GREEN)

Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #160 MERGED, v2.58.0 released), noetl/ai-meta (handoff + pointer bumps), noetl/ai-meta.wiki, noetl/server.wiki

Headline. #70 closes. Cross-agent handoff delivered the worker's missing wire counterpart: noetl-server now serves PUT /api/result/<eid> + GET /api/result/resolve (release commit 7dab231 = v2.58.0). Kind re-val of tests/output_select_test (kind exec 322232289352224768) reaches playbook.completed with test_result: "PASSED" + output_select_worked: true — the cascade that the prior session diagnosed (worker 404 → shm-only branch → null _ref → artifact tool fails) is fully unblocked.

What landed

  • noetl-server v2.58.0 (noetl/server#160; closes noetl/ai-meta#70) — 4 new files + 4 modified across repos/server:
    • src/db/queries/result_store.rs (startup DDL CREATE TABLE IF NOT EXISTS noetl.result_store + insert + get_by_ref; mirrors secret_audit pattern).
    • src/services/result_store.rs (ResultStoreService::{put, resolve}, PutResultBody / ResultPutResponse wire types matching worker ControlPlaneClient::put_result exactly, parse_noetl_ref URI parser, 10 unit tests).
    • src/handlers/result_store.rs (axum handlers with result_store.{put,resolve} tracing spans + 4 new metrics).
    • src/metrics.rs + module index files + src/main.rs (startup wiring + route registration).
    • 578 lib tests pass (+10 new); release build clean.
    • Wiki: noetl/server deployment-specification.md updated to list /api/result/* in the network surface (commit 0d25ef5).

Kind-val GREEN end-to-end

  • Built localhost/noetl-server-rust:v2.58.0 via podman (podman --connection noetl-dev build), loaded into kind (kind load image-archive), rolled deployment (kubectl set image; pod noetl-server-rust-9d5c88779-wb2bp Running; version reports 2.58.0).
  • Smoke-PUT confirmed 200 OK with the expected noetl://execution/12345/result/smoke/<snowflake> URI shape.
  • Re-ran tests/output_select_test (kind exec 322232289352224768) — COMPLETED with the orchestrator's terminal playbook.completed event; summary step's call.done shows output_select_worked: true, result_was_externalized: true, full_data_items_processed: 1000, test_result: "PASSED".
  • Worker logs confirm the SUCCESS branch (was unreachable pre-fix): Tool result exceeds inline budget; staged in durable result store + shared-memory cache. result_ref=noetl://execution/322232289352224768/result/start/322232292384706560 ... put_duration_seconds=0.014976658.
  • Negative check: zero put_result HTTP 404 lines for this execution in the worker log (was 1+ per over-budget step pre-fix).
  • Server logs confirm both endpoints actively serving: result_store.put: stored ... bytes=256945 result_ref=... + result_store.resolve: found ... duration_seconds=0.007878078.

Followups

  • noetl/server#161 — 14 pre-existing clippy errors in files unrelated to result_store will block the next PR's -D warnings CI gate (didn't block #160 because the merge config tolerates them). Mechanical cleanup; one PR per file or batched.
  • output_select per-field projection at worker call.done — flagged as a follow-up in the prior session, but the kind-val showed output_select_worked: true end-to-end (verify step received status, count, total_size correctly via the artifact-tool lazy_load round-trip). The worker comment at command.rs:909 describing "future expansion" was overly pessimistic; no follow-up issue needed.
  • Flight gRPC fast path — worker logs show Flight transport failed; falling back to HTTP (port 8083 not wired in kind). Out-of-scope here; HTTP fallback successful.

Pointers bumped

  • repos/server: f7ae136 (v2.57.2) → 7dab231 (v2.58.0).
  • repos/noetl-server-wiki: 0ad86de0d25ef5 (deployment-spec network-surface row).

2026-06-08 (#70 result-store endpoints opened on noetl/server via Codex handoff; PR #160 awaiting CI + merge)

Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #160 opened), noetl/ai-meta (handoff thread + #70 comments + #161 clippy follow-up)

Headline. First cross-agent handoff of the session — Claude scoped the work and authored the prompt, Codex executed Phases A+B unattended on the local feature branch, Claude opened the PR after user ship it. noetl/server#160 closes noetl/ai-meta#70 by porting PUT /api/result/<eid> + GET /api/result/resolve from Python.

Handoff trail

  • Thread: handoffs/active/2026-06-08-server-result-store-endpoint/.
  • round-01-prompt.md authored by Claude (commit 742e981); round-01-result.md written by Codex (commit 894f513).
  • Codex's report: 4 new files + 4 modified across repos/server; 578 lib tests pass (+10 new); release build clean; Phase C correctly gated on the ship it wait phrase.

What landed on noetl/server

  • PR #160 (branch feat/result-store-put-resolve-endpoints at 0c0d13b; closes noetl/ai-meta#70 when merged + pointer-bumped) —
    • New src/db/queries/result_store.rs (startup DDL CREATE TABLE IF NOT EXISTS noetl.result_store + insert + get_by_ref).
    • New src/services/result_store.rs (ResultStoreService::{put, resolve}, PutResultBody / ResultPutResponse wire types, parse_noetl_ref URI parser with 10 unit tests).
    • New src/handlers/result_store.rs (axum handlers + tracing spans + 4 new metrics).
    • Modified metrics.rs + the three module index files + main.rs (startup wiring + route registration).
    • MVP scope explicit: cluster-wide pool not yet shard-aware; DELETE/list/GC/TTL/scoping/Arrow Flight gRPC fast path all deferred to follow-up rounds.

Kind validation evidence (pre-merge corroboration)

  • Ran e2e/fixtures/playbooks/test_output_select.yaml (kind exec 322228228976545792) against the unpatched Rust server image. Worker logs confirm the exact failure mode put_result failed: HTTP 404 → cascade through the shm-only branch → downstream {{ start._ref }} resolves null → artifact tool errors with Invalid artifact config: invalid type: null, expected a string. Same cascade as the original noetl/ai-meta#70 report. Posted as a corroboration comment on the issue.
  • Post-merge: rebuild noetl-server image, load into kind, re-run test_output_select.yaml + test_storage_tiers.yaml; expect both to advance past the artifact step (output_select per-field projection still pending — see follow-up note in noetl/ai-meta#70 comment).
  • Ran e2e/fixtures/playbooks/iterator_save_test.yaml (kind exec 322227920619704320) — PASS. 3 rows inserted in public.iterator_save_test. Iterator + pipeline + _prev + execution_id substitution all clean.
  • Ran e2e/fixtures/playbooks/test_args_passing.yaml (kind exec 322228896269340672) — PASS. ctx.test_var + ctx.computed propagate cleanly through the next.set block.

Surfaced follow-ups

  • noetl/server#161 — clippy cleanup. cargo clippy --lib --tests --release -- -D warnings currently exits 101 on main with 14 pre-existing errors in files unrelated to result_store. Will block PR #160's CI as-is.
  • output_select per-field projection at worker call.done — separate from #70. The worker comment at command.rs:909 flags it as "future expansion when output.output_select plumbing lands at the server-side ToolSpec layer". To file when #70 merges + the worker can finally write the durable-success branch end-to-end.

Pointers

  • noetl/server PR #160 awaits merge → ai-meta pointer bump → wiki Releases.md row → Closes noetl/ai-meta#70 fires.
  • Roadmap board (Project 3): noetl/ai-meta#70 moved Todo → In progress on PR open.

2026-06-08 (#69 closes — call.done embeds _ref on durable-success branch, worker v5.15.0; #70 filed as server-side gap)

Agent: Claude · Repos touched: noetl/worker (PR #63 merged, v5.15.0 released), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Worker-side fix for noetl/ai-meta#69 shipped (worker v5.15.0); kind re-val surfaced a downstream server-side gap filed as noetl/ai-meta#70.

What landed

  • noetl-worker v5.15.0 (noetl/worker#63; closes noetl/ai-meta#69) — single-commit MINOR bump (feat: prefix). When an over-budget result's durable PUT succeeds, build_call_done_result now emits {status, context: { data: { _ref: <noetl://...> } }, reference: {...}} — the inline context.data._ref block sits next to the existing reference block so the orchestrator's extract_user_data finds it under the v10 nested-envelope path. Downstream {{ step._ref }} resolves to the URI string; consuming artifact / result_fetch tools dispatch the URI-based fetch. 4 existing tests updated; lib 126/0/0; release build clean.

Kind re-validation

  • Built localhost/noetl-worker-rust:v5.15.0-feat69 via podman, loaded into kind, rolled the deployment. Pod noetl-worker-rust-7547f66555-t2p8q running v5.15.0.
  • Re-ran e2e/fixtures/playbooks/test_output_select.yaml (kind exec 322223448124297216) + e2e/fixtures/playbooks/test_storage_tiers.yaml (kind exec 322223449223204864).
  • Both fixtures still fail at the artifact step with invalid type: null, expected a stringbut the failure shape is different: the over-budget call.done now lands on the degraded shm-only branch (reference.kind: arrow_ipc) instead of the durable-success branch (which would be kind: result_ref).

Surfaced finding

Worker logs reveal the root cause:

Durable result-store PUT failed; falling back to shared-memory cache only.
error=put_result failed: HTTP 404

The Rust noetl-server doesn't expose PUT /api/result/<execution_id> — confirmed via grep 'api/result' repos/server/src/main.rs returning no matches. Worker has been calling this endpoint for the entire Phase D+E+F lifetime (R-2.2 PR-B), but the Python-server route was never ported.

Filed as noetl/ai-meta#70 (server-side parity gap). Once that lands, the worker's #69 embedding works end-to-end.

Pointers bumped

  • repos/worker: `ca1daf4` (post-worker#62 merge) → `401bafc` (v5.15.0 release commit)

2026-06-08 (#68 closes — artifact tool args alias, v2.23.1 + worker dep bump; #69 filed as downstream gap)

Agent: Claude · Repos touched: noetl/tools (PR #40 merged, v2.23.1 released), noetl/worker (PR #62 merged, lockfile bump), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Two-PR shipment for the artifact-tool config-deserialization friction filed as noetl/ai-meta#68 earlier today. Kind re-val surfaced a new downstream gap (filed as noetl/ai-meta#69).

What landed

  • noetl-tools v2.23.1 (noetl/tools#40; closes noetl/ai-meta#68) — PATCH bump on v2.23.0. #[serde(alias = "args")] on ArtifactConfig.input resolves the friction between noetl-server's ToolSpec field-name normalization (input:args via noetl/ai-meta#56) and the Python-parity expectation that the artifact tool config field be named input. Both shapes now deserialize. 1 new unit test. Crates.io published 2026-06-08T02:48:09Z.
  • noetl-worker (worker#62) — lockfile-only bump noetl-tools 2.23.0 → 2.23.1. Cargo.toml already had noetl-tools = "2.23" accepting any 2.23.x. Worker lib 126/0. Merged at `ca1daf4`. No release tag — release-please skips chore(deps): commits.

Kind re-validation evidence

  • Built localhost/noetl-worker-rust:v5.15.1-tools231 via podman (cargo-chef warm cache), loaded into kind via podman cp + ctr -n=k8s.io images import, rolled the worker deployment (pod noetl-worker-rust-56cdb9db76-jn5s6, v5.15.1-tools231).

  • Re-ran e2e/fixtures/playbooks/test_output_select.yaml (kind exec 322212285424603136) + e2e/fixtures/playbooks/test_storage_tiers.yaml (kind exec 322212745988542464) — both fixtures now deserialize the artifact config successfully (confirming the v2.23.1 alias fix took effect) but FAIL at the first artifact step with a NEW diagnostic:

    Configuration error: Invalid artifact config: invalid type: null, expected a string
    

    Root-caused: the upstream step's {{ step._ref }} resolves to null because the Rust worker's call.done envelope carries the reference block (durable result_store PUT) without an inline data block containing the output_select fields + the synthetic _ref URI.

Surfaced finding

noetl/ai-meta#69 — worker-side fix needed for output_select / _ref population on call.done. Filed with the kind execution IDs + envelope shape + likely fix site (worker executor::command::build_call_done_result).

Pointers bumped

  • repos/tools: `e38046f` (v2.23.0) → `7d3fcfd` (v2.23.1)
  • repos/worker: `bd977f8` (v5.14.0 main pre-bump) → `ca1daf4` (worker#62 merge)

2026-06-08 (#67 closes — orchestrator exclusive-routing fix, v2.57.2 + kind-val GREEN)

Agent: Claude · Repos touched: noetl/server (PR #159 merged, v2.57.2 released), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Diagnosed + fixed the orchestrator deadlock filed as noetl/ai-meta#67 earlier today. PR #159 OPEN.

Diagnosis

Instrumented engine/orchestrator.rs::process_in_progress with debug eprintln!s and stepped through the unit-test reproducer. Found:

  • The Jinja template renderer (Chainable undefined) handles {{ A if A else B.x }} correctly when B is undefined — short-circuits to A.x via Jinja2 semantics; the renderer was a red herring.
  • The actual deadlock: build_incoming_arcs (the static planner) declared summarize as having 2 upstreams (process_high + process_low) because both step's next.arcs pointed to it. Under mode: exclusive only process_high fired; process_low stayed Pending forever; the R4 fan-in barrier (server#142) never resolved — orchestrator silently produced new_commands=0 new_events=0 should_complete=false.

Fix (3-part) — noetl/server PR #159

  1. evaluator::evaluate_next_transitions: stop break-ing out of the arcs loop on the first exclusive-mode match. Surface every remaining sibling (and any arc whose when evaluated false in inclusive mode) as EvaluationResult { matched: false, next_step: Some(name) } via a new helper not_matched_with_target.
  2. orchestrator::process_in_progress: refactor into two ordered passes:
    • Pass 1 (new): emit step.skipped for every unmatched arc target across ALL completed steps BEFORE any barrier check runs. Pre-pass design eliminates HashMap-iteration- order non-determinism.
    • Pass 2 (existing): dispatch matched arc targets. The R4 fan-in barrier now also consults the in-pass step.skipped set — so a downstream merge target dispatches in the SAME orchestrator pass as the siblings' skips (no two-trigger gymnastics required).
  3. Unit tests added: test_67_exclusive_routing_emits_step_skipped_for_unmatched_siblings (orchestrator layer — pins the comprehensive_test shape) + test_jinja_conditional_short_circuits_on_undefined_else_branch (template layer — regression guard for Chainable semantics).

Build evidence

  • cargo test --lib -- --test-threads=1568/0/0 (was 566, +2 new).
  • cargo build --release --bin noetl-control-plane → clean.

Release + kind-val GREEN

  • noetl/server v2.57.2 released — single-commit patch on top of v2.57.1. PR #159 merged at 02:11 UTC; Closes noetl/ai-meta#67 keyword auto-closed the umbrella; board 3 status auto-moved to Done.

  • Built localhost/noetl-server-rust:v2.57.2 via podman; loaded into kind via podman cp + ctr -n=k8s.io images import; rolled noetl-server-rust deployment (uptime 26s).

  • Re-ran e2e/fixtures/playbooks/comprehensive_test.yaml (kind execution 322196926424420352) — COMPLETED in ~4s (was hanging forever pre-fix). Event trace shows the exact fix shape:

    start.command.completed
      → step.skipped for process_low  (untaken exclusive sibling — THE FIX)
      → step.enter for process_high → process_high.command.completed
      → step.enter for summarize → summarize.command.completed  (barrier saw process_low as Skipped)
      → step.enter for end → end.command.completed
      → playbook.completed
    
  • Server pointer in ai-meta bumped: 9526b26 (v2.57.1) → f7ae136 (v2.57.2).

Pointers

  • ai-task: noetl/ai-meta#67 (In progress on board 3).
  • Server PR: noetl/server#159.
  • Files touched: repos/server/src/engine/evaluator.rs, repos/server/src/engine/orchestrator.rs, repos/server/src/template/jinja.rs (374 insertions, 10 deletions).

2026-06-08 (#49 regression-rig sweep — 7 fixtures swept; 1 real bug filed as #67)

Agent: Claude · Repos touched: noetl/ai-meta (issue #67), noetl/ai-meta.wiki

Headline. Swept 7 self-contained fixtures from repos/e2e/fixtures/playbooks/ against the v2.57.1 Rust kind stack: 3 PASS, 3 correct-FAIL (infra), 1 real orchestrator bug (filed as noetl/ai-meta#67).

Results

Fixture Outcome Notes
simple_python.yaml ✅ COMPLETED clean
test_args_passing.yaml ✅ COMPLETED clean
test_large_result_extraction.yaml ✅ COMPLETED clean
postgres_test.yaml ❌ correct-FAIL fixture's start step has kind: postgres with no auth: block — error: Failed to get connection. Fixture issue, not server bug.
http_test.yaml ❌ correct-FAIL fixture targets noetl.noetl.svc.cluster.local:8082 (the retired Python noetl-server service) — error: error sending request. Fixture is stale; should target noetl-server-rust.noetl.svc.cluster.local.
v10_canonical_example.yaml ❌ correct-FAIL fixture's fetch_data step targets https://api.example.com/data (fake hostname) — error: error sending request. Fixture needs a real test endpoint.
comprehensive_test.yaml ⚠️ BUG FILED orchestrator hangs after process_high.command.completed; never dispatches summarize; never emits a terminal event. Root cause: Jinja conditional {{ A if A else B }} in input: where B is undefined (because mode: exclusive routed to A only). Minimal repro: tests/k66d/comprehensive_minimal_repro (without the conditional) completes in ~4s. Filed as #67.

Recommendations for the e2e fixtures

Three stale fixtures need targeted fixes (file a sub-issue on noetl/e2e or fix inline):

  1. postgres_test.yaml start step: add auth: pg_k8s so it uses the keychain alias that already works.
  2. http_test.yaml workload api_url: bump from noetl.noetl.svc.cluster.local:8082noetl-server-rust.noetl.svc.cluster.local:8082.
  3. v10_canonical_example.yaml workload api_url: pick a kind-internal endpoint (or skip it from the auto-sweep list).

Pointers

  • Sweep runner: /tmp/sweep_runner.sh (operator-local; can be checked into repos/e2e/scripts/ as a durable rig if next session wants it).
  • Minimal bug reproducer: tests/k66d/comprehensive_minimal_repro (proves the conditional shape is the trigger).
  • Filed sub-issue: noetl/ai-meta#67 (board 3, Todo).

2026-06-08 (#49 housekeeping — {{ execution_id }} multi-statement postgres finding resolved on v2.57.1)

Agent: Claude · Repos touched: noetl/ai-meta.wiki

Headline. Re-ran the open finding from the #54 sweep on the fresh v2.57.1 server — bug no longer reproduces. Umbrella-49 "Next concrete steps" item 4 strikethrough applied; Sessions-Log entry added; pointer bumped.

What was tested

Two minimal reproducer playbooks against the live Rust kind stack (noetl-server-rust:v2.57.1 + noetl-worker-rust:v5.15.0-tools223):

  1. tests/k66b/pg_exec_id_multistatement — postgres query: alias with SELECT {{ execution_id }} AS exec_id_one (single) vs CREATE TEMP TABLE + DELETE WHERE eid = {{ execution_id }} + INSERT VALUES ({{ execution_id }}, ...) + SELECT {{ execution_id }} (multi). Both single + multi return the correct execution_id (322180876613980160) on the wire.

  2. tests/k66c/pg_command_alias_repro — postgres command: alias (the wiki-described failing shape) with the same CREATE+DELETE+INSERT+SELECT pattern that loop_with_pagination's start step uses. {{ execution_id }} renders correctly in every position; INSERT succeeds; verifying SELECT returns the row.

Conclusion

The "WHERE execution_id = ; syntax error at command-build time" fingerprint the #54 sweep recorded no longer reproduces. Likely fixed via the template defer-render series in server#71/#73 (or adjacent template-scoping work between 2026-06-05 and now). Strikethrough applied to Umbrella-49 page; no sub-issue needed.

Pointers

  • Umbrella: Umbrella: Rust Server Port.
  • Original finding: server#72 / server#73 template-defer-render context.
  • Reproducer fixtures (kept in /tmp/ on the operator's machine — not durable; the bug is closed so no checked-in fixture needed).

2026-06-07 (ai-meta#66 closes — Rust orchestrator step.data template accessor fix, v2.57.1)

Agent: Claude · Repos touched: noetl/server (PR #158 merged, v2.57.1 released), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Fix for the cross-step {{ step.data }} template gap surfaced as a finding during the previous session's #65 kind-val. Single-tool python steps producing a flat user dict were reachable via {{ step.field }} but not via {{ step.data }} / {{ step.data.field }} (the wrapped path canonical v10 fixtures use interchangeably). Two-line semantic fix in WorkflowState::build_context + two new unit tests.

What changed

  • noetl/server PR #158 OPEN — branch fix/orchestrator-step-data-template-accessor. In repos/server/src/engine/state.rs, the build_context step-loop now wraps the extracted user_data with a self-referencing .data key when no .data is already present. Guarded by !map.contains_key("data") so the task_sequence flatten path's existing .data (populated from a labeled sub-task's data field) stays intact.
  • 2 new unit tests in engine::state::tests:
    • test_build_context_exposes_step_data_accessor_for_flat_user_dict — reproduces the live kind execution 322087210360770560 envelope from the previous session's #65 kind-val; asserts BOTH step.status (flat) AND step.data.status (wrapped) resolve.
    • test_build_context_data_accessor_does_not_clobber_existing_data_field — pins task_sequence flatten back-compat: a labeled-sub-task data field stays addressable as both <step>.<label>.data.x AND <step>.data.x.

Build evidence

  • cargo test --lib engine::state19/0/0 (was 17, +2 new).
  • cargo test --lib566/0/0.
  • cargo build --release --bin noetl-control-plane → clean.

Release + closeout

  • noetl/server v2.57.1 released — single-commit patch on top of v2.57.0 (the Phase D R5 R7 parity harness).
  • PR #158 merged at 19:18 UTC; Closes noetl/ai-meta#66 keyword auto-closed the umbrella; board 3 status auto-moved to Done.
  • Server pointer in ai-meta bumped: 395f8cf9526b26.

Kind re-validation — GREEN

  • Built localhost/noetl-server-rust:v2.57.1 via podman; loaded into kind via podman cp + ctr -n=k8s.io images import; rolled noetl-server-rust deployment (uptime 25s, version reports 2.57.1).

  • Registered tests/k66/python_file_loader_v2 and executed (execution 322179661486362624). Playbook reaches playbook.completed in ~4s; verify_data_accessor step's output:

    {
      "all_pass": true,
      "checks": {
        "flat_status_ok": true,
        "flat_total_ok": true,
        "wrapped_status_ok": true,
        "wrapped_total_ok": true,
        "wrapped_script_source_ok": true,
        "flat_vs_wrapped_status_match": true,
        "flat_vs_wrapped_total_match": true
      },
      "flat_status": "success", "wrapped_status": "success",
      "flat_total": 3, "wrapped_total": 3,
      "wrapped_script_source": "file"
    }

    Both {{ run_from_file.<field> }} (existing flat path) AND {{ run_from_file.data.<field> }} (new wrapped #66 path) resolve to the same upstream user_data. #66 fix verified end-to-end on the live Rust orchestrator.

Pointers


2026-06-07 (ai-meta#65 closes — python external script loaders + main() convention, kind-val GREEN)

Agent: Claude · Repos touched: noetl/tools (PRs #38 + #39 merged), noetl/worker (PR #61 open), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Two follow-on noetl-tools releases that complete the python script: block subsystem, kind-validated end-to-end on the live worker. Surfaced one downstream finding (cross-step template-context gap, filed as noetl/ai-meta#66).

What landed

  • noetl/tools v2.22.0 (noetl/tools#38) — external python script loaders for file / gcs / http source types. PythonSource enum + resolve_source() pure classification + async load_script_code() on PythonTool (owns a reqwest::Client + GcpAuth). GCS uses GCP ADC / workload-identity per the execution-model "already-in-place trust" rule. 15 new unit tests.
  • noetl/tools v2.23.0 (noetl/tools#39) — legacy main() function convention. When the user code doesn't set a non-None result AND defines a callable main, the wrapper introspects the signature, binds named params from args, forwards **kwargs, awaits async main via asyncio.run. Mirrors noetl/tools/python/executor.py _invoke_main. 4 new unit tests.
  • noetl/worker#61 OPEN + mergeable — bumps noetl-tools = "2.21""2.23". Worker lib 126/0 against the new tools. Image built locally (v5.15.0-tools223), loaded into kind, kind-val GREEN.

Kind validation evidence

Built noetl-worker-rust:v5.15.0-tools223 → loaded into kind → rolled worker deployment. Registered + executed tests/k65/python_file_loader:

script:
  uri: /tmp/loader_hello.py
  source:
    type: file
input:
  name: NoETL
  count: 3

…where loader_hello.py defines main(name="World", count=1) (no result global).

Result (execution 322087210360770560): playbook.completed in ~6s. The loaded step's call.done data field:

{"status": "success",
 "messages": ["Hello, NoETL! (#1)", "Hello, NoETL! (#2)", "Hello, NoETL! (#3)"],
 "total_greetings": 3, "script_source": "file"}

Proves the external file loader + main(name, count) convention chain end-to-end on the live worker.

Surfaced finding — filed as #66

The kind-val playbook's downstream verify step ({{ run_from_file.data }}) resolved to None. The upstream step's data IS in the event log + correct (SELECT result::jsonb #> '{context,result,context,data}' returns the full payload), but the orchestrator doesn't project upstream-step call.done data into the next step's template scope under that reference path. Filed as a Rust-orchestrator finding (separate from #65; doesn't block the loaders).

Closeout

  • noetl/ai-meta#65 closed with full kind-val evidence cited. gcs / http loaders validated at the unit-test layer (15 tests); hermetic kind-val of those paths needs external infra (GCS object / HTTP server) so they validate in CI/GKE rather than the local kind rig.

Pointer bumps (this session, post-R5-closure)

No ai-meta pointer changes — the tools work is a crates.io dep, not a submodule. The worker pointer bump rides noetl/worker#61's merge.

Pointers

  • noetl/ai-meta#65 (CLOSED).
  • noetl/ai-meta#66 (NEW — surfaced finding).
  • noetl/worker#61 (OPEN — adoption / deployment PR).

2026-06-07 (server#157 merged — Phase D R5 R7: cross-server parity harness, v2.57.0; Phase D R5 umbrella closes)

Agent: Claude · Repos touched: noetl/server (PR merged + sub-issue auto-closed), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Phase D R5 closes. Final slice — a cross-server parity harness that feeds a synthetic event log through both the Rust fold_replay_state port and a pre-recorded Python snapshot, asserts structural equality field-by-field across every projection.

Sub-issue noetl/server#148 auto-closed via PR body's Closes noetl/server#148 keyword at 17:53:40Z. All seven Phase D R5 rounds shipped today (v2.51.0 → v2.57.0).

What landed

  • noetl/server#157 merged (server@395f8cf, v2.57.0, closes server#148):
    • tests/parity_harness/events.json — 13-event synthetic log exercising all six replay projections (execution / stage / frame / command / business_object / loop) plus payload refs.
    • tests/parity_harness/expected.json — Python's structured fold output for that fixture; committed alongside so the harness is hermetic (no live Python runtime needed at test time).
    • tests/parity_harness/regenerate_expected.py — standalone Python 3.10+ script. Verbatim extract of noetl/server/api/replay/service.py fold + helpers, no noetl-package imports (dodges the transitive-dep chain — nats, snowflake-connector-python, env-var validators, …). Re-syncable when the Python implementation changes.
    • tests/parity_harness/README.md — documents the contract.
    • tests/parity_harness.rs — 8 Rust integration tests. Loads both files, folds with Rust, compares slice-by-slice with per-key failure messages. All pass.

Parity contract

  • Structural — same projection keys, same per-key field values (status, counters, summaries, references). This is the load-bearing contract: the Rust port produces the same logical view as Python's source-of-truth.
  • NOT byte-for-byte hex on checksum.value / projection_checksums[*].value. Python and Rust hash different digest inputs (Python normalizes to flat rows; Rust hashes the typed state directly — per R4's design). Both deliver determinism + replay validation; the typed Checksum { type, value } shape (R4) keeps the wire stable for future algorithm additions.

Closing the Phase D R5 umbrella

All seven rounds shipped:

Round Topic Status
R1 Endpoint scaffold + execution projection ✅ v2.51.0
R2 stages + frames + commands projections ✅ v2.52.0
R3 loops + business_objects projections ✅ v2.53.0
R4 typed Checksum + projection_checksums ✅ v2.54.0
R5 snapshot seed + base_state + upcaster digest ✅ v2.55.0
R6 payload resolver ✅ v2.56.0
R7 cross-server parity harness against Python ✅ v2.57.0

The Replay engine port is complete. Python's ~1236-LoC noetl/server/api/replay/service.py is now ported to Rust with structural-parity unit-test coverage (564/0/0) + 8-test cross-server parity harness.

No kind-val required

R7 is purely a test/fixture addition — no runtime code changes. v2.57.0 is a release-please rolling-MINOR bump (commit message used the feat(replay): prefix) but the deployed image is identical in behaviour to v2.56.0. Kind reload skipped.

Pointer bumps (this session)

  • repos/server395f8cf (v2.57.0).

Pointers

  • noetl/server#148 (Phase D R5 sub-issue) — CLOSED.
  • noetl/ai-meta#49 Phase D R5 — Replay engine port complete.

2026-06-07 (server#156 merged — Phase D R5 R6: payload resolver, v2.56.0, kind-val GREEN)

Agent: Claude · Repos touched: noetl/server (PR merged + image build + kind reload + kind-val), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Phase D R5 sixth slice ships. Every event's result.reference JSON gets parsed into a typed PayloadSummary and appended to the relevant projection's payload-refs list. Only R7 (cross-server parity harness) remains in the umbrella.

What landed

  • noetl/server#156 merged (server@a8a054a, v2.56.0, Refs server#148). Mirrors Python's _payload_ref / _payload_summary / per-projection payload_refs population.
    • PayloadSummary struct with five fields: sha256 / schema_digest / row_count / media_type / ref (the last renamed from Rust reference_uri via serde(rename = "ref")). All Option<…> + skip_serializing_if.
    • PayloadRefEntry struct{event_id, reference, summary} per Python's dict shape.
    • ReplayEventRow.result: Option<serde_json::Value>noetl.event.result jsonb column. SQL queries updated across all three load_events variants.
    • ReplayExecutionState.payload_refs: Vec<PayloadRefEntry> — every event with result.reference appended in event_id order.
    • ReplayFrameState.output_ref + output_ref_summary — populated on frame.committed / frame.failed; summary is Some(default) (all-None fields) when the terminal event has no reference (mirrors Python's _payload_summary(None)).
    • ReplayBusinessObjectState.payload_refs + last_payload_ref — every event touching the BO with a result.reference appended; last_payload_ref points at the most recent.
    • extract_payload_ref(event) mirrors Python's _payload_ref — reads event.result.reference, returns None when the result is absent, has no reference key, or reference is null.
    • payload_summary(reference) mirrors Python's _payload_summary — three-tier fallback per field (reference.<field>reference.rows_ref.meta.<field>reference.rows_ref.ipc.<field>); sha256 falls back to reference.digest; ref falls back to reference.uri.
  • 15 new unit tests covering all the fallback paths + per-projection population. Server lib 564/0/0 (was 549/0/0).

Kind validation evidence

Built noetl-server-rust:v2.56.0 → loaded into kind → rolled deployment. Re-probe of the prior fanout_reduce execution (322023958058635264) returns identical shape as v2.55.0 with new fields all empty (the fixture has no events carrying result.reference).

A second execution (640422512395813188) confirms the live payload resolver path — three populated entries with real SHA-256 hex digests:

"execution.payload_refs": [
  {"event_id": 640422601071788780,
   "summary": {"sha256": "d0de6b8de78fd04b2e752a96ebef12df4a9b32e92565b3f6e55860ae12762133",
               "row_count": null, "ref": null}},
  {"event_id": 640422601071788782, "summary": {"sha256": "d0de6b8..."}},
  {"event_id": 640422601080177391, "summary": {"sha256": "d0de6b8..."}}
]

row_count + ref are null because the originating events' result.reference JSON didn't carry those fields — the fallback chain visited every location and returned None as documented. R6 correctness verified end-to-end.

Pointer bumps (this session)

  • repos/servera8a054a (v2.56.0).

Phase D R5 status

Round Topic Status
R1 Endpoint scaffold + execution projection ✅ MERGED (v2.51.0)
R2 stages + frames + commands projections ✅ MERGED + kind-val GREEN (v2.52.0)
R3 loops + business_objects projections ✅ MERGED + kind-val GREEN (v2.53.0)
R4 typed Checksum + projection_checksums ✅ MERGED + kind-val GREEN (v2.54.0)
R5 snapshot seed + base_state + upcaster digest ✅ MERGED + kind-val GREEN (v2.55.0)
R6 payload resolver ✅ MERGED + kind-val GREEN (v2.56.0)
R7 Cross-server parity harness against Python ⏳ next — final round

Pointers

  • noetl/server#148 (umbrella sub-issue; closes after R7).
  • noetl/ai-meta#49 Phase D R5.

2026-06-07 (server#155 merged — Phase D R5 R5: snapshot seed + base_state + upcaster digest, v2.55.0, kind-val GREEN)

Agent: Claude · Repos touched: noetl/server (PR merged + image build + kind reload + kind-val), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Phase D R5 fifth slice ships. The replay fold can now start from a prior fold's output and continue from there rather than always re-folding from event 1. Mirrors Python's base_state + snapshot_seed + upcaster_registry_digest parameters on fold_replay_state.

What landed

  • noetl/server#155 merged (server@3dc6b66, v2.55.0, Refs server#148).
    • ReplaySnapshotSeed structaggregate_id, aggregate_type, version: i64, checksum: Checksum, state: ReplayState, meta: Map. Mirrors Python's frozen dataclass from noetl/server/api/replay/types.py.
    • ReplaySnapshotInfo struct — same shape minus state (the full state already went into base_state). Surfaced on the output as ReplayState.replay_snapshot.
    • ReplayFoldOptions struct (Default impl) — carries the three optional inputs: base_state / snapshot_seed / upcaster_registry_digest.
    • New ReplayState fields: upcaster_registry_digest: Option<String> + replay_snapshot: Option<ReplaySnapshotInfo>. Both skip_serializing_if = "Option::is_none" so default-options folds produce the exact same JSON as R1–R4 — wire-shape back-compat preserved.
    • fold_replay_state_with_options new entry point. The existing fold_replay_state (5-arg) is now a thin shim that passes ReplayFoldOptions::default().
    • Continuation semantics: base_state strips its checksum + projection_checksums (they recompute at the end); counters (event_count, last_event_id, …) continue from where the base left off; caller's tenant_id / organization_id / execution_id override whatever the base recorded; caller's upcaster_registry_digest wins over base's, but None from caller preserves base's value.
  • 8 new unit tests covering option propagation, snapshot info surfacing, counter continuation, checksum stripping + recomputation, tenant/org override, upcaster digest precedence rules. Server lib 549/0/0 (was 541/0/0).

Out of scope — snapshot storage backend

The HTTP handler stays unchanged for R5; it still folds from event 1 with ReplayFoldOptions::default(). Wiring up a snapshot store + load_snapshot_seed shape + deciding when to seed is a downstream sub-issue against server#148. R5 lands the data-contract round; storage + HTTP opt-in are the next slice.

Kind validation evidence

Built noetl-server-rust:v2.55.0 → loaded into kind → rolled deployment. Re-probe of the prior fanout_reduce execution (322023958058635264) against v2.55.0:

{
  "event_count": 25,
  "last_event_type": "playbook.completed",
  "exec_status": "COMPLETED",
  "commands_n": 4,
  "checksum_type": "sha256",
  "checksum_value_len": 64,
  "projection_checksum_keys": ["business_object", "command",
    "execution", "frame", "loop", "stage"],
  "replay_snapshot_in_keys": false,
  "upcaster_in_keys": false
}

Wire-shape back-compat verified: the new R5 fields don't appear in JSON when their Option<…> is None — R1-R4 consumers see identical output. Snapshot-seeded behaviour is covered by the unit-test layer (no snapshot store in kind yet).

Pointer bumps (this session)

  • repos/server3dc6b66 (v2.55.0).

Phase D R5 status

Round Topic Status
R1 Endpoint scaffold + execution projection ✅ MERGED (v2.51.0)
R2 stages + frames + commands projections ✅ MERGED + kind-val GREEN (v2.52.0)
R3 loops + business_objects projections ✅ MERGED + kind-val GREEN (v2.53.0)
R4 typed Checksum + projection_checksums ✅ MERGED + kind-val GREEN (v2.54.0)
R5 snapshot seed + base_state + upcaster digest ✅ MERGED + kind-val GREEN (v2.55.0)
R6 Payload resolver ⏳ next
R7 Cross-server parity harness against Python

Pointers

  • noetl/server#148 (umbrella sub-issue; stays open across all 7 rounds).
  • noetl/ai-meta#49 Phase D R5.

2026-06-07 (server#154 merged — Phase D R5 R4: typed Checksum + projection_checksums, v2.54.0, kind-val GREEN)

Agent: Claude · Repos touched: noetl/server (PR merged + image build + kind reload + kind-val), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Phase D R5 fourth slice ships. Every replay fold now produces a typed Checksum over the full state plus a 6-entry projection_checksums map covering every per-projection slot. The hash function is the type of the checksum (per user direction this session) — ChecksumType enum gates future types (BLAKE3, SHA-512, …) without a wire-format break.

What landed

  • noetl/server#154 merged (server@adec21c, v2.54.0, Refs server#148). Extends the R5 R3 fold:
    • ChecksumType enum with the initial variant Sha256 (serializes lowercase snake_case "sha256", matching Python's state["checksum_algorithm"] wire form). Future variants (Blake3, Sha512, …) slot in without a schema-level break.
    • Checksum struct = { type: ChecksumType, value: String } — the value field carries the lowercase-hex digest. Replaces Python's flat state["checksum_algorithm"] + state["checksum"] pair with a typed shape.
    • ReplayState.checksum: Option<Checksum> (skip_serializing_if when None) + ReplayState.projection_checksums: BTreeMap<String, Checksum> (six entries on every fold: execution, stage, frame, command, business_object, loop).
    • stable_json_bytes helper — encodes a value as deterministic JSON (sorted keys recursively + compact separators) matching Python's json.dumps(sort_keys=True, separators=(",", ":")) byte form. Used as the SHA-256 input.
    • compute_checksums runs once at the end of fold_replay_state — per-projection SHA-256 over each typed sub-state (state.execution / state.stages / state.frames / state.commands / state.business_objects / state.loops), then the top-level digest over the full state with projection_checksums populated and checksum field still None (skip_serializing_if handles the self-reference cleanly — the digest doesn't depend on itself).
  • 9 new unit tests covering type-serialization shape, hex output format, deterministic re-runs, projection isolation, top-level self-non-dependence. Server lib 541/0/0 (was 532/0/0).

Design decision recorded

The R4 digest hashes over the typed Rust state directly, not through Python's normalize_replayed_<projection>_projection flat-row layer. Reasons documented in the PR body:

  1. The typed BTreeMap ordering + stable_json_bytes sorted-key recursion give the same determinism guarantee.
  2. The Rust state IS the source of truth for the server's view; normalizing to a Python-row shape adds a translation layer the Rust server never reads back.
  3. Cross-Python byte-for-byte parity isn't an R4 requirement — that's explicitly R7's "cross-server parity harness", which can add the normalize functions if/when parity matters.

If R7 finds Python parity needed, it adds the normalize_replayed_*_projection helpers and recomputes the projection_checksums over the normalized list — additive work without touching the R4 wire shape.

Kind validation evidence

Built noetl-server-rust:v2.54.0 → loaded into kind → rolled deployment. Re-probe of the prior fanout_reduce execution (322023958058635264) against v2.54.0:

{
  "event_count": 25,
  "last_event_type": "playbook.completed",
  "exec_status": "COMPLETED",
  "commands_n": 4,
  "loops_n": 0,
  "business_objects_n": 0,
  "checksum": {
    "type": "sha256",
    "value": "41265876487f32350fc60c5039358456ded76598b99e7a0833ac4a17ceaae426"
  },
  "projection_checksum_keys": [
    "business_object", "command", "execution",
    "frame", "loop", "stage"
  ]
}

Sample per-projection entry:

.projection_checksums.command = {
  "type": "sha256",
  "value": "58d8220005758b7f18e27d9042b3ef5fa8ca86471c9d2ea33a869fd0db31231b"
}

Same event log → same digest hex (verified by the fold_checksum_deterministic_across_runs unit test). Every projection's hex differs from every other (different sub-state input → different SHA-256 → distinct identity).

Pointer bumps (this session)

  • repos/serveradec21c (v2.54.0).

Phase D R5 status

Round Topic Status
R1 Endpoint scaffold + execution projection ✅ MERGED (v2.51.0)
R2 stages + frames + commands projections ✅ MERGED + kind-val GREEN (v2.52.0)
R3 loops + business_objects projections ✅ MERGED + kind-val GREEN (v2.53.0)
R4 typed Checksum + projection_checksums ✅ MERGED + kind-val GREEN (v2.54.0)
R5 Snapshot seeds + base_state ⏳ next
R6 Payload resolver
R7 Cross-server parity harness against Python

Pointers

  • noetl/server#148 (umbrella sub-issue; stays open across all 7 rounds).
  • noetl/ai-meta#49 Phase D R5.

2026-06-07 (server#153 merged — Phase D R5 R3: loops + business_objects projections, v2.53.0, kind-val GREEN)

Agent: Claude · Repos touched: noetl/server (PR merged + image build + kind reload + kind-val), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Phase D R5 third slice ships. The replay fold now populates loops + business_objects projections — the last two per-projection slots. All five map fields (stages / frames / commands / loops / business_objects) are now typed BTreeMaps with deterministic key ordering, ready for R4's typed Checksum + projection_checksums bundle.

What landed

  • noetl/server#153 merged (server@3174c75, v2.53.0, Refs server#148). Extends the R5 R2 fold:
    • Two new typed state structs: ReplayLoopState (loop_id + step_name + total + done + failed + completed + last_event_id) and ReplayBusinessObjectState (object_key + object_type + object_id + status + version + event_count + first_event_id + last_event_id + deleted_event_id + last_event_type + attributes). payload_refs deliberately deferred to R6 (the payload-resolver round).
    • ReplayState.loops + ReplayState.business_objects flip from serde_json::Map placeholders to BTreeMap<String, Replay{Loop,BusinessObject}State> for deterministic key ordering.
    • Two new ID extractors: extract_loop_id mirrors Python's _loop_id (reads meta.loop_id, then meta.loop_event_id, then meta.__loop_epoch_id); extract_business_object_identity mirrors Python's _business_object_identity (reads meta.business_object.{object_type|type} / {object_id|id} first, then flat meta.business_object_{type,id} / meta.object_{type,id}, then parses aggregate_type=business_object + aggregate_id=business_object/<type>/<id> as fallback).
    • business_object_status helper mirrors Python's _business_object_status: explicit non-empty event-row status wins; else suffix-match the event type (.deleted/.removedDELETED, .created/.updated/.upsertedACTIVE); else returns None (caller leaves existing status unchanged).
    • Two new populate functions: populate_loop increments done on command.completed/loop.shard.done, failed on command.failed/loop.shard.failed, flips completed=true on loop.done/loop.fanin.completed; populate_business_object updates last_event_id + last_event_type always, bumps event_count, recomputes version (meta.business_object.version || meta.business_object_version || event_count), updates status via the helper above (DELETED also sets deleted_event_id), and applies attribute updates (meta.business_object.state REPLACES, meta.business_object.patch/attributes PATCHES).
  • 13 new unit tests covering ID extraction precedence, loop counter aggregation, business-object lifecycle through three stages (created → updated → deleted), status fallbacks, version sources, no-signal skip path. Server lib 532/0/0 (was 518/0/0).
  • Cleanup follow-up (server@a235b60): R3 doc comments dropped the "canonical" qualifier per writing-style banned-word rule + user direction.

R4 design refinement (captured during this session)

Per user direction: Round 4 ships the checksum + projection_checksums fields with a typed shape, NOT Python's flat checksum_algorithm + checksum pair. Reasons: the hash function is the type of the checksum (not a sibling field), and future checksum types (BLAKE3, SHA-512, …) slot in via the enum without a wire-format break.

Proposed Rust types (final names land in the R4 PR):

pub enum ChecksumType { Sha256, /* future: Blake3, Sha512, ... */ }
pub struct Checksum { type: ChecksumType, value: String /* hex */ }
// in ReplayState:
pub checksum: Option<Checksum>,
pub projection_checksums: BTreeMap<String, Checksum>,

Underlying hash stays SHA-256 (same algorithm Python's _canonical_checksum uses). Parity test in R7 asserts byte-for-byte hex-value equality. See server#148 comment for the full design.

Kind validation evidence

Built noetl-server-rust:v2.53.0 → loaded into kind → rolled deployment. Re-probe of the prior fanout_reduce execution (322023958058635264, the R5 R1/R2 reference run):

Field R2 (v2.52.0) R3 (v2.53.0)
event_count 25 25
last_event_type playbook.completed playbook.completed
execution.status COMPLETED COMPLETED
commands (map size) 4 4
stages / frames 0 / 0 0 / 0
loops (R2: untyped Map) 0 (typed BTreeMap)
business_objects (R2: untyped Map) 0 (typed BTreeMap)

The loops + business_objects maps remain empty for this fixture — the v10 control-flow shape of fanout_reduce doesn't emit loop.* events or business-object metadata. R3's fold correctness for those projections is verified through the unit-test layer (fold_populates_loop_with_counters_and_completion, fold_populates_business_object_through_lifecycle). The wire contract change — typed BTreeMap instead of serde_json::Map — is the load-bearing shift, ready for R4's projection_checksums to hash over.

Pointer bumps (this session)

  • repos/server3174c75 (v2.53.0).

Phase D R5 status

Round Topic Status
R1 Endpoint scaffold + execution projection ✅ MERGED (v2.51.0)
R2 stages + frames + commands projections ✅ MERGED + kind-val GREEN (v2.52.0)
R3 loops + business_objects projections ✅ MERGED + kind-val GREEN (v2.53.0)
R4 typed Checksum + projection_checksums ⏳ next — design captured on server#148
R5 Snapshot seeds + base_state
R6 Payload resolver
R7 Cross-server parity harness against Python

Pointers

  • noetl/server#148 (umbrella sub-issue; stays open across all 7 rounds).
  • noetl/ai-meta#49 Phase D R5.

2026-06-07 (server#152 merged — Phase D R5 R2: stages + frames + commands projections, v2.52.0, kind-val GREEN)

Agent: Claude · Repos touched: noetl/server (PR merged + image build + kind reload + kind-val), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Phase D R5 second slice ships. The replay fold now populates stages + frames + commands projections from the event stream — mirrors Python's state["stages"] / state["frames"] / state["commands"] per-projection dicts from noetl/server/api/replay/service.py.

What landed

  • noetl/server#152 merged (server@266b4c7, v2.52.0, Refs server#148). Extends the R5 R1 fold:
    • ReplayEventRow extended with stage_id / frame_id / command_id / worker_id / aggregate_type / aggregate_id / meta columns (all #[sqlx(default)] for back-compat). SQL queries updated across all three load_events variants; R1's AT TIME ZONE 'UTC' AS created_at cast preserved.
    • Three new typed state structs (ReplayStageState / ReplayFrameState / ReplayCommandState) replacing R1's serde_json::Map placeholders. ReplayState.{stages,frames,commands} flip from serde_json::Map to BTreeMap<String, Replay{Stage,Frame,Command}State> for deterministic key ordering (matters when R4 lands the typed Checksum + projection_checksums).
    • Three new ID extractors (extract_stage_id / extract_frame_id / extract_command_id) mirror Python's resolution order: top-level column → aggregate_type+aggregate_id fallback (e.g. aggregate_type=stage+aggregate_id=stage/<id>) → meta.<key> fallback.
    • Three new populate functions with full status-transition coverage: stage opened → OPEN / closed → CLOSED; frame dispatched → CLAIMED / started → RUNNING / committed → COMPLETED / failed → FAILED / abandoned → ABANDONED; command full lifecycle (issued → PENDING / claimed → CLAIMED / started → RUNNING / completed → COMPLETED (or uppercased event-type suffix) / failed → FAILED / cancelled → CANCELLED). Each populator is a no-op when the event doesn't carry the relevant identity.
  • 10 new unit tests covering ID extraction precedence, lifecycle population per projection, status-fallback defaults, single-event multi-projection updates, no-identity skip path. Server lib 518/0/0 (was 508/0/0).

Kind validation evidence

Built noetl-server-rust:v2.52.0 → loaded into kind → rolled deployment. Re-probe of the prior fanout_reduce execution (322023958058635264, the R5 R1 reference run):

Field R1 (v2.51.0) R2 (v2.52.0)
event_count 25 25
last_event_type playbook.completed playbook.completed
execution.status COMPLETED COMPLETED
commands (map size) 0 (placeholder) 4 populated
stages / frames 0 / 0 0 / 0

The 4 populated command entries (one per dispatched command in the fixture: start, normalize_customer, enrich_customer, reduce_customer) each carry worker_id, issued_event_id, and last_event_id from the event row; status sits at RUNNING because the fanout_reduce v10 fixture emits step-level termination events (step.exit) rather than command.completed — exactly the shape R2's PR body documented. stages + frames stay empty for the same reason (no stage.* / frame.* events in the v10 control-flow shape). Fold logic correctness verified through the unit-test layer.

Pointer bumps (this session)

  • repos/server266b4c7 (v2.52.0).

Phase D R5 status

Round Topic Status
R1 Endpoint scaffold + execution projection ✅ MERGED (v2.51.0)
R2 stages + frames + commands projections ✅ MERGED + kind-val GREEN (v2.52.0)
R3 loops + business_objects projections ⏳ next
R4 typed Checksum + projection_checksums + projection_checksums
R5 Snapshot seeds + base_state
R6 Payload resolver
R7 Cross-server parity harness against Python

Pointers

  • noetl/server#148 (umbrella sub-issue; stays open across all 7 rounds).
  • noetl/ai-meta#49 Phase D R5.

2026-06-07 (server#149 merged — Phase D R5 R1: Replay endpoint scaffold + execution projection, v2.51.0)

Agent: Claude · Repos touched: noetl/server (PR merged + sub-issue opened), noetl/ai-meta, noetl/ai-meta.wiki

Headline. Opens Phase D R5 — the Replay engine port (Python's noetl/server/api/replay/service.py ~1236 LoC → Rust). Round 1 ships the endpoint scaffold + the minimal execution projection on top of the Phase D R4 foundation; subsequent rounds extend the fold incrementally without touching the surface again.

What landed

  • noetl/server sub-issue server#148 documenting the 7-round decomposition (R1 scaffold + execution / R2 stages+frames+commands / R3 loops+business_objects / R4 typed Checksum + projection_checksums / R5 snapshot seeds / R6 payload resolver / R7 cross-server parity harness).
  • noetl/server#149 merged (server@f77dead, v2.51.0). New GET /api/replay/state route mirroring Python's endpoint.py byte-for-byte (query params + defaults + projection enum + mutually-exclusive cutoffs returning 400). New services::replay module: ReplayService, ReplayCutoff, ReplayProjection, ReplayState, ReplayExecutionState, pure deterministic fold_replay_state. Round 1 only fills the execution projection (status + last_node_name) using the same terminal-event short-circuit pattern Phase D R4 landed in the orchestrator + status endpoint. 9 new unit tests (8 service + 1 handler). Server lib 508/0/0 (was 499/0/0).

Pointer bumps (this session)

  • repos/serverf77dead (v2.51.0).

Wire contract

GET /api/replay/state?execution_id=<i64>[&tenant_id=...][&organization_id=...]
                     [&as_of_event_id=N | &as_of_position=N | &as_of_time=<rfc3339>]
                     [&projection=execution|stage|frame|command|business_object|loop|all]
                     [&limit=N] [&resolve_payloads=bool]

Returns {tenant_id, organization_id, execution_id, projection, event_count, last_event_id, last_event_type, execution:{status, last_node_name}, stages:{}, frames:{}, commands:{}, business_objects:{}, loops:{}} — the map fields stay empty in R1 and populate in R2/R3.

Phase D R5 status

Round Topic Status
R1 Endpoint scaffold + execution projection ✅ MERGED (v2.51.0)
R2 stages + frames + commands projections ⏳ next
R3 loops + business_objects projections
R4 typed Checksum + projection_checksums + projection_checksums
R5 Snapshot seeds + base_state
R6 Payload resolver
R7 Cross-server parity harness against Python

Pointers

  • noetl/server#148 (umbrella sub-issue; stays open across all 7 rounds).
  • noetl/ai-meta#49 Phase D R5 (the orchestrator + read-side status both already green; Replay is the natural next big piece).

2026-06-07 (server#147 merged — status query bug fixed + kind-validated, v2.50.1)

Agent: Claude · Repos touched: noetl/server (PR merged), noetl-server-rust image build + kind reload, noetl/ai-meta, noetl/ai-meta.wiki

Headline. Phase D R4 follow-up: the read-side status query bug surfaced during the fanout_reduce kind-val (filed as server#146 earlier this session) is fixed + kind-validated. Operators querying GET /api/executions/{id}/status now correctly see COMPLETED / FAILED the moment the orchestrator emits a terminal event.

What landed

  • noetl/server#147 merged (server@d26abf8, v2.50.1, closes server#146). Two compounding fixes:
    1. New terminal SQL query in ExecutionService::get_status looks up playbook.completed / playbook.failed events FIRST and returns COMPLETED / FAILED directly — mirrors the list() endpoint's existing bool_or(playbook.completed) semantics on the per-execution path.
    2. Widened the completed_steps SQL filter to accept status IN ('COMPLETED', 'completed', 'success') so the progress counter matches what workers actually emit (command.completed events carry status='success' lowercase).
  • 6 new unit tests for the in-memory determine_status helper; server lib went 493/0/0 → 499/0/0.

Kind validation evidence

Built noetl-server-rust:v2.50.1 → loaded into kind → rolled deployment.

Re-query of prior execution (322018338286866432, the fanout_reduce_phase6 run from earlier in session):

Field Pre-fix Post-fix
status "RUNNING" "COMPLETED"
progress.completed_steps 0 4

Fresh fanout_reduce execution (322023958058635264):

  • Started 14:48:15.172, COMPLETED returned at 14:48:15.782 — single 2s poll interval was sufficient (previously stuck on RUNNING indefinitely).
  • All three Phase D R4 barrier assertions also green on this fresh run.

Pointer bumps (this session)

  • repos/serverd26abf8 (v2.50.1).

Pointers

  • noetl/server#146 (closed via server#147 merge).
  • noetl/ai-meta#49 Phase D R4 (orchestrator + read-side status both GREEN end-to-end now).

2026-06-07 (Phase D R4 — fanout_reduce kind-val GREEN on Rust-only stack)

Agent: Claude · Repos touched: noetl-server-rust image build + kind reload, noetl/ai-meta.wiki, noetl/server (status-bug sub-issue), noetl/ai-meta#49 comment

Headline. Phase D R4 evidence captured end-to-end on the Rust-only kind cluster. Built noetl-server-rust:v2.50.0 from server@499b079 via podman + Dockerfile, loaded into kind, rolled the deployment, and ran the fanout_reduce_phase6 fixture from the just-merged e2e@5da36ea rig. All three barrier assertions pass.

What landed

  • Image built + loaded into kind: localhost/noetl-server-rust:v2.50.0 (podman build → kind load image-archive → kubectl set image). Server pod uptime confirms v2.50.0 via /api/health: {"status":"ok","version":"2.50.0"}.
  • fanout_reduce_phase6 ran on Rust-only stack: noetl-server-rust + noetl-worker-rust + noetl-worker-system-pool (Python at 0). Execution_id 322018338286866432 completed in 550ms wall.

Barrier assertions (all GREEN)

Direct DB query against noetl.event:

command.completed  start              @ 14:25:56.704
step.enter         normalize_customer @ 14:25:56.708
step.enter         enrich_customer    @ 14:25:56.710
command.completed  normalize_customer @ 14:25:56.954
command.completed  enrich_customer    @ 14:25:57.043   ← barrier waited for THIS
step.enter         reduce_customer    @ 14:25:57.051   ← then dispatched ONCE
command.completed  reduce_customer    @ 14:25:57.249
playbook.completed playbook           @ 14:25:57.254
Assertion Expected Observed Status
1. playbook.completed event exists event_id 322018349171085312 @ 14:25:57.254
2. step.enter for reduce_customer count = 1 1 (event_id 322018348319641600)
3. reduce_customer.command.completed AFTER both branches r > a ∧ r > b r=14:25:57.248 > a=14:25:56.954 ∧ > b=14:25:57.042

Server log confirms: Step 'reduce_customer' already dispatched in this pass, skipping (same-pass dedup caught the sibling arc) + Orchestrator marked execution as terminal terminal_event=playbook.completed.

Status-query bug surfaced + filed as follow-up

Separately, the GET /api/executions/{id}/status endpoint continued to return RUNNING for 90s+ after playbook.completed landed in the event log. Read-side bug — doesn't affect orchestrator correctness — filed as noetl/server#146.

Phase D R4 — CLOSED at the orchestrator level

Slice PR Status
1 — fan-in / reduce barrier server#143 v2.49.0 ✅ MERGED + kind-green
2 — apply_event step.skipped server#145 v2.50.0 ✅ MERGED + kind-green
3 — fanout_reduce kind-val rig e2e#32 ✅ MERGED + ran
4 — context-merging (TBD) ⏳ Deferred — kind-val passed without it; playbook templates can already read {{ <upstream>.<field> }}

The rig itself isn't in the e2e rig's CLI form (the v4.8 CLI surface differs from noetl playbook register shape the script assumes); the manual run used a mix of v4.8 commands + direct DB queries. The rig's three core assertions are valid + green — the script as-is can be updated in a follow-up to match the v4.8 CLI surface.

Next housekeeping

  • noetl/server#146 — fix the status-query bug so GET /api/executions/{id}/status correctly returns COMPLETED after playbook.completed lands.
  • Optional: update the kind-val rig to match v4.8 CLI shape (noetl register playbook --file vs noetl playbook register --file).

2026-06-07 (e2e#32 merged — Phase D R4 fanout_reduce kind-val rig ready)

Agent: Claude · Repos touched: noetl/e2e (PR merged), noetl/ai-meta (pointer bump), noetl/ai-meta.wiki

Headline. Phase D R4 slice 3: the kind-val rig that runs the canonical fanout_reduce shape end-to-end against the orchestrator's fan-in / reduce barrier (server v2.49.0 + v2.50.0). Durable fixture + script land under noetl/e2e; the actual run-on-kind is gated on a fresh server image rolling into the cluster.

What landed

  • noetl/e2e#32 merged (e2e@5da36ea, closes e2e#31). Two artefacts:
    • fixtures/playbooks/fanout_reduce/fanout_reduce_phase6.yaml — copy of the Python reference fixture (start → branch_a/branch_b → reduce_customer → end) with a header comment documenting the orchestrator contract being exercised.
    • scripts/kind_validate_fanout_reduce.sh — modeled after kind_validate_container_callback.sh. Three assertions on the resulting event log: (1) final execution status COMPLETED, (2) exactly one step.enter for reduce_customer (barrier prevented double-dispatch), (3) reduce_customer.command.completed arrives AFTER both branches' command.completed (orchestrator waited for both upstreams).

Pointer bumps (this session)

  • repos/e2e5da36ea.

Phase D R4 status

Slice PR Status
1 — fan-in / reduce barrier server#143 v2.49.0 ✅ MERGED
2 — apply_event step.skipped server#145 v2.50.0 ✅ MERGED
3 — fanout_reduce kind-val rig e2e#32 ✅ MERGED
4 — context-merging (TBD) ⏳ Deferred — playbook templates can read {{ <upstream>.<field> }} individually today; not opening until the kind-val actually surfaces a need

Next housekeeping

  • Build a fresh noetl-server image carrying v2.50.0 + load into kind. Then run e2e/scripts/kind_validate_fanout_reduce.sh to capture the green-state evidence for Phase D R4.

Pointers

  • noetl/e2e#31 (closed via e2e#32 merge).
  • noetl/ai-meta#49 Phase D R4 (slices 1-3 shipped; kind-val run-on-kind is the next housekeeping step).

2026-06-07 (server#145 merged — Phase D R4 slice 2 closes the barrier follow-up, v2.50.0)

Agent: Claude · Repos touched: noetl/server (PR merged), noetl/ai-meta (pointer bump), noetl/ai-meta.wiki

Headline. Phase D R4 slice 2: apply_event now handles step.skipped events, closing the gap exposed by the slice 1 PR's #[ignore] test. Fan-in barrier now correctly treats a guard-skipped upstream as terminal — no more deferred-forever reduce steps.

What landed

  • noetl/server#145 merged (server@499b079, v2.50.0, closes server#144). New "step.skipped" | "step_skipped" arm in state::WorkflowState::apply_event records the step into state.steps with StepState::Skipped and sets entered_at + completed_at to the event timestamp. is_step_done already treated Skipped as terminal (state.rs:540) — the missing piece was the apply_event mapping. Slice 1's #[ignore] test (test_reduce_step_treats_skipped_upstream_as_done) flipped to active; 2 new state-level tests added. Server lib went from 490/0/+1 ignored to 493/0/0 ignored.

Pointer bumps (this session)

  • repos/server499b079 (v2.50.0).

Phase D R4 status

Slice PR Status
1 — fan-in / reduce barrier #143 v2.49.0 ✅ MERGED
2 — apply_event step.skipped #145 v2.50.0 ✅ MERGED
3 — kind-val with fanout_reduce_phase6 fixture (not yet) ⏳ pending
4 — context-merging (optional; playbook templates can already read {{ <upstream>.<field> }}) (TBD) ⏳ deferred until kind-val signals a need

Pointers

  • noetl/server#144 (closed via server#145 merge).
  • noetl/ai-meta#49 Phase D R4 (slice 1 + 2 shipped at the orchestrator level; remaining work is kind-validation + possibly context-merging).

2026-06-07 (worker#60 merged — Container Tool Callback umbrella #43 Round 4 worker-side adoption complete, v5.14.0)

Agent: Claude · Repos touched: noetl/worker (PR merged), noetl/ai-meta (pointer bump), noetl/ai-meta.wiki

Headline. The last follow-up after the closed Container Tool Callback umbrella (noetl/ai-meta#43) ships. Worker now recognises ToolResult.pending_callback = Some(true) and skips its own call.done emit — terminal call.done arrives via the server's /api/internal/container-callback/... endpoint (Round 2, v2.48.0) driven by noetl-k8s-watcher (Round 1, ops@8892043).

What landed

  • noetl/worker#60 merged (worker@f96da71, v5.14.0, closes worker#59). executor::command checks tool_result.pending_callback after a successful tool execution: when Some(true) logs INFO (with execution_id), bumps new noetl_worker_call_done_skipped_pending_callback_total{tool_kind} counter, skips its own call.done emit. When None (every existing tool, default) the existing emit path is preserved bit-for-bit. Cargo.toml: noetl-tools = "2.18""2.21", noetl-executor = "0.3""0.4" (Cargo.lock resolves 0.4.1 — published earlier this session via cli#56). 126/0 lib tests against published noetl-executor 0.4.1.

Pointer bumps (this session)

  • repos/workerf96da71 (worker v5.14.0).

Remaining work on the umbrella

The umbrella itself was closed earlier this session with all four Rust rounds shipped (ops watcher, server callback endpoint, Tool::Container + marker field, e2e kind-val rig). Round 4 worker-side adoption was the LAST coordinated follow-up. Still on the punch list (chip-level, not blocking):

  1. Cloud Build a fresh worker image carrying v5.14.0 + load into kind.
  2. Re-run e2e/scripts/kind_validate_container_callback.sh against the new image.
  3. Expected dashboard fingerprint after Round 4 lands: server's noetl_container_callback_stale_total{state} stops moving (the race window closes — the worker no longer emits early call.done); worker's noetl_worker_call_done_skipped_pending_callback_total{tool_kind="container"} ≈ server's noetl_container_callback_total{state=...} becomes the healthy-steady-state signal.

Pointers

  • noetl/worker#59 (closed via worker#60 merge).
  • noetl/ai-meta#43 (umbrella was closed earlier; this trail wraps the last follow-up).

2026-06-07 (cli#56 + server#143 merged — noetl-executor 0.4.1 released, Phase D R4 fan-in / reduce barrier landed v2.49.0)

Agent: Claude · Repos touched: noetl/cli (PR merged), noetl/server (PR merged), noetl/ai-meta (pointer bumps), noetl/ai-meta.wiki

Headline. Two PRs from earlier this session merged. cli#56 shipped noetl-executor 0.4.1 (bridge propagates the new ToolResult.pending_callback field — unblocks the worker-side adoption PR that's still draft). server#143 shipped the Phase D R4 first slice — fan-in / reduce barrier — closing the cross-pass bug where a reduce step fired on the first completing upstream instead of waiting for all.

What landed

  • noetl/cli#56 merged (cli@77be8be, v4.10.0, closes cli#55). noetl-executor 0.4.1 patch: tools_bridge::reshape_duckdb_result propagates result.pending_callback unchanged; HTTP test fixture pins pending_callback: None; noetl-tools dep bumped to ^2.21. 102/0 unit. The cli release pipeline (release-cli workflow) is publishing noetl-executor 0.4.1 to crates.io now.
  • noetl/server#143 merged (server@be37e5c, v2.49.0, closes server#142). Phase D R4 first slice — fan-in / reduce barrier (tracks noetl/ai-meta#49). Orchestrator gates dispatch of any step with incoming_arcs(target).len() > 1 until ALL upstream steps reach a terminal state. New module-private build_incoming_arcs(steps) helper mirrors the Python planner's incoming map across all four NextSpec variants. Three active tests + one #[ignore] documenting an apply_event step.skipped follow-up (the barrier code already handles Skipped on the reconstructed-state side via is_step_done; missing piece is the apply_event arm). Server lib 490/0 + 1 ignored.

Pointer bumps (this session)

  • repos/cli77be8be (cli v4.10.0, noetl-executor 0.4.1).
  • repos/serverbe37e5c (server v2.49.0).

Worker PR #60 status

Still draft. cargo build against the local crates.io cache resolves noetl-executor 0.4.0 (not 0.4.1 yet) — the release-cli workflow is in flight. Once it publishes 0.4.1, cargo update -p noetl-executor picks it up and the worker PR can be un-drafted + merged. Pointer bump for worker waits on that.

Pointers

  • noetl/cli#55 (closed via cli#56 merge).
  • noetl/server#142 (closed via server#143 merge).
  • noetl/ai-meta#49 Phase D R4 (in progress; this slice closes the barrier-gate piece; context-merging slice + kind-val with fanout_reduce fixture are separate follow-ups).
  • Umbrella-Container-Tool-Callback page (worker#60 Round-4-follow-up table stays current — still blocked on the publish).

2026-06-07 (Container Tool Callback umbrella #43 Round 4 worker-side pending_callback adoption — PRs open, blocked on noetl-executor 0.4.1 publish)

Agent: Claude · Repos touched: noetl/cli (PR), noetl/worker (PR), noetl/ai-meta.wiki

Headline. Worker-side adoption of the pending_callback marker (the last follow-up after the just-closed Container Tool Callback umbrella #43) coded + PRs opened. Two-PR chain across noetl/cli (executor bridge propagation) → noetl/worker (call.done skip + counter). Both ride a fresh ai-task sub-issue per repo.

What landed

  • noetl/cli sub-issue noetl/cli#55 + PR #56. noetl-executor 0.4.1 (patch): tools_bridge::reshape_duckdb_result propagates the new pending_callback: Option<bool> field through unchanged; bridge test fixture pins pending_callback: None. noetl-tools dep bumped to ^2.21. Built + tested: 102/0 unit. Closes noetl/cli#55, refs noetl/ai-meta#43.
  • noetl/worker sub-issue noetl/worker#59 + PR #60 (draft, blocked). executor::command checks tool_result.pending_callback after successful tool execution. When Some(true): tracing INFO with execution_id + bumps new counter noetl_worker_call_done_skipped_pending_callback_total{tool_kind}
    • skips its own call.done emit. When None (every existing tool today, default): behaviour preserved bit-for-bit. Cargo.toml: noetl-tools = "2.18""2.21", noetl-executor = "0.3""0.4". New unit test in metrics.rs. Local validation against the patched cli executor: 126/0 lib tests pass. Closes noetl/worker#59, refs noetl/ai-meta#43, depends on noetl/cli#56.

Why the worker PR is draft

CI on noetl/worker#60 fails until noetl-executor 0.4.1 publishes to crates.io. Sequence:

  1. Merge noetl/cli#56 (the bridge propagation patch).
  2. Tag + publish noetl-executor 0.4.1.
  3. CI on noetl/worker#60 turns green automatically; reviewers can un-draft.

The full Round 4 also will need a kind-validation pass against the e2e rig at e2e/scripts/kind_validate_container_callback.sh once the worker image is rebuilt with this PR — same rig that landed in umbrella #43 Round 5.

Why this matters

After Round 4 lands + kind-validates, the container-callback chain runs in the steady-state shape: worker dispatches the K8s Job, releases its slot WITHOUT emitting call.done, the watcher detects the terminal Pod state, and the server's /api/internal/container-callback/... endpoint emits the only call.done for the step. The server's noetl_container_callback_stale_total{state} counter goes back to ~0 (the race window closes), and dashboards can read worker.skipped_total{tool_kind="container"}server.container_callback_total as the healthy-steady-state fingerprint.

Pointers

  • noetl/cli#55 (sub-issue) → noetl/cli#56 (PR).
  • noetl/worker#59 (sub-issue) → noetl/worker#60 (PR, draft).
  • Container Tool Callback umbrella: noetl/ai-meta#43 (CLOSED).
  • Wiki: Umbrella-Container-Tool-Callback (the closed umbrella's "Remaining follow-up" section gets a status note in the same change set).

2026-06-07 (Container Tool Callback umbrella #43 CLOSED — Round 5 e2e kind-val rig landed e2e@17de21d)

Agent: Claude · Repos touched: noetl/e2e (PR), noetl/ai-meta.wiki

Headline. Last round of the Container Tool Callback umbrella ships. All four Rust rounds are in; the umbrella closes with the e2e kind-val rig that proves Rounds 1 + 2 + 3 wire together end-to-end.

What landed

  • Round 5 of noetl/ai-meta#43 (e2e#30, closed e2e#29; commit 17de21d). 3 new files (412 lines):
    • fixtures/playbooks/container_callback_happy_path/container_callback_happy_path.yaml — 3-step playbook (python init → kind: container alpine echo + sleep + echo → python complete). Expected terminal state: succeeded.
    • fixtures/playbooks/container_callback_oom/container_callback_oom.yaml — same shape; dispatch step runs python:3.12-alpine with a 40 MiB bytes() allocation under a 32Mi memory limit. Expected: failed_oom (the watcher's jq classifier maps pod state.terminated.reason == "OOMKilled"). Python bytes() over dd if=/dev/zero because Python's allocation actually touches physical pages (defeats lazy-allocation optimisations).
    • scripts/kind_validate_container_callback.sh — rig that preflights (kubectl + noetl + curl in PATH; watcher Deployment exists + rolled out), registers + executes each fixture, scrapes the server's /metrics BEFORE + AFTER, asserts the sum of noetl_container_callback_total{state=...} + noetl_container_callback_stale_total{state=...} moved by ≥ 1. The sum-both-counters strategy handles the worker-side pending_callback adoption transition. On failure: dumps watcher logs (tail 50) + server logs filtered to /container-callback/. Returns 0 if both probes pass; 1 otherwise.

Container Tool Callback umbrella — final inventory

Round Repo PR / commit Status
1 noetl/ops #1678892043 CLOSED
2 noetl/server #141 → v2.48.0 CLOSED
3 noetl/tools #37 → v2.21.0 CLOSED
4 noetl/noetl (Python) Parked per Rust-only standing direction
5 noetl/e2e #3017de21d CLOSED

The umbrella closes. Worker-side pending_callback adoption (suppressing the worker's own call.done emit when the marker is set) remains as a follow-up tracked under the umbrella; harmless during the transition (the watcher's callback is recorded by noetl_container_callback_stale_total, which is the migration dashboard signal).

Pointer bumps + wiki sweep

  • Bump repos/e2e to commit 17de21d.
  • ai-meta wiki: Home (Last refreshed + #43 moved from Active umbrellas to Recently closed; preamble count Three → Two), Sessions-Log (this entry), Umbrella-Container-Tool-Callback (mark CLOSED).

2026-06-07 (noetl/ai-meta#43 Round 3 — Tool::Container + ToolResult.pending_callback landed tools v2.21.0)

Agent: Claude · Repos touched: noetl/tools (PR), noetl/ai-meta.wiki

Headline. Third and last code round of the Container Tool Callback umbrella ships. After this round + a worker bump to the new noetl-tools, only Round 5 (e2e kind-val rig) remains to close the umbrella.

What landed

  • Round 3 of noetl/ai-meta#43 (tools#37, closed tools#36; v2.21.0).
    • src/result.rs — extend ToolResult with pending_callback: Option<bool> marker. Additive + skip_serializing_if; existing consumers see no change. Set by Tool::Container to Some(true) to signal "I created an external work item; suppress your normal call.done emit". 10 existing struct-literal sites backfilled with pending_callback: None.
    • src/tools/container.rs new module — ContainerTool impl + 17 unit tests (~570 lines).
    • ContainerConfig mirrors the umbrella's catalog YAML shape: image (required) + command + args + env (literal value XOR value_from { secret_name, secret_key }) + resources (requests/limits maps) + timeout_seconds (Job's activeDeadlineSeconds) + service_account + namespace + backoff_limit + restart_policy.
    • build_job translates ContainerConfig + ExecutionContext into a K8s Job:
      • Labels: noetl.execution-id / noetl.step-name / noetl.tool-kind=container on both Job.metadata.labels AND PodTemplateSpec.metadata.labels.
      • generateName: noetl-container-<step-slug>-<eid>-; slug strips chars outside [a-zA-Z0-9-] (DNS-1123-safe), truncates to 20 chars, lowercases. Empty slug → literal "step".
      • Default namespace: noetl.
      • Default backoffLimit: 0 — the playbook's own retry: block is the right place to express retry semantics; the Job controller's built-in retry would muddle the terminal-state mapping the watcher does.
      • Default restartPolicy: Never.
      • valuevalue_from mutually exclusive at build time (returns ToolError::Configuration if both set).
    • execute() builds the kube client via Client::try_default() (reads in-cluster SA token + cluster CA), POSTs the Job via api.create(), returns immediately with pending_callback: Some(true) and the Job handle in data.
    • 17 new unit tests covering label propagation, generateName shape, slug stripping, env literal + secret-ref, value XOR value_from, empty image rejection, resources requests/limits propagation, defaults (backoff = 0; restartPolicy = Never), service_account propagation, timeout
      • deadline. Lib 258/0 (241 + 17 new).

The pending_callback transition story

Worker-side adoption of the marker (suppressing the worker's own call.done emit when set) is a coordinated follow-up tracked under the same umbrella. Until that lands, the worker will emit call.done immediately, and the watcher's later callback will be treated as stale by the server (recorded by noetl_container_callback_stale_total).

That race is harmless during the transition — playbooks just see early completion; the stale-counter dashboard is the migration signal for when the worker adoption needs to ship.

Umbrella status after this round

Round Sub-issue Repo State
1 ops#166 noetl/ops CLOSED — ops@8892043
2 server#140 noetl/server CLOSED — v2.48.0
3 tools#36 noetl/tools CLOSED — v2.21.0
5 e2e#29 noetl/e2e Open — kind-val rig

3 of 4 Rust rounds done. Only Round 5 remains; the worker-side pending_callback adoption is a coordinated follow-up (currently tracked as a comment on the umbrella, not yet a sub-issue).

Pointer bumps + wiki sweep

  • Bump repos/tools to v2.21.0 (commit bd8ded8).
  • ai-meta wiki: Home (Last refreshed + ecosystem-map tools cell v2.20.0 → v2.21.0), Sessions-Log (this entry), Releases (prepend v2.21.0 row), Umbrella-Container-Tool-Callback (Recent-activity table + Next-concrete-steps update).

2026-06-07 (noetl/ai-meta#43 Round 1 — noetl-k8s-watcher Deployment landed ops@8892043)

Agent: Claude · Repos touched: noetl/ops (PR), noetl/ai-meta.wiki

Headline. Second concrete round of the Container Tool Callback umbrella ships: the external K8s Job watcher that closes the loop between Round 3's labeled-Job dispatch and Round 2's server callback endpoint. With Round 1 + Round 2 in place, the watcher side can be kind-validated end-to-end against the live endpoint by manually kubectl apply-ing a labeled Job — Round 3 (Tool::Container) can land independently.

What landed

  • Round 1 of noetl/ai-meta#43 (ops#167, closed ops#166; commit 8892043).
    • ci/manifests/k8s-watcher/ new directory with 5 files (527 lines):
      • README.md — full design + kind-val recipe + contract spec.
      • rbac.yaml — ServiceAccount + ClusterRole (Jobs/Pods get,list,watchread-only, cluster-scoped because K8s watch streams filter by namespace at server side) + ClusterRoleBinding.
      • configmap.yaml — two ConfigMaps; one for env (NOETL_SERVER_URL, NOETL_K8S_WATCH_NAMESPACE, NOETL_K8S_WATCH_LABEL_SELECTOR), one for the 147-line watcher.sh body. Shipping the script in a ConfigMap lets us iterate the contract without rebuilding an image; when the watcher graduates to a Rust binary (follow-up), only the env config stays.
      • deployment.yaml — single-replica Deployment, Recreate strategy (server's Round-2 endpoint is idempotent so a brief rollout race is harmless), bitnami/kubectl:1.30.3 image + jq/curl installed via package manager at startup, initContainer waits for noetl-server's /api/health, NOETL_INTERNAL_API_TOKEN mounted from the existing noetl-internal-api-token Secret.
      • kustomization.yamlkubectl apply -k entry point.
    • MVP shape: shell wrapper around kubectl get jobs --watch -o json piped through jq + curl. Per the sub-issue, shell is acceptable for round 1 — the contract (POST body, label selector, terminal-state mapping) is what unblocks the umbrella. A pure-Rust binary is a clean follow-up once the contract proves itself.
    • Contract:
      • Watches Jobs in NOETL_K8S_WATCH_NAMESPACE carrying the NOETL_K8S_WATCH_LABEL_SELECTOR label (default noetl.execution-id).
      • Jobs without both noetl.execution-id + noetl.step-name labels are ignored.
      • On terminal-state transition (Complete / Failed condition), classifies into the six TerminalState variants matching the umbrella's failure-mode taxonomy.
      • POSTs JSON body to {NOETL_SERVER_URL}/api/internal/container-callback/{eid}/{step} with Authorization: Bearer header.
      • Retries 3× with backoff on 5xx / transport errors; never on 4xx (Round 2's handler is idempotent so duplicate POSTs are harmless).
      • In-memory dedup by Job UID to avoid double-posting within the watcher's lifetime — server-side idempotency (Round 2) is the actual contract; this is just a hot-path optimization.
    • jq classifier maps K8s Job conditions to the six TerminalState enum variants. Finer-grain mapping for failed_image_pull / failed_oom / failed_node_lost (requires reading pod status, which the watcher has RBAC for) is staged at "minimal correctness now, finer-grain in a follow-up" — server treats unknown reasons as failed so no information is lost.
    • Sanity-checked: kubectl kustomize ci/manifests/k8s-watcher/ renders 327 lines of valid YAML; sh -n watcher.sh clean; jq classification dry-run resolves a Complete Job to succeeded.

Umbrella status after this round

Round Sub-issue Repo State
1 ops#166 noetl/ops CLOSED — ops@8892043
2 server#140 noetl/server CLOSED — v2.48.0
3 tools#36 noetl/tools Open — Tool::Container with PendingCallback marker
5 e2e#29 noetl/e2e Open — kind-val rig

Rounds 1 + 2 are both in. The Round-1 ↔ Round-2 chain can be kind-validated end-to-end against the live endpoint by manually kubectl apply-ing a labeled Job before Round 3 lands the tool side. Round 3 (Tool::Container) is the last code round in the chain — once it lands, the worker dispatches real labeled Jobs and the umbrella's only remaining round is Round 5 (e2e kind-val rig).

Pointer bumps + wiki sweep

  • Bump repos/ops to commit 8892043.
  • ai-meta wiki: Home (Last refreshed + ecosystem-map ops cell with new headline), Sessions-Log (this entry), Umbrella-Container-Tool-Callback (Recent-activity table + Next-concrete-steps update).

2026-06-07 (noetl/ai-meta#43 Round 2 — container-callback endpoint landed v2.48.0)

Agent: Claude · Repos touched: noetl/server (PR), noetl/ai-meta.wiki

Headline. First concrete round of the Container Tool Callback umbrella ships on the server side: a new /api/internal/* handler that consumes the future K8s watcher's POST and emits a call.done event on the orchestrator's pipeline. Smallest blast radius — unblocks the rest of the umbrella's four-round Rust path.

What landed

  • Round 2 of noetl/ai-meta#43POST /api/internal/container-callback/{execution_id}/{step} (server#141, closed server#140; v2.48.0).
    • src/handlers/container_callback.rs new module — full handler + 7 unit tests in 400 lines.
    • Six TerminalState variants matching the umbrella's failure-mode taxonomy: succeeded / failed / failed_image_pull / failed_oom / failed_node_lost / failed_timeout. Each survives in meta.terminal_state so the playbook can branch on the specific failure reason.
    • Stale check: single indexed SELECT on noetl.event for the execution_id. Zero rows → bump noetl_container_callback_stale_total{state} + log INFO + return 202 without emitting. Match → emit call.done via the standard insert_event path.
    • Returns 202 unconditionally on path-param validation success (the watcher is idempotent + may race with retries; the server should never 4xx on a stale callback).
    • Auth: existing RequireInternalApiToken extractor (same shape as the rest of /api/internal/*).
    • Observability per observability.md Principle 1: span container_callback carrying execution_id + step + state; counters noetl_container_callback_total{state} + noetl_container_callback_stale_total{state}; structured INFO on emit + stale paths.
    • 7 new unit tests; lib 487/0 (480 + 7 new).

Umbrella status after this round

Round Sub-issue Repo State
1 ops#166 noetl/ops Open — watcher Deployment + RBAC
2 server#140 noetl/server CLOSED — endpoint shipped today
3 tools#36 noetl/tools Open — Tool::Container with PendingCallback marker
5 e2e#29 noetl/e2e Open — kind-val rig

Round 2 unblocks Round 1: the watcher Deployment now has somewhere to POST. Kind-validation of Round 1 can manually kubectl apply a labeled Job to drive the endpoint end-to-end before Round 3 (Tool::Container) lands the worker side.

Pointer bumps + wiki sweep

  • Bump repos/server to v2.48.0 (commit fb898e5).
  • ai-meta wiki: Home (Last refreshed + ecosystem-map server cell v2.47.0 → v2.48.0), Sessions-Log (this entry), Releases (prepend v2.48.0 row).

2026-06-07 (noetl/ai-meta#64 closes — artifact tool kind added to Rust noetl-tools registry; tools v2.20.0)

Agent: Claude · Repos touched: noetl/tools (PR), noetl/ai-meta.wiki

Headline. Continuing-in-natural-order after the Secrets Wallet umbrella closed, picked up the smallest remaining open umbrella (#64) and shipped it. A thin ArtifactTool adapter in noetl-tools translates the Python-era YAML shape (action: get + input.result_ref) into a ResultFetchTool call. Keeps the three e2e fixtures that use kind: artifact working without modification.

What landed

  • noetl/tools#35 (landed via tools v2.20.0; closes noetl/tools#34; closes noetl/ai-meta#64):
    • New src/tools/artifact.rsArtifactTool impls Tool; name() = "artifact".
    • Holds a delegate ResultFetchTool + a TemplateEngine. execute() template-renders the raw config first (so input.result_ref: "{{ start._ref }}" resolves before deserialisation), translates to a synthetic result_fetch-shaped JSON, wraps in a ToolConfig with kind: "result_fetch", and delegates.
    • Pass-throughs honoured: prefer, flight_endpoint, bearer_token, tls_ca_path, client_cert_path, client_key_path all copy through to the delegate unchanged.
    • action: put returns ToolError::Configuration pointing the operator at the worker's call.done emit path (R-2.1) per agents/rules/execution-model.md. The playbook-side push surface is intentionally absent in the Rust path: a step's result lands in the result store via the worker's emit, not via a tool kind invoked by the playbook author.
    • Unknown actions rejected with the unknown name surfaced.
    • Missing input: block — typed deserialiser names the missing field for the operator.
    • 8 new unit tests in tools::artifact::tests covering: get translation with ref-only, get with all six pass-throughs, defaults action to "get" (matches Python worker default), put returns error pointing at emit path, unknown action rejected, missing input returns config error, tool name is "artifact", ToolConfig round-trip translation. Lib 241/0 (8 new).
    • Backward compatible — new tool kind; existing tools untouched; existing fixtures using kind: result_fetch keep working.

Why aliasing over migrating

noetl/ai-meta#64 listed two branches: (a) add artifact to the Rust registry (or alias to result_fetch); (b) treat as Python-era and migrate the fixtures to kind: result_fetch. Chose (a) aliasing because:

  • Smaller blast radius — touches noetl/tools only, not noetl/e2e.
  • Three fixtures already in production use the shape; migration cost exceeds the adapter cost.
  • The adapter is ~80 lines of actual logic (rest is doc + tests); no behavior drift from the underlying ResultFetchTool.

Status

  • noetl/ai-meta#64 closes — the worker bumps to noetl-tools v2.20.0 and the #54 e2e sweep can re-run test_output_select / test_gcs_storage / test_storage_tiers once the worker pointer bumps (separate ai-task issue if not already underway).
  • The Rust noetl-tools registry now matches the Python tool-kind inventory for the surfaces the e2e fixtures exercise.

Pointer bumps + wiki sweep

  • Bump repos/tools to v2.20.0 (commit a48da13).
  • ai-meta wiki: Home (Last refreshed, ecosystem-map tools cell v2.19.3 → v2.20.0, move #64 from Active umbrellas to Recently closed, preamble count Five → Four), Sessions-Log (this entry), Releases (prepend v2.20.0 row).

2026-06-07 (Secrets Wallet umbrella #61 closes — three 6d.X cloud-specific dynamic providers landed; v2.45.0 + v2.46.0 + v2.47.0)

Agent: Claude · Repos touched: noetl/server (3 PRs), noetl/ai-meta.wiki

Headline. All three cloud-specific dynamic-secret providers on the Secrets Wallet umbrella shipped this session. The umbrella noetl/ai-meta#61 is now feature-complete — every named phase + every queued follow-up has landed in noetl/server. The platform-side wallet has nothing left to ship; future work would be new product surface (e.g. additional providers, additional residency-policy modes) rather than completing the original umbrella scope.

What landed

  • Phase 6d.1 — AWS STS AssumeRoleWithWebIdentity provider (server#137, closed server#132; v2.45.0): exchanges the EKS-projected ServiceAccount JWT (AWS_WEB_IDENTITY_TOKEN_FILE) for short-lived AWS temporary credentials via STS. No SigV4 — the WebIdentityToken IS the credential (STS anonymous action), so no static AWS_ACCESS_KEY_ID needed. Response parser accepts both XML (legacy / VPC endpoints) and JSON (modern STS). Reference shape [<region>:]<role-arn>[#<session-name>]. Re-reads the token file on every fetch (kubelet rotates projected tokens every ~hour by default). 15 new unit tests; lib 456/0.
  • Phase 6d.3 — Azure AAD client-credentials provider (server#139, closed server#134; v2.46.0): off-cluster (non-IMDS) AAD client_credentials flow for deployments running outside AKS that need to call Azure APIs. Reads AZURE_TENANT_ID / AZURE_CLIENT_ID / AZURE_CLIENT_SECRET from env. Sovereign-cloud overrides via NOETL_AZURE_AAD_HOST (Azure Gov / China). Reference shape [<tenant>:]<scope>; parser only treats the :-prefix as a tenant if it doesn't look like a URL scheme (https/http). 14 new unit tests.
  • Phase 6d.2 — GCP iamcredentials.generateAccessToken provider (server#138, closed server#133; v2.47.0): mints short-lived OAuth2 access tokens for a target service account via workload-identity impersonation. Reads the caller's Workload-Identity token from the GKE metadata server (shares the env override NOETL_GCP_METADATA_TOKEN_URL with GcpSecretManager). Reference shape <target-sa-email>[#<scope>]. 10 new unit tests.

All three providers return SecretValue.expires_at populated from the issuer's response — Phase 6d's cache_decision clamps cache TTL to min(default_ttl, expires_at - now - safety_margin); Phase 7c.3's background refresh re-resolves inside the refresh window. Three discrete merge-conflict resolution cycles in src/secrets/mod.rs (each new provider added to the same factory's match arms + supported-providers error message string); all conflicts were one-line string combinations and the additive mod / pub use / match-arm lines auto-merged cleanly.

Umbrella status

Secrets Wallet umbrella noetl/ai-meta#61 is feature-complete:

  • 1 envelope encryption (v2.21.0)
  • 2 GCP Cloud KMS KeyManager (v2.22.0)
  • 3 Secret resolution via the auth:/keychain path (v2.23.0 → v2.26.0)
  • 3b + 3c Keychain caching + provider:-backed entries (v2.27.0)
  • Providers — 5 static (GCP Secret Manager, K8s Secrets, Vault, AWS Secrets Manager, Azure Key Vault) at v2.28.0 → v2.31.0
  • 4 Transport mTLS (4a server v2.30.0 + 4b worker v5.12.0 + 4c cert-manager + 4d Helm)
  • 5 Sealed payload delivery (5a v2.32.0 + 5b v2.33.0 + 5c worker v5.13.0)
  • 6 Residency-aware distributed resolution (6a region tag v2.34.0
    • 6b ProviderRegistry v2.35.0 + 6c residency-policy gate v2.36.0 + 6d primitives v2.37.0 + 6e cross-region broker v2.38.0)
  • 6d Dynamic-secret providers (6d.1 AWS STS v2.45.0 + 6d.2 GCP iamcredentials v2.47.0 + 6d.3 Azure AAD v2.46.0)
  • 7 Rotation + audit + auto-renewal (7a KEK rotation primitives v2.39.0 + 7a.2 endpoints v2.42.0 + 7b audit service v2.40.0 + 7b.2 table + GET endpoint v2.43.0 + 7c should_refresh primitive v2.41.0 + 7c.2 cache companion v2.43.0 + 7c.3 resolver-side wire-up + stampede collapse v2.44.0)

The umbrella issue gets closed as part of this change set.

Pointer bumps + wiki sweep

  • Bump repos/server to v2.47.0 (commit 605b8b1).
  • ai-meta wiki: Home (Last refreshed, ecosystem-map server cell v2.44.0 → v2.47.0, move #61 from Active umbrellas to Recently closed, drop the count from Six to Five), Sessions-Log (this entry), Releases (prepend v2.45.0 / v2.46.0 / v2.47.0), Umbrella-Secrets-Wallet (mark feature-complete + closing summary at top).

2026-06-07 (Secrets Wallet #61 Phase 7c.3 — resolver-side stampede mutex + background re-resolve, landed v2.44.0)

Agent: Claude · Repos touched: noetl/server (PR), noetl/ai-meta.wiki

Headline. Phase 7c.3 wires the Phase-7c decision primitive + the Phase-7c.2 cache-side companion into the resolver's cache-hit path. When CredentialService::try_resolve_keychain hits a fresh-but-aging row, the cached value returns IMMEDIATELY (worker fetches stay on the fast path) and a background tokio::spawn re-resolves via the Phase-3b SecretProvider + updates the cache via KeychainService::set. The Phase 7c series (primitive + cache companion + resolver wire-up) is now wire-complete.

What landed

  • Phase 7c.3 — resolver-side stampede mutex + background re-resolve (server#136, closed server#135; v2.44.0):
    • Stampede collapse — new src/services/keychain_refresh.rs RefreshInflight wraps Arc<tokio::sync::Mutex<HashSet<(i64, String)>>> with try_claim (atomic insert returning whether the slot was free)
      • release. The struct is Clone (cheap Arc clone) so every CredentialService instance derived from the same root shares the same inflight set. N workers crossing the refresh threshold for the same (catalog_id, alias) collapse to one provider call; concurrent callers piggy-back via noetl_secret_refresh_total{outcome="stampede_collapsed"}.
    • Refactor — extracted the cache-miss provider resolution (catalog → playbook → provider → cache write) into a separate resolve_via_provider method. Both the cache-miss inline path AND the background-refresh task call it — identical code path, no behavior drift between cold-miss and proactive-refresh.
    • Background-task lifecycle — cache hit → maybe_spawn_refreshshould_refresh check (single indexed cache read; awaited) → try_claimtokio::spawn with cloned service state → background: resolve_via_provider → record outcome metric (succeeded | failed) + duration histogram → release slot. On stampede: bump stampede_collapsed + return.
    • Failure modes handledshould_refresh read errors → log
      • skip (cached value already went out, never fail the credential lookup); provider failure in background → log + bump outcome="failed" + release slot; stampede → bump outcome="stampede_collapsed".
    • Six new unit tests in services::keychain_refresh::tests: single-claim succeeds; second-claim for same key returns false (stampede signal); release allows re-claim; distinct keys don't collide; release is idempotent; clone shares inner state (load-bearing: verifies the stampede-collapse invariant works across CredentialService clones). Lib 441/0 passing.

Status

Secrets Wallet umbrella noetl/ai-meta#61: all named phases (1–7) plus Phase 7c.3 resolver wire-up are now wire-complete on the platform side. The only remaining work is the three cloud-specific dynamic-secret providers — each its own sub-issue:

  • server#132 — Phase 6d.1 AWS STS AssumeRoleWithWebIdentity.
  • server#133 — Phase 6d.2 GCP iamcredentials.generateAccessToken.
  • server#134 — Phase 6d.3 Azure AAD client-credentials.

Each provider implementation needs the target cloud reachable for integration testing, so they're best scoped as discrete future rounds that pull the actual cloud credentials when picked up.

Pointer bumps + wiki sweep

  • Bump repos/server to v2.44.0 (commit 1851f68).
  • ai-meta wiki: Home (Last refreshed, ecosystem-map server cell), Sessions-Log (this entry), Releases (v2.44.0), Umbrella-Secrets-Wallet (latest landings).

2026-06-07 (Secrets Wallet #61 — 7a.2 + 7b.2 + 7c.2 follow-up rounds landed; v2.42.0 + v2.43.0)

Agent: Claude · Repos touched: noetl/server (3 PRs), noetl/server.wiki, noetl/ai-meta.wiki

Headline. Three queued follow-up rounds for the Secrets Wallet umbrella shipped this session — operator-facing endpoints and DB storage that wrap the lib-only primitives shipped in v2.39.0 / v2.40.0 / v2.41.0. All three Phase-7 named rounds (rotation + audit + auto-renewal) now have functional endpoints; three cloud-specific dynamic-secret provider sub-issues filed for the remaining 6d.X follow-up work.

What landed

  • Phase 7a.2 — KEK rotation endpoint + key-status + DB scans (server#127, closed server#126; v2.42.0): POST /api/internal/wallet/rotate-kek?batch_size=&max_batches=&table= runs a batched cursor scan across noetl.credential + noetl.keychain, calls the Phase-7a rewrap_storage_string per row, returns RotateSummary { processed, rewrapped, skipped, failed, last_id } for progress checkpointing across runs. GET /api/internal/wallet/key-status reports per-version row counts so an operator can confirm completion before retiring the old KEK version. classify_failure heuristic maps thrown errors to parse_error | failed_unwrap | failed_wrap | failed so the noetl_wallet_rotate_total{table, status} counter splits clean. Plaintext NEVER reconstructed (Phase 7a invariant preserved).

  • Phase 7b.2 — noetl.secret_audit table + DbAuditSink + GET endpoint (server#129, closed server#128; v2.43.0): durable storage path for the Phase-7b service. noetl.secret_audit table provisioned via CREATE TABLE IF NOT EXISTS at server startup (server-owned, no out-of-band migration step) — audit_id PK, credential + execution_id + occurred_at indexed. DbAuditSink impl of AuditSink writes via db::queries::secret_audit::insert with ON CONFLICT DO NOTHING for idempotency. New GET /api/internal/secret-audit?credential=&execution_id=&from=&to=&limit= returns bounded rows ORDER BY occurred_at DESC (hard cap 10_000). NoopAuditSink stays the default when NOETL_SECRET_AUDIT_REQUIRED is unset. Two merge conflicts with Phase-7a.2 in db/queries/mod.rs + main.rs resolved as additive (both modules / route blocks coexist).

  • Phase 7c.2 — KeychainService::should_refresh cache-side primitive (server#131, closed server#130; v2.43.0): cache-layer companion of the Phase-7c decision primitive. KeychainService::should_refresh(catalog_id, keychain_name, execution_id, scope_type, now) reads the cache row's expires_at, asks secrets::dynamic::should_refresh_default (honours KEYCHAIN_CACHE_REFRESH_WINDOW_SECS), bumps noetl_secret_refresh_total{outcome="triggered"} on a true return. Falls through to false when the row is missing, has no expires_at, is already expired (eviction path, not refresh), or is outside the refresh window. Backward compatible — new method, existing call sites unchanged.

Discrete follow-ups filed

  • server#132 — Phase 6d.1: AWS STS AssumeRoleWithWebIdentity dynamic-secret provider (EKS IRSA path).
  • server#133 — Phase 6d.2: GCP iamcredentials.generateAccessToken dynamic-secret provider (workload-identity impersonation).
  • server#134 — Phase 6d.3: Azure AAD client-credentials dynamic-secret provider (off-cluster, non-IMDS).
  • Phase 7c.3 (next session on this branch): per-(catalog_id, alias) tokio::sync::Mutex stampede collapse + tokio::spawn background re-resolve via the Phase-3b SecretProvider path + write via KeychainService::set. Held back from 7c.2 because the resolver-side wire-up wants the cache-side primitive stable first.

Status

Secrets Wallet umbrella noetl/ai-meta#61: all named phases (1–7) shipped. Remaining work is the three cloud-specific dynamic-secret providers + Phase 7c.3 stampede collapse — all discrete sub-issues; umbrella stays open until they close. noetl/server is at v2.43.0.

Pointer bumps + wiki sweep

  • Bump repos/server to v2.43.0 (commit ee8cebe).
  • ai-meta wiki: Home (Last refreshed, ecosystem-map server cell, Active umbrellas #61 row), Sessions-Log (this entry), Releases (v2.42.0 + v2.43.0), Umbrella-Secrets-Wallet (recent activity + next concrete steps).

2026-06-06 (Secrets Wallet #61 Phase 7c — token auto-renewal, landed v2.41.0 — closes Phase 7; all named rounds 1–7 done)

Agent: Claude · Repos touched: noetl/server (PR), noetl/server.wiki, noetl/ai-meta.wiki

Headline. Phase 7 of the Secrets Wallet umbrella closes — every named phase (1–7) is now complete. OAuth2 / JWT access tokens with expires_in are the dominant short-lived credential shape in production. Phase 6d's cache-TTL plumbing kept dead tokens out; Phase 7c adds proactive refresh-before-expiry so the cache renews when the remaining lifetime drops below a threshold. Worker fetches stay on the cached-fresh-token fast path; the auth playbook runs at most once per natural token lifetime instead of once per worker burst. Tail latency stays flat across rotations.

What landed (server#125, merged → noetl-server v2.41.0 (e1bb4f8), closed server#124):

  • secrets::dynamic::should_refresh(expires_at, refresh_window, now) decision primitive: returns true iff expires_at is set, still valid (expires_at > now), and inside the refresh window (expires_at - refresh_window <= now). Pure function; no side effects.
  • secrets::dynamic::should_refresh_default(expires_at, now) — convenience wrapper that reads the window from env.
  • KEYCHAIN_CACHE_REFRESH_WINDOW_SECS env (default 60 s) — how long before expires_at to mark a cached row "renewable."
  • noetl_secret_refresh_total{outcome} counter per agents/rules/observability.md Principle 1. outcome ∈ {triggered, succeeded, failed, stampede_collapsed}. failed at sustained rate is alert-worthy — provider is unreachable AND a cached token is about to expire. Aliases are NOT a label (cardinality blowup); per-alias detail rides the secret.refresh tracing span.
  • noetl_secret_refresh_duration_seconds histogram — buckets [0.05, 0.1, 0.25, 0.5, 1, 2, 5] (auth round-trips dominate), observed regardless of outcome so dashboards surface "slow" + "failing" independently.

Tests. Five new in secrets::dynamic::tests: returns false when no expires_at; returns false when already expired (defensive — that's the eviction path); returns false when outside window; returns true inside window; boundary case (expires_at = now + window) returns true. Lib 427 / 0 passing.

Wiki. Server deployment-specification got a new "Token auto-renewal (Phase 7c primitives)" subsection under Secret providers

Phase 7 architectural shape (now complete):

Phase 7a: rotation primitives — rewrap stored envelopes under new KEK
                                version without reconstructing plaintext.
Phase 7b: audit primitives    — AuditEvent + AuditSink + record_strict /
                                record_async modes; NEVER stores the value.
Phase 7c: refresh primitives  — should_refresh() decides when to renew a
                                still-valid cached token in the background.

Phase 7 closes. All named rounds of the Secrets Wallet umbrella (1 envelope encryption → 2 KMS providers → 3 secret resolution + 3c keychain cache → 4 transport mTLS → 5 sealed payload delivery → 6 residency + cross-region broker → 7 rotation + audit + auto-renewal) are complete.

Remaining queue (all discrete follow-up sub-issues, each its own bounded round):

  • 7a.2 — Wallet KEK rotation endpoint (POST /api/internal/wallet/rotate-kek) + DB scans over noetl.credential + noetl.keychain + diagnostic GET /api/internal/wallet/key-status.
  • 7b.2noetl.secret_audit table + DbAuditSink impl + GET /api/internal/secret-audit query endpoint + wire-up to the four credential surfaces.
  • 7c.2KeychainService::should_refresh + resolver wire-up + per-(catalog_id, alias) stampede mutex + refresh path records its own Phase-7b AuditEvent.
  • 6d.1 / 6d.2 / 6d.3 — AWS STS AssumeRoleWithWebIdentity / GCP iamcredentials.generateAccessToken / Azure AAD client-credentials dynamic-secret providers.

These plug into the existing primitives without further trait / schema surgery.


2026-06-06 (Secrets Wallet #61 Phase 7b primitives — secret-resolution audit service, landed v2.40.0)

Agent: Claude · Repos touched: noetl/server (PR), noetl/server.wiki, noetl/ai-meta.wiki

Headline. Phase 7 round 2. Today the wallet has no durable record of "who accessed credential X at time Y, on which execution, with what outcome." The tracing-span surface evaporates with log retention; compliance regimes (SOC 2, ISO 27001, FedRAMP, PCI-DSS) require a queryable audit trail with retention measured in years. This round ships the in-process service primitives; the actual noetl.secret_audit table + query endpoint + handler integration lands in 7b.2.

What landed (server#123, merged → noetl-server v2.40.0 (eb1840f), closed server#122):

  • services::secret_audit::AuditEvent struct — audit_id (application-side snowflake) + occurred_at + credential + bounded operation + bounded outcome + worker_id / execution_id / server_region / broker_region / kek_version / notes. NEVER contains the secret value.
  • Operation + Outcome bounded enums with as_str() — drift guard against the strings used in the noetl_secret_audit_writes_total{operation, outcome, ...} metric labels.
  • AuditSink trait + NoopAuditSink default impl (audit-disabled deployments + the audit-disabled test path).
  • SecretAuditService wrapper with three calls:
    • record_async — fire-and-forget; spawns a tokio task; never blocks the resolver. Failed writes log + drop + noetl_secret_audit_writes_total{status="dropped_async"}.
    • record_strict — awaits the result. Used when compliance requires the audit row exist before the value releases. Failed writes propagate the error to the handler.
    • record — branches by strict. Typical handler call.
  • NOETL_SECRET_AUDIT_REQUIRED env (default false; 1/true/ TRUE/yes/YES enable strict mode).
  • noetl_secret_audit_writes_total{operation, outcome, status} counter — status ∈ {written, dropped_async, failed_strict}. failed_strict is alert-worthy — wallet refused to release a credential because the audit couldn't be recorded.

Tests. Eight new in services::secret_audit::tests: builder fills audit_id + occurred_at; Operation + Outcome as_str round-trip (drift guard); noop sink always succeeds; record_strict blocks on sink failure (mock sink with fail=true); record_strict persists on success (mock sink records the event); record dispatches async when not strict; noop service records without blocking; from_env respects truthy values. Lib 422 / 0 passing.

Wiki. Server deployment-specification got a new "Secret- resolution audit service (Phase 7b primitives)" subsection with the AuditEvent wire shape + the new env + metric (server-wiki@d50ec40).

Lib-only. No schema migration. Backward compatible — existing deployments get NoopAuditSink + non-strict mode; no behavior change until 7b.2 wires the production sink + the DB table.

Next. Phase 7c — token auto-renewal. OAuth2 / JWT access tokens with expires_in are the dominant short-lived credential shape. Phase-3c cache currently evicts when expires_at is past; on the next worker fetch, the resolver re-runs the auth playbook (slow path). 7c adds a refresh-before-expiry hook so the cache proactively renews when the remaining lifetime drops below a threshold — the worker's next fetch hits the cached fresh token instead of paying the auth round-trip. Plus Phase 7a.2 + 7b.2 still queued (rotation endpoint + audit table + endpoint).


2026-06-06 (Secrets Wallet #61 Phase 7a — KEK rotation primitives, landed v2.39.0)

Agent: Claude · Repos touched: noetl/server (PR), noetl/server.wiki, noetl/ai-meta.wiki

Headline. Starts Phase 7 of the Secrets Wallet umbrella — rotation, audit, and auto-renewal. Phase 6 closed with full residency coverage; Phase 7 hardens the wallet for the operational lifecycle. This round ships the rotation primitives so the actual rotation endpoint + table scans (7a.2) can land cleanly.

What landed (server#121, merged → noetl-server v2.39.0 (3510170), closed server#120):

  • KeyManager::current_key_version() trait accessor with a safe default ("unknown"). LocalDevKms reports its own version string (defaults "v1"; test constructor accepts an explicit label).
  • EnvelopeCipher::rewrap_storage_string(raw) primitive:
    • Parses raw as a stored envelope.
    • If wrapped.key_version == current_key_version() → returns RewrapOutcome::Skipped { key_version } (no KMS call).
    • Else: unwraps DEK under historical KEK version → re-wraps under current → RewrapOutcome::Rewrapped { old_key_version, new_key_version, new_storage_string }.
    • Plaintext payload is NEVER reconstructed. Pure DEK re-wrap — AES-GCM ciphertext bytes stay byte-identical; only the dek field of the stored envelope changes.
  • noetl_wallet_rotate_total{table, status} counter per observability.md Principle 1. table ∈ {credential, keychain}; status ∈ {skipped, rewrapped, failed_unwrap, failed_wrap, parse_error}. failed_unwrap is alert-worthy — it means the KMS deleted the historical key version and the rotation can't complete without operator intervention.

Tests. Four new in crypto::envelope::tests: rewrap skips records already on current version (no KMS call); rewrap emits new envelope under current version when older (new storage string carries kv:"v2" and decrypts to the original plaintext); rewrap rejects non-envelope storage value (forward-only contract preserved); LocalDevKms reports its key version (drift guard). Lib 414 / 0 passing.

Wiki. Server deployment-specification got a new "Wallet KEK rotation primitives (Phase 7a)" subsection + the new metric + the planned 7a.2 operator workflow (server-wiki@36f3cfa).

Lib-only. No schema migration. Backward compatible — existing records resolve unchanged; only an explicit rotation pass touches them.

Next. Phase 7a.2 — the rotation endpoint (POST /api/internal/wallet/rotate-kek) + DB scans over noetl.credential + noetl.keychain + diagnostic GET /api/internal/wallet/key-status reporting per-version row counts. Phase 7b — secret_audit append-only table with one row per resolution attempt (alongside in this session). Phase 7c — token auto-renewal (OAuth2/JWT refresh-before-expiry inside the Phase-3c cache).


2026-06-06 (Secrets Wallet #61 Phase 6e — cross-region broker, landed v2.38.0 — closes Phase 6)

Agent: Claude · Repos touched: noetl/server (PR), noetl/server.wiki, noetl/ai-meta.wiki

Headline. Phase 6 of the Secrets Wallet umbrella closes. Phase 6c's residency gate is fail-closed: a server in us-east-1 denied a credential whose home is eu-central-1 returns HTTP 403 — the workflow step fails. That's correct for hard-isolation use cases. The more common operational shape is "the credential should be resolved IN the EU and the cleartext should never leave EU memory, but the worker that needs it happens to run in US." Phase 6e wires this pattern up by chaining residency-denied resolutions through a broker server in the credential's home region that re-seals the result to the requesting worker via the Phase-5 sealing primitives. Cleartext stays in the home region; only the sealed envelope crosses the wire; only the addressed worker can open it.

What landed (server#119, merged → noetl-server v2.38.0 (2803f05), closed server#118):

  • src/secrets/broker.rs
    • BrokerRegistryregion → broker_url map from NOETL_SECRET_BROKER_REGISTRY env (JSON object). Empty by default; deployments without a broker keep the pre-6e fail-closed behaviour.
    • BrokerClient — forwards a sealed-credential request to a peer. Maps every failure mode to AppError::CrossRegionUnreachable.
    • CrossRegionResolveRequest wire-shape struct.
  • src/handlers/cross_region.rsPOST /api/internal/cross-region/resolve peer-server endpoint: validates expected_entry_region == server_region() (defensive against misconfigured peer registries — a stale registry can't silently coerce a server into serving credentials for the wrong region; returns 403 on mismatch), resolves locally, seals via Phase-5a primitives to the requesting worker's pubkey, returns the SealedEnvelope.
  • KeychainDef.no_broker_fallback: bool — per-credential opt-out for credentials whose policy says "this data physically cannot leave its home region, full stop."
  • AppError::CrossRegionUnreachable { broker_url, cause } — new variant → HTTP 502. Distinguishes "policy says no" (403 from ResidencyViolation) from "policy says yes via broker, but broker is down" (transient).
  • get_sealed handler — on AppError::ResidencyViolation, look up the entry's region in the BrokerRegistry; when configured, forward to the broker via BrokerClient; return the broker's envelope directly. Otherwise propagate the violation per Phase-6c semantics.
  • noetl_secret_broker_call_total{broker_region, outcome} counter — outcomes: ok / unreachable / denied_by_broker / wrong_region / bad_pubkey / resolve_error / serialize_error / seal_error. wrong_region is the alert-worthy combination — it means a peer's broker registry is out of date.
  • noetl_secret_broker_call_duration_seconds{broker_region} histogram — buckets [0.05, 0.1, 0.25, 0.5, 1, 2, 5] s, observed regardless of outcome.
  • NOETL_SECRET_BROKER_TIMEOUT_SECS env (default 10 s).

Tests. Ten new across secrets::broker::tests and handlers::cross_region::tests: BrokerRegistry default empty; from_map lookup; from_env parses JSON; empty / invalid JSON treated as empty; empty-string values filtered; BrokerClient builds; trailing- slash URL handling; decode_pubkey round-trips X25519 / rejects wrong length / non-base64; CrossRegionResolveRequest round-trips JSON (wire-shape drift guard between requesting server and broker). Existing secrets::residency::tests constructor updated for the new no_broker_fallback field. Lib 410 / 0 passing.

Phase 6 architectural shape:

Worker A (us-east-1) → Server-US: GET /api/credentials/eu_token/sealed
  Server-US: residency check → Deny (entry_region=eu-central-1)
  Server-US: BrokerRegistry["eu-central-1"] → https://broker-eu.example.com
  Server-US → POST broker-eu /api/internal/cross-region/resolve
  Broker-EU: resolve locally, seal to Worker A's pubkey
  Broker-EU → Server-US: SealedEnvelope
  Server-US → Worker A: SealedEnvelope  (cleartext never left EU)

After Phase 6 — both residency shapes operational:

  • Hard isolationresidency: strict + no broker → fail-closed HTTP 403.
  • Soft federationresidency: strict + broker registered → transparent cross-region routing via the broker. Cleartext stays in the home region; the asking worker receives only the sealed envelope it can open.

That covers the original umbrella goal G7 in full.

Wiki. Server deployment-specification got a new "Cross-region broker (Phase 6e)" subsection under Secret providers + the new envs + endpoint + both metrics (server-wiki@51c1dfd).

Lib-only. No schema migration. Broker registry is opt-in via env; deployments without a broker keep pre-6e fail-closed behaviour.

Next. Phase 6 closes. The queue is:

  • Phase 6 follow-ups (each its own sub-issue under #61): 6d.1 AWS STS AssumeRoleWithWebIdentity · 6d.2 GCP iamcredentials.generateAccessToken · 6d.3 Azure AAD client-credentials. These plug into the Phase-6d primitives that already shipped.
  • Phase 7 — KEK rotation (per-record kek_version + online re-encrypt job) + secret_audit append-only table + token auto-renewal (OAuth2/JWT refresh-before-expiry inside the Phase-3c cache).

2026-06-06 (Secrets Wallet #61 Phase 6d primitives — dynamic-secret support + cache honors issuer TTL, landed v2.37.0)

Agent: Claude · Repos touched: noetl/server (PR), noetl/server.wiki, noetl/ai-meta.wiki

Headline. Phase 6 round 4. Some providers return secrets the issuer expires on a clock — AWS STS bearer tokens (15 min – 12 h), AAD access tokens (1 h), GCP iamcredentials.generateAccessToken (1 h default), OAuth2 access tokens with expires_in. The Phase-3c keychain cache used a fixed 600 s TTL; caching a token past expires_at means the next worker fetch gets a 401 and the playbook step fails.

This round ships the primitives + cache plumbing so the concrete cloud-specific dynamic providers can land as separate clean rounds (6d.1 / 6d.2 / 6d.3).

What landed (server#117, merged → noetl-server v2.37.0 (eda5cb3), closed server#116):

  • SecretValue.expires_at: Option<DateTime<Utc>> — issuer-reported expiry. Existing providers (GCP SM / AWS SM / Azure KV / Vault / K8s) pass None since they return long-lived secrets.
  • src/secrets/dynamic.rscache_decision(expires_at, default_ttl, safety_margin, now) returns CacheFor(secs) for normal-case secrets, or SkipCacheAlreadyExpired when the deadline is already past or inside the safety margin. Honours min(default_ttl, expires_at - now - safety_margin), floored at MIN_EFFECTIVE_TTL_SECS = 5.
  • KEYCHAIN_CACHE_DYNAMIC_SAFETY_MARGIN_SECS env (default 60 s) — buffer for clock skew + wall-clock between cache write and next worker fetch.
  • resolve_keychain_entry_with_meta — companion to the existing resolve_keychain_entry that also returns the bundle's expires_at. For a map-shaped keychain entry that bundles several secrets, the bundle TTL is the earliest of any contributing secre

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally