-
Notifications
You must be signed in to change notification settings - Fork 0
Sessions Log
Chronological log of agent sessions that touched the NoETL ecosystem. Each entry links to the durable artefacts (PRs, commits, wiki updates, ai-task issues) so future sessions can pick up where the last one ended.
How this log stays current. Each session adds an entry at the top when meaningful work lands (issues opened, PRs merged, pointer bumps, design decisions captured). Use the Issue Tracking convention to decide what's worth logging — same threshold: anything that will outlive the session.
2026-06-23 — 🪶 #104 OQ5 Option A SHIPPED — producer-staged result tier (worker v5.46.0); result_store-retirement soak gate defined (NOT started — gated)
Landed. OQ5 Option A merged + released: the producing worker stages the over-budget result tier object at emit time under NOETL_RESULT_PRODUCER_STAGE (default off → byte-identical no-op), decoupling the tier write from noetl.result_store; the materializer skips its result_store fetch on already-staged objects (skip-on-exists). Shared decide_tier → byte-identical to a materializer-written object.
- worker #132 (
0d9ca18) → semantic-release noetl-worker v5.46.0 (27c7c17); crate + multi-arch image published. - e2e #81 (
59352b4) kind rig; docs #186 (566333b) RFC Option A section. - ai-meta worker pointer
dd07016→27c7c17via #128. - ops #206 (open, do-not-apply): OQ5 soak-gate alert rules (GMP + VMRule) on
noetl_worker_result_mint_authoritative_total{path="legacy_fallback"}.
Soak — NOT started, gated. The result_store-retirement gate is legacy_fallback holding at 0 while producers stage in steady state. That requires deploying worker v5.46.0 to prod + NOETL_RESULT_PRODUCER_STAGE=true on noetl-worker-rust — both gated prod changes, not done this session. Plan + start criteria + 72h/48h-floor window recorded on #104. No prod default changed; the result_store dual-write stays in place; retirement (dropping it) is a separate explicit-go step.
Note: the docs Cloudflare deploy is pre-existing red (unrelated MDX errors in two other docs/architecture/*.md files since 2026-06-17); the OQ5 RFC file compiles clean. Flagged for a separate fix.
2026-06-23 — ✅ #104 — Phase D (mint-authoritative + dual-write) ENABLED LIVE on prod — staged A–D enablement COMPLETE
Headline. Flipped the final staged result-tier flag on prod GKE (noetl-demo-19700101, ns noetl): NOETL_RESULT_MINT_AUTHORITATIVE=true makes the URN→Feather/GCS tier the authoritative result store, with noetl.result_store kept as the reversible dual-write fail-safe. A–D are now all live on prod. B (RESULT_MATERIALIZER_ENABLED, system-pool) + C (RESULT_URI_RESOLVE, worker-rust) were already on; this session added D.
Flag topology (corrected live). The materializer spawn is gated materializer_enabled() OR mint_authoritative() (worker/src/result_materializer.rs:154), so the flag on worker-rust spawns a stray result materializer at the wrong cell (local-0) that would fan out as worker-rust autoscales 1→20. ops#204 templates the flag only on system-pool — the intended design. Final prod placement: noetl-worker-system-pool (authoritative materializer, cell usc1-a; log AUTHORITATIVE Feather tier; #104 Phase D … authoritative=true) + noetl-server-rust (dual-write counter; no materializer there → side-effect-free). NOT worker-rust — its consume already resolves from the tier via Phase C; D there was redundant and only added the mint counter at the cost of the stray writer. (Captured result_mint_authoritative_total{path=tier}=2 during a brief all-on window as one-time consume-path proof, then removed it from worker-rust.)
Validation (LIVE, 4 over-budget execs, 1200-row producer→settle→consume, tenant-segregated prefixes phased-mint*/phased-soak*). Every round clean: tier authoritative — Feather object written only at env=prod/region=usc1/cell=usc1-a/…/results/start/0/0/1.feather (shards s0118/s0096/s0134), system-pool sole result materializer, result_materializer_errors_total=0, no local-0 stray on any round; dual-write preserved — noetl_result_store_dual_write_total=4 (1:1 with execs), result_store_put_total{ok}=4, noetl.result_store row lands each time; resolve from tier — result_resolve_total{resolved_feather}=6, consume bound start.rows[1100] → row_count=1200/deep_id=1100/test_passed=true, all COMPLETED; cutover invariants — event-mat sole-writer projected==acked==100, duplicates=0/project_errors=0, lag 0 (nats_consumer_pending{noetl_materializer}=0), never-scan state_build_event_scans=0, 0 pod restarts, 0 GCS/ADC auth errors. Off-server CQRS gate (PUBLISH_ONLY=true+STATE_BUILDER=offserver), cell env, CPU limit 2 all preserved; GC/DR stay OFF; all three rollouts clean.
Decision: D LEFT ENABLED (system-pool + server). Revert armed: kubectl -n noetl set env deploy/noetl-worker-system-pool deploy/noetl-server-rust NOETL_RESULT_MINT_AUTHORITATIVE-. Remaining (gated, NOT touched): OQ5 byte-source re-plumb (materializer fetches payload FROM result_store today → retirement needs re-plumbing the tier-write byte source) + noetl.result_store retirement + OQ5 soak. #104 stays OPEN. #104 comment.
2026-06-23 — 🚀 #104 — WI/ADC GCS auth MERGED (server v3.45.0) + ROLLED TO PROD as WI KSA (tier still OFF)
Headline. Merged server#265 (fad5d8a, the 3-mode none/static/adc GCS auth matrix via gcp_auth) → semantic-release cut noetl-server v3.45.0 (21da3ef). Built the prod image via Cloud Build (e2-highcpu-8, us-central1) → us-central1-docker.pkg.dev/noetl-demo-19700101/noetl/server-rust@sha256:d3cbf1ad7e44d52dbee60ee2472073b4ef62f8787aeb0722b6d33f1b8eda8fe0 (tags 21da3ef + v3.45.0). Rolled it to prod GKE (gke_noetl-demo-19700101_us-central1_noetl-cluster, ns noetl) and set serviceAccountName: noetl-server-rust so the server now runs as the operator-provisioned WI-bound KSA. Applied the result-tier GCS ENV to the server (NOETL_OBJECT_STORE_BACKEND=gcs, GCS_BUCKET=noetl-demo-19700101-results, GCS_ENDPOINT=https://storage.googleapis.com, GCS_AUTH=auto, RESULT_CELL_ENV=prod/REGION=usc1/CELL=usc1-a/SHARD_COUNT=256) + seeded the matching cell ENV on noetl-worker-system-pool (no GCS_* on the pool — it writes via the server endpoint).
Tier stays INERT. No tier-enable flag was set on any deployment — NOETL_RESULT_MATERIALIZER_ENABLED / RESULT_URI_RESOLVE / RESULT_MINT_AUTHORITATIVE / RESULT_TIER_GC / RESULT_TIER_DR all remain unset/false. Behavior == prior stack. The off-server gate (NOETL_EVENT_INGEST_PUBLISH_ONLY=true + STATE_BUILDER=offserver) and CPU limits (limit 2 / req 250m) were preserved.
Validation (LIVE on prod). Server pod up healthy on v3.45.0 as KSA noetl-server-rust, 0 restarts; logs show object store backend: GCS … auth=adc resolved correctly with no token minted (lazy on first GCS I/O — none happens with the tier off), DB + NATS connected, Server listening. /api/internal/cells returns the applied config (shard_count:256, default_cell:usc1-a, cells:[usc1-a/prod/usc1/gcs/noetl-demo-19700101-results]). /health ok. Off-server cutover stayed healthy: a tenant-segregated smoke (test/simple_python) + system/scheduled_cleanup both COMPLETED; materializer sole writer (server published==projected==acked=13), consumer lag 0, state_builder_event_scans_total=0 (never-scan), replay reconstructed the full 13-event chain to terminal playbook.completed; result_materializer drained=0/errors=0 (inert). All pods 0 restarts.
Result. Prod is fully configured + the WI server is live, ready for staged B→C→D result-tier enablement (a later operator-gated task). Revert on standby: roll back the server image + serviceAccountName + the GCS/cell ENV; the cutover-gate revert (PUBLISH_ONLY=false STATE_BUILDER=server) stays armed. ai-meta pointer bumped repos/server → 21da3ef (v3.45.0).
2026-06-23 — 🪶 #104 Phases E + F MERGED to main — the FINAL build phases (side-effect barrier + result-tier GC/DR); repo-only, flags default-off
Headline. Merged the #104 Phase E and Phase F PRs in dependency order (E before F — F's worker/e2e branches were stacked on the Phase E branch). All flags default-off → inert in prod; this is repo-only, no prod deploy. The A–F build phases of #104 are now complete; only operational items remain (prod GCS infra, OQ5 byte-source re-plumb, staged enablement).
Phase E (side-effect durability barrier).
-
tools#78
feat(registry)→ noetl-tools v3.17.0 (1d49dd5): publishesregistry::kind_is_side_effecting(conservative defaulttrue; onlynoop/rhaifalse). semantic-release cut + the release workflow published to crates.io — members noetl-directives + noetl-locator re-published first (workspace publish order), then the root noetl-tools 3.17.0. -
worker#130
feat(barrier)→ noetl-worker v5.44.0 (d696f7e): the consume-pool barrier (NOETL_SIDE_EFFECT_BARRIER, default off) — before re-dispatching a side-effecting tool it adopts an already-durable result, so external side effects fire exactly once across re-drive. Repointed onto the published noetl-tools 3.17 (Cargo.toml^3.17+ lockfile) — no patch/path/git dep;cargo check/clippy/tests green. -
e2e#79 (
6dd1432): barrier kind rig.
Phase F (result-tier GC + DR).
-
server#264 merged with a
feat:subject → noetl-server v3.44.0 (341b614): conservative dry-run-first GC sweeper (NOETL_RESULT_TIER_GC, default off) — reclaims only provably-dead tier objects, never deletes a live-referenced one (unit-testeddecide). -
worker#131 merged with a
feat:subject → noetl-worker v5.45.0 (dd07016): materializer verify-and-repair DR mode (NOETL_RESULT_TIER_DR, default off) — rebuilds a missing/corrupt tier object byte-identically from its WAL-derivable source. Rebased onto main after Phase E merged; inherits the published noetl-tools 3.17 pin (no patch). -
ops#205 (
26185ff): GC/DR flags (both"false") on server-rust + worker-system-pool manifests +system/scheduled_cleanupGC step. e2e#80 (d7372be): GC + DR kind rig.
Cleanup (this change set). ai-meta pointers bumped (tools 1d49dd5, server 341b614, worker dd07016, ops 26185ff, e2e d7372be); deployment-spec wiki rows committed — worker-wiki NOETL_SIDE_EFFECT_BARRIER (E) + held NOETL_RESULT_TIER_DR (F), server-wiki held NOETL_RESULT_TIER_GC (F); Home/Sessions-Log/Releases/Umbrella-Event-WAL-Storage updated. #104 updated (Refs, stays OPEN).
2026-06-23 — 🪶 #104 Phase F implemented + kind-validated — result-tier GC + DR (→ MERGED 2026-06-23, see entry above)
Phase F (GC + DR) — the last #104 build phase — implemented + kind-validated, in review.
GC (server): POST /api/internal/result-tier/gc (gated NOETL_RESULT_TIER_GC,
dry-run default) — a conservative sweeper reclaiming only provably-dead tier
objects (execution aged out of noetl.event, past a grace window) that never
deletes a live-referenced object (unit-tested decide); object backend
list/delete; system/scheduled_cleanup GC step (double-gated). DR (worker):
result-materializer verify-and-repair mode (NOETL_RESULT_TIER_DR) rebuilding
a missing/corrupt object from its WAL-derivable source byte-identically.
OQ1 version-reaping + OQ5 retirement deliberately scoped OUT (reported, not
guessed). Tests: server 620 / worker 255 lib, clippy clean. Kind gate-ON 5-pass
green (off-server gate + fake-gcs, Phase F images built + loaded): GC-1 dry-run
lists the dead orphan + skips the live object (live_candidates=0) + deletes
nothing; GC-2 delete reclaims only the orphan, live survives + serves; GC-3
flag-off no-op; DR-1 deleted object re-derived byte-identically (sha256 match) +
served by read path; DR-2 flag-off no-op; invariants intact every execution;
baseline restored. A–F build phases complete — only operational items remain
on #104 (prod GCS infra, OQ5 byte-source re-plumbing, staged prod
enablement/minting cutover). Worker/e2e PRs stacked on the unmerged Phase E
branch (merge Phase E first). PRs (review-only): server#264 ·
worker#131 ·
ops#205 ·
e2e#80 ·
#104.
Deployment-spec wiki rows for the two flags staged (held until merge).
2026-06-23 — 🪶 #104 Phase E implemented + kind-validated — side-effect durability barrier (→ MERGED 2026-06-23, see entry above)
Headline. Implemented + kind-validated Phase E of #104 — the side-effect durability barrier (RFC §4.4 / T6). Before re-dispatching a side-effecting cycle whose derived result URN already resolves to a durable result (the Phase C read path), the worker skips re-execution and adopts the recorded result, so an external side effect fires exactly once across a crash-resume / re-drive; non-side-effecting cycles are never blocked. PRs open, not merged; no ai-meta pointer bump; server unchanged (worker-only, reuses Phase C GCS + cells).
What landed (branches).
-
tools#78 —
registry::kind_is_side_effecting+Tool::side_effecting()+ToolRegistry::is_side_effecting; conservative defaulttrue, onlynoop/rhaifalse(5 tests). -
worker#130 — the barrier in
execute_with_server_url, gatedNOETL_SIDE_EFFECT_BARRIER(default off → true no-op). Gate looks through thetask_sequencewrapper (command_is_side_effecting); existence+adopt reusesresult_resolver::resolve_by_urn;cycle_logical_urifactored fromstamp_logical_uriso a re-drive resolves the identical URN. Metricnoetl_worker_side_effect_barrier_total{outcome,tool}(11 Phase-E tests; 253 lib green; clippy clean). -
e2e#79 —
kind_validate_side_effect_barrier.sh+test_side_effect_barrier.yaml.
OQ4 resolved → static. Per-kind classification is the baseline; per-invocation
(http GET vs POST) is deliberately skipped because the barrier is adopt-only,
so over-classification is safe. attempt=1 is fixed → the barrier keys on
durable-success existence at the coordinate, not the attempt number, so OQ1
keep-every-attempt + #125 retry compose cleanly (retry-after-failure re-executes;
resume-after-success skips).
Kind validation (prod-exact off-server gate + fake-gcs; 3-pass green). A
deterministic forged re-drive (the worker acks before dispatch, and the claim
terminal-guard rejects re-publishing the same command_id, so the rig copies the
command row with a fresh event_id → a non-terminal command for the same
(execution_id, step) → same URN) + a marker-object side-effect counter:
-
PASS A (barrier ON) — primary fired once (marker=1); re-drive SKIPPED (marker stayed 1,
barrier{skipped}Δ=1). - PASS B (barrier OFF) — re-drive RE-EXECUTED (marker 1→2); barrier metric Δ0.
-
PASS C (barrier ON) — a terminal-noop re-drive never checked (
barrier_totalΔ0); counter unchanged. - Invariants intact every primary: sole-writer, roots=1, dangling=0, walk==rows, terminal=1, materializer lag 0. Baseline restored (workers back to
:104-phase-c).
Scope follow-up (not a blocker). The shipped existence signal is the tier-object half of §4.4 ("object HEAD"); small/inline side-effecting results (not tiered) re-execute today — the event-completion ("cycle acked") signal is a Phase-E follow-up. #104 stays OPEN (F + minting cutover remain).
Pointers. #104 comment · tools#78 · worker#130 · e2e#79.
2026-06-23 — 🪶 #104 Phase D MERGED — minting flip (4 PRs squash-merged in dependency order; flags default-off → inert in prod; repo-only)
Headline. Merged Phase D of #104 — the minting flip. All 4 PRs
squash-merged in dependency order with release-triggering subjects:
server#263 → noetl-server v3.43.0 6f6b9ef, worker#129 →
noetl-worker v5.43.0 be6863a, ops#204 b19b759, e2e#78
07e85aa. Flags default-off → inert in prod (a true no-op). Repo-only;
the prod minting cutover (rolling server 3.43.0 + worker 5.43.0 to GKE) is
the separate next task. #104 stays OPEN (Phase E/F + the minting
prod-cutover remain).
Dependency / version analysis. No new crate publish was needed.
Neither server#263 nor worker#129 changes Cargo.toml — the only path deps
are the in-workspace noetl-orchestrate-core, and the registry deps
(noetl-locator 0.1.1, noetl-events 0.1, noetl-tools 3.16.0, arrow 53)
all already published in Phase A–C and resolve from crates.io. No git/branch
dep, so no pre-merge repoint and no noetl-locator/noetl-tools member-publish
ordering. semantic-release cut the two expected minors (server v3.42.0 →
v3.43.0, worker v5.42.0 → v5.43.0). ops/e2e have no semantic-release —
their squash subjects are conventional-commit hygiene only.
OQ5 — result_store retirement — DECIDED metric-gated. Recorded on #104
and the RFC umbrella page: drop the dual-write once
noetl_worker_result_mint_authoritative_total{path="legacy_fallback"} holds 0
across a staging soak (the tier never misses) plus a retention-period time
floor (dual-write runs ≥ one full result retention period at flag-on so any
in-flight resume can still fall back); historical rows age out by expires_at
(no back-migration). The actual retirement is blocked on a not-yet-done
prerequisite (NOT Phase D scope): the materializer fetches the over-budget
payload from result_store today, so the tier-write byte source must be
re-plumbed first (producer stages directly to the tier, or the materializer
reads the inline over-budget payload off the event). Until that lands, dropping
result_store would starve the tier writer even with the metric gate green.
Pointers. ai-meta repos/server→6f6b9ef, repos/worker→be6863a,
repos/ops→b19b759, repos/e2e→07e85aa; wiki Home / Sessions-Log /
Releases / Umbrella-Event-WAL-Storage (Phase D → MERGED, OQ5 → metric-gated)
updated in the same change set; the held Phase D wiki-doc pointer bumps
(ai-meta-wiki, noetl-server-wiki deployment-spec, noetl-worker-wiki
deployment-spec) committed alongside.
2026-06-23 — 🪶 #104 Phase D implemented + kind-validated — minting flip, in review (4 PRs, flag default-off → inert in prod)
Headline. Implemented Phase D of #104 — the minting flip: one flag
NOETL_RESULT_MINT_AUTHORITATIVE (default off → byte-identical to Phase A–C)
makes the URN → Feather/GCS result tier the authoritative result store, with
noetl.result_store demoted to the reversible dual-write fallback. Validated
green on local kind under the prod-exact off-server gate; PRs open for review, not
merged; no ai-meta pointer bump, no prod change. #104 stays OPEN.
What Phase D added.
-
Worker (the flip): the result materializer becomes the authoritative tier
writer under the flag (implies the Phase B flag); resolve-by-URN becomes the
primary consume read path (implies the Phase C flag). A tier miss falls back
fail-safe to the dual-written
result_store(rollback safety). Newnoetl_worker_result_mint_authoritative_total{path}(tier|legacy_fallback). -
Server: config flag
result_mint_authoritative; theresult_storePUT counts each write on the newnoetl_result_store_dual_write_total(the reversible dual-write leg) under the flag. The tier write stays worker-side because the slim control plane cannot encode Feather (OQ7). - ops: the flag plumbed into the kind system-pool manifest, default off; prod manifests untouched.
-
e2e:
kind_validate_result_mint_authoritative.sh— a 3-pass rig.
Kind validation (off-server gate + fake-gcs, 3-pass green).
-
PASS 1 (flag ON): over-budget result AUTHORITATIVE in the GCS tier (gcs put
Δ4), DUAL-WRITTEN to
result_store(row present + serverdual_writeΔ1), consumer RESOLVES FROM THE TIER (gcs get Δ2, workermint{tier}Δ2); 1200 rows; sole-writer intact. - PASS 2 (flag OFF): true no-op (dual_write/mint/resolve Δ0), legacy store authoritative; parity with PASS 1 (1200 rows).
-
PASS 3 (tier-miss ROLLBACK): the execution's tier object is DELETED during the
fixture settle window (deterministic miss, independent of materializer/rollout
timing), so resolve-by-URN misses and falls back to the dual-written
result_store(mint{legacy_fallback}Δ1,fallback_object_missΔ1); full payload still bound.
Server 613 + worker 247 lib tests + clippy green; heavy graph (duckdb/arrow/tonic/ rhai/gcp_auth/kube) stays absent from the control plane. Baseline restored on kind.
OQ5 surfaced as the open DECISION before the prod minting cutover — the
result_store retirement window (how long to dual-write + historical-row handling +
the materializer-payload-source prerequisite). Framed on #104, not decided here.
Pointers. PRs (unmerged, review-only): server#263 · worker#129 · ops#204 · e2e#78. Wiki deployment-spec env-var docs: server.wiki + worker.wiki. Umbrella page Phase D → implemented/in-review; #104 stays OPEN.
2026-06-23 — 🪶 #104 Phase C MERGED — resolve-by-URN read path (3 PRs, flags default-off → inert in prod)
Headline. Merged the 3 #104 Phase C PRs in dependency order
(server → worker → e2e) and bumped the ai-meta pointers. The
result-DATA read half now lives on main: the server has a
GCS object backend + a cell-endpoint registry + GET /api/internal/cells, and the worker has the resolve-by-URN read
path (references-in-state behavior, flatten_single_tool_result;
closes OQ6) plus fixes B and B1. Every flag is default-off →
inert in prod until a future rollout enables the read path.
#104 stays OPEN (Phase D minting flip remains).
Dependency / version analysis. No new crate publish was
needed. server#262 does not touch Cargo.toml and resolves
noetl-locator 0.1.1 (Phase B) from crates.io; worker#128's only
manifest change is adding the published arrow = "53" direct dep
and it resolves noetl-tools 3.16.0 (Phase B) from the registry.
No git/branch deps anywhere → no downstream repoint. Lockfile
confirms noetl-tools 3.16.0 / noetl-locator 0.1.1 / arrow 53.4.1.
What landed.
-
server#262 (squash
082a955) → semantic-release noetl-server v3.42.0 (c2d5ca9). GCS object backend + cell registry +GET /api/internal/cells. -
worker#128 (squash
379bf31) → semantic-release noetl-worker v5.42.0 (7971041). Resolve-by-URN read path + fixes B/B1; 38 unit tests. -
e2e#77 (squash
39dc880) — 3-pass resolve-by-URN rig + fixture + fake-gcs manifest. No semantic-release (pointer-only).
ai-meta pointers bumped. repos/server→c2d5ca9 (v3.42.0),
repos/worker→7971041 (v5.42.0), repos/e2e→39dc880. Wiki:
Home + Sessions-Log + Releases + ecosystem-map cells +
Umbrella-Event-WAL-Storage (Phase C → MERGED, OQ6 resolved).
No prod deploy — Phase C ships default-off and reaches prod on a later rollout. PROD GKE untouched.
Headline. Merged the 5 #104 Phase B PRs in dependency order and
bumped the ai-meta pointers. The shadow Feather result-DATA tier now
lives on main across worker / tools / server / ops / e2e, with every
flag default-off → inert in prod until a future rollout. #104
stays OPEN (Phases C–F remain).
What landed (dependency order).
-
noetl/tools#77→ noetl-tools v3.16.0 (740a2c2merge → release7da39d8) + new noetl-locator v0.1.1 published to crates.io. The PR adds the additiveResultCoordinates::parse/from_locator(the inverse oflogical_uri) to the slim, dependency-freenoetl-locatormember. Release mechanics correction applied pre-merge: semantic-release'sprepareCmdbumps only the rootCargo.toml, and the publish step skipsnoetl-locatorwhen its version is already on crates.io (0.1.0, Phase A) — so a member version bump is required for the locator to publish. Bumpednoetl-locator/Cargo.toml0.1.0→0.1.1 (additive → patch keeps the root^0.1dep + server's^0.1.0resolving the new version) and pushed to the PR branch before merging. Confirmed on crates.io:noetl-locator 0.1.1+noetl-tools 3.16.0. -
No downstream repoint needed. Verified before merging
server/worker: neither carries a temporary git/branch dep on
unpublished code.
noetl/serverresolvesnoetl-locator ^0.1.0from the registry (registry+…crates.io-index, doesn't use the new API);noetl/workerresolvesnoetl-tools ^3.14.2from the registry and uses a self-contained localcoords_from_uriinversion (the swap tofrom_locatoris the explicitly-deferred Phase-B follow-up). No[patch], nogit =, nobranch =. The task's conditional repoint did not fire. -
noetl/server#261→ noetl-server v3.41.0 (48ad318merge → release4a6659e). Ensures the siblingnoetl_result_materializerdurable consumer at stream-birth (own ack cursor); no new deps, control plane stays slim. -
noetl/worker#127→ noetl-worker v5.41.0 (c1adb7fmerge → release4b1c15b). Thesrc/result_materializer.rsconsume-loop writes the over-budget result (tabular → Arrow Feather, non-tabular → JSON, small → inline no-op) to the derived §7 key alongsidenoetl.result_store; gatedNOETL_RESULT_MATERIALIZER_ENABLED(default off → not spawned → no-op). -
noetl/ops#203(c92753c) —NOETL_RESULT_MATERIALIZER_ENABLED(defaultfalse) + single-cell seedNOETL_RESULT_CELL_ENV/REGION/CELLon the kind system-pool deployment; prod manifests untouched. -
noetl/e2e#76(04c3332) —kind_validate_result_materializer.sh-
test_large_tabular_result.yaml(the flag-on Feather/JSON + flag-off Δ0 rig, validated pre-merge).
-
Validation. All 5 PRs were kind-validated upstream pre-merge
(gate-ON: tabular → real Arrow .feather 269 KB, non-tabular → .json,
flag-off Δ0, event-materializer sole-writer intact every leg). No prod
deploy this session — repo-only; the tier reaches prod only on a future
rollout that enables the flag.
Pointers bumped (ai-meta). repos/tools → 7da39d8, repos/server
→ 4a6659e, repos/worker → 4b1c15b, repos/ops → c92753c,
repos/e2e → 04c3332. Wiki: Home + Sessions-Log + Releases +
Umbrella-Event-WAL-Storage (Phase B → MERGED).
2026-06-22 — 🪶 #104 Phase B implemented + kind-validated — shadow Feather result tier (in review, flag default-off)
Headline. Built the result-DATA half of #104: a separate
noetl_events consume-loop on the system pool
(noetl_result_materializer, own ack cursor) that resolves an
over-budget result's payload and shadow-writes the body to the
derived §7 object key — tabular → Arrow Feather, non-tabular →
JSON (OQ3), small → inline no-op — alongside noetl.result_store.
Nothing reads it until Phase C. Gated behind
NOETL_RESULT_MATERIALIZER_ENABLED (default off → true no-op). #104
stays OPEN (C–F remain). No prod change, no pointer bump (PRs unmerged).
What landed (PRs in review).
-
worker worker#127 —
src/result_materializer.rs: the shadow consume-loop. Tiering mirrors the worker'sarrow_codec(detects a tabular rowset top-level or under the conventionaldata.<tool>envelope); keep-every attempt URN (OQ1); single-cell seed (RFC §4.3); never alters the authoritative result, never fails an event (errors counted + acked); metricsnoetl_worker_result_materializer_*. 12 unit tests + 236 lib + clippy. -
tools tools#77 —
ResultCoordinates::parse/from_locator(single-source URI→coords). -
server server#261 —
ensure the sibling
noetl_result_materializerconsumer (no new deps). - ops ops#203 — system-pool flag + single-cell seed (default off; prod manifests untouched).
-
e2e e2e#76 —
kind_validate_result_materializer.sh+ a tabular over-budget fixture.
Validation. Local kind under the prod-exact off-server gate
(PUBLISH_ONLY + off-server drive + materializer sole-writer), server +
system-pool on Phase B images: flag-on tabular → real Arrow
.feather (269 KB, Arrow magic) at
…/cell=local-0/shard=s0053/…/results/start/0/0/1.feather, non-tabular
→ .json; flag-off → 0 objects (true no-op); both COMPLETED;
event-materializer sole-writer invariants intact every leg
(event_rows==distinct, catalog0=0, orchestrate=0, mat dup=0).
Cluster restored to baseline (images reverted, env removed, server
healthy v3.39.5).
Decisions. OQ1 = keep-every; OQ3 = JSON; OQ6's multi-cell registry +
miss behaviour deferred to Phase C (one cell ⇒ no miss); object-store
backend = the shipped server-mediated PUT /api/internal/objects/{key}
seam. No new blocker before Phase C.
2026-06-22 — 📦 #104 Phase A MERGED — slim noetl-locator 0.1.0 extracted + published; server accepts the canonical result URI (flag default-off, repo-only)
Headline. Executed the #104 Phase A merge sequence end-to-end —
extracted the slim, dependency-free noetl-locator crate, published
noetl-locator 0.1.0 to crates.io, and merged the server accept hook
behind a default-off flag. No prod deploy (Phase A reaches prod on a
later server rollout).
What landed.
-
tools#76 —
squash-merged with a
feat(locator):subject (the PR was titledrefactor(locator):, which semantic-release treats as non-bumping → would have published no release). Thefeatsubject cut noetl-tools v3.15.0 (dc0c5d8) and the release CI published the new workspace membernoetl-locator 0.1.0to crates.io (member-publish ordering: locator before the root crate, same shape asnoetl-directives). The crate is purestd— ResourceLocator, ResultCoordinates, shard_key, CellPlacement, legacy-ref parse — re-exported asnoetl_tools::locatorso the worker stamp path is unchanged. -
server#260 —
repointed the temporary git-dep (
noetl-locator = { git = …, branch = … }) tonoetl-locator = "0.1.0", re-resolved the lockfile off crates.io, confirmedcargo build/cargo treeresolve from the registry and the heavy graph stays absent from the control plane (duckdb / kube / arrow / tonic / rhai / gcp_auth = 0 occurrences), 623 server tests green; committed (92fbceb) + pushed, then squash-merged → noetl-server v3.40.0 (c89d078). The server accepts the canonical result URI behindNOETL_RESULT_URI_ACCEPT(default off / no-op). -
e2e#75 — squash-merged
the Phase A kind validation rig (
eeca8b7).
Resolves OQ7 (the umbrella's open question — noetl-tools dragged its whole dependency graph onto the control plane just to parse a URI).
ai-meta change set. Pointers bumped repos/tools→dc0c5d8,
repos/server→c89d078, repos/e2e→eeca8b7; wiki Home +
Sessions-Log + Releases + ecosystem-map + Umbrella-Event-WAL-Storage
(Phase A → MERGED). #104 stays OPEN (Phases B–F ahead).
2026-06-22 — 🐘 #95 CLOSED — postgres pg_value_to_json temporal/identity serialization shipped to prod
Headline. Shipped the #95 postgres timestamp fix to prod end-to-end (same pattern as the #127 worker ship): merge tools#75 → release noetl-tools v3.14.2 → bump the worker pin → build + roll the worker to prod under the live off-server CQRS cutover → close #95.
Root cause. noetl_tools::tools::postgres::pg_value_to_json probed
i64/i32/f64/bool/String/serde_json::Value/DateTime<Utc> and fell
through to Value::Null for everything else. A timestamp without time zone
column (chrono NaiveDateTime) hit that fall-through and serialized to null
even though the value was present — the #95 repro, where auth.sessions.expires_at
came back null and tripped the gateway's expires_at validation. timestamptz
(DateTime<Utc>) already had an arm and was unaffected.
Fix (tools#75). Added arms for the
temporal + identity types that shared the gap: timestamp → NaiveDateTime
(ISO-8601 with NO offset suffix — a tz-naive value carries no zone), date →
NaiveDate, time → NaiveTime, uuid → hyphenated lowercase string,
numeric/decimal → exact decimal string via a direct lossless decode of the
postgres numeric binary wire format (matches the duckdb decimal-as-string
convention), bytea → base64. 409 lib tests + clippy clean + a kind-gated
live-postgres before/after integration test (tests/postgres_temporal_kind.rs,
behind NOETL_PG_KIND_DSN) proving the pre-fix arm nulled
timestamp/date/time/uuid/numeric/bytea while PostgresTool now returns the real
values.
What landed.
-
tools#75 squash-merged
06302ac→ semantic-release noetl-tools v3.14.2 (6d9b674), published to crates.io. -
worker#126 bumps the
noetl-toolspin3.14.1→3.14.2— deps-only, no worker source change;cargo check+cargo clippy --all-targetsclean. Squash60a849d→ semantic-release noetl-worker v5.40.5 (da24952). - Built the prod worker image via Cloud Build us-central1 — tag
noetl-worker-rust:v5.40.5, digest@sha256:45212dbe7410920aaa6311074bb9ef78f161c369dbd995364cda2ecfda2f0af2(--machine-type=e2-highcpu-8 --timeout=5400sper the #127 cold-musl-build lesson; ~24 min). - Rolled by digest onto prod
noetl-worker-rust+noetl-worker-system-poolviakubectl set image(server stays v3.39.5@feaac0c5; worker CPU req 250m / limit 2 preserved).
Rollout health (live). Rolling restart clean — noetl-worker-rust 2/2 +
noetl-worker-system-pool 1/1 Ready, 0 crashloop/restarts after a ~3-min soak,
both on the new digest, resources preserved. Off-server CQRS cutover stayed
healthy: materializer started (ack-after-materialize, sole noetl.event
writer), noetl_worker_nats_consumer_pending{consumer="noetl_materializer"}=0,
system command lag (NOETL_COMMANDS_RUST)=0, materializer_project_errors=0,
duplicates=0, and the system-pool WAL index rehydrated from the retained
noetl_events WAL (indexed_executions=5, wal_events=198, durable=false —
the #119 ephemeral rebuild). Server untouched. No rollback needed.
Prod fix confirmation. Relied on the kind-gated before/after integration
test (postgres_temporal_kind.rs) rather than running a new postgres playbook
against prod — the test already proves the before/after on a live postgres, and
running a fresh playbook would touch prod data. Rollout was clean.
Pointers. ai-meta repos/tools→6d9b674 (v3.14.2) + repos/worker→da24952
(v5.40.5); wiki Home/Sessions-Log/Releases + ecosystem-map cells updated; board
3 → Done; #95 closed.
Headline. With the event-WAL half now live on prod, upgraded the #104 umbrella page from a tracking dashboard into a full RFC and re-scoped it to the remaining result-DATA half. Design-only — no prod changes, no code, no pointer bumps to runtime repos.
What landed.
- RFC rewritten at Umbrella: Event WAL Storage (matching the #115 RFC format): numbered sections, a decided/proposed/open tenet table, NATS-as-WAL (live), convention URNs (locator shipped, adoption proposed), the Feather result tier, correctness, red-team open questions, a phased plan (A–F), and a §9 dependency/sequencing analysis vs #115 / #107 / #101.
-
Grounded in shipped code: the event half is live (
materializer.rssole writer ofnoetl.event;state_builder.rschain-walk fromexpected_head;ChainHeads.link_batchprev_event_id; off-server CQRS cutover v3.39.5 on prod). The result half is not: result bytes still live in Postgresnoetl.result_store.data(JSONB), the logicalreference.uriis stamped but unconsumed, and no production object-store Feather writer exists (only thereference-materializerWASM worked-example proving theobject_putcapability path). - Sequencing call: #104's result-data tier finishes #107 program step 3 (demote the result store, as the event store was already demoted) and is on the critical path for #107 steps 4–5.
- Home.md Active-umbrella #104 row + Last-refreshed headline updated; board 3 stays In progress (RFC under review, no lifecycle change).
- Summary comment + review request posted on #104.
Pointers. Wiki: Umbrella-Event-WAL-Storage.md, Home.md,
Sessions-Log.md. Issue: #104.
Blueprint it reconciles: event_wal_and_derivable_storage.md.
Headline. Merged the release-server workflow fix that unblocks the
crate publish pipeline. CI-only change — semantic-release cut no new
version (latest stays v3.39.6), so no repos/server pointer bump.
What landed.
-
server#259 merged (squash
be094a2) — fixes the failingrelease-serverworkflow.cargo publishofnoetl-serverfailed on the barepathdep tonoetl-orchestrate-core("all dependencies must have a version requirement specified when publishing"). The fix addsversion = "0.1.0"alongside thepathinCargo.tomland splits the publish step to publish the workspace membernoetl-orchestrate-corefirst (idempotent skip-if-published), mirroring the noetl-tools/noetl-directives pattern (noetl/ai-meta#92, #108). -
No version cut. The merge's
ci(release):commit is non-bumping; Semantic Release logged "Analysis of 1 commits complete: no release". Latest release remains v3.39.6. No pointer bump, no prod touch. -
Validation.
release.ymlfetched from the PR head parses clean (4 jobs:verify-version/publish-crate/publish-image/github-release); the author verifiedcargo publishresolution locally via dry-run. Real end-to-end validation is the next actualrelease-serverworkflow_dispatch run — not exercised here.
Why it matters. The last two release-server runs failed at the publish
step; this unblocks the crate publish pipeline for the next server release.
Headline. Closed the loop on the now-shipped CQRS / off-server-drive program and reconciled the drifted tracking surfaces. No prod changes.
What landed.
-
e2e#74 merged (squash
1deadf1) —scripts/prod_regression_validate.py, the prod-scoped Rust regression validator for the CQRS gate-ON cutover; validated 28/30 against live prod. ai-metarepos/e2epointer bumped to1deadf1. This was the last open artifact of the live CQRS cutover. -
Closed #103 (Step 2 — CQRS event log): prod cutover live + validated
(server v3.39.5
PUBLISH_ONLY+offserver/ worker v5.40.x Rust materializer sole-writer), e2e#74 was its last piece. - Closed #102 (batch event-log writes): server#198/#199 landed; worker-side superseded by #103. Closed #101 (incremental state / results-by-reference): acceptance met + kind-validated; consume side reassigned to #115 Phase 1. No residual on either.
-
Closed superseded Python ops PRs #189/#190/#192 (
system/event_materializer-
system/projectorplaybooks/CronJobs) — the cutover shipped the Rust in-process worker materializer (noetl-worker/src/materializer.rs) instead.
-
-
Archived handoffs:
2026-06-18-orchestrate-plugin-dissolution(round-02 complete) +2026-06-09-rust-stack-session-snapshot(stale read-only orientation, superseded). - Reconciled roadmap board 3 (added the open umbrellas missing from it) and Home.md Active-umbrellas (moved closed #101/#102/#103 to Recently closed).
-
Fixed the failing
release-serverCI in noetl/server (cargo publishfailed —noetl-orchestrate-corepath dep had no version): addedversion = "0.1.0"alongside the path + the release workflow now publishes the member crate first, mirroring how noetl/tools handlesnoetl-directives.
2026-06-22 — ✅ #127 CLOSED — task_sequence per-sub-task context optimization merged, released, and shipped to prod
Headline. Landed the structural half of #127: the behavior-preserving
task_sequence per-sub-task context optimization is merged, released to
crates.io, adopted by the worker, built into a prod image, and rolled onto
prod under the live off-server CQRS cutover. The code-opt now compounds with
the CPU-limit bump applied earlier this session — more headroom AND less work
per slot on the batch hot path. #127 fully closed.
What landed.
-
noetl-tools v3.14.1 — merged tools#74
(squash
9dd9aa6); semantic-release cutc8656c1and the release-tools workflow published the crate to crates.io. Thetask_sequencedrain rebuilt the template context per sub-task (running_ctx.clone()+ 2–4×to_template_context()deep-clones + per-blockExecutionContextclones + a freshcontext_to_value()per templated field).TemplateEngine::render_valuenow builds the proxied minijinja context ONCE and threads it through the recursion (render_value_with/render_with; minijinjaValueisArc-backed → reuse is a refcount bump — helps every tool dispatch); newbuild_context_with_overlay(&variables, overlay)builds straight from&variables+ a small overlay, skipping the intermediateto_template_context()HashMap deep-clone + per-blockExecutionContextclones in the set/policy paths. Isolated micro-bench (CPU held constant): per-sub-task context cost 2988.9µs→1147.1µs (−61.6%, 2.6×). 407 lib tests + 2 new equivalence pins + clippy clean. -
noetl-worker v5.40.4 — merged worker#125
(squash
1a10a73); semantic-release cut0afbf5c. Cargo.lock pinnoetl-tools 3.14→3.14.1; deps-only, no worker source change;cargo clippy --all-targetsclean (no new warnings vs baseline). -
Prod rollout — built the worker image via Cloud Build
(
us-central1-docker.pkg.dev/noetl-demo-19700101/noetl/noetl-worker-rust:0afbf5c) and rolled it onto prodnoetl-worker-rust+noetl-worker-system-pool. Server left on v3.39.5/.6 (worker-only change). Worker CPU req 250m / limit 2 kept. Rolling restart clean (pods Ready, 0 crashloop) and the off-server CQRS cutover stayed healthy throughout — materializer sole-writer projected==acked, lag ~0, command consumer lag 0, executions completing, and the system-pool rehydrated its WAL index on restart (#119 proof).
Pointers + bookkeeping. ai-meta repos/tools→c8656c1 +
repos/worker→0afbf5c; Home dashboard + this log + Releases + ecosystem
map updated; #127 closed + board 3 → Done. PROD gate env vars / server image
/ DB untouched.
2026-06-22 — ⚡ #127 PROD WORKER CPU LIMIT RAISED 1→2 + APPLIED LIVE — ~20% batch-throughput win materialized on prod
Headline. Raised the CPU limit 1→2 (request 100m→250m) on
both prod Rust worker deployments and applied it to live prod, so
the ~20% batch-throughput win profiled on kind actually lands in
production. ops#202 (squash
85e3c23); ai-meta repos/ops pointer bumped.
Why. #127 — batch
PFT throughput plateaus because the workers are CPU-throttled. Kind
profiling of a 10k-patient batch showed the Rust step workers peg one
CPU and burn 38–47% of their time CPU-throttled at the 1-CPU
limit (the per-sub-task running_ctx.clone() + to_template_context()
is the hot path). Lifting the limit to 2 CPU removed the throttle and
cut the run ~166–172s → ~137.6s (~20%), zero patient loss.
Prod baseline (verified read-only first). Live prod worker limit
was already 1 CPU (kind had measured 0.5 — confirmed prod's real
value before changing anything): noetl-worker-rust (2 replicas, KEDA
min2/max20) and noetl-worker-system-pool (1 replica) both at request
100m / limit 1. Nodes: 4× ek-standard-16, ~15.89 CPU allocatable
each (~63.5 total), running ~2% CPU — ample headroom (KEDA-max
20×2 = 40 CPU still fits; requests stay tiny at 20×250m = 5 CPU). The
cluster already uses a low-request/high-limit bursting model
(server-rust is 250m/2); the new worker values mirror that ratio.
Chosen value + rationale. request 250m (a real guaranteed floor,
matching server-rust) / limit 2 (the throttle relief = the
kind-validated win). The throttle ceiling is the limit, not the
request. Applied to both worker pools — the system-pool is a single
pod carrying the off-server orchestrate drive + sole-writer CQRS
materializer, so it pegs CPU the same way under a large batch; raising
its limit is pure upside (request stays low).
Prod apply (live action, monitored). kubectl set resources on
both deployments (surgical — touches only CPU req/limit, leaves image,
env, gate vars, memory untouched). Rolling restart completed cleanly:
all pods Ready, 0 restarts/crashloops, new limits live
(req=250m lim=2). Off-server CQRS cutover stayed healthy
throughout — pre/post materializer lag = 0 (noetl_materializer
consumer), command-consumer lag = 0, and the restarted system-pool pod
rehydrated its WAL index (noetl_worker_state_builder_indexed_executions
0, the #119 restart-rehydration proof). No revert needed; the armed revert (restore prior limits) was kept ready but unused. PROD gate env vars / images / DB untouched.
Perf confirmation. Relying on the kind-measured ~20% reference (~166–172s → ~137.6s); no prod benchmark run this session (the change is a resource bump validated upstream + a clean live rollout).
Pointers.
- ops PR: ops#202 →
85e3c23 - Manifests:
ci/manifests/noetl/worker-rust-deployment-prod.yaml,worker-system-pool-deployment-prod.yaml - Issue: #127 (stays OPEN — the deeper per-sub-task context-clone hot path is the structural follow-up).
2026-06-22 — ✅ #123 SHIPPED + CLOSED — a non-iterable loop in: now fails loudly instead of silently wedging (server v3.39.6)
Headline. Merged the reviewed fix for the silent off-server
commands=0 wedge on a non-iterable loop in:
(server#258, squash
275b914 → release v3.39.6 7f109a9). A loop step whose in:
expression rendered to a non-iterable value (e.g.
loop: { in: '{{ workload.batch_slots }}' } when batch_slots is
absent → null) silently wedged the execution at commands=0 /
RUNNING-forever under the prod-default off-server drive — looking
identical to a hang. (This observability gap produced the
false-positive #122 during the #120 2×2 repro.)
Root cause. evaluate_loop already returns
CoreError::Validation("… did not evaluate to an iterable"), and the
in-process drive already turned that into a terminal
playbook.failed. But under the off-server drive
(NOETL_ORCHESTRATE_PLUGIN_DRIVE=true) the system/orchestrate wasm
plug-in returns a structured {"error":…} envelope instead of an
OrchestrationResult; apply_worker_orchestration couldn't decode that
as a result, logged a WARN, recorded decode_error, and returned
Ok(0) — no terminal event, so the run sat RUNNING forever. The
v3.4.2 explicit validation error was lost when the drive moved
off-server.
Fix. Server apply_worker_orchestration now decodes the drive ERROR
envelope (decode_orchestrate_error) and emits a terminal
playbook.failed (metric noetl_orchestrate_drive_total{stage="drive_error"},
structured execution_id), matching the in-process drive — a transient
decode miss (no envelope) stays on the benign re-drive path.
orchestrate-core prefixes the offending step name onto the existing
evaluate_loop error (prefix_loop_step_error). An empty iterable
([]/{}) still short-circuits to next — only a non-iterable
errors.
Validation. 600 server + 135 orchestrate-core tests + clippy clean.
Kind-validated prod-exact (PLUGIN_DRIVE=true + PUBLISH_ONLY=true +
STATE_BUILDER=offserver): an absent workload.batch_slots → FAILED
with loop step 'process': Loop expression '{{ workload.batch_slots }}' did not evaluate to an iterable (got null); a valid [1,2,3] loop still
COMPLETED 3-way; #120 barrier and #124 binding behavior unaffected;
0 pod restarts.
Wrap-up. Code-only fix — server-only (worker stays v5.40.3); ships
to prod on a future server rollout. PROD was NOT redeployed this
session — it stays healthy on v3.39.5 + the live off-server cutover.
ai-meta repos/server pointer bumped → v3.39.6; #123 closed (deliberate
Closes #123) + roadmap board 3 → Done. #127 (batch PFT throughput
plateau) stays OPEN — separate perf follow-up.
Pointers. PR server#258
(squash 275b914 → release 7f109a9 = v3.39.6) · ai-meta umbrella
#123 ·
Umbrella-Decoupled-Context-Event-Chain.
2026-06-22 — ✅ #121 SECOND-HALF SHIPPED + CLOSED — off-server system/* WAL-chain wedge fully fixed (server v3.39.5); PROD off-server re-cutover (3rd attempt)
Headline. Merged the reviewed second-half fix for the off-server
WAL-chain-incomplete wedge on system/* executions
(server#257, squash
54ac277 → release v3.39.5 c421273). #256 (v3.39.4) was only the
first half — a live-prod re-cutover on v3.39.4 still wedged
system/scheduled_cleanup because the system-pool worker drives
system executions off-server regardless of the server-side gate
(STATE_BUILDER=offserver on the worker → the INSERT-not-publish chain
leaves a NULL-prev orphan), so #121 was reopened. #257 gates both
off-server-drive decision sites in trigger_orchestrator_inner on
should_publish(catalog_id) (new pure is_system_path helper + unit
test) so system/* execs fall through to server-built run_state;
regular publishable execs keep the off-server path (preserves the #256
win). 612 tests + clippy green; kind full-gate before/after
(system/scheduled_cleanup BEFORE 9× WAL chain incomplete → AFTER
COMPLETED 0 loops; regular test/simple_loop still COMPLETED
off-server). Server-only (worker stays v5.40.3).
Phase A — merge + cleanup. Squash-merged server#257 → semantic-release
cut v3.39.5 (c421273); ai-meta repos/server pointer bumped to
v3.39.5; ai-meta#121 closed (deliberate Closes #121) + roadmap board 3
→ Done. Wiki dashboard (Home/Sessions-Log/Releases/ecosystem-map server
cell) updated in the same change set.
Phase B — PROD rollout v3.39.5 + off-server re-cutover (3rd attempt) =
✅ CLEAN SUCCESS, LEFT ON. Re-verified the live baseline (server
v3.39.4 @ef7536d4 + workers v5.40.3, safe in-server drive, 0
restarts, health green, lag 0). Built+pushed server v3.39.5 to the
prod AR (server-rust@sha256:feaac0c5…, Cloud Build us-central1
E2_HIGHCPU_8, ~9 min). Rolled the server by digest (workers/system-pool
kept v5.40.3; server reported v3.39.5, 0 restarts). Re-enabled the full
off-server gate — server PUBLISH_ONLY=true+STATE_BUILDER=offserver,
system-pool STATE_BUILDER=offserver + materializer on. Validation
(tenant prod_cutover3_20260622, both previously-wedging classes
exercised twice): 2 regular loop execs COMPLETED off-server (each
62 events, roots=1/dangling=0/rows==distinct; worker
state_builder_event_scans=0 = never-scan) + 2 system/scheduled_cleanup
execs COMPLETED server-built (each 13 events, roots=1/dangling=0;
server state_build_event_scans=6 total = the system execs' intended
noetl.event reads, by design, NOT a regression). 0 WAL chain incomplete lines across the soak (prior v3.39.4 attempt looped 37+),
materializer sole-writer published==drained==acked=124, lag 0, 0
project errors, 0 pod restarts, health green. Per the decision
criterion the gate was LEFT ON — this is the successful full
off-server CQRS cutover on prod. Revert (not needed) stays one command
away: kubectl -n noetl set env deploy/noetl-server-rust NOETL_EVENT_INGEST_PUBLISH_ONLY=false NOETL_STATE_BUILDER=server
(+ system-pool STATE_BUILDER=server).
Phase C — pin the live cutover into the prod manifests
(ops#201, squash 9c94de5).
The Phase-B cutover was applied imperatively (kubectl set image /
set env), so the committed prod manifests drifted from live — they
still pinned the v3.39.1 server / v5.40.2 worker images and the
pre-cutover gate-off config, so a future kubectl apply would have
silently reverted the cutover. ops#201 rewrites the three prod
manifests to match the running state field-by-field (verified against
the live deployments before editing): server-rust-deployment-prod.yaml
→ image @feaac0c5 (v3.39.5) + NOETL_EVENT_INGEST_PUBLISH_ONLY=true
- adds
NOETL_STATE_BUILDER=offserver;worker-rust-deployment-prod.yaml→ image@5e80493b(v5.40.3);worker-system-pool-deployment-prod.yaml→ image@5e80493b(v5.40.3) +NOETL_MATERIALIZER_ENABLED=true+NOETL_STATE_BUILDER=server→offserver. Apply is now a no-op (all managed fields incl. replicas equal live; KEDA still owns worker-rust's count above its floor of 2, by design). No cluster apply was run — the live state was already the target. ai-metarepos/opspointer bumpedd6633f6→9c94de5. The ops repo has no auto-apply CI, so merging only aligned source-of-truth. Refs noetl/ai-meta#107.
2026-06-21 — ✅ #121 FIRST-HALF SHIPPED (partial) — off-server WAL-chain-incomplete loop on system/ executions FIXED; live prod off-server-cutover wedge captured as the real-world repro
Headline. Merged the reviewed fix for the off-server WAL-chain-incomplete
re-drive loop (server#256, squash
28b17cb → semantic release noetl-server v3.39.4 77aaa06). Server-only
— the diff is confined to src/handlers/events.rs + src/state.rs (the server
binary's HTTP claim handlers + the server-side ChainHeads struct); it touches
no shared crate (orchestrate-core, noetl-tools) the worker consumes, so the
worker / system-pool need no rebuild.
Two distinct defects.
-
Orphaned
command.claimed(NULLprev_event_id). The gate-offclaim_commandwrites thecommand.claimedrow with a raw in-tx INSERT (atomic with the claim), andhandle_batch_eventsdoes the same for the gate-off batch path — both bypass theevent_write::emit_events/ChainHeadslink chokepoint, soprev_event_idwas never stamped AND the per-execution chain head was never advanced. The next event (command.started) then linked back tocommand.issued, skipping the orphaned claim, so the off-serverchain_walk_fromhit a NULL-prev non-genesis head →build_spine_toIncomplete. This fired for everysystem/playbook becauseshould_publishis false for system executions even under a globalPUBLISH_ONLY=true. Fixed by stampingprev_event_idviaChainHeads::link_batchon both gate-off INSERT paths (same advance-then-write orderingemit_eventsalready uses). -
The actual loop. The stateless off-server drive builds state from the
noetl_eventsWAL, but system-execution events INSERT tonoetl.eventand never enter the WAL, so the worker's WAL build could never complete → the drive returned the__offserver_retry__no-op → the server reconciler re-drove → repeat. Fixed by gating the off-server WAL drive onshould_publish(catalog_id);system/executions drive server-built (readnoetl.event).
Validation. 598 server tests pass + clippy clean (new state.rs unit test
chain_head_claim_orphan_vs_linked models the orphan-vs-linked pointer walk).
Kind, prod-exact (PUBLISH_ONLY=true + STATE_BUILDER=offserver, single
replica): BEFORE — command.claimed prev=NULL, 17×+ WAL chain incomplete,
wedged RUNNING 120s+; AFTER — chain fully linked (0 orphans), 0 loop lines,
system/scheduled_cleanup COMPLETED in 6s; non-system off-server unaffected.
Live prod repro (read-only; no prod change made this session). The full
off-server cutover was attempted on prod earlier (server v3.39.3 + off-server
gate) and WEDGED: the drive applied counter froze while
dispatched_offserver_stateless / offserver_retry climbed in lockstep to
~389, the system pool logged WAL chain incomplete; returning no-op for
multiple executions (including prod's own 327130580493803520), all with
prev_event_id=NULL. The armed revert (PUBLISH_ONLY=false +
STATE_BUILDER=server) recovered cleanly. This is the real-world reproduction
of #121 and confirms #121 is the blocker for the prod off-server cutover (a
separate follow-up task). PROD GKE default (STATE_BUILDER=server) + all
defaults untouched here.
Pointers. ai-meta repos/server pointer bumped to v3.39.4 77aaa06; wiki
Home (Last-refreshed + Recently-closed + Ecosystem-map server cell +
Sessions-log preview + Releases preview) + Releases v3.39.4 row +
Umbrella-Decoupled-Context-Event-Chain Recent-activity row; roadmap board 3
#121 → Done. #123 (loop-non-iterable observability) + #127 (batch PFT
throughput plateau) stay OPEN — separate, not resolved by this.
2026-06-21 — ✅ #125 + #126 SHIPPED + CLOSED — task_sequence control flow and http tool body data-shape fixed; 10×1000 batch pft_flow_test clean
Headline. Fixed the two Python→Rust regressions that blocked the batch
pft_flow_test after #124 unblocked the forward-binding path. Both live in
noetl-tools; the worker adopts them via a Cargo.lock bump.
#126 — http tool body data-shape (tools#72):
the Rust http tool exposed the parsed response body under data.body, but
playbooks written against the Python-era contract read output.data.data (or
equivalently output.data for the body). The pft_flow_test save_batch step
passed the http result to jsonb_to_recordset, which received a non-array value
→ Postgres error → zero rows written. Fix: expose the parsed body under data
again; keep body as a back-compat alias. Squash 86f0216 → semantic release
noetl-tools v3.13.1 (8dd0e1f).
#125 — task_sequence do: jump/break/retry (tools#73):
the Rust task_sequence runner parsed do: but matched only "fail", so do: jump/break/retry were silently ignored — each slot ran its steps once and
returned regardless of the directive. In pft_flow_test, the batch loop's do: jump never re-entered, leaving 16 batches done and the rest pending (patient
loss across slots). Fix: honour do: jump/to: (loop to named step with
infinite-jump guard), do: break (exit the sequence), do: retry (with
configurable attempts/backoff). Squash 62d0948 → semantic release
noetl-tools v3.14.0 (638c3c6).
Worker adoption (worker#124):
bumps the noetl-tools Cargo.toml pin 3.13→3.14 (Cargo.lock resolves
3.14.0) so the worker binary actually runs both fixes at runtime. No worker
source change; clippy clean. Squash 87b85e8 → semantic release
noetl-worker v5.40.3 (6dd3449).
Result. With tools#72 + tools#73 in place, the full 10×1000 = 10,000-patient
batch pft_flow_test passes end-to-end on kind — zero patient loss, invariants
clean (roots=1/dangling=0/forks=0, materializer sole-writer, never-scan).
Perf follow-up #127. The benchmark plateaus at ~60 patients/s (10k in ~166–172 s), roughly 3× slower than the Python ~54 s kind baseline. Throttled (50 req/s, 172 s) ≈ unthrottled (166–168 s) — the rate limiter is NOT the bottleneck. Capped by the NoETL pipeline (worker concurrency 4×2 + serial per-slot drain + off-server/publish-only per-step overhead). Correctness is clean; throughput is a perf-investigation task, not a tools bug.
Still open. #121 (orphaned-spine-head WAL-chain-incomplete loop) + #123 (loop-non-iterable observability) are separate; not resolved by this. PROD GKE + all defaults untouched.
Pointers. tools PR#72 86f0216 → release 8dd0e1f (v3.13.1) · tools PR#73
62d0948 → release 638c3c6 (v3.14.0) · worker PR#124 87b85e8 → release
6dd3449 (v5.40.3) · closes noetl/ai-meta#125
- noetl/ai-meta#126 · perf follow-up noetl/ai-meta#127.
2026-06-21 — ✅ #124 SHIPPED + CLOSED — distributed task_sequence forward set:/sibling bindings no longer render empty (orchestrate-core)
Headline. Merged the already-reviewed #124 fix and landed the standard
post-merge bookkeeping. noetl/server#255
(squash d53e095) fixes a distributed task_sequence forward-data-binding
regression: inside a multi-tool (task_sequence) step, a later sub-task's
templates that reference a value a prior sub-task produces at runtime — a
forward set:, a policy-rule set:, or a sibling result — were rendered to
empty at command-build time, before the worker's per-sub-task binding ran.
orchestrate-core/src/commands.rs::render_pipeline_config preserved
set/args/spec/command verbatim but rendered every other sub-task
field (url/params/method/…) against the step-entry context under
UndefinedBehavior::Chainable, so runtime-only references ({{ iter.* }},
sibling labels) silently collapsed to empty — concretely in pft_flow_test's
batch path, claim_batch writes iter.data_type via its policy set:, then
fetch_batch's url: …/api/v1/pft/batch/{{ iter.data_type }} pre-rendered to
…/batch/ → 404 → 0 rows. Fix: new
TemplateRenderer::render_value_deferring_unresolved renders only templates
whose variable paths all resolve in the build-time context; any template
referencing an unresolved path is preserved verbatim so the worker
re-renders it against the per-sub-task running context (a superset). cargo test 134/134 + clippy clean; kind-verified (fetch_batch now hits the real
per-type URLs). Semantic release → noetl-server v3.39.3 (365d3be);
ai-meta pointer bumped + #124 closed + roadmap board 3 → Done. PROD GKE + all
defaults untouched.
Pointers. server PR#255 d53e095 → release 365d3be (v3.39.3) ·
ai-meta pointer bump · closes noetl/ai-meta#124.
#121 (orphaned-spine-head WAL-chain-incomplete loop), #123 (loop-non-iterable
observability), #125 (task_sequence do: jump/break/retry ignored), and
#126 (http tool data.body vs output.data.data contract) stay OPEN —
separate, not resolved by this. The batch PFT benchmark remains blocked behind
#125/#126 (further distinct Python→Rust regressions surfaced behind the
binding fix).
2026-06-21 — ✅ #120 SHIPPED + CLOSED — reduce barrier no longer deadlocks (commands=0) on open/asymmetric loop joins (orchestrate-core)
Headline. Merged the already-reviewed #120 fix and landed the standard
post-merge bookkeeping. noetl/server#254
(squash fbb855f) adds a runtime liveness filter to the orchestrate-core
reduce barrier: an open/asymmetric loop back-edge predecessor that never runs
on the taken path (a declared inbound arc whose forward return path is absent —
T does not forward-reach upstream U) is no longer counted as a pending
fan-in dependency, so dispatch is no longer deferred forever (commands=0).
build_incoming_arcs is unchanged (an open back-edge is still a genuine
static fan-in); the fix is in the barrier's runtime check — only an upstream
that is live on the current path (entered/in-flight, or reachable from an
active step) blocks. Affects the in-server and off-server drives identically
(shared orchestrate-core). New unit test
test_open_loop_back_edge_does_not_block_dispatch; cargo test 133/133 +
clippy clean; kind-validated (post-fix the 2×2 off-server/gate matrix all
COMPLETE, fanout_reduce/pagination/loop spot-checks green). Semantic
release → noetl-server v3.39.2 (28e8950); ai-meta pointer bumped + #120
closed + roadmap board 3 → Done. PROD GKE + all defaults untouched.
Pointers. server PR#254 fbb855f → release 28e8950 (v3.39.2) ·
ai-meta pointer bump · closes noetl/ai-meta#120.
#121 (scheduled_cleanup chain gap) + #123 (loop-non-iterable observability)
stay OPEN — separate, not resolved by this.
2026-06-20 (late) — ✅ PROD CQRS rollout RECORDED (ops#200 merged + pointer bumped) + LIVE-PROD e2e validation of the gate-ON cutover — 24/26 fixtures PASS, sole-writer + clean-chain + never-scan held on every execution incl. the failure path
Headline. Closed the rollout paperwork and then validated the live cutover
with the e2e suite against real prod. Merged ops#200
(prod manifests pinned to the executed digests — server-rust @sha256:197a6d10
v3.39.1 c5f8cb2 + worker-rust @sha256:41713265 v5.40.2 48b0bde, worker
replicas 1→2, configmap event-stream keys aligned to the live lowercase
noetl_events stream, executed-flip runbook record) → bumped the ai-meta
repos/ops pointer (ops@d6633f6, ai-meta 08c73e5). Then ran the Rust e2e
regression + specialized playbooks against LIVE PROD (the same
gke …noetl-cluster ns noetl, server v3.39.1 / worker v5.40.2,
PUBLISH_ONLY=true + STATE_BUILDER=offserver, materializer sole writer).
Prod e2e matrix — 28/30 executions PASS (24/26 distinct fixtures + 4
composition-spawned children). Coverage: python / args / vars / loops /
control-flow / output-select / large-result / actions / fan-out-reduce /
duckdb / http (in-cluster) / save-to-postgres (json_serialization_save,
pg_k8s) / sub-playbook composition (parent + 4 spawned children, pg_k8s).
For every execution under the gate-ON path: COMPLETED, sole-writer
(event rows == distinct ids, 0 catalog_id=0, 0 __orchestrate__ event rows,
≥1 __orchestrate__ command), clean chain (roots=1 / terminals=1 /
dangling=0 / head-walk == total), never-scan (worker
noetl_worker_state_builder_event_scans_total Δ0 across all batches; cumulative
still 0), materializer lag 0 throughout.
The 2 FAILs are a prod-env credential difference, NOT a cutover bug.
postgres_test (pg_noetl_k8s, stale) and postgres_jsonb_test (pg_local)
failed with Failed to get connection: error connecting to server — those
aliases point at DB hosts unreachable from this prod cluster. Even on
failure the gate-ON path stayed correct: both produced a clean
playbook.failed terminal with sole-writer + single-root chain intact (proving
the FinalizedGuard + chain linker work on the failure terminal too). No
ai-task issue filed — no genuine platform bug surfaced.
SKIP+note (external dep, not present in prod): pagination/* (needs
paginated-api.test-server.svc), http_to_postgres_* (external jsonplaceholder
egress + pg_local), save_simple/save_all/storage_tiers (pg_local;
storage_tiers also #101 bloat), auth0_login / keychain/google_id_token /
amadeus / openai / IB / snowflake (external creds/services),
server_oom_stress_* / heavy_payload_* / heavy_loop_aggregation /
lease_expiry (heavy/OOM/#101 — deliberately avoided against prod).
Prod left healthy on the new path — /api/health ok, db+nats connected,
materializer lag 0, all pods Running 0 restarts across the ~45-min run.
Test-data footprint (uncleanable — no DELETE API, 365-day retention):
tenant/catalog prefix prod-e2e-20260620-1946 — 26 catalog entries, 30
executions, 947 noetl.event rows. All identifiable by the prefix for operator
cleanup.
Verdict: the CQRS publish-only + off-server state-builder cutover is validated under real production load across the functional playbook surface. Refs #103, #107, #111.
2026-06-20 (evening) — 🚀 PROD CQRS CUTOVER EXECUTED + gate-ON validated — server v3.39.1 / worker v5.40.2 rolled to prod GKE, PUBLISH_ONLY + off-server state builder flipped LIVE
Headline. The CQRS publish-only flip + off-server state builder went
live on production (gke_noetl-demo-19700101_us-central1_noetl-cluster,
ns noetl). Prod is left gate-ON, healthy: the materializer is the sole
noetl.event writer and the orchestrator drive builds state off the
noetl_events WAL with zero event-scans. Closes the last operator-side gap
on #103; puts
#107 /
#115 /
#111 off-server work into prod.
Prerequisite — one-time owner-applied prev_event_id migration. v3.39.1
binds prev_event_id on every noetl.event / noetl.command INSERT. The
columns were absent on prod and the runtime noetl role is not the table
owner (owner=postgres; noetl is a restricted cloudsqlsuperuser), so the
server's startup ensure_columns is swallowed (must be owner). Applied the
additive, idempotent, metadata-only DDL as the DB owner — the working
postgres credential the live pgbouncer deployment carries in its
DATABASE_URLS backend route, over a kubectl port-forward (no password
rotated / printed / committed). Cascaded to all partitions (event=14,
command=16); idx_event_prev_event_id valid, 14 leaves. GSM pg_noetl_k8s
is stale (drifted password) — not used, not touched; operator should
rotate/realign it. The harmless event-chain DDL skipped WARN persists by
design (ownership checked before the IF NOT EXISTS skip).
Rollout. Server → v3.39.1 (@sha256:197a6d10…, gates default-off) →
workers shared ×2 + system pool → v5.40.2 (@sha256:41713265…). Order matters:
v3.39.1's plug-in drive routes __orchestrate__ to the system pool, which the
old cursor-100 workers can't run (call.error → drive stall) — workers roll
with/before the server. Then materializer shadow on (backlog 0), then the flip:
system-pool STATE_BUILDER=offserver, server PUBLISH_ONLY=true STATE_BUILDER=offserver.
Validation (gate-ON, 5 tenant execs: simple_python / hello_world / e2e_probe,
incl. concurrent). All COMPLETED; chains roots=1 / terminals=1 /
dangling=0. Materializer sole writer (event_ingest_published_total ==
materializer_acked_total = 71 rows; server wrote 0 tenant rows).
Never-scan holds (state_builder_event_scans_total = 0). Backlog 0
throughout; 0 pod restarts on a 90s soak.
Pointers. ops PR noetl/ops#200
(manifest digest bumps → v3.39.1 / v5.40.2 + executed-rollout record in
runbooks/noetl-cqrs-publish-only-flip.md + configmap stream-name hygiene).
ai-meta pointer bump follows the merge. Revert on standby (one command set:
PUBLISH_ONLY=false STATE_BUILDER=server on server + system pool).
2026-06-20 — ✅ #119 + #118 SHIPPED + gate-ON kind-validated + CLOSED — off-server WAL-drain restart-rehydration (#119) unblocks the terminal-finalize FinalizedGuard (#118); single- AND multi-replica off-server now blemish-free + restart-robust
Headline. Fixed #119 (the
off-server WAL-drain stall that, the prior session, prevented executions from
completing and so hid the #118 symptom), then used the now-healthy cluster to
land the end-to-end validation that closes #118.
Both single- and multi-replica off-server are now blemish-free (every chain
single-root including the terminal event, zero event-scan fallback) AND
restart-robust (the WAL index rehydrates on a worker pod restart). All three
PRs merged, pointers bumped to main, both issues CLOSED.
#119 root cause + fix. The authoritative WAL state-builder drain
(state_builder.rs::run_drain_loop) used a durable noetl_state_builder
consumer whose cursor persists across worker pod restarts, but the in-memory
WalEventIndex rebuilds empty on each boot → after a restart the cursor sat
past the events the fresh index needed (delivered+acked to a prior process,
never redelivered) → build_spine_to(expected_head) permanently Incomplete →
the off-server drive looped offserver_retry and single-replica executions
never reached a terminal event (wal_events_total=0 while the consumer showed
delivered+acked). Fix (worker-only, gated by NOETL_STATE_BUILDER=offserver;
PROD runs the in-server drive so it is untouched): the authoritative drain now
defaults to an ephemeral DeliverPolicy::All consumer — the same shape
shadow already uses — rebuilding the full index from the retained
noetl_events WAL on every boot. There is no persisted cursor to outrun;
it is also correct for >1 worker pod (each holds the complete event set for the
executions it may drive, not the load-balanced subset a shared durable would
give). Ack-policy is keyed on durable-presence, advance-timing on mode (the two
are now decoupled). Instant revert NOETL_STATE_BUILDER_DURABLE=1 restores the
pre-#119 durable consumer (not restart-safe without an index snapshot). Proof:
a one-shot index rehydrated from retained noetl_events WAL log + a new
noetl_worker_state_builder_indexed_executions gauge. Never reintroduces a
noetl.event scan (the rebuild reads the WAL stream only). noetl-worker
v5.40.2 (worker#123, 48b0bde);
224 lib tests green incl. fresh_index_rebuilds_from_full_replay_after_restart.
#118 fix (now integration-validated). A bounded process-local
FinalizedGuard (exactly-one-terminal-per-execution) suppresses a duplicate
finalize at emit_events before it reaches the chain linker (so a suppressed
duplicate never advances/consumes the head), keeping the chain single-root
including the terminal event. Gate-off byte-identical; metric
noetl_terminal_dedup_total{suppressed}; rig gains a HARD per-exec
terminals==1 assertion. noetl-server v3.39.1
(server#253, c5f8cb2) + e2e
(e2e#73, fe97d92).
Validation (gate-ON kind, server 118-finalize v3.39.1 + worker
119-rehydrate v5.40.2; offserver + publish_only + audit_only + plugin_drive).
(1) Restart-rehydration — a forced mid-flight kubectl delete pod --force
of the system-pool worker → the new pod logged index rehydrated … indexed_executions=17 wal_events=200 (pre-fix this gauge was 0 → the stall).
(2) Single-replica #118 — kind_validate_replica_coherence.sh with
NOETL_COHERENCE_FANOUT_BURST=12 × 6 consecutive iterations (~126 execs,
post-restart): 6/6 PASS, 0 anomalies — every chain roots=1 (incl. the
terminal), dangling=0, walk==rows, terminals=1, orch_events=0; zero
state_build_event_scans + zero hot-path noetl.event scans. (No genuine
duplicate-finalize arose this run, so the terminal_dedup suppression counter
stayed 0 — the fork simply never happened, which is the goal; the suppression
path itself is unit-proven.) (3) Multi-replica — the 2-replica
execution-affinity StatefulSet (NOETL_COHERENCE_DRIVE_AFFINITY=shipped,
burst=12): 21 execs COMPLETE, every chain roots=1/terminals=1,
forwarded_ok +202, kv_remote_hit +12, zero scans. Cluster restored to the
clean single-replica baseline; PROD GKE + all defaults untouched.
Verdict. Single- AND multi-replica off-server are blemish-free (single-root incl. finalize, zero fallback) and restart-robust (the index rehydrates). Closes the off-server hardening gap #117 left open; part of #115 Phase 4 / #107.
2026-06-20 — 🔧 #118 fix shipped for review — single-replica off-server terminal-finalize chain fork (FinalizedGuard); integration blocked by #119 (off-server WAL-drain stall)
Headline. Implemented + unit-tested the fix for #118
(single-replica off-server: a duplicate playbook.completed orphans the
per-execution chain with a NULL prev_event_id second root). PRs opened
under kadyapam, left open for review; ai-meta pointer bumps staged
(not merged) because the end-to-end repro could not be validated this session.
Root cause (corrected). The terminal event already goes through
ChainHeads.link_batch — it is not a chain-linker bypass. The defect is a
duplicate finalize: under NOETL_STATE_BUILDER=offserver + PUBLISH_ONLY
on a single replica, the first drive emits the terminal event (chain-linked)
and evicts the chain head + descriptor; a straggler trigger then falls through
to the server-built path, rebuilds from the materializer-lagged WAL (state not
yet terminal), drives again, and emits a second playbook.completed. That
second event reaches link_batch after the head was evicted → prev_event_id = NULL → second chain root (orphan) → off-server spine walk can't reach it → a
benign noetl_state_build_event_scans_total fallback. Multi-replica
execution-affinity (#116) serialises finalize to the owner, so it never forks
there.
Fix. A bounded, process-local FinalizedGuard
(exactly-one-terminal-per-execution; HashSet + FIFO VecDeque, default cap
8192). emit_events suppresses any later terminal for the same execution
before it reaches the chain linker (a suppressed duplicate never
advances/consumes the head). First terminal wins → single root including the
terminal. Gate-off byte-identical (a duplicate never occurs on the synchronous
in-process drive). New metric noetl_terminal_dedup_total{outcome="suppressed"}.
597 server lib tests green incl. 2 new guard tests.
What landed (open for review).
-
server#253 —
FinalizedGuard+emit_eventsdedup + metric + unit tests (branchkadyapam/118-finalize-chain-link). -
e2e#73 — HARD per-exec
terminals == 1assertion inkind_validate_replica_coherence.sh(branchkadyapam/118-terminals-assertion).
Validation status — integration BLOCKED, not closed. The local-kind
single-replica off-server drive did not complete executions this session
(even a single simple_python stalled at command.completed, the drive
looping offserver_retry), so the 2-root symptom could not be reproduced. Cause
is upstream + orthogonal: the worker's authoritative WAL drain delivers+acks
noetl_events on the durable noetl_state_builder consumer but the in-memory
WalEventIndex stays empty after a worker restart (wal_events_total=0) — the
durable cursor persists across restarts while the index rebuilds empty →
build_offserver_input always Incomplete. Filed as
#119. The noetl_terminal_dedup
path never engaged (completion blocked before any finalize).
Cluster state. Local kind restored to its working in-process-drive baseline
(server STATE_BUILDER=server, image p6-audit-only; a single exec COMPLETES
in ~9s); noetl_events/NOETL_COMMANDS purged, tables truncated. The
noetl-server:118-finalize image remains in the registry for a future
offserver validation once #119 is resolved. PROD GKE untouched; all defaults
unchanged.
Board. #118 → In progress; #119 → Todo (board 3).
2026-06-20 — ✅ #117 SHIPPED — off-server spine ordered by prev_event_id chain + walked from the real tip (worker v5.40.1 + e2e); high-concurrency fan-out reduce wedge FIXED
Headline. The off-server drive's from_events spine was built by sorting
events by event_id ascending, assuming id order == causal (prev_event_id)
chain order. Under high-concurrency fan-out two branch completions arrive at the
owner reordered relative to their producer-assigned ids, so emit_events stamps
a higher-id event as the predecessor of a lower-id one. Two failures
followed: (1) the worker tracked the chain head as max(event_id), but
ChainHeads.link_batch advances the watermark to event_ids.last() — the
last-arrived event = the real causal tip; under the inversion max(id) != tip,
so a max-id walk started one branch up and missed the inverted tip entirely,
from_events never saw that branch's command.completed, and the fan-in reduce
never fired (execution wedged RUNNING, reproduced ~1/9 on the 2-replica affinity
topology); (2) even reaching every event, the event_id sort replayed the
inverted pair out of causal order.
Fix (worker-only, inside NOETL_STATE_BUILDER=offserver — PROD's in-server
drive untouched). build_offserver_input builds the spine from expected_head
(the server's ChainHeads watermark = the real tip) via new
build_spine_to/advance_to/chain_walk_from, reaching every event regardless
of id monotonicity, and orders it by the prev_event_id chain walk (head→root,
reversed to root→head) — SpineOrder::Causal (default;
NOETL_OFFSERVER_SPINE_ORDER=event_id is the instant revert). The staleness
guard is now intrinsic (advance_to is Incomplete until the tip is indexed,
replacing the pre-#117 max_id >= expected check an inversion could satisfy
without the real tip present). For any monotonic chain the new causal order is
byte-identical to the old sort — single-replica + low-concurrency unchanged.
Shipped. noetl-worker v5.40.1 (worker#122,
baeae78) + e2e (e2e#72, cdf1768,
NOETL_COHERENCE_FANOUT_BURST stress knob). 15 unit tests (5 new for the
inversion: tip-rooted vs max-id walk, causal vs legacy order, intrinsic
staleness, incremental==cold-rebuild through inversion) + 223 lib tests pass;
clippy clean.
Kind validation (gate-ON: offserver + publish_only + materializer sole-writer).
2-replica execution-affinity topology (nats_kv coherence + affinity, server
v3.39.0): acceptance pass RUNS=3 → 9/9 COMPLETE, chain integrity HARD
(roots=1/dangling=0/walk==total), sole-writer (orch_events=0, rows==distinct),
never-scan (build_scans=0/hotpath=0), forwarded_ok +31; high-concurrency stress
6/6 iterations PASS, 108/108 executions COMPLETE (12 concurrent fanout each),
zero scans. Direct proof: 15 executions carried a real id-inversion
(prev_event_id > event_id) on the normalize_customer/enrich_customer
branches — all 15 fired reduce_customer and reached terminal completion
(pre-#117 these wedged). Single-replica off-server: 7/8 iterations PASS, zero
#117 wedges — the 1 FAIL was a separate pre-existing terminal-finalize race
(the playbook.completed event stamped NULL prev_event_id → 2 roots + a benign
event-scan fallback; non-wedging, reduce still fired). Filed as a #116/#115
write-ordering follow-up; absent under multi-replica affinity (which serializes
the finalize write to the owner).
Verdict. Off-server fan-out is high-concurrency-ready at horizontal scale on the multi-replica execution-affinity topology (linear/loop/sequential + concurrent fan-out all complete, chain coherent, real inversions handled). For single-replica off-server the #117 reduce wedge is fixed; a residual terminal-finalize chain-linking race (non-wedging) is staged as the next write-ordering item.
2026-06-20 — ✅ #116 program-scale step 2 SHIPPED + multi-replica gate-ON validated — execution-affinity single-owner WRITE ORDERING (server v3.39.0 + e2e)
Headline. Step 1 (KV data coherence) made 2+ replicas resolve the same
chain head/descriptor but was necessary-not-sufficient: the command.issued
prev-read (handlers::execute) and the head CAS-advance (emit_events) are two
non-atomic steps, so concurrent cross-replica emits forked the chain and
executions stuck RUNNING. Execution-affinity closes it by routing every
trigger for an execution (POST /api/events, which also fires the drive) to the
single replica that sharding::ShardConfig::owns(execution_id) owns (stable
XxHash64); a non-owner forwards a transparent reverse-proxy POST to the owner
(one-hop loop guard, degrade-to-local on failure). On the owner the existing
single-process drive lock + in-memory ChainHeads make the read→advance atomic
with no distributed lock; KV coherence composes as the
genesis-on-other-replica + handoff vehicle (the owner resolves head/descriptor
from its LOCAL write-through cache, so kv_remote_hit trends to 0 by design).
Chose forwarding over a per-drive NATS-KV lease (option ii) — one mechanism fixes
both the chain fork AND the double-drive, reusing src/sharding.rs verbatim.
Landed. noetl-server server#252
→ v3.39.0 5e00d0a: src/affinity.rs (ExecutionAffinity router +
shard_index_from_hostname); flags NOETL_EXECUTION_AFFINITY /
NOETL_PEER_URL_TEMPLATE / NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off,
prod unchanged); handle_event forwards when not owned + reconcile poller
skips non-owned; metric noetl_execution_affinity_total{outcome} (forwarded_ok
= the write-ordering proof). noetl/e2e
e2e#71 66b6e1b: 2-replica StatefulSet
topology (manifests/replica-affinity/ — distinct shard index per pod via
hostname ordinal + headless DNS) + deploy_replica_affinity_topology.sh (up/down,
flips the system-pool worker to offserver) + the rig flipped HARD
(forwarded_ok proof; kv_remote_hit informational under affinity; workload
auto-detects Deployment-or-StatefulSet).
Validation (multi-replica gate-ON kind). 2-replica StatefulSet, server
offserver+audit_only+nats_kv+affinity+publish_only, worker offserver:
NOETL_COHERENCE_DRIVE_AFFINITY=shipped PASS — linear/loop/fanout COMPLETE;
every chain roots=1/dangling=0/walk==total (no fork — exactly the chains
that forked without affinity, incl. the validated cross-execution prev the
data-layer session caught); forwarded_ok +9 (cross-replica single-owner
routing); state_build_event_scans +0 + hotpath scan +0 (never-scan across
replicas); sole-writer rows==distinct, __orchestrate__ event=0; degraded +0; single-replica unchanged. 595 server tests + 7 new affinity unit tests +
clippy green; baseline restored.
Known follow-up #117 (separate,
pre-existing): the worker's off-server from_events spine is ordered by
event_id (ordered.sort_unstable()); under a chain-order≠id-order inversion
(affinity's forwarding makes it likelier under high-concurrency fan-out) the
fan-in reduce wedged on 1/9 execs in a 9-way concurrent run — the chain stayed
clean (roots=1), only the id-order replay assumption broke. Fix = order the
spine by the prev_event_id chain walk. Linear/loop already reliable.
Prod multi-replica verdict. Write-ordering is COMPLETE (no fork) — prod can horizontally scale the off-server stack for linear/loop workloads reliably; high-concurrency FAN-OUT completion needs #117 first. All affinity flags default off; PROD GKE untouched.
Pointers. server 5e00d0a (v3.39.0) · e2e 66b6e1b · ai-meta pointer bump ·
#116 (closed) ·
#117 (opened) ·
#115 · #107.
2026-06-20 — ✅ #115 program-scale step 1 SHIPPED — multi-replica coherence DATA LAYER (NATS-KV-backed ChainHeads + ExecDescriptor); execution-affinity STAGED (server v3.38.0 + e2e)
Headline. The off-server drive keys two execution-scoped facts off in-memory
AppState maps — the ChainHeads prev_event_id watermark + the
ExecDescriptor (catalog_id + routing + terminal) — both single-replica-local.
Behind NOETL_REPLICA_COHERENCE=nats_kv (default local, prod
unchanged) both are backed by JetStream KV buckets (noetl_chain_heads,
noetl_exec_descriptors) so 2+ replicas resolve the same value: the head advance
is a CAS (one chain under concurrent emits), the descriptor a CAS
read-modify-write (seed + terminal merge). The in-process maps become a
write-through cache / degraded-mode fallback (KV down / local → bit-identical
to today).
What landed.
-
noetl-server v3.38.0 (server#251,
8f39a79): newsrc/coherence.rs(CoherenceKv+KvRead+ CAS helpers; lazy buckets, same shape as the event-stream publisher);ChainHeads/ExecDescriptorsmethods nowasync(all call sites already async);ExecDescriptorserde-derived for KV storage; proof metricnoetl_replica_coherence_total{structure,op,outcome}— load-bearing seriesoutcome="kv_remote_hit"(a head/descriptor another replica seeded, resolved from KV = a server-built cold fallback avoided).NOETL_REPLICA_COHERENCEconfig enum (local|nats_kv). -
e2e (e2e#70,
e222877):kind_validate_replica_coherence.sh(new) + un-staled the #113/#114 offload-advance asserts behindNOETL_RIG_EXPECT_OFFLOAD(default false — underrefs_in_state=trueevents/commands carry references, so the offload paths legitimately stay flat; the COMPLETE + no-oversized-event + zero-__orchestrate__-event invariants stay HARD).
Validation (kind).
-
Single-replica
nats_kv= bit-for-bit parity withlocal:kind_validate_replica_coherence.shPASS — linear(13)/loop(62)/fan-out(25) ×2 all COMPLETE, chain integrity perfect (roots=1, dangling=0, head-walk==total),state_build_event_scans+0, hotpathscan+0, sole-writer intact,__orchestrate__event rows = 0. Defaultlocallikewise unaffected (588 lib tests cover it; baseline sanity exec COMPLETED). -
2-replica
nats_kv: the coherence resolves provably happen —noetl_replica_coherence_total{outcome="kv_remote_hit"}advanced for bothstructure=chain_headandstructure=descriptor; nokv_unavailable;link_batchCASkv_okin the hundreds. The KV layer is doing exactly what it was built to do across replicas. -
Necessary but NOT sufficient (the key finding): on 2+ replicas executions do not reliably COMPLETE — concurrent cross-replica emits fork the chain. Diagnosed precisely: the
command.issuedexplicit prev is read from the chain head inexecute.rs(issuing_event = chain_heads.head()), then the head is CAS-advanced inemit_events; across two replicas these two steps are not atomic, so another replica advances the head between them → a stale explicit prev + a 2-head fork (validation also observed acommand.startedwhoseprevbelonged to a different execution). The data-coherence layer alone cannot fix this; it needs single-owner write ordering.
What this means for #107 / the prod cutover. The off-server architecture is now multi-replica-COHERENT (data) — any replica resolves the same watermark/descriptor — but not yet multi-replica-COMPLETE (write-ordering). The remaining piece is execution-affinity (one replica owns an execution's drive + chain write), which is option (ii) from the design space and the genuine program-scale step 2. The substrate is already present (src/sharding.rs shard_for / ShardConfig::owns, the stable XxHash64 keyed by execution_id). Prod GKE stays single-replica for the off-server stack until affinity lands; this PR is the data substrate affinity builds on. No prod default changed; the kind gate state was restored after validation.
Pointers. ai-meta → server 8f39a79 (v3.38.0) + e2e e222877. server#251 · e2e#70 · #115 · #107 · #111.
2026-06-20 — ✅ #115 Phase 5 SHIPPED + gate-ON validated — atomic-working-item context (tenet 6): the drive hands a worker only its minimal declared slice (server v3.37.0 + worker v5.40.0 + e2e)
NOETL_ATOMIC_ITEM_CONTEXT=true|false (default false, prod unchanged). Realizes RFC #115 tenet 6: a tool-running worker stops receiving the whole accumulated execution context and instead gets only the minimal slice of base-context keys the step statically references.
#77 dependency resolved (the ambiguity). The RFC's "gated on #77" is §6.3 + §8.2. #77 (Explicit Input Binding) is CLOSED (2026-06-09, BREAKING v3.0.0) — it shipped the declaration surface (input:/args: on steps + tool items; arc/step set:→input: forward-only propagation). What Phase 5 needed and #77 did not provide: an extractor that turns those declarations into the minimal upstream slice + the drive narrowing (the server attached the full accumulated context to every command regardless of input:). Phase 5 adds exactly that — #77 is the foundation, not a gap.
What landed (merged + ai-meta pointers bumped on main):
-
noetl-server v3.37.0 (server#250,
a96ade8) — neworchestrate-core::input_binding:analyze(step)statically extracts the base dispatch-context keys a step references (minijinjaundeclared_variablesover the serialized tool def + step input;ctx.X→X,workload→workload, bare step-name→key, injected rootsiter/_prev/output/…→none), conservative (any unbounded ref — whole-context{{ ctx }}spread, unparseable fragment — reportsbounded=false);project_contextnarrows the flat context to the referenced keys or returnsNone(caller keeps full context).CommandBuilder/WorkflowOrchestrator::with_atomic_item_contextnarrow the persisted worker-bound context for plain non-loop steps while server-side rendering still runs against the full context. PluginOrchestrateInput/OrchestrateStateInputcarry a#[serde(default)]flag.NOETL_ATOMIC_ITEM_CONTEXT(default false); metricnoetl_atomic_item_context_total{outcome}. -
noetl-worker v5.40.0 (worker#121,
2484d17) —build_offserver_inputforwards the flag onto the off-serverfrom_eventsOrchestrateInputso the off-server drive narrows too (the run_state fallback already carries it). -
noetl-e2e (e2e#69,
79505fa) —atomic_item_context.yaml(consumerbinds only{{ producer_a.tag }}) +kind_validate_atomic_item_context.sh.
Validation (gate-ON kind — server p5-atomic, NOETL_ATOMIC_ITEM_CONTEXT=true, STATE_BUILDER=server + PLUGIN_DRIVE=true + PUBLISH_ONLY=true):
-
Minimal slice — flag-ON consumer
render_context=[producer_a]ONLY;producer_b+start+steps+workload+execution_id+catalog_id+pathall dropped; execution COMPLETED;noetl_atomic_item_context_total{narrowed}+1. -
Back-compat — flag-OFF consumer
render_context=[catalog_id, execution_id, path, producer_a, producer_b, start, steps, workload](full context), COMPLETED. -
No regression — offserver rig core assertions green at the default: all fixtures COMPLETED,
__orchestrate__event rows = 0, dispatched/applied advance, system-pool isolation, lag-0. (The rig's#113/#114offload-metric-advance assertions are pre-existing-stale under therefs_in_state=truedefault — they fail identically with this flag off.) - 7 input_binding + 132 orchestrate-core + 584 server + 10 worker state_builder tests green; clippy clean (no new warnings); baseline restored.
Scope. Realizes tenet 6 behind the flag. PROD GKE untouched; default false; no gate/mode/builder default flipped. ai-meta pointers → server a96ade8 (v3.37.0) + worker 2484d17 (v5.40.0) + e2e 79505fa. Remainder #115: program-scale (per-shard WAL, multi-replica descriptor coherence).
2026-06-20 — ✅ #115 Phase 6 SHIPPED + gate-ON literal-zero validated — the hot-path noetl.event read class is RETIRED; the table is AUDIT-ONLY (server v3.36.0 + ops + e2e)
NOETL_EVENT_READ_PATH=event_scan|audit_only (default event_scan, prod unchanged). Phase 4 removed the drive's state-rebuild scan under NOETL_STATE_BUILDER=offserver; Phase 6 retires the remaining execution-lifecycle readers of noetl.event — the WHERE execution_id replay class that runs outside the drive.
What landed (merged + ai-meta pointers bumped on main):
-
noetl-server v3.36.0 (server#249,
b71ca1d) — underaudit_only,get_catalog_id(per-ingest:normalize_event_to_row+ batch),inherit_parent_trace, the subscription dedup-audit + container-callback catalog/existence reads serve from the in-memory execute-timeExecDescriptor; on a cold descriptor (a post-terminal straggler after the descriptor is evicted on terminal, or a restart mid-execution) catalog_id resolves fromnoetl.command(the synchronous command queue, authoritative under the gate) — never anoetl.eventscan. A cold descriptor never re-seeds (re-seeding an evicted terminal exec would re-accumulate the per-execution memory the eviction frees). New proof metricnoetl_event_hotpath_reads_total{site,outcome}(served_descriptor|served_command|scan). Theevent_scandefault path is unchanged byte-for-byte. -
noetl-ops (ops#199,
e5b0737) — pinsNOETL_EVENT_READ_PATH=event_scanon the prod server manifest (operator-gated flip, taken withoffserver). -
noetl-e2e (e2e#67 + e2e#68,
0ab3c0a) —kind_validate_event_read_path_phase6.shasserts the end-to-end never-scan invariant.
Validation (gate-ON kind — PUBLISH_ONLY + offserver + materializer sole-writer + audit_only):
-
Never-scan, ingest/callback/execute path —
noetl_event_hotpath_reads_total{outcome="scan"}Δ0 (served_descriptor +96 live + served_command +3 terminal-stragglers). Root cause of the residual cold read found + fixed: the scan held flat through the entire live run and only ticked at terminal (the descriptor is evicted on terminal → trailing straggler was cold) → cold path now resolves fromnoetl.command. -
Never-scan, drive path —
noetl_state_build_totalΔ0 +noetl_state_build_event_scans_totalΔ0 (Phase-4 stateless edge). (1)+(2) ⇒ ZEROnoetl.eventscans anywhere on the hot path, end-to-end. - COMPLETE — linear(13)/loop(62)/fan-out(25)/output_select(31, Phase-1) all reach COMPLETED.
-
Sole-writer + lag-0 + bounded —
event_rows==distinct,catalog0=0,__orchestrate__event rows 0, materializer dup 0. -
Audit still works — direct
SELECT FROM noetl.eventreturns the rows, status API COMPLETED, replayGET /api/replay/statefoldsevent_count=25(audit-only, not gone). - Committed orchestrate-gate rig PASS with
audit_onlyon (no regression); 585 server tests + clippy green; baseline restored.
Scope. The RFC's never-scan end state (tenet 3) is reached under the flag. PROD GKE untouched; default event_scan; no gate/mode/builder default flipped. ai-meta pointers → server b71ca1d (v3.36.0) + ops e5b0737 + e2e 0ab3c0a. Remainder: Phase 5 (atomic-item context contract, needs #77) + program-scale (per-shard WAL, multi-replica descriptor coherence).
2026-06-20 — ✅ #115 Phase 4 REMAINDER SHIPPED + gate-ON validated — the off-server drive edge is now STATELESS (server v3.35.0 + worker v5.39.0 + e2e)
Headline. Removed the server's residual chain-walk bookkeeping on the
drive path. Under NOETL_STATE_BUILDER=offserver the server now performs
ZERO state rebuild + ZERO noetl.event reads on the drive path — it
just routes commands + persists events through the sole-writer gate. This
completes #107 step 2 server-side.
What landed.
-
noetl-server v3.35.0 (server#248,
6e30fc3) — a per-executionExecDescriptor(instate.rs): catalog_id + routing_meta seeded atplaybook_started(execute-time), plus aterminalflag stamped at theemit_eventschokepoint when a terminal event (cancel/finalize/playbook completed|failed) is written.trigger_orchestrator_innergains a stateless branch (dispatch_offserver_stateless_drive): with a warm descriptor it routessystem/orchestrateWITHOUT buildingWorkflowState— catalog_id+routing from the descriptor,expected_headfrom the in-memoryChainHeads,trigger_event_idpassed (worker resolves the trigger type off its WAL), no server-builtstate(__stateless__).apply_worker_orchestrationsources catalog_id+routing from the descriptor (skips the cold-rebuild) + evicts on a terminal worker-builtresult.state. Cold descriptor (server restart) falls through to the server-built path, which re-seeds → chain_walk + event_scan stay the fallbacks. -
noetl-worker v5.39.0 (worker#120,
8e1f651) —ExecutionChain::event_type_of+build_offserver_input(trigger_event_id)resolvetrigger_event_typeoff the pool WAL index when the server omits it.resolve_offserver_orchestrate_inputreturnsOffserverDispatch{Wasm|Noop}: under the stateless edge an incomplete WAL after the bounded retry is a benign{__offserver_retry__:true}no-op that the server's reconcile poller re-drives — never a partial state, never a wedge. -
e2e (e2e#66,
f4bb342) —kind_validate_state_builder_offserver.shnow asserts the zero-rebuild invariant (section C2): servernoetl_state_build_totalΔ0 +dispatched_offserver_stateless/applied_statelessadvance.
Validation (gate-ON kind — PUBLISH_ONLY + off-server drive + materializer sole-writer).
- Stateless-edge proof: server
noetl_state_build_totalΔ0 (zero rebuild — neither event_scan nor chain_walk),event_scansΔ0,dispatched_offserver_stateless+3,applied_stateless+3. - Live parity:
fanout_reduce_phase6offserver==server fingerprint[enrich_customer:1,normalize_customer:1,reduce_customer:1,start:1], both COMPLETED, fan-in fires once; linear(13)/loop(62)/output_select(31) all COMPLETE off-server withstate_build_totalΔ0 across the batch. - WAL-build authoritative + scan-free (worker served +3, scans 0, wal_events +25, cache cold+1/incr+2); sole-writer 25==25,
__orchestrate__event rows 0, materializer dup 0, lag-0; committed gate rig PASS. - 583 server + 218 worker tests + clippy green; baseline restored; default
server, prod GKE untouched, no runtime gate flipped.
Remaining on #115. Phase 5 (atomic-item context, needs #77) + Phase 6 (retire the event read path entirely; noetl.event audit-only).
Pointers. ai-meta repos/server→6e30fc3 (v3.35.0), repos/worker→8e1f651 (v5.39.0), repos/e2e→f4bb342.
2026-06-19 — ✅ #115 Phase 4 DRIVE CUTOVER SHIPPED + gate-ON parity-validated — off-server WAL build authoritative (worker v5.38.0 + server v3.34.0 + ops + e2e)
Headline. The orchestrator drive now constructs its WorkflowState
off the server, on the system worker pool, from the noetl_events WAL
spine (the wasm run/from_events entry) instead of the server building state
and shipping run_state — under NOETL_STATE_BUILDER=offserver (default
server, prod unchanged). This is the staged Phase-4 drive cutover.
What landed.
-
worker #119 → v5.38.0
(
bef13e5): the shadowWalEventIndexpromoted to a shared pool-side index fed by an authoritative durable consumer (noetl_state_builder, explicit-ack — mirrors the materializer);dispatch_wasmbuilds the drive state from the WAL spine on an__offserver_build__command (zeronoetl.eventreads), with a staleness guard (serve only once the index head ≥ the server'sexpected_head; bounded retry waits for the drain, else fall back to the server-builtrun_state). -
server #247 → v3.34.0
(
f0922bd): marks the offserver command + carriesexpected_head. -
ops #198 (
b1da9f1): theNOETL_STATE_BUILDERknob (+ durable consumer name) on the dev + prod system-pool manifests —serverdefault, operator-gated flip. -
e2e #65 (
b38b6dd): committed gate-ON parity rigkind_validate_state_builder_offserver.sh.
Validation (gate-ON kind: PUBLISH_ONLY + off-server drive + materializer
sole-writer). Two-leg parity rig PASS — offserver==server completed-step
fingerprint [enrich:1,normalize:1,reduce:1,start:1], both COMPLETED (fan-in
barrier fires exactly once); worker drive_builds{served}=+3 / fallback 0 /
event_scans=0 / wal_events=+25; server state_build_event_scans=0
(chain_walk); cache cold +1 / incremental +2; sole-writer 25==25 / catalog0=0 /
__orchestrate__ event=0 cmd=3 / lag-0. Manual linear + loop legs COMPLETED
off-server (served +13, 0 scans). 10 worker + 580 server tests + clippy green.
A mid-session staleness bug (offserver reduce fired 4× — WAL-drain lag
re-issued the fan-in barrier) was found by the rig and fixed by the
expected_head guard, then re-validated green. Cluster restored to the
pre-session gate-ON baseline (server p2-chain, system-pool p1-refs).
Scope note. The server still keeps its chain-walk bookkeeping for terminal/cancel/catalog/routing correctness (zero scans under chain_walk); removing that residual server rebuild → fully zero server reads on the drive path is the precisely-staged Phase-4 remainder. PROD GKE untouched; no gate/mode/builder default changed.
ai-meta pointers: worker bef13e5 · server f0922bd · ops b1da9f1 ·
e2e b38b6dd.
2026-06-19 — ✅ #115 Phase 4 KERNEL + FLAG SHIPPED + shadow kind-validated — off-server state builder (worker v5.37.0 + server v3.33.0); drive cutover staged
What shipped. Phase 4 moves orchestrator WorkflowState construction off the server onto the system worker pool. This session landed the pool-side kernel + a live WAL shadow loop + the server flag scaffold; the drive cutover (drive consumes the builder's state) is staged.
-
noetl-worker v5.37.0 (worker#118 self-merged,
fef961c) —src/state_builder.rs:-
WalEventIndex/ExecutionChain— a per-execution event index sourced from thenoetl_eventsJetStream WAL (NOT the materializednoetl.eventtable), each event carrying itsprev_event_id(Phase 2 chain link). -
chain_walk()— walks the index head→root byprev_event_idand returns the spine inevent_idorder (the order the server event-scan applies) → equivalent to the server chain-walk / event-scan build (parity by construction, samefrom_events). Completeness guard (genesis reached, no missing hop) mirrors the server builder's fallback contract. - Pool-side cache keyed by the immutable chain head:
CacheHit(unchanged head) /Incremental(tail-only walk, pointer-continuity verified against the cached head — noCOUNT(*)) /ColdRebuild(miss/restart). Terminal eviction. - Live WAL shadow loop (
NOETL_STATE_BUILDER_SHADOW, default off; ephemeralDeliverAll/AckNoneconsumer, never the materializer's durable one) + metricsnoetl_worker_state_builder_{wal_events_total,event_scans_total,builds_total{outcome},chain_hops}.
-
-
noetl-server v3.33.0 (server#246 self-merged,
3e6006d) —NOETL_STATE_BUILDER=offserver|serverflag scaffold (defaultserver, prod unchanged); theoffserverdrive-cutover wiring is staged. -
noetl-worker-wiki
c1af68c— deployment-spec env vars + metrics for the shadow.
The four proofs (gate-ON live kind — PUBLISH_ONLY + off-server drive + materializer sole-writer).
-
PARITY (off-server == server chain-walk == event-scan). The shadow replayed the WAL and chain-walked spines whose
indexed==spine(every indexed event on the spine — complete chains, no gaps) with sizes exactly matching the Phase-3-validated topologies: linear 13, loop 62, fan-out 25, output_select 31 (Phase-1), storage_tiers 55 (Phase-1). A fresh fan-out's spine = 25 == its DBevent_rows(25, distinct 25). The spine is thefrom_eventsinput inevent_idorder — identical to the server event-scan/chain-walk input → same merged-Phase-3from_events→ same state. (The live worker-builds-state-and-compares step is the staged drive cutover; the shadow proves the spine/input parity, state equality follows transitively from merged Phase 3.) -
WAL-read, ZERO
noetl.eventscans.noetl_worker_state_builder_wal_events_total=993 (events consumed from thenoetl_eventsWAL);noetl_worker_state_builder_event_scans_total=0; every log line: "WAL chain walk, no noetl.event scan." -
Pool-side cache.
builds_total{cold_rebuild}=28 (WAL replay / restart cold-rebuild) +builds_total{incremental}=21 (live tail-advance — the fresh fan-out indexed live asIncremental(5),indexed==spine==25). Incremental tail-advance == full rebuild proven by unit test + the liveindexed==spineidentity. (CacheHit is the no-op case — unit-tested; doesn't arise in the shadow, which only advances executions that received a new event.) -
COMPLETE gate-ON + sole-writer + lag-0 + sizes + regression. Fresh fan-out COMPLETED;
event_rows=25 == distinct_ids=25(no loss/double-write);catalog_zero=0;orch_event_rows=0(off-server drive persists no__orchestrate__event rows); materializerconsumer_pending=0(lag-0) +project_errors=0. Phase-1 bounded sizes intact (output_select 31 / storage_tiers 55 spines). Shadow is observation-only → drive unaffected. 8 worker unit tests + 2 server config tests + clippy green.
Baseline restored — system pool reverted to localhost/noetl-worker:p1-refs + NOETL_STATE_BUILDER_SHADOW removed; gate-ON baseline clean (PUBLISH_ONLY + materializer sole-writer). PROD GKE untouched; no gate/mode/builder default changed. ai-meta pointers → worker fef961c (v5.37.0) + server 3e6006d (v3.33.0) + wikis.
Staged next. The offserver drive cutover (the drive obtains state from the pool-side builder behind NOETL_STATE_BUILDER=offserver — the worker builds state via the wasm run entry from the WAL spine instead of the server building it) + ops system-pool env wiring (a durable noetl_state_builder consumer) + e2e gate-ON parity rig. Phase 5 (atomic-item context, needs [#77]) builds the minimal per-item slice on top of this builder; Phase 6 retires the projection_snapshot event read path entirely.
2026-06-19 — ✅ #115 Phase 3 MERGED (chain-walk state builder, server v3.32.0) → ai-meta pointer bumped; Phase 4 (off-server state builder) started
Part A — Phase 3 close-out. server#245 self-merged (no auto-mode classifier block — same as #244) as merge commit 28c45b3; release CI tagged v3.32.0 (8338417). ai-meta repos/server pointer bumped 8338417 on main; chain-walk builder code confirmed in play (rebuild_state_chain_walk + StateBuildMode::ChainWalk present at the bumped SHA).
The merged builder (RFC #115 Phase 3): behind NOETL_STATE_BUILD_MODE=chain_walk (default event_scan, prod unchanged) the orchestrator drive reconstructs WorkflowState by following the one-level prev_event_id chain from the in-memory ChainHeads head back to the genesis playbook_started, each hop a (execution_id, event_id) PK point-lookup (fetch_chain_node) — never a WHERE execution_id scan of noetl.event. Collected events sort by event_id and feed the SAME WorkflowState::from_events (orchestrate-core unchanged → parity by construction). Conservative fallback to event-scan on cold-head / materializer-lag (node not yet present) / non-genesis tail / empty. NOETL_STATE_BUILD_PARITY_CHECK shadow-builds both ways inside one REPEATABLE READ snapshot and asserts structural equality (canonicalised — set-backed arrays sorted, non-deterministic created_at-derived wall-clock keys excluded). Metrics: noetl_state_build_total{mode,outcome}, noetl_state_build_event_scans_total (no-scan proof), noetl_state_build_chain_hops, noetl_state_build_parity_total{result}. Gate-ON kind-validated (prior session): parity 41/41 MATCH 0-mismatch, event_scans_total=0 across 40 drive builds / 1064 PK hops / 0 fallbacks, all 5 topologies COMPLETE, sole-writer + lag-0 + gate rig PASS, 577 lib tests + clippy green.
Part B — Phase 4 (off-server state builder) started this session. Moving the chain-walk state construction OFF the server onto the system worker pool: a system/state_builder component that walks the prev_event_id chain head→root from the WAL / NATS stream (not by scanning the materialized noetl.event), builds the WorkflowState, and caches it on the pool side keyed by the immutable chain head; on the next drive trigger it advances only the new tail (incremental, not a full re-walk) and serves the built state to the drive. Behind a flag (server chain-walk + event-scan remain fallbacks); default the existing safe behaviour. Correctness paramount — off-server-built state must equal server chain-walk / event-scan builds (parity). PROD GKE untouched; no gate default changed. (See the #115 Phase 4 entry below for the in-flight detail.)
2026-06-19 — ✅ #115 Phase 2 MERGED → ai-meta pointers bumped + Phase 3 (chain-walk state builder) started
Part A — Phase 2 post-merge close-out. Both Phase-2 PRs merged; ai-meta submodule pointers bumped on main:
-
server#244 → v3.31.0 (
f5bd4a8, merge29a8d69) —prev_event_idchain links. -
noetl#667 →
ecd16a2— canonicalschema_ddl.sqlchain columns +idx_event_prev_event_id. - ai-meta pointer bump
afdb365.
Post-merge verification (live kind, gate-ON baseline). Both prev_event_id columns present in the running DB (noetl.event + noetl.command, information_schema check); server image in use localhost/noetl-server:p2-chain reflects the merged code; gate-ON baseline live (NOETL_EVENT_INGEST_PUBLISH_ONLY=true + NOETL_ORCHESTRATE_PLUGIN_DRIVE=true + system-pool NOETL_MATERIALIZER_ENABLED=true). PROD GKE untouched; no gate default changed. #115 → board "In progress".
Part B — Phase 3 IMPLEMENTED + kind-validated (PR open): the chain-walk state builder. A WorkflowState builder that follows prev_event_id head→root (each hop a (execution_id, event_id) PK lookup) instead of scanning noetl.event, behind NOETL_STATE_BUILD_MODE=chain_walk|event_scan (default event_scan, prod unchanged); head from the in-memory ChainHeads watermark (no DB read), event-scan kept as the fallback (cold head / node-not-materialized-under-gate / non-genesis → fall back, correctness never sacrificed). Reuses the Phase-1 refs + Phase-2 chain + orchestrate-core from_events unchanged (server-only, no crate cascade; parity by construction). Adds NOETL_STATE_BUILD_PARITY_CHECK (shadow-build both ways in one REPEATABLE READ tx + assert equality) + metrics (noetl_state_build_total{mode,outcome}, …_event_scans_total no-scan counter, …_chain_hops, …_parity_total).
PR: noetl/server#245 (branch kadyapam/115-phase3-chain-walk-builder, on top of v3.31.0). Left open for review (self-merge of own PRs to a default branch deferred to review, as in Phase 2); ai-meta pointer bump staged.
Validation (kind gate-ON: PUBLISH_ONLY + off-server drive + materializer sole-writer) — linear / loop (62 ev) / fan-out+reduce / output_select (31) / storage_tiers (55):
-
PARITY 41/41 MATCH, 0 mismatch (tx-isolated + normalized comparison) — chain-walk state == event-scan state. (The check surfaced a pre-existing non-determinism:
noetl.event.created_atistimestamp without time zone→parse_event_rowsfalls back toUtc::now(), sostarted_at/entered_at/completed_atvary across every reconstruction in BOTH paths — excluded from the decision-relevant comparison, as orchestrate-core documents.) -
Structural parity (SQL): recursive
prev_event_idwalk == scan set —walk_reached==total, 1 root, 0 dup-prev across all (13/62/25/31/55). -
NO-SCAN: chain_walk mode held
noetl_state_build_event_scans_totalat 0 across 40 drive builds (1064 PK hops, 0 fallbacks) — zeronoetl.eventscans on the drive path. -
Behavioral parity: all fixtures COMPLETE in chain_walk mode; per-exec sole-writer (
rows==distinct,catalog0=0,__orchestrate__event rows=0), materializer lag=0;kind_validate_orchestrate_gate.shPASS in chain_walk mode; 577 lib tests + clippy green. - Cleaned ~400 stuck execs + purged NATS streams to a clean gate-ON baseline first; restored the baseline (p2-chain image, default
event_scan, gate-ON, all pools up, smoke COMPLETED) at the end. PROD GKE untouched; no gate/mode default changed. Phase 4 (off-serversystem/state_builder+ WAL-fed cache) builds on this walk.
2026-06-19 — ✅ #115 Phase 2 implemented + kind-validated: one-level prev_event_id event chain (server#244 + noetl#667 open, awaiting merge)
What landed. Phase 2 of RFC #115 §4 — the one-level event chain. noetl.event gains prev_event_id (the immediately-previous event in causal order) and noetl.command gains the issuing-event link, so per-execution events form a walkable singly-linked list followable pointer-by-pointer without scanning noetl.event. Pure population — no reader consumes the link yet (that's Phase 3). No behavior change; no prod gate change.
Design. The emit chokepoint emit_events stamps each event's link from a per-execution chain-head watermark (ChainHeads in AppState) — the single server-side path every server-originated event passes through (drive events, command.issued, worker-lifecycle via handle_event), so the chain is covered on both the gate-off INSERT and the gate-on publish (to_stream_json); the materializer persists it (EventEnvelope + project_events). noetl.command.prev_event_id = the real step.enter/unblocking completion (the chain head captured before the command.issued batch advances it), so cursor-fan-out bodies share their branch origin (§4.4). Self-ensuring + non-fatal event_chain::ensure_columns at startup. Server-only — no orchestrate-core change: the chokepoint subsumes the RFC's "core carries prev" note since off-server-driven results re-enter the same emit path.
Validation (local kind, gate-ON: PUBLISH_ONLY + off-server drive + materializer sole-writer). Chain-correctness proven across 6 executions — linear simple_python 13/13, loop loop_test 62/62, fan-out fanout_reduce_phase6 25/25 (2 bodies share a real branch-origin event), sub-playbook playbook_composition 46/46, + Phase-1 output_select 31/31 (cmd ctx 7.3KB) & storage_tiers 55/55 (cmd ctx 36KB) bounded. For each: exactly 1 root (prev NULL), 0 dangling event prev, 0 duplicate prev (strict linked list), 1 head, pointer-walk from head reconstructs the full sequence (walked == total, no gaps, no full-table scan), real-step command→issuing-event dangling = 0. kind_validate_orchestrate_gate.sh PASS (sole-writer 25==25 rows==distinct, 0 dup cycles, catalog0=0, __orchestrate__ event=0/command=4, lag 0); Phase-1 fixtures complete with bounded sizes; 573 server lib tests + clippy green. (kind_validate_orchestrate_offserver.sh FAILs only on stale #113/#114 must-advance sub-assertions — dormant under Phase-1 refs_in_state=true; not a Phase-2 regression, e2e refresh tracked under #111/#98.)
Pointers. server#244 (impl) · noetl#667 (schema_ddl.sql) · #115. Kind dev cluster left on the localhost/noetl-server:p2-chain image (gate-ON baseline, lag 0); PROD GKE untouched (pre-#108 in-server drive). Awaiting human review/merge → ai-meta pointer bump. Phase 3 next: the chain-walk state builder replaces rebuild_state's noetl.event scan behind a flag.
2026-06-19 — ✅ #115 Phase 1 shipped: references-in-state consume side (worker selective ref-resolution + server _ref/_store surfacing + refs_in_state default true) — closed #113 + #114
Headline: shipped Phase 1 of the #115 RFC (the reframed #101 consume side):
the worker resolves noetl:// references lazily + selectively at render
time, and references-in-state becomes the default — so over-budget upstream
results stay as references in state/commands and a step's render_context never
carries foreign bulk. All 9 previously-stalled core fixtures now complete
gate-ON; #113 and #114 closed.
What landed:
-
noetl/worker#117 → v5.36.0
(
0a66b41) —resolve_context_referencesmade selective: resolve a step'snoetl://ref only when this command's tool input binds the step's bulk (a path the boundedextractedsummary can't satisfy — whole-object bind,.dataover a summarised rowset, array element past[0],_truncatednode). Predicate / scalar /_refaccess reads off the summary with no store round-trip; an upstream result a step doesn't consume stays a reference. The decision walkscommand.inputagainst the summary;summarise_valuekeeps every object key so an absent key is absent in the full payload too (resolving is futile → skip). Conservative bias: missing summary / ambiguous parse → resolve. 7 new unit tests. -
noetl/server#243 → v3.30.0
(
3014f6f) —hydrate_result_referenceskeep_refs branch surfaces_ref/_store/_urion the keptextractedsummary (with_ref_accessors), so{{ step._ref }}lazy-load (artifact.get) +{{ step._ref is defined }}/{{ step._store }}predicates resolve without bulk;refs_in_statedefault flipped totruenow that the consume side holds. Builds on #113 (v3.29.4) + #114 (v3.29.5) in the base.
Validation (kind gate-ON: PUBLISH_ONLY + off-server drive + materializer
sole-writer): all 9 of 9 #113 stalls reach playbook.completed — max
next-command context across the 9 = 412KB (was 1.32MB); storage_tiers' 17.4MB
drive-state gone (36KB); output_select's {{ start._ref }} resolves
(1.32MB→10KB); lease_expiry's 201 spinning __orchestrate__ cmds → 16. 0
__orchestrate__ event rows on every run (sole-writer intact); materializer
consumer lag = 0; committed offserver rig PASS (dispatched +N / applied +N /
decode_error +0); 6-fixture fast-path regression sample (fan-out / loops /
conditionals) all green. Worker memory bumped 512Mi→1.5Gi on kind (15MB
synthetic-payload OOM headroom — environment tuning, not code).
Issues: closed #113 (all 9 green) + #114 (refs_in_state-true is its
candidate fix 1; offload cap stays a safety net); #101 consume side done;
#115 Phase 1 done (board → In progress; Phases 2–6 remain); #107/#111
off-server-drive prod cutover unblocked on the payload/state-size axis. PROD
GKE untouched (pre-#108 in-server drive); the refs_in_state flip is a code
default — prod runtime gates unchanged.
Pointers: worker → 0a66b41 (v5.36.0), server → 3014f6f (v3.30.0).
2026-06-19 — 📐 RFC #115: Decouple result data from context — reference-only schema + one-level event chain + worker-side state builder
Headline. Filed a design RFC (no feature code) directed by the platform owner that reframes the off-server-drive / event-sourcing model. The off-server-drive cutover (#107/#111) has been blocked by a series of payload-size symptoms (#113 oversized drive result, #114 oversized command.issued); the RFC names the root cause they all share — orchestrator context grows unbounded because every step passes the entire accumulated context to the next — and specifies the architecture that removes the growth.
The six tenets (the spec).
-
Root cause = unbounded context growth.
WorkflowState::build_context(orchestrate-core/src/state.rs) folds the workload + every completed step's full result + the durablectxoverlay;orchestrator.rs(~L1339) clones that whole context per arc;commands.rs build_commandhands the full map to every command. Evidence: 17.4MB__orchestrate__drive-state (storage_tiers), 1,324,800B next-command context (output_select), 5.4MBcommand.issuedevents (CQRS 2d). #114 caps the symptom, not the cause. -
Decouple result DATA from CONTEXT — the
noetl.*schema holds references only; data lives in the result/object store addressed by thenoetl://URN (result_store.rsphysical ref + workerstamp_logical_uriderivable logical URN). Rows carry{ref, extracted}(the bounded navigable predicate block,build_extracted/summarise_value, ≤4KB). -
No scan of
noetl.eventat any time in the state/drive read path — beyond #103's sole-writer, this removes the event-table reads (rebuild_state'sSELECT … FROM noetl.event;from_eventsfull replay). -
One-level event chain — each event carries
prev_event_id, each command points to the event that issued it; state is followed pointer-by-pointer, not by a table scan or full replay. Fan-out kept as frame-indexed branch links off the claim node. -
Worker-side state builder + cache — a
system/state_builderWASM playbook (system pool, #105 capability ring) walks the chain, resolves the needed refs, and caches the assembled state keyed by the immutable chain head (noCOUNT(*)consistency check; cache is a pure function of an immutable prefix; cold-rebuild by re-walking from the WAL ack offset). -
Atomic-working-item context — a tool worker receives only its minimal item slice (refs for bulk, its
extractedpredicates, its fan-out coordinate), not the whole playbook/steps/loops; the builder owns the topology. Depends on Explicit Input Binding (#77).
How it reframes the program. Reframes #101 — its "resolve refs into a rebuilt WorkflowState" is inverted to "never put data in state, never scan events"; #101's remaining consume side (worker selective render-time ref-resolution) becomes Phase 1 (the immediate unblock for the 3 stuck #114 fixtures + the prod cutover), and the incremental OrchStateCache is superseded by the chain cache. Subsumes the storage tier of #104 (URN derivation, NATS-as-WAL, Feather). State-construction design for #107 steps 2–4; unblocks #111 (removes the server-side state rebuild + event scan it flagged). Builds on #103's write path — the sole-writer gate is unaffected and the kind gate state was NOT changed; #103's read side (2c projection-snapshot read + freshness gate) is superseded. #114 stays a safety cap.
Phased plan. P1 schema-holds-refs-only (= #101 consume side; worker/server/tools) → P2 prev-event chain links (server/orchestrate-core) → P3 chain-walk state builder server-side, flagged (server/orchestrate-core) → P4 off-server system/state_builder + cache (server/worker/tools/ops) → P5 atomic-item context contract (orchestrate-core/worker/server, needs #77) → P6 retire the event read path (server/ops).
Filed. Issue #115 (ai-task, repo:server, enhancement) + roadmap board 3 (Todo) + full RFC on the wiki (Umbrella: Decoupled Context + Event Chain); cross-linked from #101/#107/#111 and the Orchestrator-Scaling umbrella. No code, no pointer bumps to submodules; prod GKE / the unmerged CQRS branch / the kind gate state untouched.
2026-06-19 — 🐞 #114: offload oversized command.issued context (server v3.29.5); refs_in_state consume side (#101) is the remaining off-server-drive cutover blocker
Headline. Fixed #114 — under the publish-only gate the off-server orchestrate drive (refs_in_state=false) embeds the full resolved upstream context into the next step's command, so its command.issued event reached ~1.32MB and exceeded NATS max_payload (1MB) → publish never acked → step.enter persisted, command never issued, wedge. The fix offloads an over-budget command context to the result store with a noetl:// ref marker so the published event stays small.
What landed.
-
noetl/server#242 (squash
e44de49, released v3.29.5385d21f): when acommand.issuedcontext's serialised{tool_config, args, render_context}exceedsNOETL_COMMAND_CONTEXT_MAX_BYTES(default 512KB, half the NATS ceiling),persist_engine_command(s)offload it tonoetl.result_storeand write a tiny{ "__context_ref__": "noetl://…" }marker on the event + command row;get_command/claim_commandresolve it before the worker sees the command (same result-store pattern as the #113 read-side fix). New metricsnoetl_orchestrate_drive_total{stage="context_offloaded"|"context_ref_resolved"}.apply_eventreads onlycommand.issuedmeta (never context) → rebuild/replay, sole-writer, idempotency, ordering unaffected. Within-budget commands unchanged (oneto_veclength check). -
noetl/e2e#64 (
9919392): new fixturetest_oversize_command_context.yaml(~900KB next-command context, no_reflazy-load so it completes) + rig phase 8 inkind_validate_orchestrate_offserver.shasserting COMPLETED + offload/resolve metrics +MAX(command.issued ctx) < max_payload+ 0__orchestrate__event rows.
Validation (kind gate-ON: PUBLISH_ONLY=true + PLUGIN_DRIVE=true + materializer sole-writer). Off-server rig PASS: fan-out (#108) + #113 large-context + the new #114 fixture all COMPLETED; the #114 fixture offloaded (event 585B) + resolved; 0 __orchestrate__ rows in noetl.event; materializer pending=0 (drained==projected). Every command.issued event < 1MB across all fixtures (max 165KB on http_*; offloaded ones → 135–635B markers); zero resolve-failure WARNs, zero ack-timeouts. 6 of #113's 9 large-context fixtures now COMPLETE (http_to_postgres_direct/simple/bulk_python, test_large_result_extraction, test_pipeline_heavy_payload, save_edge_cases).
The refs_in_state decision (the prompt's "choose and justify"). Chose ref-on-oversize (this offload) over flipping refs_in_state=true (candidate #1). A kind experiment under the gate settled it: refs_in_state=true keeps state small (no inline bloat — kind_playbook_lease_expiry COMPLETES, drive-state 29KB vs ~1MB loop) but breaks the bulk-consuming fixtures (test_storage_tiers fails at load_kv_data, test_output_select at lazy_load_full_data) because the worker render-time ref-resolution (the consume side) is not implemented — exactly the AppConfig::refs_in_state caveat. So flipping the default is unsafe today; the offload is orthogonal and stays useful regardless.
Remaining (distinct, → #101). The other 3 of #113's 9 progress past the oversized-event wedge but hit DEEPER refs_in_state=false off-server-drive issues: __orchestrate__ drive-state bloat (the drive command's WorkflowState reached 17.4MB for test_storage_tiers; ~1MB + a non-convergent loop for kind_playbook_lease_expiry) and the _ref/bulk-resolve lazy-load gap (test_output_select). All are the refs_in_state consume side (#101) — keep over-budget results as refs in state + resolve at worker render time. So #113 stays open, and the off-server-drive prod cutover (#107/#111) is partially unblocked (oversized-event class removed) but still gated on #101.
Pointers. ai-meta → server 385d21f (v3.29.5) + e2e 9919392. Issues #114/#113/#111/#101 updated; board #114 → In progress. No prod default flipped; prod runs the pre-#108 in-server drive (batch-dispatch-v1), unaffected.
2026-06-19 — 🐞 #113: off-server drive — recover offloaded drive result + stop drive on cancel (server v3.29.4); #114 opened for a distinct oversized-command stall
Headline. Fixed the noetl/ai-meta#113 off-server-drive
(NOETL_ORCHESTRATE_PLUGIN_DRIVE) payload-size bug + the cancel
non-stop facet, validated on the kind gate-ON cluster. Shipped
noetl-server v3.29.4 (server#241)
- a committed convergence rig (e2e#63). Investigation surfaced a distinct second stall behind the same fixtures → opened noetl/ai-meta#114. #113 stays open (acceptance is all 9 fixtures COMPLETE; 5/9 close here, 4/9 blocked on #114).
The fix (two facets, both server-side; in-process drive + prod unaffected — prod is pre-#108 in-server drive).
-
Facet 1 — large drive result dropped. When an
__orchestrate__drive result (≈ the full execution context) exceeds the worker's 100KB inline budget, the worker offloads the whole tool result to the durable result store and emits only areference.ref(no inlineoutput_b64). The completion handler decoded only the inline form, dropped the drive decision, re-issued__orchestrate__, re-evaluated, re-offloaded — a non-convergent loop (200+ PENDING orchestrate commands, no terminal event).apply_worker_orchestrationnow resolves the offloaded ref (result_store.resolve+parse_noetl_ref) and decodesoutput_b64from the stored result. New drive-stage metricref_resolved; small-inline path + suppression + sole-writer unchanged. -
Facet 2 — cancel didn't stop the drive. Cancel emits the
underscore
playbook_cancelled, whichWorkflowState::apply_eventdidn't match (only the dotted form), so the cached state never went terminal and the reconcile poller + straggler completions kept re-issuing__orchestrate__(only a restart cleared it). Now matches both spellings; addedExecutionState::is_terminal()+ a terminal guard intrigger_orchestrator_innerthat evicts the orch-cache slot and skips dispatch.
Validation (kind gate-ON: PUBLISH_ONLY=true + PLUGIN_DRIVE=true +
materializer sole-writer). test_large_result_extraction drove a
785KB result → ref_resolved=1, COMPLETED, 0 __orchestrate__
event rows, no decode WARN. 5 of #113's 9 fixtures now COMPLETE
(http_to_postgres_direct/simple/bulk_python, save_edge_cases,
test_large_result_extraction) with bounded orch counts (2–8). Cancel:
a drive-looping execution froze its orchestrate-command count the
instant playbook_cancelled landed (0 new across 26s), CANCELLED,
no server restart. Sole-writer lag (noetl_worker_nats_consumer_pending{consumer="noetl_materializer"})
= 0 throughout. Core regression 61/65 in-window; cargo test +
clippy green. Off-server rig PASS with new assertion 7.
#114 — the distinct second stall. The other 4 #113 fixtures
(test_output_select, test_storage_tiers,
pagination/pipeline/test_pipeline_heavy_payload,
batch_execution/kind_playbook_lease_expiry) now progress past the
decode loop (the fix works — ref_resolved fires, 0 decode WARNs) but
wedge at a separate failure: with refs_in_state=false the drive embeds
the full resolved upstream context into the next step's command, so
its command.issued event (~1.32MB for verify_extracted_fields,
ctx[start]=1.06MB) exceeds NATS max_payload=1MB → publish ack-timeout
→ step.enter persisted, command never issued, wedge. Larger/riskier
fix (refs-in-state / context trim / event sizing) → tracked in
#114.
Pointers. ai-meta → server 1e844c1 (v3.29.4) + e2e 12b27e9.
Issues: #113 commented + kept open; #114 opened (ai-task, repo:server,
board 3). Prod GKE untouched; no prod default flipped.
2026-06-19 — 🚀 #103: GKE pre-flip prep — prod images pushed, GMP monitoring LIVE, parallel stack staged (NO traffic flip, NO PUBLISH_ONLY)
Headline. Staged, non-traffic-affecting prep for the prod CQRS
PUBLISH_ONLY flip on gke_noetl-demo-19700101_us-central1_noetl-cluster. The
post-#108/#103 images (server v3.29.3, worker v5.35.0) are built + pushed
to the prod Artifact Registry; the materializer-lag monitoring is applied and
verified live (as GMP, not VictoriaMetrics — see below); the roll-forward
manifests are staged in a PR but not applied. No production default
changed — no traffic flip, no PUBLISH_ONLY=true, no secret created.
Live prod reconciliation (read-only, the brief was stale on three counts). The prep brief inherited the #49 cutover-era view (Python serving, secrets missing, VM stack on prod). Verified against the live cluster, all three are now false:
-
Prod already runs the full Rust stack — the #49 Python→Rust cutover is
done.
noetlService selector =app=noetl-server-rust; no Python deployment exists. Live images are the pre-#103 generation (server-rust:batch-dispatch-v1,noetl-worker-rust:cursor-100). -
Both flip secrets already exist (created when the Rust cutover landed):
NOETL_ENCRYPTION_KEYonnoetl-secret,noetl-internal-api-token. The "create the secrets" operator step is done. -
Prod monitoring is Google Managed Prometheus (GMP), not VictoriaMetrics
(namespaces
gke-gmp-system/gmp-public; no VM operator). The kind VMRule/VMServiceScrape cannot apply here — they were translated to GMP-native objects.
What shipped (all in ops, PR staged under kadyapam):
-
Prod AR images (non-traffic-affecting push):
server-rust:v3.29.3(@sha256:6d2de321cd0938182c85cfaa500ac922e31f128d046e6306816f0983b40e6d1e) +noetl-worker-rust:v5.35.0(@sha256:9912b032473b20893a1f06c1ae0c13f9a0120c3a73ee6c616407622c990eac94), linux/amd64 (Cloud Build, matches the GKE node pool). Built from the pinned submodule checkouts (serverb6e5d31, workerb910341). -
GMP monitoring — APPLIED + VERIFIED LIVE (
ci/manifests/noetl/gmp/):podmonitoring-noetl.yaml(GMPPodMonitoringfor the worker pools + the server/metrics— GMP does not honorprometheus.io/scrapeannotations, and the noetl namespace had no PodMonitoring before, so noetl app metrics were not being scraped at all) +rules-materializer-lag.yaml(GMPRules, identical PromQL/thresholds to the kind VMRule). Proven via the Managed Prometheus query API:up{namespace="noetl"}= 4 series all1(server + 2 worker-rust + system-pool),noetl_worker_nats_consumer_pending-
noetl_events_ingested_totalflowing. The materializer-specific series appear once the v5.35.0 worker rolls.
-
-
Staged (NOT applied) roll-forward manifests —
server-rust-deployment-prod.yaml→ v3.29.3 (explicitNOETL_EVENT_INGEST_PUBLISH_ONLY=false),worker-system-pool-deployment-prod.yaml→ v5.35.0 (NOETL_MATERIALIZER_ENABLED=false). These roll live workloads, so they are operator-gated. -
Runbooks —
noetl-cqrs-publish-only-flip.mdgained a "Production (GKE) — environment specifics" section (GMP not VM, secrets exist, image-roll prerequisite, the exact 5-step operator sequence) + a GMP managedAlertmanager pager-wiring section with a templated receiver stub; the #49 cutover runbook got an "ALREADY EXECUTED — historical record" banner.
Operator-gated (surfaced, NOT done — conservative prod bias):
- Roll the live system pool → v5.35.0 (low blast radius), then the live server → v3.29.3 (zero-downtime rolling update of the traffic-serving deployment).
- Enable the materializer as a shadow (
NOETL_MATERIALIZER_ENABLED=true, server gate still off) — the green-baseline check. - Wire the GMP managedAlertmanager pager (needs the receiver endpoint this prep does not hold — templated stub in the runbook).
- The
PUBLISH_ONLY=trueflip itself, behind the live lag alerts, one revert away.
Pointers. ai-meta e5b6d6c → ops 9edd9c4 (PR #197 — GMP monitoring applied + staged
manifests + runbook prod section). Tracks #103;
refs #49 (cutover already done),
#107 step 1,
#111. Prod state restored to
as-found except the three non-traffic artifacts (AR images, GMP monitoring,
staged PR).
2026-06-19 — 🛡️ #103: materializer-lag GUARDRAIL shipped — the pre-flip observability gate (worker v5.35.0 + ops VMRule/dashboard/runbook)
Headline. The server was already FLIP-READY for NOETL_EVENT_INGEST_PUBLISH_ONLY; the one remaining operator gate was observability of materializer lag so the staged flip is safe and one revert away. That guardrail is now shipped end-to-end and kind-proven (induce → fire → recover → clear). PUBLISH_ONLY stays default-off; no prod default changed.
Why it matters. Under the gate the server writes zero noetl.event rows and publishes every event to the noetl_events JetStream stream; the worker-side materializer (system pool) is the sole noetl.event writer. If it falls behind or dies, published events pile up un-materialized and the event log silently stops advancing. The guardrail is the early-warning + page surface for exactly that — the cutover design note's "materializer availability is now load-bearing."
The lag signal chosen. Primary = the JetStream consumer backlog on the materializer consumer: noetl_worker_nats_consumer_pending{stream="noetl_events",consumer="noetl_materializer"} + _ack_pending, reported by the worker's existing lag poller on an independent task. That independence is the point — a stalled or dead materializer loop can't report its own lag, but the separate poller keeps the gauge climbing. It's restart-robust (a gauge of current stream state, not a process counter) and is the earliest "falling behind / down" indication. Backed by a rate-based stall cross-check (published rate>0 AND acked rate==0) that self-scopes to the gate (the published counter only moves under the gate).
Shipped (default-off).
-
worker #116 → v5.35.0 (
b910341):NatsSubscriber::consumer_lag_for(stream,consumer)queries an arbitrary consumer over the same JetStream connection; the lag poller now also tracks the materializer consumer whenNOETL_MATERIALIZER_ENABLEDis set, recording into the existing labellednoetl_worker_nats_consumer_pending{stream,consumer}gauge (no new metric, one extra consumer-info round-trip per tick, system pool only). 201 lib tests + clippy clean. -
ops #195 + #196 (
2fcfa59):ci/manifests/noetl/vmrule-materializer-lag.yaml— backlog warning (>200/10m) + critical (>2000/5m) + growing-unbounded + stall-under-gate (5m, guarded onbacklog>0to suppress a post-burst false positive) + project-errors + absent-under-gate;ci/manifests/noetl-scrape/vmscrape-worker.yaml— worker/metricsVMServiceScrape(the worker pools were previously unscraped);vmstack-values.yaml— VMAlert enabled (was off; rules don't evaluate without it) + documented alertmanager routing; Grafana dashboard; flip runbookrunbooks/noetl-cqrs-publish-only-flip.md(pre-flip green-baseline check, per-alert meaning, one-command revertNOETL_EVENT_INGEST_PUBLISH_ONLY=false). -
worker-wiki
0030f30— deployment-spec Observability section documents the lag gauge + materializer counters.
Kind validation (the proof, not just deploy). Deployed the VictoriaMetrics stack (operator + vmsingle + vmagent + vmalert) on kind-noetl against a gate-on server (v3.29.x) + the v5.35.0 worker:
-
Green baseline — gate-on, healthy materializer: backlog
noetl:materializer_backlog=0; an exec COMPLETED with published==projected==acked==25 (each event published once, materialized once, acked once); all alerts inactive. -
Induced lag —
NOETL_MATERIALIZER_FAULT_FAIL_FIRST=1000000(materializer drains but skips project+ack) while driving executions under the gate: the backlog gauge climbed 0 → 120 → 396 → 684 via the independent poller;published rate>0,acked rate→0. Alerts fired — backlog warning + critical + stall (proven on the same-expression short-window validation copy); the shippedMaterializerProjectErrorsandMaterializerStalledUnderGatereached firing on the prod windows. -
Recovery — fault removed: the healthy materializer drained the whole backlog idempotently (
drained=N projected=N acked=N duplicates=0), backlog → 0, alerts cleared. The recovery also surfaced + fixed a post-burst false positive (stall expr re-asserting at backlog=0 due to the lingering published-rate window) — guarded onbacklog>0(#196, verified inactive at backlog=0). - Cluster restored to baseline (gate off, system pool back to the baseline image, FAULT removed, stream purged, rust pool restored).
Pointers. ai-meta → worker b910341 (v5.35.0) + ops 2fcfa59 + worker-wiki 0030f30. Tracks #103; refs #107 step 1, #111. Flip-readiness now includes the monitoring gate.
2026-06-19 — 🎯 #103: server CQRS cutover COMPLETE — FLIP-READY (cancel/finalize through the chokepoint, server v3.29.3)
Headline: the third and final PUBLISH_ONLY flip blocker is closed. The
two ExecutionService terminal writers — POST /api/executions/{id}/cancel
(playbook_cancelled) and POST /api/executions/{id}/finalize
(playbook_completed/playbook_failed) — now route their noetl.event writes
through the emit_event chokepoint, so they honour
NOETL_EVENT_INGEST_PUBLISH_ONLY like the other 13 producer sites instead of
INSERTing synchronously under the gate. The server is now a complete
non-writer of the event log when PUBLISH_ONLY is on.
What shipped (default-off):
-
server#240 → v3.29.3 (
b6e5d31) —ExecutionServicecarries the fullAppState(cheap, all-Arc; no cycle —AppStatedoesn't reference the service).cancel/finalizebuild anEventRowand callemit_event. Newresolve_catalog_idfalls backnoetl.event→noetl.command(mirrors #236) becausenoetl.eventis empty under the gate for a fresh exec.require_stateguards the pool-less test shim so there's exactly one event write path. 558 lib tests + clippy green. -
e2e#62 (
dee459f) — dual-mode rigkind_validate_cancel_finalize_gate.sh(auto-detects the gate; sibling of the #61 orchestrate-gate rig).
Kind proof — both modes:
-
gate-OFF: cancel → CANCELLED, finalize → FAILED; both INSERT synchronously (
published_delta=+0), byte-identical columns (node_id=node_name='playbook', finalizeerrorpreserved);rows==distinct, 0catalog_id=0; natural completion still COMPLETED (25==25) — no regression. -
gate-ON (
PUBLISH_ONLY=true, materializer = system pool sole writer): servernoetl_event_ingest_published_total{playbook_cancelled}=1,{playbook_failed}=1— PUBLISHED, not inserted; materializer cyclesdrained→projected→acked duplicates=0; both terminal events materialized with byte-identical columns + correctcatalog_id(command fallback held); both executions reach the correct terminal state;rows==distinct, 0catalog_id=0, no loss/dup.
No remaining synchronous server noetl.event writers under the gate. The only
INSERT INTO noetl.event left server-side are the chokepoint's own gate-off
INSERT and the two materializer sinks (events_project + project_events,
which ARE the sole writer under the gate). claim_command / handle_batch_events
in-tx INSERTs are should_publish-gated. EventService / db::queries::event
is dead code (never instantiated).
Flip-readiness (for the operator). All three flip blockers are now closed —
(1) ack-after-materialize durability, (2) off-server-drive × gate reconciliation
(#104), (3) cancel/finalize. Flipping NOETL_EVENT_INGEST_PUBLISH_ONLY on
(materializer sole writer) is a staged operator decision, behind a
materializer-lag metric/alert, one revert away. No production default changed.
Pointers: ai-meta repos/server → b6e5d31 (v3.29.3), repos/e2e → dee459f.
Cluster restored to baseline (server/system-pool → oc-pool, gate off, materializer
off, stream purged). Closes server#239.
2026-06-19 — #104: off-server-drive × gate reconciliation PROVEN (server v3.29.2 cold-cache rebuild + committed gate e2e rig)
Headline: the last real blocker before the PUBLISH_ONLY sole-writer
flip is operator-safe is closed. The combination #103 left unproven — gate-ON
(NOETL_EVENT_INGEST_PUBLISH_ONLY=true) with the off-server worker-driven
drive (NOETL_ORCHESTRATE_PLUGIN_DRIVE=true) and the materializer as sole
noetl.event writer — is now green end-to-end on kind.
Why it composes (the reconciliation). The off-server drive does not
read noetl.event itself: the server rebuilds WorkflowState from the
event log and passes the bounded state into the __orchestrate__ command. Under
the gate the orchestrator trigger is relocated to fire from the
materializer's write endpoint (events_project) after the row is durably
materialized — so when the server rebuilds state it reads committed
noetl.event, giving read-your-writes. No worker-side read-cache was needed for
the steady state; the existing trigger relocation already reconciles it. What
was missing was (a) a committed proof, and (b) crash-recovery for the cold-cache
apply window.
Live proof on kind (clean cluster; server v3.29.1→.2 + gate + off-server
drive; system pool NOETL_MATERIALIZER_ENABLED=true):
- Happy path: fresh exec + cursor fan-out → COMPLETED; server wrote 0
noetl.eventrows (all 25 PUBLISHED —noetl_event_ingest_published_total=25, every write viaevents/project); materializer materialized all 25 exactly once (drained=projected=acked, 25 rows == 25 distinct ids, 0catalog_id=0, 0 duplicate cycles); drivedispatched=applied,decode_error=0,event_suppressed>0 (meta-command never hitnoetl.event). - Crash-recovery: a server hard-killed mid-drive → the in-flight
__orchestrate__call.donelands on the cold new pod → the newcold_rebuildpath fires (metric + log) → that execution COMPLETES with full integrity (25 rows == 25 distinct, correct fan-out,playbook.completed). Graceful-restart soak: 20/20 executions across 2 rolling restarts COMPLETED, 0 loss (reconcile poller + workeremit_event_with_retryrecover all). - Regression: gate-off + in-process drive (prod default) → COMPLETED,
synchronous INSERT, no published/drive metrics; gate-off + off-server drive
(
kind_validate_orchestrate_offserver.sh) → PASS.
Shipped (default-off; no prod default changed):
-
noetl/server#238 → v3.29.2
(
76d29bb, closes server#237) —apply_worker_orchestrationrebuildsWorkflowStatefrom the durable log on a cold-cache apply (server restarted mid-drive) instead of dropping the in-flight result; the #104 "rebuild from the WAL/projection" principle. Confined to the cold branch the warm happy path never enters; idempotent re-apply (deterministiccommand_id+ cursor id-set gating). Addsnoetl_orchestrate_drive_total{stage=cold_rebuild|cold_rebuild_failed}. -
noetl/e2e#61 (
61f7a5c, closes e2e#60) — committed kind rigkind_validate_orchestrate_gate.shasserting the gate × off-server-drive combination (server published / materializer sole writer / rows==distinct / off-server topology under the gate); documented indocs/operations/local-kind.md.
Pointers: ai-meta → server 76d29bb + e2e 61f7a5c.
Remaining before a safe PUBLISH_ONLY flip: only the 2 ExecutionService
cancel/finalize sites (still synchronous under the gate; correct — no
lost/double writes; need AppState). The off-server-drive×gate item is closed.
2026-06-19 — #103: ack-after-materialize durability RESOLVED + fault-tested (deferred ack + worker materializer loop)
Headline. Closed the one gating correctness item for a safe PUBLISH_ONLY flip: the materializer used to ack noetl_events messages on fetch, before events/project ran, so a transient project failure (server restart mid-drain) lost the acked-but-unmaterialized events. Now there is true ack-after-materialize — acked only after the durable insert; on failure the batch redelivers. Default-off; no prod default changed.
Shipped (PRs open; pointer bumps staged on the crate-publish cascade):
-
noetl/tools#71 — deferred (ack-after-processing) ack in the
subscriptionSourceClient+ NATS source.AckMode::Defersurfaces a durable per-message ack handle (the NATS$JS.ACK.*reply subject — connection/process-independent within ack-wait);SourceClient::ack(ack_ids, AckDisposition)=Ack/Nack/Term(NATS + Pub/Sub impls); tooloperation: ack|nack|term. Opt-in — existingon_success/manualcallers unchanged. -
noetl/worker#115 — in-process CQRS materializer consume-loop (
src/materializer.rs,NOETL_MATERIALIZER_ENABLED, default off, system pool only): drainnoetl_events(deferred) →events/project→ ack only on 2xx; failure → un-acked → redeliver. Chosen over playbook deferred-ack because the step model can't hold an ack handle across drain→build→project on different pods (result-store_refstall, concurrent-drain batch splits). Observability triad metrics added. -
noetl/ops#194 — system-pool wiring (kind on; prod wired-but-
false, staging order documented).
Fault-injection proof (kind, gate-ON, PUBLISH_ONLY=true, in-process drive, materializer = sole noetl.event writer; server mat-gate=v3.29.1+gate, worker mat-defer):
- Happy: execution COMPLETED; materializer cycles
drained=N projected=N acked=N duplicates=0; rows==distinct (25==25); stream unprocessed=0 — sole writer, zero loss. - Fault before ack:
FAULT-INJECT: skipping project + ack; batch will redeliver→ after ack-wait the batch redelivered + projected+acked → execution COMPLETED; rows==distinct (907==907); unprocessed=0 — loss=0 across a mid-drain failure, idempotent (no dup rows). - Also: crate-level live-NATS integration test (
nats_integration_deferred_ack_and_redelivery) proves defer→no-ack→redeliver→ack→empty. Cluster restored to baseline.
Cascade LANDED (same day): tools#71 merged → noetl-tools 3.13.0 published (crates.io); worker#115 rebuilt green against 3.13.0 + merged → noetl-worker 5.34.0; ops#194 merged. ai-meta pointers bumped on main (tools 1ba739f + worker 92ee58f + ops 30e194d, ai-meta 0cc74cd). Squash-merges with conventional titles drove semantic-release.
Remaining before the flip is operator-safe: (1) ack-after-materialize done, (2) off-server-drive×gate reconciliation (#104 read-cache — gate-on validated with the in-process drive), (3) 2 ExecutionService cancel/finalize sites. Default-off.
Headline. A full gate-ON execution now runs all the way to COMPLETED via the materializer-as-sole-writer path, with zero loss — server writes 0 noetl.event rows, the materializer writes all 31, drained=materialized=31. (server#236 → v3.29.1 994da30, ai-meta db7ceeb).
Root cause (found via a serial clean-cluster soak — the earlier playbook ack/routing/concurrency hypotheses were red herrings). Under PUBLISH_ONLY noetl.event is empty, so get_catalog_id (noetl.event-only lookup) returned catalog_id=0 for worker-emitted events → published rows carried catalog_id=0 → event_catalog_id_fkey violation → events/project batch 500 → the ack-on-fetch'd events were lost. Fix: get_catalog_id falls back to noetl.command (synchronous under the gate).
Proof. Gate-on: COMPLETED, 0 server writes, materializer projected 6→12→12→1 (no dup/FK), 31 events / 31 distinct ids, strictly ordered, 0 catalog_id=0. Gate-off: off-server orchestrate e2e PASS (no regression); 557 lib tests. Cluster restored to baseline.
Remaining before the PUBLISH_ONLY flip is operator-safe: (1) materializer ack-after-materialize durability hardening (it acks on fetch → a future transient events/project failure could still lose events; needs deferred-ack tooling or a worker-side consume loop), (2) off-server-drive×gate reconciliation (#104 read-cache; gate-on validated with in-process drive), (3) 2 cancel/finalize sites. Default-off; no prod flip.
2026-06-18 — #103: 2d-3 sole-writer cutover implemented (emit_event chokepoint + PUBLISH_ONLY gate, default-off)
Update — MERGED + clean-cluster soak. server#235 merged → server v3.29.0 (e2dc0ce, ai-meta 319985d). Clean-cluster gate-on soak re-proved the server cutover (a fresh gate-on exec writes 0 noetl.event rows; worker claims+runs, events published not inserted), but exposed a materializer-playbook lost-events bug (ops#192, not the server): ack: on_success is on the drain step, so the materializer acks the 6 published events off the stream before materializing them, and when {{ drain_events.count }} mis-routes to empty the events are acked-but-not-materialized = lost. Fix staged on ops#192 (ack-after-materialize + count-path fix + serialize the drainer); ties to #104's WAL-durability. Gate stays default-off until fixed.
Headline. The server-side CQRS write-path cutover is implemented and
default-off (server#235). Under the
gate the server stops writing noetl.event — events publish to the stream
and the system/event_materializer becomes the sole writer. Proven on kind: a
gate-on execution writes 0 noetl.event rows; gate-off is byte-identical. No
production default flipped (operator decision).
Shipped (server#235, branch kadyapam/cqrs-2d3-publish-only-gate, 2 commits).
-
handlers::event_write::emit_event/emit_eventschokepoint overEventRow: gate-off = canonical full-column INSERT (byte-identical — absent columns bind NULL = today's defaults); gate-on = publishto_jsonb(row)tonoetl_events. -
NOETL_EVENT_INGEST_PUBLISH_ONLYconfig (default false) + lazy publisher in AppState. -
13 producer sites routed through it;
noetl.commandstays synchronous; the two sink writers (events_materialize,events_project) untouched. -
Trigger relocation: under the gate
events_projectfires the orchestrator trigger after the row materializes (read-your-writes); ingest no longer triggers. -
System-pool exemption:
system/*playbooks (the drainers) write synchronously even under the gate — else the materializer deadlocks waiting to drain its own events (found live).
Live proof on kind (image built from the branch). Gate-OFF: off-server
orchestrate e2e PASS + 25/25 event-row regression (distinct ids, all columns
correct, tenant/org defaults, playbook_started node_id≠node_name,
command.issued parent+nodetype) → byte-identical. Gate-ON: a gate-on execution
wrote 0 noetl.event rows (twice) — events PUBLISHED
(noetl_event_ingest_published_total confirms), noetl.command synchronous; the
exemption + materialize ({projected:0,duplicates:N}) + relocated trigger all
work. Not cleanly shown: a single fresh exec driven fully to COMPLETED under
the gate — the shared kind cluster was saturated by accumulated stuck test execs
(reconcile re-drives re-flooding the stream); a test-environment artifact, not a
defect. 557 lib tests + clippy green; 3 new unit tests. Cluster restored to baseline.
Staged / operator decision. PUBLISH_ONLY stays default-off; the flip is an
operator decision behind a materializer-lag metric/alert, one revert away. ai-meta
pointer bump waits for server#235 merge. Follow-ups: clean-cluster end-to-end soak;
off-server-drive × gate reconciliation (ties to #104 read-cache);
ExecutionService cancel/finalize (2 sites still synchronous, correct, staged).
Headline. Every NoETL worker pod was a latent crash-loop: the Kubernetes
container-runtime default /dev/shm is a 64 MiB tmpfs, but every worker
process allocates a 256 MiB Arrow IPC shared-memory cache at init
(NOETL_IPC_CACHE_BUDGET_BYTES, default 268435456) backed by POSIX shm on
/dev/shm. Under shm-heavy load the cache writes past 64 MiB, the store
page-faults against the full tmpfs, and the kernel delivers SIGBUS — the
worker dies with exit 135 and crash-loops. Surfaced during the #103 CQRS
kind validation on the system pool; a transient live fix was reverted, leaving
the committed manifests carrying the bug.
What shipped. ops#193 (merged →
ops f4df4c1) fixes all 7 worker deployments that allocate the cache —
system pool, shared/rust pool, rust subscription pool, subscription runtime,
the legacy Python cpu pool, and the two prod variants (config only; not rolled
out by the PR). Each gets a memory-backed /dev/shm (emptyDir
medium: Memory, sizeLimit: 320Mi — budget + 64 MiB headroom), the budget
pinned explicitly via NOETL_IPC_CACHE_BUDGET_BYTES=268435456 next to the
sizeLimit so the two can't silently drift, and the container memory limit
raised to 768Mi (the tmpfs is charged to the pod cgroup, so the limit must
cover sizeLimit + worker RSS).
Validation (kind, system pool). Reproduced the crash faithfully (the
cache's ftruncate+mmap+page-store pattern): on the 64 MiB default tmpfs the
write SIGBUSed at the boundary (exit 135, "Bus error"). After applying the
fix (strategic patch preserving the running image, to avoid disturbing other
sessions on the shared cluster), /dev/shm is a 320 MiB tmpfs, the pod is
ready/restarts=0, the cache initialises with budget_bytes=268435456, and
a full 256 MiB write completes (exit 0, peak 256M/320M, 80%) with no
OOM against the 768Mi limit. The kind system pool was restored to its
pre-validation baseline (image, 64 MiB shm, 512Mi limit) afterwards.
Wiki. worker deployment-specification
gains a Shared memory (/dev/shm) section (the required memory-backed
mount + the three values that move together) and corrects the Resources memory
guidance (was 256Mi/"cap at 384Mi" with no /dev/shm mention — the latent
bug); the NOETL_IPC_CACHE_BUDGET_BYTES env-var row now points at it.
Pointers. ai-meta → ops f4df4c1 + ai-meta-wiki. Did NOT touch the #49
prod GKE cluster, the in-flight CQRS chokepoint work, or the unmerged CQRS
branch.
2026-06-18 — #103: CQRS 2d shadow validated; materializer proven sole-writer-capable; 2d-3 cutover designed + staged
Headline. The CQRS materializer (system/event_materializer) is now a
deployed, kind-validated, sole-writer-capable component — and the 2d-3
producer cutover that makes it the sole noetl.event writer is designed and
precisely staged. No production default flipped (operator-gated, like the prior
default-flips).
What shipped. ops#192 — the
system/event_materializer + system/projector system playbooks (+ looping
CronJobs). The materializer drains the noetl_events JetStream stream and writes
noetl.event via POST /api/internal/events/project (idempotent
ON CONFLICT DO NOTHING), talking only to the server's internal API
(data-access-boundary). Built on the in-flight feat/cqrs-2d1-materializer-playbook
branch, rebased + reconciled against the merged 2a/2b/2d-1/2d-2 server scaffold.
Live proof on kind (shadow; tailer on, synchronous INSERT still on).
Materializer reproduces the log byte-identically + idempotently:
events/project of 25 real off-server-orchestrate event rows →
{projected:0, duplicates:25} (every row already present, zero double-writes,
zero errors); a full playbook cycle returned {projected:0, duplicates:20}.
Tailer publishes every committed event; both consumers
(noetl_projector/noetl_materializer) ensured at startup. Off-server
orchestrate e2e (kind_validate_orchestrate_offserver.sh) stayed green with
the tailer on.
Findings (fixed/surfaced). (1) Playbook stall: batch: 500→25 — a drain over
the worker's 100 KiB inline-context budget stages to a _ref and
{{ drain_events.count/messages }} stop resolving (fixed in ops#192). (2) The
system-pool /dev/shm (64 MiB k8s default) < the worker's 256 MB Arrow IPC cache
budget → SIGBUS under shm-heavy load (surfaced separately; not required with
batch=25).
2d-3 cutover scope (designed, NOT shipped). Discovered the sole-writer cutover
is a server-wide ~18-site event-write chokepoint refactor (no single
chokepoint today) + a default-off NOETL_EVENT_INGEST_PUBLISH_ONLY gate (publish
the normalized row instead of INSERT; materializer becomes sole INSERTer) +
orchestrator-trigger relocation to the materializer write endpoint
(read-your-writes — the trigger reads noetl.event with a COUNT consistency
check today, so it hard-depends on synchronous writes). Correctness-critical,
multi-round, deliberately not rushed. Tracked on
#103; the producer cutover is an
operator decision for relay. Cluster restored to baseline.
2026-06-18 — #111: e2e topology coverage for the off-server orchestrate drive + server-API-only gap assessment
Headline. Closed the missing-e2e gap on the Server-Dissolution program's step 2 (#107): the worker-driven orchestrate topology (default-on since server v3.28.0 / #108, shadow retired in #110) had been validated ad-hoc during the shipping sessions but had no committed, repeatable rig. Added one, and used the wrap-up to produce the server-API-only status assessment + surface two operator decisions.
What landed.
-
e2e rig
scripts/kind_validate_orchestrate_offserver.sh(e2e#59, squash-merged → e2e977efc2). Self-contained kind rig over thefanout_reduce_phase6fixture; hard-asserts the drive runs off-server on the system pool: final COMPLETED; 0__orchestrate__rows innoetl.event(no event-log burst);__orchestrate__rows present innoetl.command(off-server dispatch);noetl_orchestrate_drive_totaldispatched + applied both advance with nodecode_error(the server applied a worker-computed result);noetl_orchestrate_shadow_totalabsent (post-#110). Documented underdocs/operations/local-kind.md. -
Live validation on
kind-noetlagainst server v3.28.0 (post-#110oc-noshadowimage, drive ON): PASS — COMPLETED, event rows = 0, 4 off-server drives, dispatched=applied=4, 0 decode errors, shadow series absent. Kind restored to its as-found baseline (oc-pool image,DRIVE=false,SHADOW=false) afterward. - Tracking #111 opened (ai-task, repo:e2e; board → In progress) as the durable home for the e2e coverage + the gap assessment + the operator decisions.
Server-API-only assessment. Step 2 (orchestrator → plug-in) is complete —
the evaluate loop runs off-server. The server is not yet API-only: it
remains the sole writer to noetl.event/noetl.command (the drive's apply
path, apply_worker_orchestration → apply_orchestration_result in
server/src/handlers/events.rs) and still rebuilds WorkflowState from
events to bound the drive input. Both move under the later program steps —
CQRS write-path (#103) makes the materializer sole writer; NATS-as-WAL (#104) +
Postgres-demotion (#107 step 4) remove the from-events rebuild. Not owned by
the orchestrator-dissolution thread.
Surfaced for the operator (not done unilaterally).
- (A) Retire the in-process drive fallback — gated on prod adopting a
post-#108 image first. Prod GKE still runs
server-rust:batch-dispatch-v1(pre-#108), so the worker-driven drive is not live in prod; removing the=falserevert now is premature. - (B) Reap
__orchestrate__delivery rows — each drive writes one PENDING row tonoetl.command(worker_id=null) that is never reconciled to terminal (its lifecycle events are suppressed fromnoetl.event). Accumulates one row per drive (~694 in a single #108 soak). Wants a TTL / mark-terminal-on-apply / separate-table strategy. Scale-relevant, not a correctness bug.
Pointers. ai-meta repos/e2e bumped 94aa7f1 → 977efc2. Handoff
2026-06-18-orchestrate-plugin-dissolution round-02-result.
Headline. Server-slimming follow-up to the now-closed #108. The in-server
shadow was the slice-4 cutover-confidence harness — it ran the
system/orchestrate plug-in inside the server via an embedded wasmtime host
and diffed its commands against the in-process drive (529 match / 0 mismatch).
With the worker-driven drive default-on and proven, that harness is dead weight:
the live drive uses the worker's wasmtime host, never the server's. Retiring
it drops the heavy wasmtime server dependency and collapses the build to one
config.
What landed. server#234 (squash
f3043c9, refactor: → no version bump, stays v3.28.0) removed
src/orchestrate_shadow.rs, the orchestrate-shadow cargo feature + the
optional wasmtime dep (the cranelift/wasmtime tree — ~1000 Cargo.lock lines —
fell out; cargo tree -i wasmtime now matches nothing), the
trigger_orchestrator_inner shadow hook (shadow_pre_state + shadow_diff),
the main.rs boot loader, the orchestrate_plugin_shadow config field +
NOETL_ORCHESTRATE_PLUGIN_SHADOW, the noetl_orchestrate_shadow_total metric,
and --features orchestrate-shadow from the Dockerfile. Kept
noetl-orchestrate-plugin's run_state (the drive uses it) and
NOETL_ORCHESTRATE_PLUGIN_DRIVE (default true).
Validation. cargo build/test/clippy --all-targets clean (single config).
Kind smoke on a 4-page self-contained cursor loop (tests/oc_smoke/cursor,
alternating fetch/check steps), freshly-built image, both drive modes:
drive=false (in-process — the exact edited path) → COMPLETED, 0 __orchestrate__
rows in noetl.event; drive=true (worker-driven default) → COMPLETED, 0
__orchestrate__ rows, 10 drive commands on the system pool
(dispatched=applied=10, event_suppressed=30, skipped_in_flight=2).
noetl_orchestrate_shadow_total confirmed gone from /metrics. Kind cluster
restored to the as-found baseline (image oc-pool, drive=false).
Surprise (unrelated to #110, flagged): a self-referencing cursor arc
(fetch_page → fetch_page) stalled the worker-driven drive in RUNNING after one
iteration — restructuring to the proven alternating two-step shape fixed it. Arc
evaluation lives in noetl-orchestrate-core (untouched by this PR), so it's a
possibly-real pre-existing limitation of self-loop arcs, not a regression.
Pointers. ai-meta → server f3043c9; wiki noetl-server-wiki@be76279
(deployment-specification env-var catalogue trimmed). Handoff thread
handoffs/active/2026-06-18-orchestrate-plugin-dissolution/. Closes
#110.
Headline. Flipped NOETL_ORCHESTRATE_PLUGIN_DRIVE to default true
(server#233, v3.28.0 →
server@80cc0e6) — the orchestrate-on-the-system-pool drive (the dissolution
path) is on by default instead of opt-in. This is the last item of
#108; the issue is closed and
the orchestrator-as-plug-in step of the dissolution program
(#107) is complete.
Why it was safe now. The flip was deferred as a production-policy decision until the drive was proven off-server, burst-free, and pool-isolated at scale — all of which shipped (slices 1–3 #229, zero-event-burst #230/#231, system-pool isolation #232 + worker#114 + ops#191).
Validation (kind, staged). Images built from the released tips (server v3.27.0 / worker v5.33.0).
-
Pre-flip scale soak, drive ON via env (code default still false): a single
cursor+fan-out
test_pft_flow_v2(3 fac × 40 pat, page_size 1) COMPLETED in 511s with 694 drives (dispatched=applied, decode_error=0); system pool +694 (= the drives), shared pool +671 (real steps only) → full isolation;__orchestrate__rows innoetl.event= 0 (event_suppressed +2082); 0 errors; 23 distinct workflow steps. 5× concurrent self-contained cursor = 5/5 COMPLETED, all drives system-isolated, 0 burst. (A 3× concurrent PFT run saw 2 fixture-level DDL deadlocks on sharedpft_test_*tables — a fixture artifact, not a drive defect; the drive stayed correct through them.) - Post-flip, drive via the new code default (no env var): PFT 2×30 COMPLETED, 361 drives, system +361 / shared +349, 0 burst — identical shape to explicit-on. Drive metrics started at 0 on the fresh pod and climbed to 361, confirming the drive fired purely from the default.
- Regression: 15/15 normal fixtures green (python/http/postgres/duckdb/ loops/fanout/sub-playbooks/cursor+offset pagination/control-flow/vars/args).
-
Revert verified:
NOETL_ORCHESTRATE_PLUGIN_DRIVE=falseon the flipped image → simple_python COMPLETED with system delta 0, dispatched 0 → clean fallback to the in-process drive (trigger_orchestrator_inner, kept), no rebuild.
Revert. NOETL_ORCHESTRATE_PLUGIN_DRIVE=false on the deployment —
per-deployment, immediate, no rebuild. Or revert server#233.
Pointers. ai-meta → server 80cc0e6 (v3.28.0) + worker 437b0be (v5.33.0,
release hygiene) + server-wiki 0210012 (deployment-spec default + revert).
Board #3: #108 → Done.
Headline. Closed a committed-config defect complementary to follow-up (b)'s
server/worker execution_pool decline: the system-pool deployment bound NATS
consumer noetl_worker_system_rust, but its KEDA scaler reads
noetl_worker_pool_system and the other pools follow the
noetl_worker_pool_<segment> convention with deployment-name == scaler-name.
The system pool was the lone exception — the scaler watched a consumer the
worker never created (no backlog scaling), and the live kind state had drifted
to a hand-applied noetl_worker_system_kindtest present in no committed
manifest (the very broad-filter drift the server-side fix guards against).
ops#191 → ops@4816af0.
Fix. Aligned the dev system-pool deployment's NATS_CONSUMER to
noetl_worker_pool_system. Single-stream consumer-filter affinity model
unchanged; the filter_subject (noetl.commands.system.>) isolates the
consumer and the shared pool's consumer (noetl.commands.shared.>) cannot see
system subjects.
Kind-validated (drive on, cursor+fan-out test_pft_flow_v2, 1 facility × 3
patients): the noetl_worker_pool_system consumer claimed all 44 orchestrate
drives (Δ+44 = noetl_orchestrate_drive_total{dispatched=44, applied=44},
skipped_in_flight=3); the shared consumer received only the 42 real
postgres/http step commands; the orphaned kindtest consumer received 0.
Execution COMPLETED across 23 distinct workflow steps; __orchestrate__ rows
in noetl.event = 0. Validated on the :oc-pool worker image (pre-#114), so
this proves the consumer-filter affinity independent of the #114 decline layer.
Cluster restored to clean stable state (drive off; orphaned kindtest consumer
removed). Prod cluster untouched.
Pointer bump. ai-meta repos/ops → ops@4816af0. #108 stays open for (c)
the deliberate default-flip (production-policy; not flipped).
Headline. The worker-driven orchestrate drive now runs on the dedicated
system pool, not the default pool — pool affinity that survives a JetStream
consumer whose filter_subject drifted broad.
server#232 (server@846166b) +
worker#114 (worker@e2162b7).
Kind-validated.
Root cause (corrected). There is NO worker HTTP pending-poll — a worker claims only what its NATS consumer delivers. The system-routed orchestrate command was claimed by the default pool because that pool's durable consumer's filter had drifted broad (durable consumers keep their creation-time filter) and, with no pool-affinity check, it won the claim race.
Fix. Server (publish_command_notification) stamps the resolved pool segment
on the notification as execution_pool ("update context for dispatch"); worker
gains CommandNotification.execution_pool + segment_from_filter(NATS_FILTER_SUBJECT)
and NatsCommandSource::next declines (ACK + skip) a notification whose
execution_pool differs from its own segment — the correct pool's independent
delivery then claims it. Enforced only when both name a concrete segment
(backward compatible).
Validation (kind, server + both worker pools on new images, drive on):
test/simple_python → COMPLETED; __orchestrate__ commands pulled + claimed +
executed on the system pool (3), zero on the default pool. 553 server +
196 worker tests green.
Remaining for #108: only (c) — the deliberate default-flip (staged
production-policy). The worker-driven drive is now functionally complete +
scale-hardened (off-server, cursor/fan-out, zero noetl.event burst, system-pool
isolated).
Aside (read-only): prod is already fully on the Rust stack (the #49 flip
happened ~4 days ago; noetl Service → noetl-server-rust, both secrets present),
on an image predating this #108 work; the ops#178 cutover runbook is stale as a
"prep" doc.
Pointers. server#232 → server@846166b; worker#114 → worker@e2162b7.
Headline. Closing the directive that system pool playbooks keep only their own
state, not workflow events: the worker-driven __orchestrate__ meta-command now
writes 0 rows to noetl.event (down from 5/drive).
server#231 → server@9438f3b.
Kind-validated.
What. dispatch_orchestrate_command stops writing command.issued to
noetl.event — the command's record lives only in noetl.command (its own state,
fatal-on-error as the sole delivery row). claim_command/get_command read
noetl.event first (a pri ordering keeps it authoritative, so normal commands'
claim path is byte-for-byte unchanged) and fall back to noetl.command only on a
miss (the event-free meta-command). Combined with slice 4b's lifecycle suppression,
__orchestrate__ writes 0 of its former 5 rows.
Validation (kind, drive on, small cursor+fan-out): __orchestrate__ rows in
noetl.event = 0; COMPLETED via the noetl.command fallback; 20 real workflow
steps drove normally (shared claim path unaffected); 0 errors. At 10×1000 this
removes thousands of infrastructure rows from the burst.
Remaining for #108: (b) NATS stream/consumer affinity so the system pool actually claims the drive (ops); (c) the deliberate default-flip (staged).
Headline. Per the directive that system pool playbooks are part of the NoETL
ecosystem and must not flood the workflow event log: the __orchestrate__
meta-command's lifecycle events are no longer persisted to noetl.event.
server#230 → server@6aef3a6.
Kind-validated.
Postgres burst fix. The meta-command is infrastructure, not a workflow step —
yet each drive wrote 5 rows (issued/claimed/started/call.done/completed). At scale
that bursts noetl.event + Postgres for no benefit (the drive state is a pure
function of the real step events; the result is applied from the in-memory
call.done payload). The server now skips persisting them: handle_event_inner
for the worker-emitted ones, claim_command for command.claimed. Validated:
__orchestrate__ now writes ONLY the lone command.issued delivery row (1 of 5 —
80% fewer rows); the small cursor+fan-out test_pft_flow_v2 still COMPLETED.
System-pool routing. dispatch_orchestrate_command routes the drive to the
system segment (noetl.commands.system.<eid>). Honest scope: the server
publishes there, but on kind the default pool still claims it via the HTTP
pending-poll — true isolation needs a NATS stream/consumer-affinity fix in ops.
The drive stays functional (resilient via the poll).
Follow-ups: (a) eliminate the last command.issued via a noetl.event-free
claim path (claim/get reading noetl.command) → zero event rows for the
meta-command; (b) NATS affinity so the system pool actually claims the drive.
Pointers. server#230 → server@6aef3a6.
Headline. The worker-driven drive (#108)
handles the complex orchestrator surface, not just the linear case. Ran
test_pft_flow_v2 under drive mode (small workload: 1 facility × 3 patients,
page_size 1) on kind → COMPLETED, 0 errors. No code change — the slice-3 drive
already handles cursors.
What it exercised. 20 distinct workflow steps; cursor fan-out (fetch_* each
issued 5 commands across pages); a multi-facility loop (load_next_facility ran
twice); 43 __orchestrate__ drive round-trips on the worker
(dispatched=43, applied=43), the in-flight guard firing once
(skipped_in_flight=1 — concurrent triggers serialized); __orchestrate__ did
NOT leak as a workflow step across all 43 round-trips (state guard holds under
cursors); playbook.completed emitted.
Remaining (deliberate, not rushed). System-pool routing (the orchestrate
command currently co-locates on the execution's segment); and the default-flip
(NOETL_ORCHESTRATE_PLUGIN_DRIVE default-true + retire trigger_orchestrator_inner)
— a production-policy change recommended via staged rollout (opt-in → one
deployment → soak at true scale → then default), not a unilateral flip. The
dissolution's core (drive on the pool) is proven for linear AND cursor/fan-out
workloads.
2026-06-17 — 🎯 the orchestrator drive runs OFF-SERVER on the worker pool (#108 slice 3, kind-validated)
Headline. The dissolution milestone: with NOETL_ORCHESTRATE_PLUGIN_DRIVE=on
the orchestrator drive runs on the worker pool, not in the server.
Kind-validated end to end — a real execution drives start→end→completed entirely
through the worker round-trip. server#229
→ server@465cdbb (v3.23.0).
How. (1) Scheduler (trigger_orchestrator_inner): in drive mode the server
issues one system/orchestrate command (entry: run_state, args = the bounded
WorkflowState + playbook + trigger) to the worker pool via
dispatch_orchestrate_command — no in-process evaluate; an orchestrate_in_flight
cache flag serialises drives per execution. (2) Worker runs the plug-in
(run_state, worker#113) and returns the OrchestrationResult (base64
output_b64 on the call.done event). (3) Apply-on-callback (handle_event_inner):
a call.done for __orchestrate__ → apply_worker_orchestration → decode →
apply_orchestration_result (slice 2) emits events + issues the real commands;
their completions re-trigger the drive, the meta-command's own events never do
(loop-safe). (4) State guard (apply_event): __orchestrate__ events are ignored
so the meta-command never phantom-creates a workflow step.
Validation (server oc-drive2 + worker oc-drive, drive on): test/simple_python
→ COMPLETED; noetl_orchestrate_drive_total{dispatched=2, applied=2}, zero
decode_error, zero skipped; 2 __orchestrate__ round-trips, real steps start+end
ran on the worker, playbook.completed emitted; __orchestrate__ did NOT leak as
a workflow step (steps entered = ['end']). A first-pass bug was caught + fixed:
output_b64 rides call.done, not command.completed (lifecycle-only).
Safety. Default off — the in-process drive is the untouched fallback. core 124
- server 553 tests green; clippy clean both build configs.
Worker-driven arc complete through the drive: slice 1 (worker entry/run_state)
→ slice 2 (apply_orchestration_result extracted) → slice 3 (drive on the pool,
validated). Remaining: slice 4 — shadow→flip at scale (PFT under drive), make drive
the default, retire trigger_orchestrator_inner; route the orchestrate command to
the dedicated system pool.
Pointers. server#229 → server@465cdbb
(v3.23.0). Server wiki: deployment-specification env-var NOETL_ORCHESTRATE_PLUGIN_DRIVE.
2026-06-17 — worker-driven cutover slice 2: apply_orchestration_result extracted + slice 3 designed (#108)
Headline. Slice 2 of the worker-driven cutover
(#108): the post-evaluate emission
logic (emit pure events → issue commands → terminal event) is extracted verbatim
from trigger_orchestrator_inner into a reusable apply_orchestration_result(...),
so the worker-driven drive can apply an OrchestrationResult computed on a worker
the same way the in-process drive applies its own.
server#228 → server@586aeae.
Behavior-preserving (553 tests green, clippy clean both configs); internal refactor.
Slice 3 designed (grounded). Checked the load-bearing risk: apply_event
(state.rs:408) does steps.entry(name).or_insert_with(StepInfo::new) for every
node_name on command.issued/command.completed — so issuing the orchestrate
"meta" command as a normal command would create a phantom step and pollute the
drive state. The design: (1) flag NOETL_ORCHESTRATE_PLUGIN_DRIVE (default off,
in-process fallback); (2) scheduler issues one step: __orchestrate__ wasm command
(system/orchestrate, entry: run_state, args: OrchestrateStateInput) to the
system pool instead of evaluating in-process; (3) an apply_event/from_events
guard ignores command.* events for the reserved __orchestrate__ step so it never
pollutes state; (4) on the __orchestrate__ completion the server decodes the
OrchestrationResult (data.output_b64) and calls apply_orchestration_result —
not a re-dispatch; (5) the loop: real-step completion → dispatch orchestrate →
apply → real commands → … (the meta-command's own lifecycle never re-triggers).
State coherence holds (drive state is a pure function of the real-step events).
Lands behind the flag, kind-validated by driving a real execution through the
round-trip before the shadow→flip.
Pointers. server#228 → server@586aeae (internal refactor, stays version 3.22.0). Design: #108 comment.
Headline. First worker-side slice of the worker-driven orchestrator cutover
(#108) — moving the drive off the
server onto the worker pool. A tool: {kind: wasm, plugin: {path, version, entry}}
command can now name the guest export to invoke; the worker-driven orchestrator
will dispatch system/orchestrate with entry: "run_state".
worker#113 → worker@04420d0.
What. WasmPluginHost::invoke_bytes_with_entry +
WasmDispatcher::run_by_ref_entry/run_and_apply_by_ref_entry (the originals
delegate with "run", so existing call sites + tests are untouched);
wasm_config_to_ref parses optional plugin.entry (default run). Same
data-plane ABI, only the export name differs. Test proves run→0xAA vs
run_state→0xBB dispatch + a missing-export error. 194 worker tests green;
new code clippy-clean. Purely additive — no live behavior change (nothing
issues a run_state command yet).
Remaining (server side, the hot path). (2) Scheduler — on a trigger, issue an
orchestrate command (entry: run_state, input = OrchestrateStateInput) to the
system worker pool instead of driving in-process; (3) Apply — on that command's
completion, decode the OrchestrationResult and emit (extracted from
trigger_orchestrator_inner) + loop-prevention; (4) shadow-then-flip cutover,
then retire trigger_orchestrator_inner. All behind a default-off flag with the
in-process drive as the fallback. The per-drive server→worker→server round-trip is
the real architectural shift (drive CPU distributes across the pool).
Pointers. worker#113 → worker@04420d0 (stays v5.31.2). Worker wasm-plugin host has no wiki page (pre-existing #105 gap) — flagged for a Rule-1 backfill.
Headline. Slice 4 of the plug-in round
(#108): the orchestrator now runs
the system/orchestrate plug-in alongside the in-process drive on every
evaluation and diffs the emitted commands — the cutover-confidence gate. The
in-process result stays authoritative (observation only) → zero risk to the live
platform. server#227 → server@bd652ab.
Kind-validated over the live 10×1000 PFT: 529 evaluations, ZERO divergence.
What. The plug-in gains a state-input path (OrchestrateStateInput +
run_state export) so the shadow hands it the same WorkflowState the
in-process evaluate_state consumes (no event-slice/snapshot reconstruction to
confound the diff). New src/orchestrate_shadow.rs (feature orchestrate-shadow,
optional wasmtime): a process-global wasmtime host (fresh Store/Instance per
call, mirrors the worker invoke ABI) loaded from noetl.plugin_module at boot;
trigger_orchestrator_inner clones the pre-evaluate state + diffs after (command-
set identity — parsed Value eq, slice-2 key-order finding). Metric
noetl_orchestrate_shadow_total{result=match|mismatch|error}; config flag
orchestrate_plugin_shadow; Dockerfile builds with the feature (runtime-gated by
NOETL_ORCHESTRATE_PLUGIN_SHADOW, default off). Always-present no-op wrappers so
the production build (default, no wasmtime) is unaffected.
Validation (kind, image oc-shadow, shadow on). noetl_orchestrate_shadow_total {result="match"} 529 — the only label; zero mismatch, zero error, 0
divergence log lines; server + workers stable (r=0); boot orchestrate shadow host loaded bytes=1603665. Both build configs green; 553 server + 3 plug-in tests;
clippy clean.
The arc so far. Slices 1-4 prove orchestrator-as-plug-in end to end: the plug-in exists (0-import wasm, slice 1) → runs identically in wasmtime (slice 2) → is deployed + servable (seed-on-boot, slice 3) → drives the real workload identically, live (shadow, slice 4).
Next. The worker-driven cutover — the kernel scheduler dispatches
system/orchestrate on a worker (fetched from the registry), publishes the
emitted commands, and retires trigger_orchestrator_inner; the in-server shadow +
the wasmtime server dep come back out once the server no longer drives.
Pointers. server#227 → server@bd652ab
(Cargo.toml now 3.22.0, no tag cut). Server wiki: deployment-specification
env-var NOETL_ORCHESTRATE_PLUGIN_SHADOW (noetl-server-wiki@50f9965).
Headline. Slice 3 of the plug-in round
(#108): the server now bakes
the orchestrate wasm into its image and seeds built-in system plug-ins into
noetl.plugin_module on boot, so the worker pool can fetch system/orchestrate@1
by (path, version) + digest without an out-of-band operator POST.
server#226 → server@b21b589.
Kind-validated.
What. New src/system_plugins.rs: scan_system_plugins(dir) (pure — glob
*.wasm → read → sha256 digest → path system/<stem> → version 1; unit-tested,
no DB) + seed_system_plugins(pool) (upserts each via plugin_module::upsert,
in-process — not the token-gated /api/internal/plugins surface, which is for
external registration). main.rs seeds after plugin_module::ensure_table,
non-fatal. A wasmbuilder Dockerfile stage builds plugins/orchestrate to wasm32
and bakes orchestrate.wasm → /opt/noetl/plugins/; NOETL_SYSTEM_PLUGIN_DIR
defaults there. .dockerignore now excludes nested **/target (the excluded
plug-in crate's local target is 1.8 GB and broke the context COPY). Digest-keyed
hot-reload: every boot re-seeds @1 with current bytes, so a new image hot-reloads
the pool with no version bump.
Validation (kind, image oc-seed). Boot log seeded system plug-in path=system/orchestrate version=1 digest=823dec… bytes=1559093 → count=1. GET /api/internal/plugins/system/orchestrate?version=1 → 200, application/wasm,
1559093 bytes, magic \0asm, ETag=digest; correct digest → 200, stale → 409.
Baked-file sha256 == served digest (823dec…) — byte-identical image → registry →
API. 553 lib tests green (2 new); clippy clean.
Next. (4) kernel scheduler (NOETL_ORCHESTRATE_PLUGIN, default off) — dispatch
the now-registered plug-in alongside the in-process orchestrator, live-shadow over
the PFT, flip after green.
Pointers. server#226 → server@b21b589
(internal — stays v3.20.0). Server wiki: deployment-specification env-var
NOETL_SYSTEM_PLUGIN_DIR (noetl-server-wiki@5cb58cd).
Headline. Slice 2 of the plug-in round
(#108): a wasmtime shadow-diff
proves the system/orchestrate .wasm doesn't just compile — it executes
identically inside the same wasmtime contract the worker uses.
server#225 → server@ccec104.
What. tests/shadow_diff.rs loads the built plug-in through a harness
mirroring the worker host's invoke_bytes ABI byte-for-byte (alloc → write →
run(ptr,len) → unpack packed i64 → read; fresh Store/Instance per call; 0
imports → bare instance) and asserts the wasm output equals the native drive over
two fixtures: auth0 multi-arc when: routing (exercises the minijinja template
engine in wasm) and a cold-start linear flow. wasmtime dev-dep pinned to the
worker's major (27) so "runs in the harness" ⟺ "runs in the worker host".
Determinism finding (relevant for the slice-4 live shadow). The diff is
command-set identity (parsed Value equality), not raw-byte identity. The
drive builds a step context as a serde_json::Value map; with serde_json's
preserve_order in the tree, object key order is insertion order, and that
order traces to upstream HashMap iteration — which hashes differently on wasm32
vs the host arch. So wire bytes differ in key order while the value is identical.
The correct bar: the kernel scheduler deserializes the plug-in output to
Vec<Command> and persists it through the server's own encoder, so the plug-in's
wire bytes are transient — what must match is the command set. A canonical
sorted-key command encoding (if byte-identical persistence is ever wanted) is a
separable follow-up, not required for correctness.
Validation. 2 unit + shadow-diff (2 fixtures) green; clippy clean; test
self-skips when the .wasm isn't pre-built (gate: cargo build --release --target wasm32-unknown-unknown && cargo test). Plug-in stays excluded — native
server build/test unaffected. Test-only; no kind validation needed.
Next. (3) catalog register/serve; (4) kernel scheduler
(NOETL_ORCHESTRATE_PLUGIN, default off) replacing trigger_orchestrator_inner,
live-shadowed over the PFT before the flip.
Pointers. server#225 → server@ccec104 (internal — stays v3.20.0).
Headline. First slice of the plug-in round
(#108, step 2 of the OS program
#107): a new standalone
plugins/orchestrate/ crate (in the server repo, next to the core, depending on
noetl-orchestrate-core by path) wraps the drive behind the worker plug-in ABI
and compiles to wasm32-unknown-unknown — the first non-trivial compiled
system playbook. server#224 →
server@10a629b.
ABI. Input bytes = JSON OrchestrateInput { events, playbook, trigger_event_type } (the bounded event slice + the catalog playbook — the same
read-set trigger_orchestrator loads); output bytes = JSON OrchestrationResult
(commands + completion + events_to_emit), or {"error": "..."} on a drive
failure. Data-plane = the host's memory + alloc(size)->ptr +
run(ptr,len)->packed contract (same as the reference-materializer).
The feasibility risk #108 flagged is retired. #108's "Hard parts" asked
whether the template/evaluator compiles to wasm32 or needs a host render
callback. The compiled .wasm answers it: zero imports — no WASI, no
render — so the whole drive (condition evaluator and minijinja template engine
included) runs in-guest. That's the payoff of #109
keeping the entire core wasm-resident (the "keep core compiled to wasm32" call).
Exports: exactly memory / alloc / run. Artifact 1.54 MB.
Validation. Native parity test — the shadow-diff in miniature —
orchestrate(json_bytes) reproduces native evaluate byte-for-byte (commands +
events_to_emit) on the auth0 multi-arc when: fixture; malformed input → error
envelope, no panic. wasm32 build green; import/export sections verified
programmatically. Server workspace unaffected: the crate is excluded from the
workspace so the native build/test/clippy never pull it in — 551 lib tests green,
clippy clean both the workspace and the plug-in crate. No kind validation needed:
nothing loads the plug-in yet, so the server binary's runtime is byte-identical.
Next. (2) worker host shadow-diff — load the .wasm, feed it the same
(events, playbook) the in-process orchestrator gets on the PFT, assert
byte-identical commands; (3) catalog register/serve; (4) kernel scheduler
(NOETL_ORCHESTRATE_PLUGIN, default off) replacing trigger_orchestrator_inner.
Design note: commands return over the byte path, so the command_emit capability
the original scope called for is optional (evaluate is referentially
transparent — it returns the full Vec<Command>).
Pointers. server#224 → server@10a629b (internal — server stays v3.20.0, no release tag).
Headline. The last slice of the orchestrator-as-plug-in extraction
(#108) landed: orchestrator.rs
— WorkflowOrchestrator, evaluate, evaluate_state — moved from
src/engine/ into the pure noetl-orchestrate-core crate
(server#223). With it, every
drive module — renderer, playbook model, commands, evaluator, state, and now
the orchestrator Event-type switch — compiles to both native (linked into
noetl-server, re-exported through src/engine/ so call sites are unchanged)
and wasm32-unknown-unknown (the seed for the future system/orchestrate WASM
plug-in). Event-ABI round #109
CLOSED.
What moved. evaluate(events: &[core::event::Event], …) now reads the pure
core::event::Event read-set defined in slice 1; the server converts its
db::Event (sqlx::FromRow, native-only) at the trigger_orchestrator
boundary via the existing From impl (4 conversion sites in
handlers/events.rs) — no production drive change, since the drive already
rebuilds state from converted events. Test fixtures ported to the core shape:
Utc::now() → fixed DateTime::from_timestamp(0, 0) (the core has no clock
under wasm), the db-serial .id ordering field → .event_id (the drive's real
ordering key), playbook::types:: → playbook::. From<CoreError> for AppError moved ahead of the test module (clippy items-after-test-module);
dropped a useless vec!.
Validation. 122 core tests green (native), 0 WASI imports on
wasm32-unknown-unknown; 565 server tests green; clippy clean on both targets.
cargo-chef Docker image built (noetl-server v3.20.0, 44.8 MB), loaded into kind,
server rolled out (container noetl-server, only the new image running).
kind e2e — PFT 10×1000 (test_pft_flow_v2, 10 facilities × 1000 patients,
page_size 1): full command lifecycle flowing (issued→claimed→started→call.done
→completed, 400 of each in the first event page), 0 errors, 0 restarts
across the server + all 8 worker pods, live forward progress confirmed (event
timestamps advancing in real time). The drive behaves identically on the
relocated core.
Pointers. server#223 →
server@bfd3f77 (internal refactor — stays v3.20.0, no release tag). Slice 4
(data-plane ABI + command_emit capability + kernel scheduler + shadow-diff →
live system/orchestrate plug-in) is the follow-on round, tracked under
#108. Standing direction honored:
Claude wrote the Rust directly, no Codex.
Headline. Captured the architecture thesis the in-flight umbrellas converge
on. New blueprint noetl_server_dissolution_and_global_grid.md
(docs#183 + docs#184):
the server dissolves into a stateless edge; the NATS JetStream WAL + object
store are the only durable state (no Postgres source of truth); all processing
is event-driven system/data playbooks on a sharded global grid. Named plainly,
NoETL is a distributed multitenant operating system — process = ephemeral
atomic block, scheduler = JetStream-lag pump, syscall = the WASM capability
ring, drivers = the tool registry, VFS = the locator namespace, journaling =
the WAL, isolation = shard-key + sandbox + capability + keychain — and the
foundation for a quantum-cloud-hybrid platform (QPU as a tool driver, a
circuit as an atomic block per no-cloning, hybrid as a cursor/loop, queue latency
as the callback rule; positioning, not a roadmap).
Program opened. #107 is the
strategic roof over #101–#105 with the 5-step path (CQRS cutover → orchestrator-as-plug-in
→ per-shard WAL → drop Postgres → cross-shard federation). Step 2 scoped in
#108 — extract the already-pure
evaluate/evaluate_state drive core into a system/orchestrate WASM plug-in +
a kernel scheduler replacing trigger_orchestrator, shadow-diffed before any flip.
Step 1 (CQRS cutover) is unblocked — its shadow gate went green this session via
the #106 fix.
#108 started + first slice landed. Directive: the orchestrator core stays
compiled to wasm32 (no host-side template carve-out). Spike proved minijinja +
serde_json compile to wasm32-unknown-unknown with no WASI imports. Then stood up
noetl-orchestrate-core (server#218)
— the runtime-free drive core compiling from one source to both the server
(native) and a wasm32 plug-in. First slice = the template renderer (the foundation
the evaluator/state/commands/orchestrator build on); the only server coupling
(AppError) became the crate's CoreError, mapped at the boundary. minijinja-for-wasm
recipe: default features minus loader + serde. 21 core + 651 server tests green;
cargo-chef Docker build works with the new workspace; kind-validated (wasm_smoke
playbook.completed, 0 template errors). Pure refactor.
Continued through the slices: playbook model (server#219, with the Residency enum untangled), then commands + evaluator (server#220). All four of evaluate's dependencies — renderer, type model, commands, evaluator — now compile to native + wasm32 from one source. E2E checkpoint PASSED on kind: the integrated server (4 of 6 modules extracted) drives test_pft_flow_v2 correctly through the extracted core — 356 commands issued, cursor fan-out, ctx.updated firing, 0 errors, 0 restarts; behavior-preserving. Each slice: 57 core + 615 server tests, wasm clean, cargo-chef Docker build green. Pointer f051779.
Event-ABI boundary designed + opened, slice 1 landed. Weighed minimal-vs-canonical event (the db-row→core-event conversion is unavoidable either way since db::Event is sqlx::FromRow, so unifying buys nothing on the boundary + pre-empts the #104 WAL design) → minimal core::event::Event, named to converge with EventEnvelope. Design note orchestrate_core_event_abi.md (docs#185); round issue #109. Slice 1 (server#221): the pure event type + From<&db::Event> + chrono (no clock) → compiles to wasm32; determinism audit favorable (the drive's only real-output now() is a generated-event timestamp the shadow-diff normalizes; 6 of 8 sites are tests). Pointer 516ef17. Slice 2 landed + e2e-validated (server#222): WorkflowState (state.rs) moved into the core on core::event::Event; the server converts db::Event at the drive boundary (orchestrator's evaluate + 4 events.rs sites). wasm portability surfaced + handled — Instant gated #[cfg(not(wasm32))], tracing + catalog_id (serde-default) added. E2E: PFT drives through the extracted state — 98 commands, ctx.updated firing, 0 errors. 587 server + 86 core tests; pointer 8be15be. 5 of 6 modules now in the core (template, model, commands, evaluator, state). Last slice: move orchestrator/evaluate (the orchestrator Event-type switch + its fixture churn the boundary deferred) → then the whole drive core is wasm-resident and the plug-in round begins.
Pointers. docs dd391f3; program #107 + step-2 #108 on the roadmap boards.
Headline. Worked the queued PR backlog across the orchestrator-scaling + event-WAL line, merging the clean/validated work and validating the integrated result on kind. #105 closed (runtime complete).
Landed on main (each rebased onto a moved main, built + tested green before merge):
-
CQRS #103 phase 2b/2d — projector owns
projection_snapshot+noetl_materializerconsumer + sharednormalize_event_to_row/POST /api/internal/events/materialize(server#215, rebased; the stacked #204/#205/#206 diverged after #202's 2a squash + main's wasm advance — superseded). All default-off. Build hotfix server#216 (the rebase left arebuild_statecall at the old 3-arg signature; v3.15.0 shipped non-compiling — fixed within minutes). -
Batch event-log INSERT #102 — N inserts → one multi-row
QueryBuilder(server#199, v3.15.1). -
ctx/workload shim-dedup #101 — server stops persisting the shims (the ~5MB
command.issued), worker rebuilds them at render (server#207 v3.15.3 + worker#90 v5.31.2). Unflagged → deployed + validated together. -
Closed worker#89 (superseded — main already has
inline_budget_bytes()).
Final e2e (kind). Built server v3.15.3 + worker v5.31.2 from main, deployed to
the 8-pod Rust pool, ran test_pft_flow_v2 (distributed): 7300+ context-dependent
commands dispatched + completed, ZERO errors, 0 worker/server restarts, no OOM;
ctx.updated shows correctly-folded context (the shim-dedup'd worker rebuilds the
shims faithfully). The growing cursor backlog is the known 10×1000 scale the
umbrellas target, not a regression. Pointers bumped: ai-meta@d469e70.
refs-in-state phase 2 — already on main, now VALIDATED. The open phase-2 PRs (server#208 + worker#91) turned out to be stale duplicates of an old branch — the whole feature already landed on main (server hydrate_result_references(keep_refs) + cursor claim_ref resolution; worker build_extracted emit + resolve_context_references consume + resolve_ref). Both closed as superseded. What was never done — the flag flip — validated on kind (NOETL_REFS_IN_STATE=true, budget 2KB, test_pft_flow_v2): references kept in state (_ref noetl:// URI + arrow_ipc 50-row reference block), command context stays ~10KB (not the inline 50-row payload), cursor fans out (claim-ref resolved, no wrong-drain), 0 errors across 645 context-dependent completions. Flag stays default-off; enabling is a rollout decision, not a code gap. #101's two pillars (shim-dedup + references-in-state) are both landed + validated.
CQRS write-path — 2a producer validated live + 2d-3 cutover scoped. Set NOETL_EVENT_STREAM_ENABLED=true on kind: the tailer started, published a run's 7 events noetl.event → noetl_events stream, advanced its stream_cursor; noetl_projector + noetl_materializer consumers ensured. Reverted to default-off. The 2d-3 cutover (materializer as sole noetl.event writer) is an architecture change, not a flag-flip — the tailer reads committed rows, so dropping the synchronous INSERT requires producer→stream-direct publish (new worker code), a skip-synchronous gate (doesn't exist), and the system/event_materializer playbook deployed on the system pool. Scoped on #103; the durability-contract design note landed at docs/architecture/cqrs_write_path_cutover.md (docs#182) — pins the boundary move (synchronous INSERT → JetStream publish-ack), the producer fork (server-mediated Option A vs worker-direct Option B), and the staged shadow→producer-move→drop-tailer rollout. Shadow phase then stood up on kind (the rollout's first step): producer validated (tailer publishes, cursor advances), materializer deployed on the system pool + draining the consumer (20 msgs acked) — and it did its job, surfacing a real reproduction-blocker #106: the worker renders a large templated http body ({{ build_envelopes.events }}, 47 KB) truncated to ~1182 chars (orchestrator context holds the full 348 KB), so the materializer's /events/project POST fails to deserialize. Cutover producer-move stays gated behind #106. Also a sizing finding — the materializer drove the 512Mi server to OOMKilled under a large backlog.
refs-in-state phase 2 validated (see #101): the open PRs #208/#91 were stale dupes (already on main, closed superseded); the NOETL_REFS_IN_STATE flip proven on kind — refs kept in state, command ctx ~10KB not inline, cursor fans out, 0 errors over 645 completions.
Pointers. server ed691c7 (v3.15.3) + worker faa2b16 (v5.31.2); earlier server bea09db (CQRS).
Headline. The WASM system-plug-in capability (#105)
is functionally complete and proven end to end on the kind cluster — a
tool: {kind: wasm, plugin: {...}} playbook runs a compiled Rust→wasm plug-in
on the worker's wasmtime host and its object_put lands in the object store.
Routing shipped this session: digest resolution at dispatch
(worker#105) · the dispatch branch
tool_kind: "wasm" → WasmDispatcher::run_and_apply_by_ref
(worker#107) · wasm-plugin flipped
on by default (worker#108, v5.31.0) ·
ToolKind::Wasm in the playbook schema
(server#214, v3.13.0).
Live kind validation (the capstone): built + deployed the feature-on worker
(v5.31.0, wasmtime carried) + server v3.13.0 (plug-in registry + object store).
(1) Flip-safety — a full PFT run playbook.completed (217 events) on the
host-carrying workers, 0 restarts. (2) Registered system/reference-materializer@1
via POST /api/internal/plugins (digest c6bd7d05…); GET returns it with the
ETag. (3) A wasm playbook playbook.completed; the step routed
(command.issued→claimed→started→call.done); the plug-in's object_put landed
in noetl.object_store at noetl/results/reference/0/0/1.feather. The whole
path — server accepts wasm tool → tool_kind: "wasm" → worker routes to host →
digest from registry → load + run → flush to the object store — works against
running pods.
Real-data flow closed (input→args): the object payload had been empty
because the worker's wasm_config_to_ref read config.input while the server
canonicalizes a step's input: to args. Fixed
(worker#110, v5.31.1): read args
first, input fallback. Re-validated on kind — a wasm playbook with
input: {hello: world} now lands {"hello":"world"} (17 bytes) in
noetl.object_store (was 0). The full WASM dispatch path now carries real
data end to end. #105's runtime is complete; the only remaining scope is the
optional playbook→WASM lowering + porting system/materialiser to the compiled
path.
Hot-replace pillar proven LIVE (worker#112, test + kind validation): the host resolves the plug-in digest from the registry per dispatch, so republishing the same path@version with new bytes hot-reloads the pool with no restart. Proof on kind — registered @1=variant A → ran → object at .../1.feather; republished @1=variant B (new digest, different object key) → re-ran → object at .../HOTRELOAD-B.feather, all 8 worker restartCount still 0. The full #105 runtime — load + run + capability-flush + hot-replace — is now live-proven. The two remaining items (executor: step-level author sugar; porting the Python system/event_materializer to a compiled plug-in) are deferred: the materialiser port needs nats_drain/events_project capabilities + the Arrow-in-wasm decision, which belong to #104/#103, not #105.
Pointers. worker df16b83/bump (v5.31.1, worker#110) ·
worker d6cd215 (v5.31.0) · server 2b21f28 (v3.13.0). (worker#112 hot-reload test — pointer bump pends merge.)
Headline. The WASM plug-in host capability landed end to end across worker + server, plus the Resource Locator naming foundation; ai-meta pointers bumped to the merged releases.
Pointer bumps (this change set):
-
server→a162514(v3.11.0, server#210) — plug-in module registry (noetl.plugin_module+POST/GET /api/internal/plugins/{*path}), the PluginSource backend. -
worker→2433492(v5.23.0 → v5.24.0, worker#93 + worker#95) — wasmtime host (capability ring, hot-reload, Arrow byte data-plane ABI, materialiser capability ring) + HTTP PluginSource. Catalog-loading loop closed: server registry → host → wasmtime. All behind off-by-defaultwasm-plugin. -
tools→2ca2f2a(v3.11.0, tools#68) —noetl_tools::locator(ResourceLocator logical URI + ResultCoordinates §7 physical key + stable FNV-1a shard_key).
Still in flight: #105 Round 5 (dispatcher mode + lowering + materialiser port); #104 R02 — locator fan-out (frame,row) refinement open as tools#70 (v3.12.0); R02a (orchestrator stamps cursor.{frame,row} on body commands) already in place; R02b (worker stamp) gated on the tools 3.12.0 release.
Pointers. ai-meta chore(sync) commits e96b6d6 (server) + e34dbcb (worker) + 2b71d4d (tools).
Headline. Implementation of the Event-WAL umbrella started with the naming foundation; a standing directive routed all async system services to the system-pool plug-in ring and opened the WASM-compilation capability as its own umbrella (#105).
Round 01 — naming foundation (noetl/tools#68,
merged; v3.11.0). Shared noetl_tools::locator: ResourceLocator (the stable §8
logical URI), ResultCoordinates (the §7 physical key, collision-free across
frame/attempt), shard_key (fixed FNV-1a — reproducible across
binaries/arch/time, locked test), CellPlacement, legacy parse. Pure module,
12 unit tests, not yet wired into runtime. Sub-issue
tools#67. Single source of truth
replacing the divergent noetl://execution/... formatting.
Directive — system services as plug-in-ring playbooks. The Event-WAL
materialiser and every async system service run on the system worker pool as
playbooks (like the live system/projector + system/outbox_publisher), not
bespoke Rust services; the worker's parallel publish stays compiled-core-thin.
"Compile playbook logic to compiled, hot-replaceable, managed library" is the
Phase 4 (WASM compilation) of the
System Pool + WASM Plug-in ADR:
wasmtime host in the worker binary, server-side compile-at-register, hot-reload
via catalog version bump, the catalog as the managed plug-in library. Designed,
not built; its old umbrella #46 is
closed, so opened #105 to track
it (gating decision: the lowering model — transpile vs hand-written Rust
plug-ins). #104 blueprint reshaped to route services to the plug-in ring
(docs#180 836725c).
Pointers. tools#68 (v3.11.0, merged); #105 opened + on board 3; docs#180 updated.
Headline. The references-in-state flag-on stall is fixed and re-validated on kind; the architecture for making the event/result path fast and crash-resilient is captured as a new umbrella (#104).
Stall fix (worker feat/refs-in-state-extracted @ 56c253c, off main). The
extracted predicate block was collapsing over-budget results to a flat
{_count, _keys} shape, so the orchestrator's
{{ output.data.rows[0].facility_mapping_id }} resolved to null and the PFT
stalled at 13 events. Replaced the flat collapse with a bounded structural
summary (build_extracted/summarise_value recurse: objects keep every key,
arrays keep their first element as a real 1-element array so arr[0].<field>
resolves; large strings → {_len}; threaded byte budget caps at 4KB). Worker-only
— the server command carries no output_select. Re-validated flag-on (2 fac × 5
pat): 253 events (vs 13), full per-facility pipeline ran, 31 reference URIs
active, command.issued bounded (max 18KB / avg 11KB). The terminal
playbook.failed was the hardcoded 1000/1000 go/no-go assertion vs the
5-patient override (fixture artifact). 2 unit tests added; kind reverted to
flag-off. Recorded on #101.
New umbrella #104 — Event WAL + derivable result storage. Design blueprint
landed: docs/architecture/event_wal_and_derivable_storage.md
(docs branch feat/event-wal-derivable-storage @ fdbc388). The model: NATS
JetStream is the write-ahead log (publish-ack = durability, the synchronous
noetl.event INSERT leaves the hot path); local memory/temp is a read cache
ahead of the last acked offset; result locations are derived from a URN
naming convention instead of carried as references; two pools drain the log
(projector → projection_snapshot, materialiser → Arrow Feather in object
store). The load-bearing decision: the durability barrier sits at side-effecting
tool boundaries only. Folds CQRS (#103)
- references-in-state (#101) into one model; ~70% already scaffolded by #103. Added to roadmap board 3.
Pointers. worker feat/refs-in-state-extracted 56c253c; docs
feat/event-wal-derivable-storage fdbc388; umbrella Event WAL Storage.
Headline. Block b of the orchestrator-scaling umbrella landed in noetl/server v3.9.0 and was proven stall-proof on the GKE db-g1-small + PgBouncer small tier.
What landed (noetl/server#197 →
v3.9.0 1760c19): projection-snapshot bounded rebuild (flat memory — 167KB
snapshot at 200k events, was OOM at ~19k); throttled consistency COUNT (O(events)
per-trigger COUNT off the hot path); a background reconcile poller force-
advancing every active execution every 8s; results-by-reference resolution; and
the GET /api/executions/{id} memory-bomb fix (was loading all 200k events → OOM
on poll).
The journey. kind 10×1000 held flat memory to 200k events but a status-endpoint
poll OOM'd it (self-inflicted — found + fixed the executions endpoint). On the
GKE small tier the slow DB backpressure surfaced a deadlock: a non-triggering
straggler (a cursor claim's call.done carrying the row batch) missed in a COUNT
throttle gap left the cursor unable to fan out, and with no further events there
was no trigger to retry → permanent stall. Fixed with the reconcile poller.
Re-validated on GKE db-g1-small + PgBouncer (10×200): cleared the prior deadlock
point, poller observed advancing a stuck execution, 0 fails / 0 restarts,
Cloud SQL bounded ~15 backends. Processing is slow under the small tier (~1.5
items/s — the bottleneck is the Cloud SQL tier + PgBouncer pool, not the
orchestrator) but does not stop.
Pointers. Issue #101 · Umbrella: Orchestrator Scaling · server v3.9.0. Full GKE 10×1000 running for the small-tier completion number. Block-b stage 2 (references-in-state + completed-frame pruning) still ahead.
Headline. Found + fixed two coupled orchestrator-scaling bugs that surfaced
validating test_pft_flow_v2 at scale, then committed both as PRs (kind-validated,
awaiting merge).
What landed (PRs open).
-
server#197 — incremental orchestrator
state (per-execution
OrchStateCache: apply-new-events behind a per-exec lock, full-rebuild fallback on count mismatch, evict on terminal; cursor frames moved intoStepInfo) +hydrate_result_references(resolve over-budget{data:{_ref}}references fromnoetl.result_store, nested + top-level envelope shapes, before the orchestrator reads events). -
worker#89 — env-configurable inline
budget
NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES(default 100 KB).
Root cause. The orchestrator never resolved references when reading events, so
a cursor claim returning its rows by reference looked like zero rows → 0-row frames
→ infinite re-claim (20,844 events, work-queue stuck pending=5). And the
per-trigger full event-log replay OOM'd the server under cursor concurrency.
Validation. kind, worker budget=256 (forces every result by reference): PFT
test_pft_flow_v2 (1 facility × 5 patients) COMPLETED — 34 references / 9 steps
stored+resolved in noetl.result_store, 292 events (vs 20,844 runaway pre-fix),
all 25 work-queue rows done. server 666 tests pass, worker 19 pass, clippy clean.
Pointers. Issue #101 (board:
In progress) · Umbrella: Orchestrator Scaling ·
worker wiki deployment-spec env var (worker.wiki@a1d3b09).
Pointer bumps deferred to merge. Follow-up (user-steered): extend the contract
so event.result + command.context carry references only (never inline data),
with an extracted predicate-fields block — projection + reference +
step-container / heterogeneous-runtime model.
Headline. The Rust orchestrator now supports loop.spec.mode: cursor — a claim-based work-distribution loop that repeatedly leases a frame of rows from a database table and processes the step body per row until the claim returns nothing. The full test_pft_flow_v2 patient-fetch flow ran end-to-end on the Rust stack with all_passed: true (5/5 per data type: assessments, conditions, medications, vital_signs, demographics) against the throttling/error-injecting paginated-api test server on kind. This proves the playbook handles retries + multi-frame cursor loops.
What landed.
-
noetl/server#196 → v3.8.0 — cursor loop engine:
LoopMode::Cursor,CursorClaim,FrameSpec; orchestrator entry hook issues the first claim command;advance_cursordrives subsequent frames;reconstruct_cursor_framesmatches completions to frames;StepInfo.is_cursorprevents premature step completion; drain fires when the claim returns 0 rows (__cursor_drained,event.name = loop.done). Also lands: theoutputnamespace (arcwhen:/ stepset:may now use{{ output.<field> }}as an alias for the just-completed step's result — unblocked the PFT's output-gated arcs) + cursor loop-back re-entry (re-running a cursor step from a loop-back arc resets frame tracking so prior drained run's frames don't merge). -
noetl/tools#66 → v3.10.1 — postgres multi-statement splitter fix: the splitter now skips
--line comments before scanning for;. An apostrophe in a--comment was swallowing the trailing semicolon and merging subsequent statements, producing "cannot insert multiple commands into a prepared statement" atsetup_facility_work. -
noetl/worker#88 — bumps noetl-tools dep to v3.10.1 to pick up the splitter fix.
Orchestrator-driven design. No worker holds a slot between claim frames. Each claim runs as a normal postgres tool command on a worker; the orchestrator issues the next claim when the prior frame's body work completes. Stale-row reclaiming is the playbook's own responsibility via a reclaim-stale CTE in the claim SQL (FOR UPDATE SKIP LOCKED … RETURNING). {{ __frame_max_rows }} is injected from loop.spec.frame.max_rows.
Validation. test_pft_flow_v2 on the local kind cluster:
- Stack: server v3.8.0 + noetl-tools v3.10.1 + worker pulling v3.10.1.
- Test server: throttling/error-injecting paginated-api (proves error handling + multi-frame loops).
- Result:
all_passed: true, 5/5 per data type.
Pointers. server → 1418a93 (v3.8.0) · tools → 454ab6e (v3.10.1) · worker → e4b4a64.
Umbrella. noetl/ai-meta#100 — closed. Umbrella-Cursor-Loop-Mode page created. Server wiki cursor-loop-mode deep-dive page added.
Headline. The transfer tool now moves rows between Snowflake and Postgres in both directions
with full credential-alias resolution. Validated end-to-end on kind against the live sf_test
Snowflake account (NDCFGPC-MI21697) + kind Postgres: the full bidirectional
data_transfer/snowflake_postgres fixture turned green — every step COMPLETED, including
transfer_sf_to_pg and transfer_pg_to_sf. Real data moved with correct types: SF→PG landed
id (int) / name (text) / value (numeric 100.50) / created_at (timestamptz
2026-06-15 05:55:01.262+00) / metadata (jsonb) correctly; PG→SF moved all 5 rows back.
What landed.
-
noetl/tools#65 → v3.10.0 — implemented the
(Snowflake,Postgres)and(Postgres,Snowflake)transfer arms (previously validated as supported but not implemented). ReusesSnowflakeTool(newquery_rowsmethod) +PostgresToolinternally. Two key coercion problems solved: (1) Snowflake returns every cell as a string, so SF→PG looks up the target column types frominformation_schema.columnsand coerces with$n::text::<udt>casts; (2) Snowflake's internalTIMESTAMP_TZformat (<epoch>.<nanos> <tzmin>) is reformatted to RFC3339 before the PG cast. PG→SF writes generated SQL-escapedINSERTstatements.SourceConfig/TargetConfigcapture worker-injected credential fields via#[serde(flatten)] extra. -
noetl/worker#87 → v5.22.0 — the worker pre-resolves the keychain alias on each transfer endpoint (
source.auth/target.auth), mirroring thetask_sequencepre-resolution pattern, and bumps noetl-tools to 3.10. (Earlier worker bumps this session: #83 / #84 / #85 / #86 for the key-pair JWT work under #98.) -
noetl/e2e#58 — migrated the fixture's transfer steps off the Python-era nested-auth map to string-alias auth + table-based auto-INSERT.
Validation (kind, live account). Full bidirectional data_transfer/snowflake_postgres fixture
green: create_sf_database, setup_sf_table, create_pg_table, transfer_sf_to_pg, and
transfer_pg_to_sf all COMPLETED. Type coercion confirmed live: numeric, text, timestamptz, jsonb
all round-tripped correctly in the SF→PG direction; all 5 rows moved cleanly in the PG→SF direction.
Pointers bumped. tools → 4127b4b (v3.10.0) · worker → 6d97e7c (v5.22.0) · e2e → 94aa7f1.
Headline. Snowflake key-pair (JWT) authentication implemented and
validated end-to-end on kind against the live sf_test account
(Snowflake account NDCFGPC-MI21697, user NOETL, warehouse
SNOWFLAKE_LEARNING_WH). This was the last external-fixture gap for the
regression-baseline migration (#98):
the Snowflake tool only supported password auth, and the real account
requires MFA with TOTP — rejected every attempt. JWT bypasses the
password/MFA path entirely.
What landed.
-
noetl/tools#62 → v3.9.0 —
key-pair JWT auth for the Snowflake tool. RS256 JWT with
iss = <ACCOUNT>.<USER>.SHA256:<base64(SHA256(public-key DER))>,sub = <ACCOUNT>.<USER>(both uppercased, region segment dropped from account); sent as aBearertoken withX-Snowflake-Authorization-Token-Type: KEYPAIR_JWT, bypassing the password/MFA login flow. NewSnowflakeConfig.public_key(PEM) field. Deps:jsonwebtoken9 +pem3. -
noetl/tools#63 → v3.9.1 —
set a
User-Agentheader on the Snowflake HTTP client. Snowflake's SQL REST API rejects requests with noUser-Agent(400/ code391903); reqwest sends none by default. -
noetl/tools#64 → v3.9.2 —
set Snowflake session context (
warehouse/role/database/schema) in the request body instead of viaUSEstatements (the SQL API rejectsUSEwith code391911); split multi-statementcommand:blocks on;(the SQL API runs one statement per request; a whole multi-statement block fails with code000008);database/schemaomitted forCREATE/DROP DATABASE. -
noetl/worker#83 — added
sf_public_key -> public_keytoauth_alias.rsSNOWFLAKE_FIELD_MAP. -
noetl/worker#84/#85/#86 —
bumped
noetl-toolsdep 3.7 → 3.9 → 3.9.1 → 3.9.2 (binary redeployed on kind; no crates.io worker publish). -
noetl/e2e#57 — dropped
unsupported
USE DATABASE; USE SCHEMAstatements from thedata_transfer/snowflake_postgresfixture.
Validation (kind, live account). The two tool: snowflake steps in
the fixture — create_sf_database (CREATE DATABASE) and setup_sf_table
(CREATE TABLE + INSERT) — both reached COMPLETED via key-pair JWT. The
transfer_sf_to_pg step still fails: the transfer tool reads inlined
credentials and cannot resolve credential aliases, and its Snowflake source
has no key-pair fields. Deferred to
#99.
Pointers bumped. tools → a216ab2 (v3.9.2) · worker → 9d6b127 · e2e → e191231.
Headline. Stood up the Rust-native system worker pool. The Rust server
publishes commands directly to NATS and writes events inline, so the Python-era
outbox-publisher/projector system playbooks are obsolete — the system pool's
real job is scheduled retention/cleanup of the transient noetl.* tables
(nothing did this after the cutover). Deployed + verified in prod; Python-era
legacy removed.
What landed.
-
noetl/server#193 (v3.6.0) —
POST /api/internal/cleanup/purge, service-account-gated. Purges terminalnoetl.commandrows + deadnoetl.runtimeworker registrations;noetl.eventretention is opt-in (default off — append-only source of truth). Ships span +noetl_cleanup_rows_purged_total{table}metric + structured log. Prod imageserver-rust:v3.5.4(sha256:2bb06d5a…). -
noetl/ops#185 —
system/scheduled_cleanupplaybook (calls the endpoint via the system pool, per data-access-boundary.md), prod system-pool deployment (NOETL_COMMANDS_RUSTstream,noetl_worker_system_rustconsumer), hourly CronJob trigger. Deleted the obsolete Pythonoutbox_publisherplaybook +outbox-publisher-deployment+configmap-outbox-publisher+projector-statefulset.
Python-era legacy removed. Deleted the dead prod noetl-server +
noetl-worker Python deployments (orphaned — the noetl Service already selects
app: noetl-server-rust; gateway health stayed 200). Prod now runs Rust only
(server-rust + worker-rust + system-pool). The remaining entangled refactor
(Python server-deployment/worker-deployment/subscription-runtime manifests,
the kind redeploy automation that hardcodes Python deployment names, the stale
helm release noetl rev 185 from before the cutover) is tracked in
#97.
Validation. Kind end-to-end (/api/execute → system pool → cleanup/purge
200, purged 30 dead runtime rows), then prod: server rolled to v3.5.4, system
pool 1/1, CronJob hourly; a trigger run reached playbook.completed with a 200
from the endpoint.
Pointers bumped. server@9f399f7 + ops@7b02727.
Also this session — Rust regression baseline + e2e migration (#98). Built a Rust-stack regression runner + PF-resilient batched runner; grew the green core 26 → 40 → 64 fixtures, all verified on kind. Migrated 2 fixtures to the start-entry convention. Root-caused + fixed a kind credential blocker (noetl/ops#186): the kind noetl-secret lacked NOETL_ENCRYPTION_KEY, so the server used a random default key regenerated per restart → flaky Decryption failed on postgres fixtures; a stable dev key unlocked the whole postgres batch (40→65 green). Remaining red is external-cloud (OpenAI/GCS/IB/Snowflake — can't run in kind) + a few engine cases. Browser e2e on the Rust-only prod stack also green. Tracked in #98.
(superseded by the fuller note below) Also this session — Rust regression baseline. Added a Rust-stack regression runner (noetl/e2e#52, scripts/rust_regression_run.sh) that drives the server's /api/execute directly and ships a green 10-fixture core baseline verified on kind (basic python, loops, control-flow routing, fanout/parallelism, sub-playbook composition, large-result extraction, output selection). Browser e2e on the Rust-only prod stack also green (login → run tests/e2e_probe → playbook.completed). Growing the baseline / migrating the Python-era suite tracked in #98.
Headline. Browser-driven Auth0 login at https://mestumre.dev/login now
authenticates and redirects to the dashboard; the auth0_login playbook runs
to playbook.completed. Login had been down since the cutover; root-caused and
fixed forward (no Python back-compat) per standing guidance.
What landed.
-
noetl/worker#81 (worker@9ce4d6d,
released v5.20.1) —
SOURCE_FIELD_MAPmapsnats_url/nats_user/nats_passwordcredential fields to the flaturl/user/passwordthe NATS tool deserializes. The shippednats_credentialused the prefixed names, whichapply_source_credentialinjected verbatim → dropped as serde-unknown → nourl→cache_and_callback's NATSkv_putfailed (503 auth backend is busyat the gateway). Mirrors the existingPOSTGRES_FIELD_MAPpattern. Prod image rolled tosha256:61278cb6…f524658(ops#184). -
noetl/e2e#51 (e2e@1c2a0b5) — migrated
auth0_login.yamlto Rust execution conventions (prod catalog v102): pythonlibs:→explicitimport;context.get()→input:+args.get(); http callback bodydata:→json:; dropped Python-era.contextstep-result wrapper; carried the session token as the non-sensitivesess_ref(the server + worker[REDACTED]-redact any*token*field when persisting step results, which destroyed the real token before downstream steps read it);expires_at::textcast (the Rust postgres tool returned null for the raw timestamptz column).
Validation. Full browser Auth0 login through the prod Rust stack →
authenticated /catalog dashboard; execution reached playbook.completed
across start → create_user_session → prepare_session_cache → cache_and_callback
→ send_success_callback.
Pointers bumped. worker@690ef1d + e2e@1c2a0b5 + ops@19d7f84.
Follow-ups filed. #95 — postgres
pg_value_to_json returns null for timestamptz/NaiveDateTime. Noted in #49:
event-log *token* redaction corrupts inter-step propagation (favor
response-boundary redaction per the data-access-boundary rule).
2026-06-13 (#49 — post-cutover edge refresh: gateway v3.4.0 + SPAs on Cloudflare Pages + Rust worker KEDA)
Headline. On top of the full Rust-stack cutover, refreshed the edge and added worker autoscaling. All public endpoints live + verified through Cloudflare.
-
Rust worker KEDA autoscaler (ops#182) — 2→20 on NATS JetStream lag of the
dedicated
NOETL_COMMANDS_RUSTstream /noetl_worker_rust_sharedconsumer. -
Gateway → v3.4.0 (
noetl-gateway@sha256:97e72c97…49c48,f175a87) — rolled in-cluster by digest; forwards to the Rust control plane;/healthok. Config-compatible (new env all optional). Helm-managed → set image in prod helm values before anyhelm upgrade. -
NOETL dashboard → Cloudflare Pages (
noetl-gui,mestumre.dev) viaautomation/cloudflare/gke_gateway_edge.yaml action=pages. -
Travel SPA → Cloudflare Pages (
travel,travel.mestumre.dev) via itsnpm run build:cf && deploy:cf(Maps key from SMgoogle-maps-widget-key). - Frontend pattern = Cloudflare Pages for SPAs + tunnel for the gateway API
only (not in-cluster). Verified:
gateway.mestumre.dev/health200,mestumre.dev200,travel.mestumre.dev200.
ops pointer 494497f. #49 detail.
Headline. Production GKE (noetl-demo-19700101) now runs the full Rust
stack (Rust server + Rust worker); Python is scaled to 0. The Rust
noetl/server crate is the production control plane.
The decisive lesson. A server-only flip (attempt 1) failed: the Rust
server publishes commands to the hierarchical subject
noetl.commands.{pool}.{execution_id}, which the Python workers + prod NATS
stream (flat noetl.commands) don't consume → /api/execute 500'd. Rolled
back cleanly. The Rust worker is what consumes the hierarchical subjects,
so the cutover had to be the full Rust stack (the kind-validated config).
Attempt 2 (validated-then-flipped).
- Built the prod amd64 Rust worker image (
noetl-worker-rust:v5.20.0, digestsha256:b808bc60…c8dea9). - Created a dedicated NATS stream
NOETL_COMMANDS_RUST(noetl.commands.>), disjoint from Python's flat stream — Python untouched. - Deployed the Rust worker as a canary + proved a real
hello_worldexecution COMPLETED end-to-end off the traffic path (the gate). - Cut over: scaled Rust worker → 3, flipped the
noetlselector → Rust, re-encrypted the 19 credentials under Rust (19×200), scaled Python server- workers → 0 (KEDA ScaledObject paused at 0).
- Verified through the production
noetlService / gateway: executions COMPLETE, credentials decrypt, health green, logs clean.
Fixes landed this cutover: pgbouncer transaction-mode sqlx
(NOETL_PG_STATEMENT_CACHE_CAPACITY=0, server#191),
time =0.3.47 build pin (server#190),
DB password from NOETL_PASSWORD key (ops#180),
prod manifests + runbook (ops#178/#179/#181).
State: noetl-server-rust 1/1, noetl-worker-rust 3/3; Python
noetl-server/noetl-worker retained at 0 for rollback. #49 open for a short
soak. Full detail: #49 cutover comment.
Headline. Ran the read-only pre-flight against live prod with the operator. Surfaced one real blocker and corrected two assumptions; fixed the blocker. Production still 100% Python — no traffic moved.
Findings.
- 🔧 DB is Cloud SQL behind transaction-mode pgbouncer. Prod has no direct
postgres Service —
pgbouncer.postgres.svcrunsPOOL_MODE=transaction(cloud-sql-proxy →noetl-shared-pg). sqlx's named prepared-statement cache fails intermittently under transaction pooling. Fixed: addedNOETL_PG_STATEMENT_CACHE_CAPACITYto the server (default 100 unchanged; set0behind a transaction pooler → one-shot unnamed statements) — noetl/server#191 MERGED (0577cc6, v3.5.1). Prod manifest sets it to0. Rust stays behind pgbouncer like Python. - ✅ Decision C N/A. Prod
noetlService exposes8082/TCPonly — no 8083/Flight port (that's kind-only). Nothing to break on flip. -
Image repinned to the rebuilt digest
sha256:c3783281…964984(server-rust:e7df366/:v3.5.0, Cloud Build94cc199e) carrying both thetimepin and the statement-cache fix. -
noetl/ops#179 MERGED (
1164270): manifest env + repin + runbook corrections (Decision B RESOLVED, C N/A). Server-wiki deployment-spec env catalogue updated (a17cf50).
ai-meta pointers: server 55d2dfc→0577cc6, ops dd5ede7→1164270,
server-wiki a17cf50. Operator-pending: provision the two secrets →
apply → canary (query-heavy playbook proves the pgbouncer fix) → flip → scale
Python to 0. #49 open; board In progress.
Headline. Following the NO-GO readiness review (same day, below), all safe, non-traffic-affecting cutover prep is built and merged. Production is still 100% Python — the cutover is operator-gated and fully scripted. #49 stays open; board In progress.
What landed.
-
Prod amd64 image pushed to the prod Artifact Registry:
server-rust:4644c49/:v3.5.0, digestsha256:78cce8f3…b929fa(Cloud Build00a26c26, linux/amd64). -
noetl/server#190 MERGED
(
55d2dfc) —time =0.3.47build-fix pin. The v3.5.0 release commit7b217d8doesn't compile (time 0.3.48×async-nats 0.38E0119); server was the last Rust repo missing the pin that tools/worker/gateway already had. -
noetl/ops#178 MERGED (
dd5ede7) — prod-shapedserver-rust-deployment-prod.yaml(image pinned by digest, encryption key + internal-api-token REQUIRED/fail-closed, pgbouncer DB, NATS prod auth; does NOT touch thenoetlService) + the operator runbookrunbooks/noetl-server-rust-cutover.md- the amd64 Cloud Build asset.
Operator-pending (in the runbook): provision NOETL_ENCRYPTION_KEY
(+ re-enter the plaintext-stored prod credentials under the Rust AES-GCM
scheme) and noetl-internal-api-token; verify pgbouncer session mode + the
port-8083 Flight caveat; apply → canary → selector flip → scale Python to 0;
one-command rollback on standby.
ai-meta pointers bumped: server 7b217d8→55d2dfc, ops 85bfc1f→dd5ede7.
Headline. Stage-1 readiness review for flipping production GKE
(gke_noetl-demo-19700101_us-central1_noetl-cluster) from the Python
noetl-server (FastAPI) to the Rust noetl/server crate. Verdict:
NO-GO — no production traffic was flipped; prod is untouched and still
serving on Python.
What was found. Prod read-access confirmed. Routing is the
gateway LoadBalancer → noetl ClusterIP Service (selector
app=noetl-server), not a K8s Ingress. The prod baseline is Python
only: noetl-server 1/1 on image noetl:coalesce-20260529230422
(cmd ["python"]), noetl-worker 3/3. No noetl-server-rust
Deployment, Service, pods, or image exist in prod — the Rust server
has never been deployed to GKE; it runs only on the kind stack
(healthy at runtime v3.4.2). Submodule pointer repos/server =
7b217d8 (v3.2.0-10-g7b217d8).
Hard blockers (beyond "not deployed").
-
noetl-secretin prod holds onlyNOETL_PASSWORD+POSTGRES_PASSWORD— noNOETL_ENCRYPTION_KEY; the Rust manifest references itoptional: true, so a Rust pod would boot on the insecure default key (prereq 3b FAIL). -
noetl-internal-api-tokensecret absent — Rust manifest references it as a non-optionalsecretKeyRef→ pod fails to start (3c FAIL). - Named gate
validate-shard-routing-n2.shis kind-scoped; fresh re-run blocked on a kind PostgresCREATEDBharness-privilege gap (shard routing previously passed at Phase F R4). Not unblocked (out of authorized scope).
Outcome. Full GO/NO-GO with per-prerequisite pass/fail, the exact
selector-flip cutover + one-command rollback, blast radius, and the
operator action list (build/push amd64 image → provision the two
secrets → deploy + canary → flip) recorded on
#49.
No PR opened (applying a Rust prod Deployment before the secrets exist
would crashloop or silently use the insecure default key). #49 stays
open; board stays In progress. Unrelated uncommitted items
(repos/.dockerignore, scripts/start_noetl_ui.command) left untouched.
Headline. The gateway push-ingress pubsub_oidc verifier is now
proven against the real Google JWKS — closing the one positive-path
gap #90 Phase 3 deferred for lack of a real Google-signed token.
What landed.
- A genuinely Google-signed OIDC token was minted by impersonating the
#90 Phase 5 least-privilege runtime SA
(
noetl-subscription-runtime@noetl-demo-19700101.iam.gserviceaccount.com) with--audiences+--include-email, so it carries the exact claims a Pub/Sub push subscription with OIDC auth sends (iss=accounts.google.com, customaud,email=<SA>,email_verified=true, RS256 + real Googlekid). -
#[ignore]d live testoidc_live_google_token_against_real_jwksinrepos/gateway/src/ingress/verify.rsfetches the live JWKS via the gateway's ownfetch_google_jwksand validates the token: valid → verified; wrong-aud →oidc_wrong_audience; wrong-SA →oidc_wrong_sa; tampered →oidc_bad_signature. (gateway#30, test-only) - Reproducible runner
scripts/live_validate_oidc_verify.shin noetl/e2e mints the token + runs the test; no secret printed or committed. (e2e#50)
Live HTTP gold-standard (kind-noetl). Ran the gateway binary against
the in-cluster server (NATS + server port-forwarded) with a registered
pubsub_oidc subscription, then POSTed the real token in a Pub/Sub-push
body to /ingress/oidcbilling: 4 received → 1 dispatched (valid token
→ HTTP 202 + one COMPLETED child execution 323957201221718016 on the
subscription pool) → 3 rejected (tampered 401, wrong-aud 403, missing
401), zero executions from the rejected deliveries. Metrics confirm:
noetl_ingress_dispatched_total{oidcbilling}=1.
GCP hygiene. No cost-bearing resources created (no Pub/Sub
topics/subscriptions, no Cloud Run, no buckets) — only an ephemeral token.
The scoped roles/iam.serviceAccountTokenCreator binding added to mint the
token was removed at the end.
Status. #91 CLOSED. gateway#30 + e2e#50 merged (test-only; gateway
stays v3.4.0 — no release cut); ai-meta pointers bumped (gateway f175a87 +
e2e f7a24de). Board → Done. Full HTTP run this round also added the
wrong-SA negative at the HTTP layer (→ 403 oidc_wrong_sa).
Headline. The header-directive engine is extracted into a standalone, lean
noetl-directives crate (serde+thiserror only) so the security-sensitive
allowlist (RFC §7.5) has one implementation — the internet-facing
noetl-gateway de-vendors its former serde-only copy (src/ingress/directives.rs)
and depends on the shared crate instead, eliminating the drift risk.
-
noetl-directives 0.1.0 published to crates.io (new crate). tools v3.8.0
(tools#61) makes
repos/toolsa workspace (rootnoetl-tools+directives/member), re-exports every symbol fromtools::sourceso the worker's call sites are unchanged, and publishes the member before the root inrelease.yml. -
gateway v3.4.0 (gateway#29) drops the
vendored copy for
noetl-directives = "0.1". -
Scope: extracted
noetl-directives(the drift-risk fix); deferrednoetl-spool(single consumer → no drift; couples toPolledMessage) — #92 notes it.
Validation. 13 noetl-directives tests + 376 noetl-tools tests + 69 gateway tests
green; clippy clean. Gateway stays lean — cargo tree shows no
duckdb/kube/tokio-postgres/noetl-tools creep (the whole point). Directive behavior
is byte-identical (the engine + its full suite moved verbatim) and was already
exercised live by the #93/#94 e2e earlier today. Also pinned time =0.3.47 across
tools/worker/gateway (0.3.48 broke async-nats 0.38 with E0119 under rustc 1.92).
Pointers. ai-meta → tools d8bef36 (v3.8.0) + gateway 2c48c26 (v3.4.0). Closes #92.
Headline. Two of the three spool/directives refinements spun out of the closed subscription RFC (#90) land, both live-proven on kind:
-
#94 — s3 spool backend. New
noetl_tools::spool::S3Backend(hand-rolled AWS SigV4 over reqwest +hmac/sha2— no AWS SDK; S3/MinIO/R2/B2) + worker wiring (SpoolBackendKind::S3, keychain-auth credential). tools v3.7.1 (tools#58), worker v5.20.0 (worker#80). -
#93 — cross-restart drain.
recv_seqhigh-water recovery +SpoolRuntime::recover_on_startup: on boot, list the durable spool and auto-drain (closes the gcs/s3 in-memory-circuit gap where a restart mid-outage otherwise forgot the backlog). tools recovery helpers (tools#59) + the same worker PR.
Build unblock. time 0.3.48 (published 2026-06-12) breaks async-nats
0.38 with E0119 under rustc 1.92 — it failed the 3.6.0/3.7.0 crate publishes.
Pinned time =0.3.47 in tools (tools#60)
- worker → noetl-tools 3.7.1 published. Revisit when async-nats ships a fix.
Live proof (kind, MinIO). kind_validate_subscription_spool_s3.sh
(e2e#49, MinIO via ops#177):
outage → 6 buffered to MinIO (s3 SigV4) → kill+restart runtime → startup
auto-drain (subscription.spool.recovered fired) → 6 replayed in order, 6
COMPLETED, idempotent (no dups), no loss. Also: the s3 backend's
put/list/get/delete proven directly against MinIO (s3_live test).
Pointers. ai-meta → tools f362aa1 (v3.7.1) + worker 7b8a09a (v5.20.0)
- ops
85bfc1f+ e2e1ea7bd0. Closes #94 -
#93. Remaining refinement:
#92 (shared
noetl-directives/noetl-spoolcrate extraction).
2026-06-12 (RFC #90 Phase 7 SHIPPED — scale hardening; #90 CLOSED — all 7 phases complete, live proof green)
Headline. The subscription/listener RFC's final phase lands and #90
is closed: all seven phases (bounded-drain tool → kind: Subscription
continuous runtime + header directives → gateway push-ingress + auth-gated
trust → store-and-forward spool + circuit breaker → out-of-cluster Cloud
Run + gcs spool → CLI local noetl subscribe → scale hardening) are
shipped and live-proven on kind / Cloud Run.
What shipped (Phase 7).
-
server v3.5.0 (server#189, closes server#188) —
POST /api/execute/batch(N→N executions in one round-trip, partial-failure contained, reuses the single-executeexecute_onepath so per-message routing/trace/dedup are intact; server still owns every DB write) + the opt-in exactly-once dedup window (RFC §10 OQ1):noetl.subscription_dedup(idempotent startup DDL, bounded by age, cluster-pool authority),executetakesdedup: { key, window_secs }scoped byparent_execution_id(the subscription), a duplicate within the window collapses to the existing execution + asubscription.message.deduplicatedaudit event, race-safe viaINSERT … ON CONFLICT, default off; validation of the newdispatch.batch_dispatch/batch_max/dedup/limitsblocks;noetl_execute_outcomes_total+noetl_execute_batch_size. -
worker v5.19.0 (worker#79, closes worker#78) — batch dispatch (
dispatch.batch_dispatch→execute_batchin chunks ofbatch_max, each item its own playbook/pool/trace/dedup); opt-in dedup stamps the block (idempotency_key→message_id, OQ8); per-subscription rate limits (RFC §9) via a new deterministic token-bucketRateGovernor(src/ratelimit.rs) enforced on the fetch side — over the cap the runtime stops fetching (source keeps the backlog, redelivers — no loss) + asubscription.rate_limitedevent; new batch/rate-limit counters. -
ops (ops#176) —
subscription_scale_hardened.yamlexample. e2e (e2e#48) —kind_validate_subscription_scale.sh+ 3 fixtures. - No tools change → no crate cascade.
Live proof (kind, server v3.5.0 + worker v5.19.0).
-
batch — 12 pre-loaded →
children=12 completed=12 pooled=12 traced=12(12→12, all COMPLETED on the subscription pool, per-message traceparent preserved), runtime usedexecute_batch(server handled 5 batch calls). -
dedup — a duplicate (same
x-idempotency-key) + a distinct key →children=2 deduplicated_events=1(the dup collapsed to one execution); direct-curl proved within-window→duplicate, outside-window→allowed, dedup-off→no-collapse, batch partial-failure containment. -
rate-limit — burst of 10 at
max_dispatch_per_sec=2→rate_limited_events=1 children=10 completed=10(limit engaged, every message became an execution — no loss).
Unit. server batch/dedup/validation tests; worker RateGovernor (throttle→recover no-loss, clamp, combined caps), Phase-7 spec parse, dedup-key resolution, batch client shapes. Full server + worker suites green; clippy clean.
Pointers. ai-meta → server 7b217d8 (v3.5.0) + worker 7531f4a (v5.19.0) + ops 6db69b9 + e2e 203593b.
#90 closed; refinement follow-ups spun out as separate ai-task issues (none a planned phase or a load-bearing gap): #91 (live OIDC signature), #92 (shared noetl-directives/noetl-spool crates), #93 (cross-restart spool drain auto-trigger), #94 (s3 spool backend wiring), tools#57 (real-Pub/Sub pull default).
2026-06-12 (RFC #90 Phase 6 SHIPPED — CLI local noetl subscribe + FileEventSink + local_disk spool, live local proof green)
Headline. noetl subscribe <spec.yaml> runs a kind: Subscription
listener standalone in local mode — no Kubernetes, no NATS-dispatch
server is required for the listening itself. It reuses the same
noetl_tools::tools::source clients + header-directive engine +
noetl_tools::spool engine the in-cluster worker runtime uses, and emits the
same ExecutorEvent envelope — to a local FileEventSink (one event
per line, JSONL) — so a local run produces a replayable event-sourced log
identical in shape to the in-cluster / Cloud Run trail. This completes RFC
#90 Phases 1–6.
What shipped. cli v4.11.0 (cli#60,
closes cli#59) — new
src/subscribe/{mod,spec,sink,dispatch,runtime,spool}.rs + examples/subscribe/:
-
sink.rs—FileEventSink(JSONL, flushed per emit → crash-safe trail) implementing the sharednoetl_events::EventSink; app-side snowflake id generator (observability.md Principle 3, hostname+pid machine id). -
dispatch.rs— the local dispatch model (RFC §5.3):LocalDispatcherruns the target playbook in-process viaPlaybookRunner(the pure-local default);ServerDispatcherPOSTs/api/execute(--dispatch server). Plus the message→workload envelope (mirrors the worker'sbuild_payload). -
spec.rs— parse akind: Subscriptionfor local mode; forces the spool backend tolocal_disk(RFC §8.6) so a spec authored for the in-clusternats_object/gcsbackend runs locally unchanged. -
spool.rs— local spool runtime overnoetl_tools::spool::LocalDiskBackend: circuit breaker + buffer + ordered replay + idempotency + dead-letter; circuit state in a localcontrol/file; the six spool/circuit events to the FileEventSink; in-process replay. -
runtime.rs— the continuous drain loop: lifecycle (registered/activated/drained/deactivated) + per-messagereceived→playbook.started→completed/failed+directives_applied;--once/--max-messagesstop conditions; Ctrl-C/SIGTERM drain.
cli-only. The source clients + spool engine already ship in noetl-tools
v3.5.0 (Phases 1–5), so no tools change / crate cascade was needed — the
PR bumps the noetl-tools lock 3.0.0 → 3.5.0 (the executor's "3" constraint).
Tests. 12 subscribe unit/integration tests; full bin suite 53 passed;
clippy-clean. Includes a deterministic local outage → local_disk spool →
ordered replay → idempotency proof exercising the real noetl_tools::spool
engine (a TCP downstream probe toggled in-test).
Live proof (local mode, against the in-cluster NATS broker on kind).
(1) Drain + in-process dispatch + event-sourced JSONL: created a JetStream
stream + durable consumer, published 5 messages → received=5 dispatched=5 failed=0; the JSONL trail held 19 events (lifecycle×4 + received×5 +
playbook.started×5 + playbook.completed×5), every line round-tripping as
ExecutorEvent. (2) local_disk spool outage → recovery: a tcp downstream
probe pointed at a closed port → subscription.circuit.opened → 6
subscription.message.spooled to the local_disk dir (recv_seq-ordered object
keys on disk), 0 dispatched (no loss); bind the port (downstream up) →
subscription.circuit.closed → subscription.spool.draining → 6
subscription.message.replayed in receive order → spool drained to 0
(pending_spooled=0). The whole outage is reconstructable from the JSONL trail.
Finding. The NATS source connects via async-nats ConnectOptions, which
does not honor user:pass embedded in the URL — the spec uses explicit
user/password fields (or auth: <alias> + --credential). Documented in
the example specs.
Pointers + wiki. ai-meta → cli 2fb3fb0 (v4.11.0). Wiki: new cli page
subscribe; umbrella
Subscription / Listener Phase 6 row + recent
activity. #90 stays open for Phase 7 (scale hardening — batch execute,
opt-in dedup window, rate limits — volume-gated).
2026-06-12 (RFC #90 Phase 5 SHIPPED — out-of-cluster Cloud Run target + gcs spool backend, live out-of-cluster proof green)
Headline. The subscription/listener runtime now runs out-of-cluster
on Google Cloud Run (RFC §5.2): a Pub/Sub firehose is consumed off-cluster
and dispatched to the NoETL server over HTTPS, never entering the cluster
network until it is a well-formed execution. The gcs store-and-forward
spool backend landed alongside.
What landed.
-
tools v3.5.0 (tools#56, closes tools#55) —
noetl_tools::spool::GcsBackend, the GCS impl of the Phase-4SpoolBackendtrait over the JSON API, reusing the existingGcpAuth(ADC) +reqwest(no new dependency); prefix-shared bucket, live+dlq split, recv_seq-ordered keys, idempotent put/delete;gcsfeature (default-on). Live GCS round-trip proven. -
worker v5.18.0 (worker#77, closes worker#76) —
spool.backend: gcswired into theWORKER_MODE=subscriptionrun-loop (ADC/Workload Identity; in-memory circuit out-of-cluster); optionalNOETL_INTERNAL_API_TOKENbearer auth to the control plane;$PORT-aware metrics/health bind (Cloud Run startup probe, no new HTTP code). -
server v3.4.2 (server#187, closes server#186) — gcs/s3 spool
credentialmade optional (absent → ADC/Workload Identity for the Cloud Run platform bucket; present → tenant-bucket keychain alias);bucketstays required. A real Phase-5 finding: the Phase-4 validation wrongly rejected the Cloud Run bucket. -
ops (ops#175, closes ops#174) —
automation/cloud-run/: least-priv SA + spool bucket + Pub/Subsetup-gcp.sh, Cloud Build +gcloud run deploydeploy.sh(min=1--no-cpu-throttlingsingleton),teardown.sh, declarativeservice.yaml, README. - docs (docs#179) — the Cloud Run runtime architecture page.
- e2e (e2e#47, closes e2e#46) — Pub/Sub-source + gcs-spool fixture + hybrid Cloud Run validation driver.
Live proof (noetl-demo-19700101, server reached via a cloudflared tunnel to the kind cluster). Live: the GCS backend round-trip; the Cloud Run service deploying + running ($PORT health bound, startup probe green); the out-of-cluster runtime activating against the server over HTTPS (register+activate in the event log); spool runtime active backend=gcs; 6/6 Pub/Sub messages → one POST /api/execute each over HTTPS → COMPLETED on the subscription pool; GCS spool under a live outage — killing the tunnel opened the circuit and a message buffered durably to the real GCS bucket (…/spool/000…001-<id>, sha256 + reason=circuit_open). Not auto-triggered: cross-restart GCS drain (the documented in-memory-circuit limitation — drain+replay+idempotency was proven live in Phase 4 with the same engine). Finding: the pubsub source's synchronous pull (emulator-validated in Phase 1) needs timeout_ms ≥ 10s against real Pub/Sub — the 2s NATS default stalls; filed tools#57.
GCP setup. Dedicated least-privilege runtime SA noetl-subscription-runtime (objectAdmin on the one spool bucket + subscriber on the one Pub/Sub subscription; no project-wide roles, no exported key — Workload Identity / ADC). All test resources torn down at the end (Cloud Run service deleted, tunnels killed, spool bucket + Pub/Sub topic/sub + throwaway AR image deleted) — no cost-bearing resources left; the runtime SA kept (free).
Pointer bumps. ai-meta → tools 0f29c57 (v3.5.0) + server 67669ba (v3.4.2) + worker e1a74ce (v5.18.0) + ops 14c5bf1 + docs cb48772 + e2e 99cda2b. #90 stays open (Phases 6–7 remain).
2026-06-12 (RFC #90 Phase 4 SHIPPED — store-and-forward spool + per-downstream circuit breaker, live outage proof green)
Headline. Phase 4 of the subscription/listener RFC (#90) shipped + was live-validated on kind under a simulated outage: when a downstream a subscription depends on goes offline, incoming messages are durably buffered (buffer_and_ack) and replayed in order on recovery — proven no data loss.
What landed.
-
tools v3.4.0 (tools#54, closes tools#53) —
noetl_tools::spool: a pure per-downstream circuit breaker (trip-after-N / half-open probe / close; NATS-KV-serializable; one breaker per declared downstream → resolves OQ2), the SpoolItem envelope (SHA-256 +noetl://spool/<sub>/<recv_seq>/<id>ref + recv_seq-ordered object keys so a lexical list == receive order), a SpoolBackend trait +nats_object+local_diskbackends, and the engine (ordering global/per_key/none + idempotency + poison→dead-letter + retention max_age/max_bytes/on_full + GC) + http/tcp/nats probes. 44 unit tests under simulated outage + a real-NATSnats_objectintegration test. -
worker v5.17.0 (worker#75, closes worker#74) — wires the spool into the
WORKER_MODE=subscriptionrun-loop: probe→circuit→spool-or-dispatch→ack, NATS-KV circuit persistence (survives a restart mid-outage), drain-on-recovery, 6 spool/circuit events +noetl_subscription_spool_bytesgauge. -
server v3.4.1 (server#184 + server#185, closes server#183) —
spool:block validation + the lifecycle-status fix (spool/circuit events share the subscription's execution_id but must not corrupt its lifecycle status; surfaced + fixed during live validation when an open circuit 500'dactivate). -
ops (ops#173) — toggleable
spool-downstream-echo+ runtime NATS env. e2e (e2e#44 + e2e#45) —kind_validate_subscription_spool.sh.
Live proof (kind, 6 messages). Scale the downstream to 0 → subscription.circuit.opened → publish 6 → 6 subscription.message.spooled (recv_seq 1-6, each with the noetl://spool ref + sha256), 0 dispatched while open (no loss); scale back to 1 → subscription.circuit.closed + subscription.spool.draining → 6 subscription.message.replayed → 6 child executions COMPLETED on the subscription pool → spool drained to 0 → exactly 6 distinct children (idempotency held). The entire outage is reconstructable from noetl.event.
Decisions / deferred. OQ2 per-downstream scope; OQ3 immediate-GC + max_bytes ceiling + gauge; OQ8 idempotency_key wins over message_id; OQ14 ack-after-dispatch stop-ack tracked (buffer_and_ack/hybrid loss-safe). Deferred/tracked: gcs/s3 backends (same trait), gateway edge spool + shared noetl-directives/noetl-spool crate, hybrid stop-ack-blip optimisation.
Pointers. ai-meta → tools 02110a5 (v3.4.0) + server 51dc0d1 (v3.4.1) + worker 65fb27d (v5.17.0) + ops ab9af34 + e2e 2d0ad0a. #90 stays open (Phases 5–7 remain).
2026-06-11 (RFC #90 Phase 3 SHIPPED — gateway push-ingress (Mode C) + auth-gated directive trust, live E2E green)
Headline. Phase 3 of the subscription/listener RFC (#90) shipped: the gateway gains POST /ingress/{listener} — it terminates untrusted webhook / Pub-Sub-push traffic as a verify-and-forward gatekeeper (no DB on the ingress path), and the auth-gated directive trust the Phase-2 directive engine was designed for lands. The gateway verifies a delivery (HMAC / bearer / Pub-Sub OIDC, secret resolved from the Secrets Wallet by alias) and only then applies the header directives and forwards one POST /api/execute per delivery on the dedicated pool. The auth gate is a structural invariant (verify_then_plan): a failed verification yields no dispatch plan, so an unauthenticated caller can never drive routing (RFC §7.5).
What shipped.
-
noetl-gateway v3.3.0 (gateway#28, closes gateway#27) —
src/ingress/:verify.rs(HMAC-SHA256 over the raw body, constant-time; bearer, constant-time; Google Pub/Sub OIDC — RS256 vs Google JWKS,aud+email/service_account+email_verified+exp);directives.rs(serde-only vendored port of the tools v3.3.0 engine — the internet-facing edge must not pullduckdb/kube);mod.rs(verify_then_planfuses verify + directive resolution; Pub/Sub-push envelope unwrap → attributes channel; auth headers stripped from the forwarded workload; first/metricssurface). 25 ingress + verify unit tests incl. every negative +directives_applied_only_after_verification_passes. -
noetl-server v3.3.0 (server#182, closes server#181) — push catalog validation (
ingress.verifyrequired,nonerejected) +GET /api/internal/ingress/{listener}(service-account-gated) resolving the verify-secret alias via the Wallet + idempotent subscription registration.subscription::ensure_registeredextracted + reused. 9 push-validation unit tests. -
ops (ops#172) — gateway
NOETL_INTERNAL_API_TOKENenv. e2e (e2e#43) —kind_validate_subscription_push.sh+ HMAC/bearer push fixtures.
Validation (live on kind). Built + loaded + rolled server v3.3.0 (+ internal token) and gateway v3.3.0 (ns gateway, matching token); deployed the dedicated subscription pool. kind_validate_subscription_push.sh: HMAC 12/12 + bearer 12/12 green — N signed deliveries → one execution per delivery on the subscription pool → COMPLETED; allowlisted x-noetl-route redirect honored only after verification; the auth gate — a tampered (bad-signature) and an unsigned/unauth delivery (both carrying the redirect header) → 401, no execution, no directive applied. Pub/Sub-push path proven live (bearer-verified pubsub-source subscription, base64 message.data decoded → order_id, redirect via the attribute x-noetl-route). OIDC signature path unit-proven (every negative: bad-sig / expired / wrong-aud / wrong-SA / unknown-kid).
Pointers. ai-meta → server fa1ff3f (v3.3.0) + gateway 38f024b (v3.3.0) + ops 54f2d65 + e2e 1421267 + gateway-wiki (push-ingress page). #90 stays open — Phases 4–7 (spool, Cloud Run, CLI local, scale-hardening) remain. Fast-follow tracked: extract a lean shared noetl-directives crate so the gateway + tools consume one engine instead of the vendored copy.
2026-06-11 (RFC #90 Phase 2 SHIPPED — kind:Subscription + continuous runtime + header-directive engine, live E2E green)
Headline. Phase 2 of the subscription/listener RFC (#90) shipped across five repos and validated live end-to-end on kind (13/13 assertions). kind: Subscription is now a first-class catalog type; the continuous listener runtime (Mode B) turns each received message into one execution on a dedicated pool segment; and the header-directive engine (redirect / pool / idempotency / content + W3C trace, untrusted by default) lands.
What shipped.
-
tools v3.3.0 (tools#52, closes tools#51) —
source/directives.rsheader-directive engine (DirectiveSpec→DispatchPlan: allowlisted redirect/pool/priority/idempotency/content + W3C trace extraction; value-allowlists enforced at parse; multi-value last-wins;applied[]audit) + publicbuild_source(cfg, ctx)factory. 12 new tests. -
server v3.2.0 (server#180, closes server#179) —
kind: Subscriptioncatalog validation (source/mode/dispatch, no step-DAG); event-sourced lifecycle endpoints/api/subscriptions(register→activate→pause/resume→drain→deactivate, idempotent register, GET list/get);execution_pooloverride on/api/execute→noetl.commands.<pool>.<eid>across the whole execution (persisted inplaybook_startedmeta, orchestrator reads back); W3Ctraceintometa.trace+ command notification + child inheritance. -
worker v5.16.0 (worker#73, closes worker#72) —
WORKER_MODE=subscriptioncontinuous runtime: build SourceClient via the tools factory, register+activate, looppoll()→onePOST /api/executeper message on the dedicated pool, apply directives + emitsubscription.message.directives_applied, drain+deactivate on SIGTERM. Observability triad (noetl_subscription_*counters). -
ops (ops#171) — dedicated
noetl-worker-rust-subscription-pool(filternoetl.commands.subscription.>) +noetl-subscription-runtime(Recreate strategy) + KEDA scaler. -
e2e (e2e#42) —
kind_validate_subscription_runtime.sh+ akind: Subscriptionfixture + two target playbooks + the NATS credential.
Live E2E (kind). 6 NATS messages → 6 child executions, all COMPLETED on the dedicated subscription pool (playbook_started.meta.execution_pool=subscription); 2 header-redirected (x-noetl-route → a different allowlisted playbook), 4 default; W3C traceparent propagated into all 6 children's meta.trace; 6 directives_applied audit events; full lifecycle registered→activated→paused→resumed→draining→deactivated event-logged; invalid transitions rejected (422). 13/13 assertions PASS.
Three integration gaps the E2E surfaced + fixed in-PR (same pattern Phase 1 hit): (1) noetl.catalog.kind → noetl.resource(name) FK lacked a subscription row → server startup seed (ensure_builtin_kinds); (2) noetl.event.created_at is TIMESTAMP, decoded as TIMESTAMPTZ → NaiveDateTime; (3) the runtime drained only on SIGINT but K8s sends SIGTERM → select on both. Plus idempotent register (reuse per path) + Recreate strategy (singleton drain handoff).
Decisions. OQ1 resolved — new WORKER_MODE=subscription run-mode of the worker binary (not a new artifact). OQ7 resolved — explicit dispatch.execution_pool wins over a priority map; multi-value headers last-wins. CLI parse deferred to Phase 6 (the noetl subscribe local mode owns it; Phase-2 registration is a thin server POST). Ack-after-dispatch + durable spool is Phase 4; gateway push (Mode C) is Phase 3.
Pointers. ai-meta → server ebd2944 (v3.2.0) + worker 1f74992 (v5.16.0) + tools 4995692 (v3.3.0) + ops 242e420 + e2e 32df918. Cluster left on the clean :dev stack (brokers kept). #90 stays open (Phases 3–7).
Headline. Closed the Phase-1 validation gap: the subscription
tool's Pub/Sub-pull and Kafka-poll backends were proven live
end-to-end on kind, the same bar NATS met. (Phase 1 had proven NATS
live; Pub/Sub was emulator-gated unit-only and Kafka was adapter-level.)
What landed.
-
Brokers in kind (ops#170) —
ci/manifests/pubsub-emulator/(gcloud SDK:emulatorsimage,gcloud beta emulators pubsub starton 8085) andci/manifests/kafka/(single-broker KRaftapache/kafka:3.9.1, advertised on the in-cluster Service DNS; retired bitnami images avoided). Same PR fixes thesubscription_drain.yamlexample's.output.accessor. -
E2E fixtures + runners (e2e#41) —
subscription_{pubsub,kafka}_drain.yaml+{pubsub,kafka}_e2e.json.examplecredential aliases (cluster DNS only, no secret) +scripts/kind_validate_subscription_{pubsub,kafka}.sh. The runners provision the broker (topic/sub + publish/produce N), register + execute the playbook (--setunique names per run), then assert source/count=N/acked=true on the drain result, execution COMPLETED, and the event trail.
Live results (server v3.1.0 + worker v5.15.2 + tools v3.2.0 on kind):
| Backend | Result |
|---|---|
| Pub/Sub | publish 5 → drain count=5 acked=true → COMPLETED → call.done/command.completed/playbook.completed ✓ |
| Kafka | produce 5 → drain count=5 acked=true → COMPLETED → same trail ✓ |
Re-ran at N=4 against the final committed form — both green.
Adapter fixes. None needed — both backends worked as-is against real
brokers (the pure-Rust kafka crate speaks to Kafka 3.9 KRaft; the
Pub/Sub REST backend works against the emulator; the worker's v5.15.2
nats|pubsub|kafka credential arm merged endpoint/brokers correctly).
The only bug was a playbook accessor: {{ <step>.output.<field> }}
never resolved (both when: arcs evaluated false → the drain stalled
after the subscription step), corrected to {{ <step>.<field> }} in both
fixtures and the latent ops example.
Cluster note. Rebuilt server :dev from the v3.1.0 tree (the running
image had been an earlier :dev reporting 3.0.6, pre the version-bump
commit) + worker :dev from v5.15.2; rolled both. Cluster left on this
clean released stack; Pub/Sub-emulator + Kafka brokers left deployed as
first-class e2e infra.
Pointers. ai-meta → ops 568a4ac + e2e 8d21e7a. Wikis: tools
SubscriptionTool
(new Live validation section), this log, Home, Releases,
Umbrella: Subscription / Listener.
Board 3: #90 stays In progress (Phases 2–7 remain). Standing direction
honored — Claude authored the ops/e2e changes directly.
Headline. Shipped Phase 1 (Mode A — bounded drain) of the
subscription/listener RFC:
a new atomic subscription registry tool plus the reusable source-client
abstraction the later phases build on. First feature code under #90.
What landed.
-
noetl-toolsv3.2.0 (tools#50, closes tools#49). New tool kindsubscription(operation: poll) — a bounded drain that fetches up tobatch/ until empty / untiltimeout_ms(both hard-capped, so the worker slot is never held), acks per policy, and returns the normalized batch. -
Source-client abstraction (
src/tools/source/): theSourceClienttrait (poll(&PollOptions) -> PollOutcome),PolledMessage/AckModetypes, and shareddecode_payload/normalize_headers(RFC §7.1). This is the deliverable later phases reuse — a continuous runtime callspollin a loop; a gateway push reuses the normalizers. -
Three backends. NATS (refactors
js_consumeinto the shareddrain_pull_consumer; thenatstool now delegates to it), Pub/Sub pull (RESTpull+acknowledgevia gcp_auth, emulator support, featurepubsub), Kafka poll (pure-Rustkafkacrate, featurekafka; Phase-1 limits documented). Worker dispatches via the generic registry — no dispatch-match change. -
ops example (ops#169):
playbooks/examples/subscription_drain.yaml— a scheduled bounded NATS drain. - Wiki: SubscriptionTool page on the tools wiki; RFC umbrella marked Phase 1 shipped.
Validation. 323 lib tests + clippy -D warnings clean across
--no-default-features / --features pubsub / --all-features. NATS poll
path validated live against the in-cluster NATS JetStream broker (port-forward;
create stream → publish 3 → drain → ack → second drain returns 0). Pub/Sub
emulator path is emulator-gated; Kafka is adapter/unit level (no broker stood
up this session).
Full in-cluster playbook-dispatch E2E — green. Built worker + server
images carrying the tool, loaded into kind, rolled, ran
examples/subscription_e2e: subscription poll drained count=5,
acked=true, the NATS consumer went 5→0 pending, the execution reached
COMPLETED with call.done/command.completed/playbook.completed in the
event log. The E2E surfaced two integration gaps the unit + live-NATS tests
could not — both fixed and re-validated end-to-end:
-
server v3.1.0 (server#178)
— the orchestrator validates every step's
tool.kindagainst a typedToolKindenum, sokind: subscriptionwas rejected atPOST /api/executewith HTTP 400 until theSubscriptionvariant was added. -
worker v5.15.2 (worker#71)
—
apply_credentialonly knew postgres/bearer/api_key/basic; a type-natscredential errored ("unsupported type 'nats'"), so the no-default-connectionauth:alias pattern failed for the nats + subscription tools. Fixed by merging the credential's connection fields into the tool config.
Also re-confirmed the local dev wallet's ephemeral KEK behaviour (a server
roll re-keys, so playbook credentials must be re-registered after — per the
roll-staleness note). Cluster restored to a clean :dev build (validated
images re-tagged; re-smoke green: drain count=2 acked=true → COMPLETED).
Pointers. ai-meta pointer bumps: tools → v3.2.0, worker → v3.2.0 (#70) then v5.15.2 (#71), server → v3.1.0, ops → subscription example, ai-meta-wiki
- noetl-tools-wiki. #90 stays open (umbrella; Phases 2–7 remain). Board: #90 → In progress.
Headline. Revised the subscription RFC to v3 (new §7). Still design-only.
Header-as-instruction, configurable + allowlisted. Each source's
metadata channel — Pub/Sub attributes, Kafka record headers, NATS
headers, HTTP headers (webhook/push) — is normalized into one uniform
message.headers map so playbooks + the runtime see the same shape
regardless of source. An opt-in headers.directives allowlist in the
kind: Subscription spec declares which keys act as instructions:
-
redirect (
dispatch.playbook) — run a different target playbook than the subscription default; -
pool routing (
dispatch.execution_pool) — land the run on a different worker pool / command segment (noetl.commands.<override>, thepool_segmentseam atrepos/server/src/handlers/execute.rs:705-709); -
priority(map→pool),idempotency_key(feeds dedup + spool key),content_type/schema_hint; -
W3C trace / mesh propagation — there is no trace propagation
today (only
request_idatgateway/src/sse.rs:266+execution_id); the layer extractstraceparent/tracestate/allowlisted-baggage on ingest, stamps it into eventmeta.trace(execute.rs:630threadsmeta) + span, and propagates it on thecommand.issuedNATS message and into child executions — so a message is traceable upstream-mesh → gateway/runtime → execution → child runs → event log.
Security — untrusted by default. Only allowlisted keys are honored as
instructions (rest are data); allowed:/map: constrain even allowlisted
headers so they can't pick an arbitrary target; push/webhook directive
trust is gated on auth — directives are parsed only after
HMAC/bearer/OIDC verification (§6 flow step 4 after step 3), so an
unauthenticated caller can't drive routing; a header may name a credential
alias only if the alias is itself allowlisted (Secrets Wallet boundary).
Applied directives event-logged (subscription.message.directives_applied),
honored across in-cluster / Cloud Run (post-auth) / CLI-local.
Plan. No new phase — the directive engine (normalization + allowlist + redirect/pool/idempotency/content + W3C trace into execution) lands in Phase 2 (dispatch-layer); auth-gated push directive trust in Phase 3 (gateway ingress). New open questions OQ7–OQ10 (directive precedence/conflict, idempotency-key vs message_id, header-chosen credentials [default off, needs security review], trace depth). Issue #90 body + a v3 delta comment updated. No pointer bump beyond the wiki.
Headline. Revised the subscription RFC to v2 with two review refinements. Still design-only.
1 — kind: Subscription first-class catalog type. Resolves the old
open question (distinct kind vs trigger: block) in favour of a
dedicated catalog type registered alongside kind: Playbook. It
completely isolates the class: own type + validation, own dedicated
runtime + worker pool + command segment (noetl.commands.subscription/
iot.*), own lifecycle (register → activate → pause/resume → drain →
deactivate, each event-logged), own KEDA scaling on source backlog. The
three prongs (bounded-drain A / continuous runtime B / gateway push C)
become activation modes of this one type. A trigger: flag would
entangle subscription lifecycle into the step orchestrator and default
the firehose onto the shared stream; a distinct type keeps isolation both
schematic and operational.
2 — configurable store-and-forward spool (RFC §7). When a downstream
(target storage / DB / produced-to Pub/Sub-Kafka topic) is unavailable, a
circuit breaker trips and incoming messages accumulate in a fallback
buffer, replayed in order on recovery. The durability tradeoff is the
explicit spool.mode knob — off (stop-acking, the source is the
buffer; cheapest but bounded by source retention + useless for
non-redelivering webhooks), buffer_and_ack (write-to-durable-store
then ack; survives arbitrary outages + push sources, costs a write/msg),
hybrid (escalate). Backends reuse existing object-store primitives —
gcs/s3 via the Result-Store noetl:// payload-ref pattern
(repos/server/src/services/result_store.rs, repos/tools/src/tools/artifact.rs),
nats_object via the NATS Object Store ops already in the nats tool,
local_disk for CLI. Ordering (per_key lanes) + idempotency on
message_id + poison→dead-letter + retention/drain policy. Fully
event-logged (subscription.message.spooled / circuit.opened /
circuit.closed / spool.draining / message.replayed /
message.dead_lettered) so an entire outage is replayable from the log;
payload bytes → tenant object store (external, keychain-auth per
data-access-boundary), metadata + ref + SHA-256 → the event log. Works
across in-cluster (NATS-KV circuit state) / Cloud Run (HTTPS flow-back) /
CLI-local (file).
Plan re-cut. 7 phases — Phase 1 unchanged (bounded-drain tool);
kind: Subscription + runtime = Phase 2; gateway push = Phase 3; spool
= Phase 4 (right after push, where buffer_and_ack is non-optional);
Cloud Run 5; CLI local 6; scale-hardening 7. Issue #90 body + a
v2 delta comment
updated. No pointer bump beyond the wiki.
2026-06-11 (RFC filed — subscription / listener tool for Pub/Sub, NATS, Kafka, webhooks — #90, design only)
Headline. Filed #90 (ai-task, repo:tools, board 3 Todo) with a full RFC on the wiki: Umbrella-Subscription-Listener. Design deliverable — no feature code.
The tension. A listener is inherently long-lived; a worker tool is
inherently atomic. The nats tool already encodes this — js_consume
is a bounded pull, "not subscriptions," because a long-lived
subscription would hold a worker slot and violate the execution model
(repos/tools/src/tools/nats.rs:8-20). The RFC resolves it by splitting
listening from processing across three prongs:
-
A — bounded-drain tool (
tool: subscription, oppoll): atomic, a registry tool like the other 18, a generalizedjs_consumeacross Pub/Sub-pull / NATS / Kafka. Reuses the worker model wholesale, no new runtime — ships first (Phase 1). -
B — listener runtime: a long-lived ingress host that turns each
message into a normal
POST /api/execute. Runs on an in-cluster KEDA-scaled dedicated pool or out-of-cluster on Cloud Run. IoT firehose isolated on a dedicatediotcommand segment + worker pool so the shared stream never degrades (req. 3). -
C — gateway push-ingress:
/ingress/{listener}for webhooks + Pub/Sub-push; verifies HMAC / bearer / Pub/Sub-OIDC; secrets from the Secrets Wallet by alias, never gateway env. Verify-and-forward; never touches domain data.
Traceability. Same event envelope in local (CLI noetl listen +
a FileEventSink reusing the executor's pluggable
EventSink), in-cluster, and out-of-cluster (Cloud Run emits via HTTPS,
holds no DB connection per data-access-boundary). The event log is the
single traceability story regardless of where the listener ran.
Grounding. Built on a parallel code study citing real files: tool
trait/registry (repos/tools/src/registry.rs:132,
tools/mod.rs:56-79), worker pull loop + pending_callback async path
(worker/src/worker.rs:234, executor/command.rs:205-457), container
callback precedent (server/.../container_callback.rs), command publish
- pool routing (
server/.../execute.rs:684-742), KEDA scalers (ops/ci/manifests/keda/scaledobject-worker-*.yaml), gateway auth + callback ingress (gateway/src/main.rs:230-288,auth/middleware.rs), Secrets Wallet envelope (server/src/crypto/envelope.rs), CLI local mode +EventSink(cli/src/playbook_runner.rs,cli/executor/src/events.rs).
Plan. 6 phases (tools → worker/server → gateway → cloud-run → cli → scale-hardening). Reuses #46 system-pool primitives, #61 Wallet, #49 server API surface. Next: review the three-prong model + confirm Phase 1 scope. No pointer bump (wiki-only + ai-task issue).
2026-06-11 (#89 shipped — JSON null round-trips through {{ step }}; root cause was the server renderer, not the worker — server v3.0.6)
Headline. Closed #89,
the null-serialization bug #88 surfaced. The cursor pagination fixture
walked all 4 fetch pages but its 4th check_pagination Python step
crashed on the terminal page: the API returns next_cursor: null, and
re-injecting the whole {{ fetch_page }} envelope into the next step's
input rendered that field as the JS token undefined — invalid JSON —
so the consuming step received response as a raw str and died with
AttributeError: 'str' object has no attribute 'get'.
Root cause — server, not worker. The issue hypothesized the worker.
Tracing the corrupt command.issued args.response showed it's emitted
by the server orchestrator, which renders next-step inputs via
repos/server/src/template/jinja.rs::render_to_value. Two facts
combined: json_value_to_minijinja maps a JSON null to
Value::UNDEFINED, and minijinja renders a map with Python-style repr,
so the field surfaces as the bare token undefined. render_to_value
then failed serde_json::from_str and fell through to returning the
entire envelope as a string. The noetl-tools TemplateEngine::render_value
already had a | tojson retry for exactly this case; the server's
renderer was a divergent copy that lacked it.
Fix. server#177 (v3.0.6)
adds the | tojson retry to render_to_value: a lone {{ expr }} whose
plain render is container-shaped-but-invalid JSON re-renders with
| tojson, and minijinja_to_json maps undefined/none → JSON null, so
a null field round-trips as null and the downstream step receives a
parsed object. 5 new regression tests (null in nested + top-level
objects, null array element, explicit | tojson no-double-pipe, scalars
unchanged); 619 lib + 8 parity tests green; clippy clean.
Kind validation. Built noetl-server-rust:dev, loaded into the local
kind cluster (podman save → kind load image-archive), rolled
noetl-server-rust, re-ran tests/pagination/cursor/cursor against the
live paginated-api test-server. Baseline (unfixed) reproduced the bug:
4th check_pagination → command.completed error, args.response
carried next_cursor": undefined. Fixed: all 4 cycles → success,
terminal args.response = next_cursor": null, execution completed
through end, validate_results → status=success, total_events=35, first_id=1, last_id=35 — matching the offset fixture's clean
full-collection result. No error events.
Pointers. ai-meta → server 8e17fbe (v3.0.6). Standing direction
honored — Claude wrote the Rust directly, no Codex. Left the unrelated
uncommitted items (repos/.dockerignore, scripts/start_noetl_ui.command)
untouched.
2026-06-10 (#88 shipped — pagination fixtures read response.body.*; #89 filed — worker null→undefined serialization)
Headline. Closed the pagination fixture-path follow-up
#88 that #85 anticipated.
The offset/cursor e2e fixtures read the HTTP response at
response.get('data', {}), but the Rust http tool nests the parsed
JSON payload under body — {{ fetch_page }} resolves to
{body, headers, status_code}. So response.data had no
users/events/has_more/next_cursor, the loop saw has_more=False
and exited after page 1 even though the post-#85 loop machinery is
correct.
Root cause confirmed against a live http-tool response (not guessed):
call.done fetch_page → result.context.data = { body: { users:[…], has_more, offset, limit, total }, headers, status_code }. Switched both
check_pagination steps to response.get('body', {}).
Fixtures changed. repos/e2e/fixtures/playbooks/pagination/offset/test_pagination_offset.yaml
and .../cursor/test_pagination_cursor.yaml ('data'→'body' +
clarifying comments). Other pagination fixtures (retry,
max_iterations, pipeline*, loop_with_pagination) share the same
envelope-key assumption (http_response.get('data') over
/api/v1/assessments|flaky, which return {data, paging}) — flagged on
the umbrella, left out of scope.
Kind validation (live paginated-api test-server, Rust server/worker
:dev carrying the #85 fix):
-
offset — exec
323345262938427392: offsets0→10→20→30,has_moreT/T/T/F, users 10/10/10/5,validate_resultssuccess 35 (first_id=1, last_id=35),playbook.completed COMPLETED. Fully green. -
cursor — exec
323345263580155904/323346678553776128(deterministic): cursorsMg==→Mw==→NA==→null, 35 events fetched, collected 10→20→30, then the 4thcheck_paginationcrashed ('str' object has no attribute 'get') — the worker re-injected the terminalnext_cursor: nullas JSundefined, invalid JSON, so the Python step received an unparseablestr. Filed #89 (repo:worker) for the serialization bug; not a response-path issue, so out of scope for #88.
Shipped. e2e#40 (squash-merged,
author kadyapam) → e2e 72a7525; ai-meta pointer bumped; #88 closed +
moved to Done on roadmap board 3; #89 opened (Todo) + added to board 3.
Picked up the deferred deep layer of #85
on the existing draft server#176
(dispatch-guard re-entry already on the branch). Claude wrote all
Rust directly per handoff-routing.md.
Root cause (two layers). The dispatch-guard layer made the loop
re-enter, but kind validation had shown loop variables thrashing
(0,0,1,0,1,2,…). Diagnosed two distinct bugs:
-
Loop-ctx not durable. Step-level
set: ctx.*was recomputed every orchestrator pass from the workload default + step results. The per-pass “apply every completed step’s set” loop re-firedstart’s initializerset: ctx.offset: {{ workload.offset }}(= 0) on every pass, competing non-deterministically (random HashMap order) withcheck_pagination’s advancingset: ctx.offset. -
Loop-exit hang. Once the variable advanced, the exit branch
(
validate_results) was markedstep.skippedon a pass triggered by the loop body completing — the recency-based branch-point detector saw the body as newer than the branch point — turning the exit branch terminal so theis_step_doneguard later suppressed the exit dispatch.
Fix (server#176, v3.0.5).
(1) Persist each completing step’s rendered set: values to the event
log as a ctx.updated event; WorkflowState folds them latest-wins
into a durable ctx map that build_context overlays. Emission is
once per completion, keyed by the completion event’s event_id
(StepInfo.completed_event_id) — not completed_at, which the
event loader fills with Utc::now() when the row’s created_at is
unreadable (observed live: start re-emitting counter=0 every pass,
oscillating the fold). (2) Protect the exit branch with a
structural loop-branch-point test (a step with any back-edge arc),
independent of completion timing.
Tests. 614 lib tests (+6 new for this change; 2 verified to fail
without their guard): end-to-end counter loop advances 0→1→2 and
terminates through a real exit step; ctx.updated emission +
once-per-completion idempotency; durable-ctx fold + build_context
overlay; exit branch not skipped on a body-completion pass. Clippy-clean.
Kind validation (rebuilt + rolled noetl-server-rust, context
kind-noetl):
- Counter-loop repro:
ctx.updatedemitsstart=0, gate=1,2,3(once each, no thrash),workdispatched0,1,2, zerostep.skipped,validate.final_counter=3 success=True, COMPLETED. - Real-http offset pagination (test-server, 35 users / 10 per page):
fetch_pagecompletes 4×, offset0→10→20→30→40, collected0→10→20→30→35,has_moreflips false only at the end, zero skips,validate_results: total_users=35 success=True, COMPLETED.
Separate finding (not #85, follow-up filed): the e2e offset/cursor
fixtures’ check_pagination reads the HTTP response as
response.data.users; the Rust http tool nests it at
response.body.users, so the fixtures exit after page 1 (loop ctx
still propagates correctly — confirmed offset 0→10).
Merged + released: server #176
→ v3.0.5 (e519fdc). ai-meta pointer → server e519fdc.
#85 closed; roadmap board 3 → Done.
Picked up the two open regression-sweep follow-up bugs from the 2026-06-10 sweep. Standing direction honored — Claude wrote all Rust directly (no Codex handoffs).
#87 — multi-tool sibling references — CLOSED.
Root cause in noetl-tools TaskSequenceTool::execute (the
runtime for tool: [list] multi-tool steps): each sub-tool's
result was stored in labeled_results for the aggregated step
output but never injected into the running context, so a later
sub-tool referencing an earlier sibling via {{ <label>.<field> }}
rendered empty. Masked wherever the reference sat in a quoted
position (empty render = valid ''); surfaced as a
syntax error at or near "," in an unquoted numeric SQL position
(save_edge_cases test_large_payload:
VALUES ('large_payload_test', {{ generate_large.metadata.record_count }}, ...)).
Fix: inject each sub-tool's result under its label after it
completes, with a synthetic .data self-reference (mirrors the
server's build_context shape) so both {{ label.field }} and
{{ label.data.field }} resolve. Also visible to a later python
sub-tool's stdin variables. 2 new unit tests; 300/0 tools lib.
- Ship: tools#48 → noetl-tools v3.1.1 (published to crates.io after a transient HTTP2-flake re-run); worker adopts via worker#69 (Cargo.lock → 3.1.1,
^3covered it). - Validation: built a worker image against the 3.1.1 code (cargo
[patch.crates-io]to the local tools tree), loaded into the kind cluster, re-registered credentials.save_edge_casestest_large_payload→record_count = 100with no SQL syntax error;save_delegation_testcompleted clean (its unquoted{{ generate_data.value }}postgres insert would have failed if the sibling ref rendered empty). - Pointers: ai-meta → tools
76f942a(v3.1.1) + tools-wiki4962f8b+ workerb97f642. Tools wiki: added a Multi-tool steps (task_sequence) section.
#85 — workflow-arc loop can't re-enter a completed step — DEFERRED (kept open).
The filed root cause (the pass-2 is_step_done guard suppressing
the loop back-edge dispatch) is real and fixed by a back-edge
detector — a matched arc src → target where target is
terminal, target can forward-reach src (cycle), and src
completed strictly after target (recency, which disambiguates
the two arcs of a 2-cycle). The guard is bypassed for recognized
back-edges and the target re-dispatched; no state reset (the
re-dispatch's own step.enter/command.issued re-activate it).
608 server lib tests + 5 new pass.
But kind validation found a second, deeper layer: the loop
re-enters and no longer hangs, yet the loop variable does not
propagate across iterations. set: ctx.X mutations are
recomputed per orchestrator pass from step results and revert to
the workload default whenever the producing step is re-dispatched.
A minimal counter-loop driven by set: ctx.counter: {{ work.next_counter }}
thrashes (work dispatched with counter 0,0,1,0,1,2,… instead
of 0,1,2,3), so offset/cursor pagination still stall. This is
the "event-sourced step re-enter / loop iteration signal" the
issue body anticipated — the Rust orchestrator has no durable ctx
state across passes. Held as draft server#176 (not merged — a clean hang is more debuggable than a non-deterministic thrash); detailed next-step proposal (durable context.set events reconstructed in from_events) recorded on #85. Board #85 → In progress.
Agent: Claude (direct — Rust + e2e fixtures) · Repos touched: noetl/server (PR #175), noetl/e2e (PR #39), noetl/cli (PR #58, doc fix). Merged + pointer-bumped: ai-meta → server 480ba72 (v3.0.4), e2e b0a5c85, cli a3e22ef; issues #83/#84/#86 closed.
Headline. Fresh full-platform regression sweep against the local kind cluster after rebuilding noetl-server to v3.0.3 (the deployed image reported 3.0.2 — missing the #173 container-callback fix; worker already at v5.15.1 99e2c66). Config-driven 36-playbook sweep went 19→27/36 PASS after fixes. Found three distinct orchestrator/platform bugs; fixed two, filed the third.
Bugs found.
-
#83 (FIXED, server#175) — the fan-in/reduce barrier deadlocked every workflow loop.
build_incoming_arcscounted a loop back-edge (check_pagination → fetch_page) as an upstream, so the barrier deferred the loop head forever (Reduce step 'fetch_page' deferring dispatch — 1 of 2 upstream(s) still pending). Fix: exclude back-edges via a newforward_reachablehelper. fanout_reduce barrier unaffected. -
#84 (FIXED, server#175) —
event.namewas never populated for arc evaluation, so the canonicalwhen: {{ event.name == "loop.done" }}gate (10+ fixtures) never matched → in-steploop:steps hung after completion. Fix: injectevent.name = "loop.done"into a completed loop step's next-arc context. Validated:test_pagination_basiccompletes. -
#85 (FILED, deep) — workflow-arc loops can't re-enter an already-
Completedstep: the pass-2is_step_donedispatch guard suppresses the loop-back, so offset/cursor pagination stall on the 2nd iteration (flaky: pass when 1 page suffices). Needs an event-sourced step-reset — left for follow-up. #83 is what lets such loops run their first iteration. -
#86 (FIXED, e2e#39) — duckdb fixtures used
commands:(plural); the tool field iscommand/query. Renamed across 4 storage/gcs fixtures;save_all_storage_typesgreen. -
#87 (FILED) — multi-tool (workbook-list) step: a later sub-tool can't reference an earlier sibling's output (
{{ generate_large.metadata.record_count }}renders empty); surfaces insave_edge_cases(unquoted SQL position), masked insave_all_storage_types(quoted).
Other setup. Re-registered all 5 test credentials (the dev cluster uses NOETL_ALLOW_INSECURE_DEFAULT_KEY → ephemeral per-pod key, so creds need re-registration after each server restart — by design; the stable fix is a fixed NOETL_ENCRYPTION_KEY in noetl-secret). Built + deployed the missing paginated-api test-server (ops/ci/manifests/test-server/) into kind — pagination/retry fixtures depend on it.
After-matrix (27/36). Remaining 9 non-passes: 4 external (GCS×2, OpenAI, github-URL), 2 distributed-mode local-file sources (python_file/postgres_file — script not shipped to worker), 1 negative-test harness artifact (should_error — server's HTTP 400 reject is the expected failure), 2 multi-tool templating (#87). Targeted validators kind_validate_fanout_reduce.sh + kind_validate_container_callback.sh both PASS.
Pointers. server PR #175 (fix/orchestrator-loop-completion, 26 orchestrator unit tests + 2 new). e2e PR #39. Validated against noetl-server v3.0.3 / noetl-worker v5.15.1.
Agent: Claude (direct — Rust, per handoff-routing.md) · Repos touched: noetl/worker (PR #68, v5.15.1, sub-issue worker#67); worker wiki; ai-meta pointer bumps + wiki.
Headline. Fixed noetl/ai-meta#78: a command that failed before tool dispatch (credential-alias resolution, tool-config deserialization) ?-propagated out of CommandExecutor::execute_with_server_url to the dispatch loop, which only logged Command execution failed — no call.error reached the server, so the execution hung at command.started forever (had to be noetl cancel'd by hand).
What landed.
- Typed
CredentialResolutionError— terminal (AliasNotFound/Invalid) vs retryable (Transient), classified by type / HTTP status, not by string-matchinganyhowmessages. -
CredentialHttpErrorsurfaces the credential-fetch HTTP status soclassify_fetch_errordecides retryability by code: terminal for404/400/401/403/500, retryable for408/429/502/503/504+ transport errors. -
CommandExecutor::handle_predispatch_failureemitscall.error+command.failedfor terminal failures (and retry-exhausted transients viaMAX_PREDISPATCH_ATTEMPTS=3); fresh transients emit nothing so the command path's retry runs. - Folded in the gated
noetl-toolsdep revert (path = "../tools"→"3", Cargo.lock → 3.1.0) — this also unblocks the Docker image build (the path dep can't resolve in the build context).
Diagnosis correction. The umbrella framed this as a "clean 404" from /api/keychain/..., but the worker has no such call — it resolves aliases via /api/credentials/{alias}. The live pg_noetl_k8s failure is actually an HTTP 500 "Decryption failed: aead::Error". The status-aware classification handles both 404 and the real 500 as terminal.
Validation. cargo build/test (133 lib + 9 integration, +7 new) / clippy green. Kind-val on local Rust stack (rebuilt worker image loaded into kind noetl): test/postgres (pg_noetl_k8s) → call.error → command.failed → playbook.failed (no hang; worker log: "Pre-dispatch failure is terminal; emitted call.error + command.failed ..."); hello_world → playbook.completed.
Pointers. ai-meta → worker 99e2c66; worker wiki 987abba (worker-credentials Pre-dispatch failure handling). PR worker#68; sub-issue worker#67 (auto-closed). Handoff thread 2026-06-09-worker-predispatch-call-error completed directly by Claude (not Codex) and archived.
Agent: Claude (direct — Rust + ops + e2e) · Repos touched: noetl/ops (PR #168), noetl/server (PR #173, v3.0.3), noetl/e2e (PR #38); three pointer bumps in ai-meta; ai-meta wiki.
Headline. The watcher's missing curl (#80) was the named blocker, but fixing it uncovered two more layered bugs beneath it. With all three fixed, kind_validate_container_callback.sh is green both probes — happy_path → succeeded, oom → failed_oom — closing the last blocker on the #43 container-callback chain.
Root cause (three layers).
-
Watcher image / curl —
repos/ops/ci/manifests/k8s-watcher/deployment.yamlused the retiredbitnami/kubectl:1.30.3(removed from Docker Hub; the live cluster was patched to thebitnamilegacyarchive image as a stopgap) with a runtimeapt/apk install jq curlstep that never putcurlon PATH. Every callback POST hitcurl: not found→ HTTP 000, sonoetl_container_callback_totalnever bumped. -
Server insert schema — once curl worked, the POST reached the server and 500'd:
handlers::container_callbackemitted its resumecall.doneviadb::queries::event::insert_event, whose SQL targetsattempt+idcolumns andRETURNING id— none of which exist on the deployednoetl.event(PK(execution_id, event_id)).column "attempt" of relation "event" does not exist. The normal ingestion path (handlers::events) uses the correct column set; only this handler wired into the stale module. -
OOM path (never functional) — the watcher's
classify_state()only read Job-level conditions, so it could emitsucceeded/failed/failed_timeoutbut neverfailed_oom; and the e2e fixture'sbytes(40 * 1024 * 1024)is calloc-backed (mapped to the zero page, never faulted in), so the container exited 0 instead of OOMing. A third bug hid behind those:build_body'scompleted_atfallback for failed Jobs used bare jqnow(a numeric Unix epoch the server rejects as aDateTime<Utc>→ HTTP 422).
Fixes.
-
ops #168 (→
cacc513): image →alpine/k8s:1.30.3(kubectl + jq + curl baked in), drop the install hack; addclassify_pod_failure()reading the backing Pod's status (RBAC already grants pod reads) →failed_oom(OOMKilled) /failed_image_pull(ImagePullBackOff);completed_atfallback → RFC3339now | todate. -
server #173 (v3.0.3 →
5d2cf58): replaceinsert_eventwith an inline INSERT matchinghandlers::events; terminal outcome rides in achk_event_result_shape-conformingresultenvelope ({status, context}),node_typeinmeta.cargo build, clippy (no new warnings), 7 container_callback unit tests pass. -
e2e #38 (→
6aaf06e): fixture uses a written-intobytearray(64 MiB)(one byte per 4 KiB page) so the kernel backs every page and the kubelet OOM-kills the pod.
Validation. Confirmed the kind cluster enforces memory limits (120 MiB dirty alloc in a 32Mi pod → OOMKilled exit 137). Rebuilt the server image + reloaded into kind; restarted the watcher with the new script. Final run:
kind-val: PASS — happy_path (state=succeeded counter delta = 1)
kind-val: PASS — oom (state=failed_oom counter delta = 1)
kind-val: ALL PROBES PASS — Container Tool Callback chain green
Pointer bumps. ai-meta@811b3da (ops cacc513) + ai-meta@12bb6d6 (server 5d2cf58, v3.0.3) + ai-meta@649c646 (e2e 6aaf06e). #80 closed by the ops pointer-bump commit; #43 (closed design issue) got a closing-the-loop note that acceptance #5 (kind validation) is green.
Agent: Claude (direct — repos/e2e is non-Rust) · Repos touched: noetl/e2e (PR #37); pointer bump in ai-meta; e2e wiki (new Kind-Val Runners page).
Headline. Both scripts/kind_validate_fanout_reduce.sh and scripts/kind_validate_container_callback.sh aborted immediately on error: unrecognized subcommand 'playbook'. They targeted the retired noetl playbook register/execute + noetl execution status/events verbs. The validation logic and the asserted event taxonomy (step.enter / command.completed / node_name / fan-in barrier) were intact — only the CLI invocation layer had drifted.
Root cause. The maintained CLI surface (stable from v2.17.0 through the v4.x repos/cli line) is register playbook / exec / status / query; the runners were never updated when the old playbook / execution command groups were removed.
Fix (e2e#37, squashed to a3594b3).
-
noetl playbook register --file F→noetl register playbook --file F. -
noetl playbook execute --path P --output json→noetl exec <catalog-path> --runtime distributed --json(exec bymetadata.path, not the bare name). -
noetl execution status --id ID --output json→noetl status ID --json. -
noetl execution events --id ID --output json→noetl query "SELECT … FROM noetl.event WHERE execution_id = ID ORDER BY event_id" --format json(noeventsverb today; rows wrap under.result;noetl.eventhas no timestamp column so the barrier-ordering assertion comparesevent_id, replacing the removedcreated_at). - Fail-fast CLI-surface guard added to each runner (names the missing verb + installed version).
- container_callback: default
NOETL_SERVER_DEPLOY→noetl-server-rust;run_probenow takes the catalog path explicitly.
Validation (local kind, server-rust v3.0.1 + worker-rust, :8082).
-
fanout_reduce — PASS start-to-finish, no manual workaround (final COMPLETED; exactly one
step.enterforreduce_customer; reducecommand.completedafter both upstreams). -
container_callback — now drives register → exec → terminal COMPLETED cleanly; stops at the metric-delta assertion because the deployed
noetl-k8s-watcherimage lackscurl(watcher.sh: curl: not found→ callback POST HTTP 000). Cluster-side watcher gap, tracked on #80; not the CLI surface.
Version-skew decision. PATH binary is noetl 2.17.0; repos/cli submodule is v4.10.0. The targeted surface is identical across both, so the runners work on either. The installed binary lags the submodule by a major-version line — worth refreshing for general parity, but not required for these runners.
Wiki. New Kind-Val Runners page on the e2e wiki documents both runners + the CLI-surface contract; linked from Home.
Pointer: e2e → a3594b3; e2e wiki → 59413b1.
2026-06-10 (#82 closed — GUI credential View/Edit recovered for pre-wallet records; e2e fixture dedup + dev:kind script landed)
Agent: Claude (direct — repos/gui/repos/e2e are non-Rust) · Repos touched: noetl/gui (PRs #36, #35; v1.11.0 + v1.11.1), noetl/e2e (PR #36); pointer bumps in ai-meta; wiki
Headline. After the Secrets Wallet migration (#61) the credential section couldn't display some credentials on View or Edit — they failed with a generic toast. Root cause is not a GUI API-shape change: the wallet moved credential storage to forward-only envelope encryption (repos/server/src/services/credential.rs — "Forward-only — there is no legacy single-master-key path; a pre-wallet record must be re-registered"). GET /api/credentials/{id}?include_data=true calls cipher.open_storage_json(&entry.data); records sealed under the old static/all-zeros key can't be unwrapped by the new KEK, so the server returns 500 Decryption failed: aead::Error. Proven live: all 21 stored (pre-wallet) credentials 500 on include_data=true, while a freshly-POSTed credential reads back its data exactly as before. The GUI met that 500 with a dead-end toast — View silently failed, Edit never opened.
-
GUI credential recovery (
repos/gui/src/components/Credentials.tsx+styles/Credentials.css, gui#36, v1.11.0; closes #82): View surfaces the real reason and points to Edit as the recovery path; Edit still opens the modal using the metadata already in the list row (name/type/description/tags) + a warning banner, with an empty-but-required data field — re-entering the secret and saving POSTs it back, re-sealing the record under the current wallet. Happy path (wallet-era credentials) unchanged. -
dev:kindconvenience script (repos/gui, gui#35, v1.11.1):VITE_API_MODE=direct+ skip-auth Vite target that talks straight to the kind server on :8082, plus README. -
e2e fixture dedup (
repos/e2e/.../tooling_non_blocking.yaml, e2e#36): removed a duplicateenable_snowflake_probe/enable_nats_kv_probepair in the sameworkload:mapping (a serde_yaml dup-key defect; last-wins under PyYAML). Canonical declarations +{{ workload.* }}references intact; no behavior change.
npm run type-check + npm run build + npm test (4/4) green. Live against the kind cluster (server :8082) + npm run dev:kind UI on :3001 via Playwright: View on a legacy credential → 500 caught → decrypt-specific toast (no silent failure); Edit on a legacy credential → modal opens with the warning banner, name (pg_local) + type (PostgreSQL) preserved, data field empty + required. The re-save → re-seal loop confirmed at the API layer (a freshly-POSTed credential reads back its data). e2e fixture: yaml.safe_load parses clean, both flags resolve to False, references + auth aliases intact.
- ai-meta → gui
8cacc9e(v1.11.1); ai-meta → e2e4a9ffbc. - The worker
Cargo.tomlpath-dep override (noetl-tools = { path = "../tools" }) remains in the working tree, untouched — gated on #78.
2026-06-10 (#81 closed — noetl-server v3.0.2 fixes the container-tool command type contradiction; kind-val GREEN)
Agent: Claude (direct, Rust per agents/rules/handoff-routing.md) · Repos touched: noetl/server (PR #172, v3.0.2); pointer bump in ai-meta; wiki
Headline. The container tool kind couldn't execute with any command value — the server and worker disagreed on the type. Server ToolSpec.command was Option<String> (scalar): an array command: ["/bin/sh", "-c"] failed the ToolDefinition untagged-enum match at deserialise time (400 Bad Request: data did not match any variant of untagged enum ToolDefinition). Worker ContainerConfig.command is Option<Vec<String>> (a sequence): a scalar cleared the server but was then rejected by the worker (Invalid container config: invalid type: string, expected a sequence). No value satisfied both schemas, so the container-callback chain couldn't be exercised — no K8s Job was ever created.
-
Fix (
repos/server/src/playbook/types.rs, server#172, v3.0.2): typedToolSpec.commandasOption<serde_json::Value>— the same treatmentargsalready gets and which already decoded its array fine. A scalar stays a JSON string for the shell/db consumers (db tools read it viaquery's#[serde(alias = "command")]); an array passes through unchanged to the worker'sContainerConfig.command: Option<Vec<String>>.ToolCall::from_specnow forwardscommandverbatim instead of wrapping it inserde_json::Value::String(...). Worker side (noetl-tools) needed no change. -
Tests: 2 new regression tests (
test_container_array_command_decodes_and_passes_through,test_scalar_command_stays_a_string);cargo test --lib playbook::types18/18; clippy clean on the change.
Built localhost/noetl-server-rust:dev with the fix, loaded into kind via image-archive, rolled noetl-server-rust (server only — no full-stack redeploy). Reproduced the baseline 400 against the pre-fix server, then re-ran container_callback_happy_path (command: ["/bin/sh","-c"]) against the fixed server:
- Server accepted it (no 400); execution
323132934439571456dispatched. - Worker logs:
container.dispatch→container.dispatched(no "expected a sequence"). - K8s Job
noetl-container-dispatchcontainer-323132934439571456-kdrq7reachedComplete 1/1. Pre-fixkubectl get jobsstayed empty.
The chain's terminal-state counter-bump validation (succeeded / failed_oom) still depends on the resume path + #79 (runner-CLI refresh) / umbrella #43.
- Server pointer bumped:
repos/server→bd36672(v3.0.2). - Closed noetl/ai-meta#81; board 3 → Done.
- Dashboard reconciled: added open #79 + #80 to Home's Active-umbrellas table (were missing).
2026-06-09 (Full e2e regression sweep — v3.0.1 server + v3.1.0 tools: regression-clean; filed worker bug #78)
Agent: Claude (direct) · Repos touched: none (validation + issue filing); credentials registered in the running kind cluster (pg_k8s, pg_local)
Headline. Ran the config-driven regression set (repos/e2e/fixtures/playbook_test_config.yaml, 36 enabled playbooks) against the Rust-only kind stack (noetl-server-rust :dev v3.0.1 code, noetl-worker-rust :dev local tools v3.1.0, Python scaled to 0). Harness: catalog register → POST /api/execute → poll status → compare to the config's expected execution_status. Conclusion: no platform regression from the v3.0.1/v3.1.0 cleanup.
-
18 confirmed PASS covering every code path the cleanup touched: core (
hello_world, start/end-with-action), control flow /when:/ tojson (control_flow_workbook,weather_control_flow), variables (5×test_vars_*), retries (http_retry_*,python_retry_exception,postgres_retry_connection,duckdb_retry_query), nested-playbook composition (playbook_composition), postgres+http (http_to_postgres_simple). - Every non-pass root-caused to environment / fixture / harness, not the platform:
- CLI v2.17.0 path-derivation quirk (bare-basename → 404 when a
<name>.yamlsibling exists); the server has them registered andPOST /api/executeruns them. - External mock-HTTP (
api_urlflaky server) / GCS not present in kind (pagination,retry_simple_config,python_http_example,python_gcs_example,duckdb_gcs_workload_identity). - Missing worker-FS script file (
python_file_example). - Missing
pg_localcredential (registered mid-run → postgres fixtures then reach a clean terminal). - Fixture SQL drift (
save_*:truncate_tablesexpectssimple_test_flatwhichcreate_tablesdoesn't create — surfaced cleanly asSQLSTATE 42P01, i.e. the postgres-observability fix works).
- CLI v2.17.0 path-derivation quirk (bare-basename → 404 when a
Genuine bug found → filed #78
The missing-pg_local case exposed a worker error-propagation gap: a pre-dispatch failure (credential-alias resolution) ?-propagates out of CommandExecutor::execute_with_server_url; the dispatch loop (repos/worker/src/worker.rs:306) only logs "Command execution failed" and emits no call.error → the execution hangs at command.started forever instead of reaching FAILED. Violates agents/rules/no-default-connection.md. Confirmed by before/after credential registration (missing → hang; present → clean terminal). Pre-existing; independent of the v3.0.1/v3.1.0 work. Labelled ai-task + repo:worker + bug; on roadmap board 3 (Todo).
- Regression result posted on #49.
- ai-meta
mainpushed earlier this session (b2e8e83): tools v3.1.0 + server v3.0.1 + ai-meta-wiki pointer bumps.
Agent: Claude (direct) · Repos touched: noetl/tools (v3.1.0, PR #47), noetl/server (v3.0.1, PR #171)
Headline. Stripped the diagnostic tracing::debug! scaffolding added during the prior-session e2e triage, kept the production fixes, opened + merged the two PRs, bumped pointers. Tracks the #49 Rust-server parity umbrella.
-
noetl-tools v3.1.0 (noetl/tools#47 MERGED):
- YAML boolean
when: truein policy rules now checksas_bool()before the string-template fallthrough —Value::Bool(true).as_str()returnsNone, so the pre-fix code never matched. -
|tojsonfallback: when a single{{ expr }}template renders a complex object, minijinja emits Python-style repr (True/None); the engine retries with|tojsonappended to produce valid JSON. -
UndefinedBehavior::Chainableon the tools template engine — matches the server's permissive undefined-variable behaviour so{{ iter.item }}returns undefined rather than throwing. - Test hygiene: two tests (
test_render_value_roundtrip_complex_object,test_policy_rules_extraction_from_wire_format) were outside themod tests {}block — moved inside, removedeprintln!debug prints.
- YAML boolean
-
noetl-server v3.0.1 (noetl/server#171 MERGED):
- Result-store PUT body limit raised to 64 MB (
DefaultBodyLimit::max(64 * 1024 * 1024)) — Axum's default 2 MB limit was rejecting 15 MB+ result payloads with HTTP 413. -
render_pipeline_configstashesset/args/spec/commandblocks before Tera rendering and restores them after. -
iternamespace map inbuild_iteration_commandso{{ iter.item }}resolves during fan-out. -
cmd_render_ctxusescommand.context.clone().unwrap_or_default()instead of the raw orchestrator context, so per-command overrides propagate. - Stripped diagnostic
tracing::debug!blocks fromcommands.rs+execute.rs.
- Result-store PUT body limit raised to 64 MB (
All 7 e2e sweep playbooks PASS on the Rust-only kind stack (Rust server + Rust worker, Python scaled to 0) — confirmed in the prior session before the cleanup; these PRs are logging-only deltas on top of the validated source trees.
- ai-meta@316048c —
chore(sync): bump tools to d294a6c (v3.1.0) - ai-meta@6590bd6 —
chore(sync): bump server to 33789b0 (v3.0.1) - Commented landing on #49; corrected the stale
Refs #69citation the PRs carried (#69 is the closed workeroutput_selectbug, not these fixes).
repos/worker/Cargo.toml still pins noetl-tools = { path = "../tools" } for local testing. Reverting to noetl-tools = "3" (crates.io) is blocked on noetl-tools v3.1.0 publishing to crates.io — currently crates.io is at v3.0.0; the v3.1.0 release commit carries [skip ci], so the publish CI didn't fire. Reverting now would regress the worker to the pre-fix v3.0.0.
2026-06-09 (#77 CLOSED — all 5 PRs merged: noetl-tools v3.0.0 + noetl-server v3.0.0 + e2e#35 + cli v4.10.0 + worker#66)
Agent: Claude (direct) · Repos touched: noetl/tools (v3.0.0), noetl/server (v3.0.0), noetl/e2e (PR #35), noetl/cli (v4.10.0, PR #57), noetl/worker (PR #66)
Headline. Post-merge propagation for #77 — explicit input:/set: forward-only data binding. All five PRs merged; all pointer bumps landed. Umbrella CLOSED, board → Done.
-
noetl-tools v3.0.0 (noetl/tools#45 MERGED): breaking —
task_sequencepipeline reads each sub-tool'sinput:map and injects key-value pairs as template variables._prev/_resultspositional workaround removed. -
noetl-server v3.0.0 (noetl/server#169 MERGED): breaking —
render_pipeline_configreplacesrender_value_deferring. Server no longer defers_prev/_resultstemplates. -
e2e fixture migration (noetl/e2e#35 MERGED): all 13 YAML fixtures rewritten from
_prevreferences toset:/input:pattern. -
CLI dep bump (noetl/cli#57 MERGED): noetl-tools 2.x -> 3, noetl-executor 0.4.x -> 0.5.0.
PASSES for #77 scope. Server orchestrator correctly renders pipeline config and propagates set: values via the new path. Template resolution failure on {{ ctx.test_var }} is a pre-existing Rust worker ctx-namespace gap, not a #77 regression.
- Tools: 10cc751 -> fdbc407 (v3.0.0)
- Server: 54fecd3 -> 0f8dc63 (v3.0.0)
- E2E: 6b3a52a -> f6a9a93 (PR #35)
- CLI: 77be8be -> c73f99d (PR #57)
- Worker: 02f18d5 -> 8dd653b (PR #66, dep bump noetl-tools 3.0.0 + noetl-executor 0.5.0)
- Umbrella page: Umbrella: Explicit Input Binding
- Issue #77 CLOSED. Board → Done.
2026-06-08 (noetl-server v2.63.0 — step-level set: + ctx shims MERGED; e2e fixture _prev template fixes MERGED; #77 opened)
Agent: Claude (direct) · Repos touched: noetl/server (PR #168 MERGED — v2.63.0), noetl/e2e (PR #34 MERGED)
Headline. Server PR #168 merged (v2.63.0) — ctx/workload namespace shims + step-level set: mutation support. Fixed the last 2 e2e fixture failures (http_to_postgres_simple + save_simple_test) by applying the _prev deferred template pattern for multi-tool steps. E2E PR #34 merged; pointer bumped. Opened #77 — explicit input: binding for multi-tool step arguments (replaces the positional _prev workaround with proper data-binding).
-
Server v2.63.0 (noetl/server#168 MERGED):
-
with_ctx_shims()helper at 7 orchestrator call sites. -
set_varsfield onStepstruct — step-levelset:mutations applied after each completed step. - Kind-validated in prior session; deployed to kind cluster.
-
-
E2E fixture
_prevfixes (noetl/e2e#34 MERGED):-
http_to_postgres_simple: merged separate transform + execute steps into single multi-tool step using{{ _prev.id_0 }}etc. (execution 322618534888738816 → COMPLETED). -
save_simple_test: fixed bothpostgres_flat_test({{ _prev.data.record_id }}) andpostgres_nested_test({{ _prev.nested_id }}) — both previously used tool-name references that resolved to empty server-side (execution 322619363913895936 → COMPLETED,affected_rows: 1).
-
-
#77 opened — explicit
input:binding for multi-tool step arguments. The_prevpattern in PR #34 is a positional workaround; the proper fix istask_sequencein noetl-tools extracting each sub-tool'sinput:map and injecting those key-value pairs as template variables for the tool's owncommand:/query:.
The Rust server's template engine (render_value_deferring() in jinja.rs) pre-renders ALL Jinja references except _prev and _results before dispatching task_sequence to the worker. Cross-tool references like {{ generate_data.data.record_id }} resolve to empty/null server-side because tool names aren't in the server's context. The correct pattern for multi-tool steps is {{ _prev.property }} — deferred by the server, resolved by the worker's task_sequence at runtime. The long-term fix is #77 — explicit input: blocks on each tool.
- Server: da867c4 → 54fecd3 (v2.63.0)
- E2E: 8f807c8 → 6b3a52a (PR #34 merged)
- New rule:
agents/rules/no-default-connection.md - New issue: #77 — explicit input binding
Agent: Claude (direct) · Repos touched: noetl/server (PR #168 OPEN — v2.64.0 on branch fix/ctx-shim-orchestrator-eval)
Headline. Two orchestrator bugs fixed + full 38-playbook e2e validation sweep on the Rust-only kind stack (Python scaled to 0). 25 of 27 testable playbooks PASS (92.6% excluding external-dep fixtures). PR #168 carries both fixes; awaiting merge.
-
ctx/workload namespace shims (commit b05f978):
with_ctx_shims()helper addsctx.*andworkload.*namespace entries to the Jinja evaluation context at all 7 orchestrator call sites. Fixes{{ ctx.probe_delays }}and similar expressions that resolved to undefined. -
Step-level
set:mutation support (commit 48cb008): Addedset_varsfield toStepstruct (#[serde(default, rename = "set")]). Orchestrator applies step-levelset:mutations after each completed step. Previously step-levelset:YAML was silently dropped during deserialization — root cause ofpagination_basicfailure (19 events → now 45+ events, progresses past the loop).
| Tier | PASS | FAIL (ext) | TIMEOUT (ext) | Bug |
|---|---|---|---|---|
| 1 — Core | 9/10 | — | — | 1 (test_storage_tiers artifact _ref) |
| 2 — Control flow | 9/10 | — | 1 (OpenAI key) | — |
| 3 — HTTP/postgres | 4/10 | 3 (fixture) | 3 (ext HTTP server) | — |
| 4 — Composition | 6/9 | 2 (fixture) | — | 1 (YAML parse) |
| Total | 25/38 | 5 | 4 | 1 server + 1 YAML + 2 fixture |
Performance highlights: heavy_loop_aggregation 526 events in 31.89s (~16.5 events/s sustained); median simple-playbook time ~1.45s.
- Server branch:
fix/ctx-shim-orchestrator-eval@ 48cb008 - PR: noetl/server#168 (OPEN)
- Performance report: noetl/ai-meta#49 comment
Agent: Claude (direct) · Repos touched: noetl/tools (PR #44 MERGED, v2.24.2 released), noetl/server (PR #167 MERGED, v2.62.1 released — clippy cleanup; #22 closed)
Headline. Housekeeping session — cleared the clippy -D warnings CI gate on both noetl-tools (15 warnings across 7 files) and noetl-server (14 clippy categories across 17 files). All mechanical lint fixes, zero behavior changes. Closed stale noetl/server#22 (Phase D orchestrator engine port sub-issue — all rounds shipped).
-
noetl-tools v2.24.2 (noetl/tools#44; closes tools#42):
- 15 clippy fixes: unused bindings, dead code suppression, identical if/else simplification, doc indent,
.clamp()over.min().max(),constassertion blocks. - 7 files touched:
script.rs,snowflake.rs,task_sequence.rs,transfer.rs,artifact.rs,mcp.rs,nats.rs. - 288 lib tests pass; no behavioral changes.
- 15 clippy fixes: unused bindings, dead code suppression, identical if/else simplification, doc indent,
-
noetl-server v2.62.1 (noetl/server#167; closes server#161):
- 14 clippy categories resolved across 17 files:
Box::newwrapping for large enum variants,u32type mismatches,too_many_argumentsallow, and other mechanical lint fixes. - No behavioral changes; PATCH bump only.
- 14 clippy categories resolved across 17 files:
- noetl/server#22 closed — Phase D orchestrator engine port complete (all 6 rounds shipped: R1–R3c + R4-5 sharding).
-
repos/tools10cc751(v2.24.1) →56225ca(v2.24.2) -
repos/server2430bc2(v2.62.0) →da867c4(v2.62.1)
2026-06-08 (#76 closes — noetl-server v2.62.0 ships sequential-mode iterator dispatch; first Claude-direct Rust PR under the new rule)
Agent: Claude (direct) · Repos touched: noetl/server (PR #166 MERGED, v2.62.0 released)
Headline. #76 closes — first Rust-submodule change Claude authored directly per agents/rules/handoff-routing.md (codified earlier this session to stop the Codex handoff pattern). Sequential-mode iterator dispatch: when a step declares loop.spec.mode: sequential (or omits mode:, since Sequential is the #[default]), the orchestrator serializes per-iteration commands instead of fanning out all iterations at once. This fixes the remaining 2-of-4 process_users iterations that stalled during the #75 kind-val on playbook_composition.yaml — child playbooks were contending for worker slots because all 4 iterations dispatched simultaneously.
-
noetl-server v2.62.0 (noetl/server#166; closes noetl/ai-meta#76):
-
LoopModeenum (Sequentialdefault /Parallel) inplaybook/types.rs;LoopSpec.modeparsed from playbook YAMLloop.spec.mode. -
StepInfo.iterations_dispatchedfield trackscommand.issuedcount for the sequential dispatch guard. - Sequential dispatch pattern: dispatch iteration 0 at fan-out; on each
command.completed, checkiterations_dispatched == iterations_completed()and dispatch next. - Existing parallel-mode tests updated to explicitly set
mode: Parallel. - 3 new tests pinning sequential behavior; lib tests pass; release build + clippy clean.
-
-
test/loop→ COMPLETED (5/5 steps, 0 failures) — parallel iterator regression confirmed working. -
tests/iterator_save_test→ COMPLETED (4 steps, 0 failures) — iterator save regression confirmed. - Server submodule pointer bumped:
12eb2b8→2430bc2.
-
repos/server12eb2b8→2430bc2(v2.62.0)
2026-06-08 (#75 closes — noetl-tools v2.24.1 ships PlaybookTool polling fix; 7th + final Codex Rust handoff before rule change)
Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/tools (PR #43 MERGED, v2.24.1 released), noetl/worker (PR #65 MERGED for lockfile bump), noetl/ai-meta (agents/rules/handoff-routing.md codified)
Headline. #75 closes — Codex handoff #7 (the day's last Rust-via-Codex round) fixed the PlaybookTool::execute polling loop's terminal-status check. Pre-fix it read payload.completed/payload.failed as booleans, but the status endpoint returns payload.status as a string — both lookups always returned false, so child playbooks dispatched with return_step: end timed out at 300s instead of returning their result. Post-fix, the kind-val on playbook_composition.yaml shows 2 of 4 iterations completing end-to-end with save_profile.status: COMPLETED from the child playbooks (was: zero completions pre-fix; child-side {status: "timeout"} shape).
Rule change codified mid-session: after this PR opened, the user said "stop using codex for rust code - do it yourself". Codified as agents/rules/handoff-routing.md. All future Rust-submodule changes (repos/cli, repos/server, repos/worker, repos/tools, repos/doctor, repos/gateway) Claude does directly via Read/Edit/Bash — no Codex handoff prompt, no Agent dispatch. Codex remains in play for non-Rust work (Python, YAML, docs in non-Rust submodules) but those are rare.
-
noetl-tools v2.24.1 (noetl/tools#43; closes noetl/ai-meta#75):
- Extract
PlaybookTool::is_terminal_status(payload)helper so the check is unit-testable without HTTP mocks. - Replace the two boolean lookups with a string-status check:
payload.status == "COMPLETED" | "FAILED" | "CANCELLED", plus theis_cancelled: truefallback. - Single file touched (
src/tools/playbook.rs); 132 lines added / 9 removed. - 7 new unit tests; lib 288 passed / 0 failed (was 281/0); release build + clippy clean.
- PATCH bump (commit prefix
fix:). - Authored on a Codex handoff at
handoffs/active/2026-06-08-tools-playbook-polling-fix/.
- Extract
-
noetl-worker PR #65 (merged):
Cargo.lockbump to 2.24.1.Cargo.toml's^2.24caret already covered it; only the lockfile changed. Claude-authored directly (post-rule-change). -
agents/rules/handoff-routing.md(commite5f53fe): codifies "Claude writes Rust directly; no Codex dispatch for Rust" so it survives compaction.
Built localhost/noetl-worker-rust:v5.15.0-tools2241 via podman, loaded into kind, rolled deploy. Re-ran tests/playbook_composition/playbook_composition.yaml (kind exec 322425980322844672):
- 2 of 4
process_usersiterations completed end-to-end. Theircall.donepayloads carrysave_profile.status: COMPLETEDfrom the child playbooks (was: zero completions;{status: "timeout"}shape pre-fix). - 4 child executions visible in
noetl.eventwithparent_execution_id = 322425980322844672(but NOT innoetl.execution— separate parity gap from the Rust orchestrator not populating that table). - Remaining 2 iterations stuck on a separate concern (likely worker concurrency under sequential-mode iterator dispatch, or per-iteration child stall). #76 filed as the follow-up.
-
repos/tools:b6b80ce(v2.24.0) →10cc751(v2.24.1). -
repos/worker:802954b→be431a5(lockfile bump; no version tag).
7 issues closed today: #70, #71, #72, #73, #74, #75 (umbrellas) + #75-the-tools-clippy-followup #42 + #161 still open. 7 Codex Rust handoffs + the rule change that stops it from being 8. 7 server releases (v2.58.0 → v2.59.0 → v2.60.0 → v2.61.0 → v2.61.1), 2 tools releases (v2.24.0 → v2.24.1), worker lockfile bumps to track.
2026-06-08 (#72 closes — noetl-server v2.61.1 ships honest in-flight check in get_status via Codex handoff)
Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #165 MERGED, v2.61.1 released), noetl/ai-meta
Headline. #72 closes — sixth Codex handoff of the day fixes three compounding bugs in ExecutionService::get_status that caused /api/executions/{id}/status to return COMPLETED while iterator commands were still in-flight. Kind re-val on playbook_composition.yaml confirms the endpoint now correctly reports RUNNING with running_steps: 3 while the 4 process_users iterations are pending (was: COMPLETED, running_steps: 0 pre-fix).
-
noetl-server v2.61.1 (noetl/server#165; closes noetl/ai-meta#72):
-
Fix 1:
running_stepsSQL filter changed fromstatus='RUNNING'toevent_type IN ('command.claimed', 'command.started') AND status IN ('RUNNING', 'STARTED'). Workers emitcommand.startedwithstatus='STARTED', so the old filter returned 0 even when N commands were mid-execution. -
Fix 2: New
in_flight_commandsSQL query counts non-terminal rows innoetl.commandusingpool_for(execution_id)for sharding consistency. -
Fix 3: COMPLETED branch now requires both
stats.1 == stats.0 && stats.0 > 0 && in_flight_commands.0 == 0. Either signal alone trips prematurely (event-log signal when iterator steps fire onestep.enterfor N commands; command-table when projection lags). Requiring both is honest. - 1 file touched (
src/services/execution.rs); 160 lines added / 8 removed. - 4 new unit tests; lib 598 passed / 0 failed (was 594/0); release build + clippy clean.
- Authored on a Codex handoff at
handoffs/active/2026-06-08-server-status-endpoint-fix/. - PATCH bump (commit prefix
fix:).
-
Fix 1:
Built localhost/noetl-server-rust:v2.61.1 via podman, loaded into kind, rolled deploy. Re-ran tests/playbook_composition/playbook_composition.yaml (kind exec 322392447093051392):
| Snapshot | status | running_steps | total_steps | completed_steps |
|---|---|---|---|---|
| t+2s | RUNNING | 3 | 2 | 2 |
| t+8s | RUNNING | 3 | 2 | 2 |
| t+30s | RUNNING | 3 | 2 | 2 |
Pre-fix would have returned COMPLETED, running_steps: 0 at t+2s already. Cancelled the stuck execution after verifying the fix.
The 4 process_users commands still never reach command.completed — the task_sequence → tool: kind: playbook dispatch path stalls. That's the original cause of the kind exec hanging; this PR only fixed the status endpoint's lying. Filing as a separate follow-up issue.
-
repos/server:c01d3ce(v2.61.0) →12eb2b8(v2.61.1).
2026-06-08 (#74 closes — noetl-server v2.61.0 exposes ctx/workload namespace shim via Codex handoff; test_args_passing fully green; #73→#74 chain complete)
Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #164 MERGED, v2.61.0 released), noetl/ai-meta
Headline. #74 closes and the #73→#74 orchestrator-context chain completes. Codex handoff #5 of the day wraps CommandBuilder::build_command + build_iteration_command render contexts with ctx + workload namespace shims mirroring Python's commands.py:915-916. Kind re-val on test_args_passing.yaml is now fully green — every step command.completed | success, {{ ctx.test_var }} resolves to 100, all assertions pass.
-
noetl-server v2.61.0 (noetl/server#164; closes noetl/ai-meta#74):
- In
build_command+build_iteration_command: clone the incoming context intorender_ctx, useentry().or_insert_with()to addctx+workloadkeys pointing at aserde_json::Value::Objectview of the original context. -
entry().or_insert_with()preserves any pre-existingworkloadbinding (whichgenerate_initial_commandspopulates with the structured YAML workload block atexecute.rs:453) so{{ workload.session_token }}keeps working. -
Command.contextpersists the original flat context (not the shimmedrender_ctx) — keeps event-log payloads compact. - Iteration shim runs AFTER iterator-var insertions so
{{ ctx.<item_var> }}also resolves. - 1 file touched; 210 lines added / 5 removed.
- 5 new unit tests; lib 594 passed / 0 failed (was 589/0).
- Authored on a Codex handoff at
handoffs/active/2026-06-08-server-ctx-namespace-shim/.
- In
Built localhost/noetl-server-rust:v2.61.0 via podman, loaded into kind, rolled deploy. Re-ran tests/test_args_passing (kind exec 322284179205132288):
| event_type | status | node_name |
|---|---|---|
| command.completed | success | start |
| command.completed | success | use_vars |
| command.completed | success | end |
| playbook.completed | COMPLETED | playbook |
Dispatch context for use_vars now carries tool_config.args: {test_var: 100, computed: 200} (was {test_var: null, computed: null} on v2.60.0). Worker stdout: test_var=100, computed=200, result=300\n. All Python assertions pass (assert test_var == 100; assert computed == 200; assert result == 300).
Four Codex handoffs delivered the orchestrator-context plumbing across the day:
- #73 gap 1 (PR #162, v2.59.0): iterator fan-out at the initial-dispatch path. loop_test.yaml fully green.
-
#73 gap 2 (PR #163, v2.60.0):
apply_set_mutationsat orchestrator dispatch withctx./iter./step.scope-strip. actions_test.yaml partially recovers; test_args_passing.yaml still fails because of missing namespace shim. -
#74 (PR #164, v2.61.0):
ctx+workloadnamespace shim in render context, mirroring Pythoncommands.py:915-916. test_args_passing.yaml fully green.
-
repos/server:084cad4(v2.60.0) →c01d3ce(v2.61.0).
2026-06-08 (#73 gap 2 ships — noetl-server v2.60.0 applies arc-level set: via Codex handoff; partial kind-val, #74 filed for namespace shim)
Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #163 MERGED, v2.60.0 released), noetl/ai-meta (handoff + pointer bump + #74 filed)
Headline. #73 gap 2 ships — Codex handoff #4 of the day delivered arc-level set: propagation (apply_set_mutations + render at dispatch). Kind re-val on v2.60.0 shows the apply runs correctly (bare keys land in dispatch context) but the downstream Jinja templates {{ ctx.test_var }} still resolve to null because the orchestrator doesn't expose a ctx namespace in the render context. Filed as noetl/ai-meta#74 — small follow-up to wrap the dispatch context with {ctx, iter, step} namespace shims mirroring Python's commands.py:915 (context["ctx"] = state.variables).
-
noetl-server v2.60.0 (noetl/server#163; closes noetl/ai-meta#73):
-
NextArc.argsrenamed toset_varswith#[serde(rename = "set")]— YAML key now matches Python canonical (set:, not legacyargs:). -
apply_set_mutationshelper instate.rsmirrors Python's_apply_set_mutationsverbatim:ctx./iter./step.scopes strip to bare keys; others kept as-is. - Orchestrator dispatch sites (main path + skip-chain loop) render
set_varstemplates against producing-step completion context, apply to dispatch context before issuing downstream command. - 4 files touched; 457 lines added / 67 removed; 8 new unit tests; lib 589 passed / 0 failed (was 581/0).
- Authored on a Codex handoff at
handoffs/active/2026-06-08-server-next-set-propagation/.
-
Built localhost/noetl-server-rust:v2.60.0 via podman, loaded into kind, rolled deploy.
-
actions_test.yaml(kind exec322277934972801024): 4 downstream steps recover (aggregate,helper,verify,endallcommand.completed | success— were error pre-fix).process_loop×3 still error because it usesspec.policy.rules.then.set(per-task pipeline directive), a DIFFERENT code path from arc-levelset:. -
test_args_passing.yaml(kind exec322277934767280128):use_varsSTILL errors withTypeError: 'NoneType' + 'NoneType'. Command-table evidence:tool_config.args: {test_var: null, computed: null}despitestate.variablescontaining the bare keys (post-strip). Root cause: the downstream Jinja template{{ ctx.test_var }}looks up actxnamespace value, but the dispatch render context doesn't exposectxas a key. Python handles this incommands.py:915withcontext["ctx"] = state.variables.
-
noetl/ai-meta#74 (
repo:server) — bindctx/iter/stepnamespaces in the dispatch render context. Pure addition to the render-context construction inCommandBuilder::build_command; should be a small Codex handoff.
-
repos/server:59b743c(v2.59.0) →084cad4(v2.60.0).
2026-06-08 (#73 gap 1 closes — noetl-server v2.59.0 fans out start-step iterators via Codex handoff; loop_test fully green)
Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #162 MERGED, v2.59.0 released), noetl/ai-meta (handoff + pointer bump)
Headline. #73 gap 1 closes — Codex handoff #3 of the day delivered the iterator-binding fix at the initial-dispatch path. Kind re-val of loop_test.yaml on v2.59.0 produces 5 iterator-bound start commands (was 1 empty-args pre-fix) + all 11 per-step command.completed | success. Gap 2 (next.set: value propagation across step transitions) remains the next round on #73.
-
noetl-server v2.59.0 (noetl/server#162; Refs noetl/ai-meta#73):
-
generate_initial_commandslearns aboutstart_step.loop: ifloop:is set, renderloop.in_exprto a JSON array, require array (elseAppError::Validation), iterate to build per-iterationIteratorMetadata+ dispatch viabuild_iteration_command+persist_engine_command. Mirrors the existing Phase D R3b orchestrator fan-out into the/api/executeinitial-command path. - 282 lines added / 8 removed in
src/handlers/execute.rs. - 3 new unit tests (fan-out, back-compat no-loop, non-array rejection); lib 581 passed / 0 failed (was 578/0); release build + clippy clean.
- Authored on a Codex handoff at
handoffs/active/2026-06-08-server-initial-iterator-fanout/.
-
Built localhost/noetl-server-rust:v2.59.0 via podman, loaded into kind, rolled deploy. Re-ran tests/test/loop (kind exec 322268172092706816):
step_name | dispatched
----------+------------
start | 5 ← was 1 with args:{} pre-fix
process | 1
user_loop | 3 ← already worked via R3b
verify | 1
end | 1
All 11 command.completed events carry status: success. playbook.completed | COMPLETED reflects real state.
process step's call.done shows the upstream loop's aggregated path (original_sum: 15, doubled_sum: 30, count: 5); verify step received it via {{ process }} template and passed all assertions; end step's verify_result.test_passed: true.
-
Gap 2:
next.set:value propagation. Affectstest_args_passing.yaml'suse_varsstep —tool_config.args: {test_var: null, computed: null}because the upstream arc'sset: { ctx.test_var: '{{ initial_value }}' }doesn't propagate into the downstream step's context at dispatch. Distinct code path from the iterator fan-out; will land in a follow-up round.
-
repos/server:7dab231(v2.58.0) →59b743c(v2.59.0).
2026-06-08 (#71 CLOSED — noetl-tools v2.24.0 ships python wrapper fix via Codex handoff; kind re-val surfaces #73)
Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/tools (PR #41 MERGED, v2.24.0 released), noetl/worker (PR #64 OPEN for the dep bump), noetl/ai-meta (handoff + new follow-ups + pointer bump)
Headline. #71 closes at the noetl-tools layer. Second cross-agent handoff of the day used the same Claude-dispatch + Codex-execute pattern that delivered #70. Kind re-val with the new worker image (localhost/noetl-worker-rust:v5.15.0-tools224) confirms the python wrapper fixes shipped — Style C top-level-return wrap is visible in fixture tracebacks — but surfaces a distinct orchestrator-side binding gap filed as noetl/ai-meta#73.
-
noetl-tools v2.24.0 (noetl/tools#41; closes noetl/ai-meta#71):
-
wrap_top_level_return()helper detects unindentedreturnbefore anydef/async def/classand wraps user code indef __noetl_step__(args, input_data, **kw):+ a call that captures the return value asresult. -
input_data = dict(args)global added to the wrapper template (right afterglobals().update(args)). - Style B (
def main()convention from #65) preserved untouched. - 7 new unit tests; lib 281 passed / 0 failed (was 274/0); release build clean.
- Authored on a Codex handoff at
handoffs/active/2026-06-08-tools-python-wrapper-contract/(prompt commit74ffe97, result commit pushed by Codex itself).
-
-
noetl-worker PR #64 (noetl/worker#64) — one-commit dep bump (
noetl-tools = "2.23" → "2.24"in Cargo.toml + Cargo.lock). Awaiting review/merge.
Built localhost/noetl-worker-rust:v5.15.0-tools224 via podman, loaded into kind, rolled the deployment (pod noetl-worker-rust-68ff5bd99f-dgf65 registered + serving). Re-ran the three #71 fixtures:
| Fixture | Kind exec | Wrapper layer | Orchestrator layer | Net |
|---|---|---|---|---|
loop_test.yaml |
322253869239242752 |
✅ Style C wrap fires; result = __noetl_step__(args, input_data) in traceback |
❌ start step's command context shows args: {} — iterator's num not bound |
partial |
actions_test.yaml |
322253869834833920 |
✅ input_data global injected |
❌ downstream null-arg dispatch same as test_args_passing | partial |
test_args_passing.yaml |
322253869994217472 |
✅ wrapper runs cleanly | ❌ tool_config.args: {test_var: null, computed: null} — next.set not propagated |
partial |
Evidence the wrapper IS fixed. The original #71 fingerprints (SyntaxError: 'return' outside function + NameError: name 'input_data' is not defined) are GONE. Replaced by TypeError: unsupported operand type(s) for *: 'NoneType' and 'int' — Python successfully ran the wrapped function, but downstream values arrived as None.
The new finding. Orchestrator dispatches with empty / null args:
- Iterator binding gap:
loop_test'sstartstep getsargs: {}despiteinput: { num: '{{ num }}', loop_index: '{{ loop_index }}' }. -
next.set:propagation gap:test_args_passing'suse_varsstep getsargs.test_var: nulldespite the upstream arc'sset: { ctx.test_var: '{{ initial_value }}' }.
Both are orchestrator-side template-resolution gaps, distinct from #60 (template context for when:). Filed as noetl/ai-meta#73.
-
noetl/ai-meta#73 (
repo:server) — orchestrator doesn't bind iterator values +next.setvalue templates into downstream step args at dispatch. -
noetl/tools#42 — 12 pre-existing clippy errors in
mcp.rs/nats.rs/snowflake.rs/result_fetch.rs, parallel to noetl/server#161. Mechanical cleanup; blocks PR CI on-D warnings.
-
repos/tools:7d3fcfd(v2.23.1) →b6b80ce(v2.24.0). -
repos/worker:401bafc(v5.15.0) →802954b(worker#64 merged;chore(deps):so no release-please tag).
Agent: Claude · Repos touched: noetl/ai-meta (issues #71 + #72 opened), noetl/ai-meta.wiki
Headline. After #70 closed, swept four more e2e fixtures against the new v2.58.0 stack. test_storage_tiers clean. loop_test + actions_test surfaced two distinct python-tool wrapper gaps + an orchestrator status-drift concern. Re-audit of the earlier test_args_passing PASS verdict revealed it was a false-pass — command.completed | error masked by playbook.completed | COMPLETED.
| Fixture | Kind exec | Verdict |
|---|---|---|
test_storage_tiers.yaml |
322239636480987136 |
PASS (5/5 steps, 0 errors) — confirms #70 cascade unblocked for this fixture too. |
loop_test.yaml |
322239728994750464 |
FAIL — all 5 python steps fail with SyntaxError: 'return' outside function. |
actions_test.yaml |
322240336095088640 |
FAIL — start step OK; downstream python steps fail with NameError: name 'input_data' is not defined. |
test_args_passing.yaml (re-audit) |
322228896269340672 |
False-pass. use_vars step had `command.completed |
-
noetl/ai-meta#71 (
repo:tools) — Rust python tool's wrapper doesn't replicate the legacynoetl/noetl/tools/python/executor.pycontract. Two gaps:- Top-level
return XtriggersSyntaxError: 'return' outside function(need to wrap user code in an implicit function). -
input_dataglobal isn't bound in the exec namespace (Python tool setsinput_data+args+resultas globals before exec). Mechanical fix — wrapper-level only, no fixture rewrites.
- Top-level
-
noetl/ai-meta#72 (
repo:server) — orchestrator emitsplaybook.completed | COMPLETEDeven when EVERY step'scommand.completediserror. CI gates / dashboards that watchplaybook.completedtrust the green status on silently-broken playbooks. Needs a design call: cascade toplaybook.failed, addplaybook.completed_with_errors, or add an error-count field — pick one and document inevent-envelope.md.
This session reaffirmed the pattern that the "did the orchestrator emit a terminal event" check is insufficient for kind-val PASS verdicts — every sweep must also audit the per-step call.done / command.completed events. The Sessions-Log entry from earlier in this session reporting test_args_passing as PASS was a false-positive directly caused by #72.
- 7 e2e fixtures hit the #71 pattern (grep
input_data\.get|^\s*returnoverrepos/e2e/fixtures/playbooks/*.yaml):actions_test,broken_sql,duckdb_test,loop_test,postgres_test,test_args_passing,test_end_with_action. Many more under subdirectories. - The orchestrator status-drift evidence query (event-log slice from #72 body) reproduces on all three failed fixtures above.
2026-06-08 (#70 CLOSED — noetl-server v2.58.0 ships durable result-store endpoints via Codex handoff; kind-val GREEN)
Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #160 MERGED, v2.58.0 released), noetl/ai-meta (handoff + pointer bumps), noetl/ai-meta.wiki, noetl/server.wiki
Headline. #70 closes. Cross-agent handoff delivered the worker's missing wire counterpart: noetl-server now serves PUT /api/result/<eid> + GET /api/result/resolve (release commit 7dab231 = v2.58.0). Kind re-val of tests/output_select_test (kind exec 322232289352224768) reaches playbook.completed with test_result: "PASSED" + output_select_worked: true — the cascade that the prior session diagnosed (worker 404 → shm-only branch → null _ref → artifact tool fails) is fully unblocked.
-
noetl-server v2.58.0 (noetl/server#160; closes noetl/ai-meta#70) — 4 new files + 4 modified across
repos/server:-
src/db/queries/result_store.rs(startup DDLCREATE TABLE IF NOT EXISTS noetl.result_store+insert+get_by_ref; mirrorssecret_auditpattern). -
src/services/result_store.rs(ResultStoreService::{put, resolve},PutResultBody/ResultPutResponsewire types matching workerControlPlaneClient::put_resultexactly,parse_noetl_refURI parser, 10 unit tests). -
src/handlers/result_store.rs(axum handlers withresult_store.{put,resolve}tracing spans + 4 new metrics). -
src/metrics.rs+ module index files +src/main.rs(startup wiring + route registration). - 578 lib tests pass (+10 new); release build clean.
- Wiki: noetl/server
deployment-specification.mdupdated to list/api/result/*in the network surface (commit0d25ef5).
-
- Built
localhost/noetl-server-rust:v2.58.0via podman (podman --connection noetl-dev build), loaded into kind (kind load image-archive), rolled deployment (kubectl set image; podnoetl-server-rust-9d5c88779-wb2bpRunning; version reports2.58.0). - Smoke-PUT confirmed
200 OKwith the expectednoetl://execution/12345/result/smoke/<snowflake>URI shape. - Re-ran
tests/output_select_test(kind exec322232289352224768) — COMPLETED with the orchestrator's terminalplaybook.completedevent;summarystep's call.done showsoutput_select_worked: true, result_was_externalized: true, full_data_items_processed: 1000, test_result: "PASSED". -
Worker logs confirm the SUCCESS branch (was unreachable pre-fix):
Tool result exceeds inline budget; staged in durable result store + shared-memory cache. result_ref=noetl://execution/322232289352224768/result/start/322232292384706560 ... put_duration_seconds=0.014976658. -
Negative check: zero
put_result HTTP 404lines for this execution in the worker log (was 1+ per over-budget step pre-fix). -
Server logs confirm both endpoints actively serving:
result_store.put: stored ... bytes=256945 result_ref=...+result_store.resolve: found ... duration_seconds=0.007878078.
-
noetl/server#161 — 14 pre-existing clippy errors in files unrelated to result_store will block the next PR's
-D warningsCI gate (didn't block #160 because the merge config tolerates them). Mechanical cleanup; one PR per file or batched. -
output_selectper-field projection at worker call.done — flagged as a follow-up in the prior session, but the kind-val showedoutput_select_worked: trueend-to-end (verify step receivedstatus,count,total_sizecorrectly via the artifact-tool lazy_load round-trip). The worker comment atcommand.rs:909describing "future expansion" was overly pessimistic; no follow-up issue needed. -
Flight gRPC fast path — worker logs show
Flight transport failed; falling back to HTTP(port 8083 not wired in kind). Out-of-scope here; HTTP fallback successful.
-
repos/server:f7ae136(v2.57.2) →7dab231(v2.58.0). -
repos/noetl-server-wiki:0ad86de→0d25ef5(deployment-spec network-surface row).
2026-06-08 (#70 result-store endpoints opened on noetl/server via Codex handoff; PR #160 awaiting CI + merge)
Agent: Claude (dispatcher) + Codex (executor) · Repos touched: noetl/server (PR #160 opened), noetl/ai-meta (handoff thread + #70 comments + #161 clippy follow-up)
Headline. First cross-agent handoff of the session — Claude scoped the work and authored the prompt, Codex executed Phases A+B unattended on the local feature branch, Claude opened the PR after user ship it. noetl/server#160 closes noetl/ai-meta#70 by porting PUT /api/result/<eid> + GET /api/result/resolve from Python.
- Thread:
handoffs/active/2026-06-08-server-result-store-endpoint/. -
round-01-prompt.mdauthored by Claude (commit742e981);round-01-result.mdwritten by Codex (commit894f513). - Codex's report: 4 new files + 4 modified across
repos/server; 578 lib tests pass (+10 new); release build clean; Phase C correctly gated on theship itwait phrase.
-
PR #160 (branch
feat/result-store-put-resolve-endpointsat0c0d13b; closes noetl/ai-meta#70 when merged + pointer-bumped) —-
New
src/db/queries/result_store.rs(startup DDLCREATE TABLE IF NOT EXISTS noetl.result_store+insert+get_by_ref). -
New
src/services/result_store.rs(ResultStoreService::{put, resolve},PutResultBody/ResultPutResponsewire types,parse_noetl_refURI parser with 10 unit tests). -
New
src/handlers/result_store.rs(axum handlers + tracing spans + 4 new metrics). -
Modified
metrics.rs+ the three module index files +main.rs(startup wiring + route registration). - MVP scope explicit: cluster-wide pool not yet shard-aware; DELETE/list/GC/TTL/scoping/Arrow Flight gRPC fast path all deferred to follow-up rounds.
-
New
- Ran
e2e/fixtures/playbooks/test_output_select.yaml(kind exec322228228976545792) against the unpatched Rust server image. Worker logs confirm the exact failure modeput_result failed: HTTP 404→ cascade through the shm-only branch → downstream{{ start._ref }}resolves null → artifact tool errors withInvalid artifact config: invalid type: null, expected a string. Same cascade as the original noetl/ai-meta#70 report. Posted as a corroboration comment on the issue. - Post-merge: rebuild noetl-server image, load into kind, re-run
test_output_select.yaml+test_storage_tiers.yaml; expect both to advance past the artifact step (output_select per-field projection still pending — see follow-up note in noetl/ai-meta#70 comment). - Ran
e2e/fixtures/playbooks/iterator_save_test.yaml(kind exec322227920619704320) — PASS. 3 rows inserted inpublic.iterator_save_test. Iterator + pipeline +_prev+execution_idsubstitution all clean. - Ran
e2e/fixtures/playbooks/test_args_passing.yaml(kind exec322228896269340672) — PASS.ctx.test_var+ctx.computedpropagate cleanly through thenext.setblock.
-
noetl/server#161 — clippy cleanup.
cargo clippy --lib --tests --release -- -D warningscurrently exits 101 onmainwith 14 pre-existing errors in files unrelated to result_store. Will block PR #160's CI as-is. -
output_selectper-field projection at worker call.done — separate from #70. The worker comment atcommand.rs:909flags it as "future expansion whenoutput.output_selectplumbing lands at the server-side ToolSpec layer". To file when #70 merges + the worker can finally write the durable-success branch end-to-end.
- noetl/server PR #160 awaits merge → ai-meta pointer bump → wiki Releases.md row →
Closes noetl/ai-meta#70fires. - Roadmap board (Project 3): noetl/ai-meta#70 moved Todo → In progress on PR open.
2026-06-08 (#69 closes — call.done embeds _ref on durable-success branch, worker v5.15.0; #70 filed as server-side gap)
Agent: Claude · Repos touched: noetl/worker (PR #63 merged, v5.15.0 released), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Worker-side fix for noetl/ai-meta#69 shipped (worker v5.15.0); kind re-val surfaced a downstream server-side gap filed as noetl/ai-meta#70.
-
noetl-worker v5.15.0 (noetl/worker#63; closes noetl/ai-meta#69) — single-commit MINOR bump (
feat:prefix). When an over-budget result's durable PUT succeeds,build_call_done_resultnow emits{status, context: { data: { _ref: <noetl://...> } }, reference: {...}}— the inlinecontext.data._refblock sits next to the existingreferenceblock so the orchestrator'sextract_user_datafinds it under the v10 nested-envelope path. Downstream{{ step._ref }}resolves to the URI string; consumingartifact/result_fetchtools dispatch the URI-based fetch. 4 existing tests updated; lib 126/0/0; release build clean.
- Built
localhost/noetl-worker-rust:v5.15.0-feat69via podman, loaded into kind, rolled the deployment. Podnoetl-worker-rust-7547f66555-t2p8qrunning v5.15.0. - Re-ran
e2e/fixtures/playbooks/test_output_select.yaml(kind exec322223448124297216) +e2e/fixtures/playbooks/test_storage_tiers.yaml(kind exec322223449223204864). - Both fixtures still fail at the artifact step with
invalid type: null, expected a string— but the failure shape is different: the over-budget call.done now lands on the degraded shm-only branch (reference.kind: arrow_ipc) instead of the durable-success branch (which would bekind: result_ref).
Worker logs reveal the root cause:
Durable result-store PUT failed; falling back to shared-memory cache only.
error=put_result failed: HTTP 404
The Rust noetl-server doesn't expose PUT /api/result/<execution_id> — confirmed via grep 'api/result' repos/server/src/main.rs returning no matches. Worker has been calling this endpoint for the entire Phase D+E+F lifetime (R-2.2 PR-B), but the Python-server route was never ported.
Filed as noetl/ai-meta#70 (server-side parity gap). Once that lands, the worker's #69 embedding works end-to-end.
- repos/worker: `ca1daf4` (post-worker#62 merge) → `401bafc` (v5.15.0 release commit)
2026-06-08 (#68 closes — artifact tool args alias, v2.23.1 + worker dep bump; #69 filed as downstream gap)
Agent: Claude · Repos touched: noetl/tools (PR #40 merged, v2.23.1 released), noetl/worker (PR #62 merged, lockfile bump), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Two-PR shipment for the artifact-tool config-deserialization friction filed as noetl/ai-meta#68 earlier today. Kind re-val surfaced a new downstream gap (filed as noetl/ai-meta#69).
-
noetl-tools v2.23.1 (noetl/tools#40; closes noetl/ai-meta#68) — PATCH bump on v2.23.0.
#[serde(alias = "args")]onArtifactConfig.inputresolves the friction between noetl-server'sToolSpecfield-name normalization (input:→argsvia noetl/ai-meta#56) and the Python-parity expectation that the artifact tool config field be namedinput. Both shapes now deserialize. 1 new unit test. Crates.io published 2026-06-08T02:48:09Z. -
noetl-worker (worker#62) — lockfile-only bump
noetl-tools 2.23.0 → 2.23.1. Cargo.toml already hadnoetl-tools = "2.23"accepting any 2.23.x. Worker lib 126/0. Merged at `ca1daf4`. No release tag — release-please skipschore(deps):commits.
-
Built
localhost/noetl-worker-rust:v5.15.1-tools231via podman (cargo-chef warm cache), loaded into kind viapodman cp+ctr -n=k8s.io images import, rolled the worker deployment (podnoetl-worker-rust-56cdb9db76-jn5s6, v5.15.1-tools231). -
Re-ran
e2e/fixtures/playbooks/test_output_select.yaml(kind exec322212285424603136) +e2e/fixtures/playbooks/test_storage_tiers.yaml(kind exec322212745988542464) — both fixtures now deserialize the artifact config successfully (confirming the v2.23.1 alias fix took effect) but FAIL at the first artifact step with a NEW diagnostic:Configuration error: Invalid artifact config: invalid type: null, expected a stringRoot-caused: the upstream step's
{{ step._ref }}resolves tonullbecause the Rust worker'scall.doneenvelope carries thereferenceblock (durable result_store PUT) without an inlinedatablock containing theoutput_selectfields + the synthetic_refURI.
noetl/ai-meta#69 — worker-side fix needed for output_select / _ref population on call.done. Filed with the kind execution IDs + envelope shape + likely fix site (worker executor::command::build_call_done_result).
- repos/tools: `e38046f` (v2.23.0) → `7d3fcfd` (v2.23.1)
- repos/worker: `bd977f8` (v5.14.0 main pre-bump) → `ca1daf4` (worker#62 merge)
Agent: Claude · Repos touched: noetl/server (PR #159 merged, v2.57.2 released), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Diagnosed + fixed the orchestrator deadlock filed as noetl/ai-meta#67 earlier today. PR #159 OPEN.
Instrumented engine/orchestrator.rs::process_in_progress with
debug eprintln!s and stepped through the unit-test reproducer.
Found:
- The Jinja template renderer (Chainable undefined) handles
{{ A if A else B.x }}correctly when B is undefined — short-circuits toA.xvia Jinja2 semantics; the renderer was a red herring. - The actual deadlock:
build_incoming_arcs(the static planner) declaredsummarizeas having 2 upstreams (process_high+process_low) because both step'snext.arcspointed to it. Undermode: exclusiveonlyprocess_highfired;process_lowstayed Pending forever; the R4 fan-in barrier (server#142) never resolved — orchestrator silently producednew_commands=0 new_events=0 should_complete=false.
-
evaluator::evaluate_next_transitions: stopbreak-ing out of the arcs loop on the first exclusive-mode match. Surface every remaining sibling (and any arc whosewhenevaluated false in inclusive mode) asEvaluationResult { matched: false, next_step: Some(name) }via a new helpernot_matched_with_target. -
orchestrator::process_in_progress: refactor into two ordered passes:-
Pass 1 (new): emit
step.skippedfor every unmatched arc target across ALL completed steps BEFORE any barrier check runs. Pre-pass design eliminates HashMap-iteration- order non-determinism. -
Pass 2 (existing): dispatch matched arc targets. The
R4 fan-in barrier now also consults the in-pass
step.skippedset — so a downstream merge target dispatches in the SAME orchestrator pass as the siblings' skips (no two-trigger gymnastics required).
-
Pass 1 (new): emit
-
Unit tests added:
test_67_exclusive_routing_emits_step_skipped_for_unmatched_siblings(orchestrator layer — pins the comprehensive_test shape) +test_jinja_conditional_short_circuits_on_undefined_else_branch(template layer — regression guard for Chainable semantics).
-
cargo test --lib -- --test-threads=1→ 568/0/0 (was 566, +2 new). -
cargo build --release --bin noetl-control-plane→ clean.
-
noetl/server v2.57.2 released — single-commit patch on top of v2.57.1. PR #159 merged at 02:11 UTC;
Closes noetl/ai-meta#67keyword auto-closed the umbrella; board 3 status auto-moved to Done. -
Built
localhost/noetl-server-rust:v2.57.2via podman; loaded into kind viapodman cp+ctr -n=k8s.io images import; rollednoetl-server-rustdeployment (uptime 26s). -
Re-ran
e2e/fixtures/playbooks/comprehensive_test.yaml(kind execution322196926424420352) — COMPLETED in ~4s (was hanging forever pre-fix). Event trace shows the exact fix shape:start.command.completed → step.skipped for process_low (untaken exclusive sibling — THE FIX) → step.enter for process_high → process_high.command.completed → step.enter for summarize → summarize.command.completed (barrier saw process_low as Skipped) → step.enter for end → end.command.completed → playbook.completed -
Server pointer in ai-meta bumped:
9526b26(v2.57.1) →f7ae136(v2.57.2).
- ai-task: noetl/ai-meta#67 (In progress on board 3).
- Server PR: noetl/server#159.
- Files touched:
repos/server/src/engine/evaluator.rs,repos/server/src/engine/orchestrator.rs,repos/server/src/template/jinja.rs(374 insertions, 10 deletions).
Agent: Claude · Repos touched: noetl/ai-meta (issue #67), noetl/ai-meta.wiki
Headline. Swept 7 self-contained fixtures from
repos/e2e/fixtures/playbooks/ against the v2.57.1 Rust kind
stack: 3 PASS, 3 correct-FAIL (infra), 1 real orchestrator
bug (filed as noetl/ai-meta#67).
| Fixture | Outcome | Notes |
|---|---|---|
simple_python.yaml |
✅ COMPLETED | clean |
test_args_passing.yaml |
✅ COMPLETED | clean |
test_large_result_extraction.yaml |
✅ COMPLETED | clean |
postgres_test.yaml |
❌ correct-FAIL | fixture's start step has kind: postgres with no auth: block — error: Failed to get connection. Fixture issue, not server bug. |
http_test.yaml |
❌ correct-FAIL | fixture targets noetl.noetl.svc.cluster.local:8082 (the retired Python noetl-server service) — error: error sending request. Fixture is stale; should target noetl-server-rust.noetl.svc.cluster.local. |
v10_canonical_example.yaml |
❌ correct-FAIL | fixture's fetch_data step targets https://api.example.com/data (fake hostname) — error: error sending request. Fixture needs a real test endpoint. |
comprehensive_test.yaml |
orchestrator hangs after process_high.command.completed; never dispatches summarize; never emits a terminal event. Root cause: Jinja conditional {{ A if A else B }} in input: where B is undefined (because mode: exclusive routed to A only). Minimal repro: tests/k66d/comprehensive_minimal_repro (without the conditional) completes in ~4s. Filed as #67. |
Three stale fixtures need targeted fixes (file a sub-issue on noetl/e2e or fix inline):
-
postgres_test.yamlstart step: addauth: pg_k8sso it uses the keychain alias that already works. -
http_test.yamlworkloadapi_url: bump fromnoetl.noetl.svc.cluster.local:8082→noetl-server-rust.noetl.svc.cluster.local:8082. -
v10_canonical_example.yamlworkloadapi_url: pick a kind-internal endpoint (or skip it from the auto-sweep list).
- Sweep runner:
/tmp/sweep_runner.sh(operator-local; can be checked intorepos/e2e/scripts/as a durable rig if next session wants it). - Minimal bug reproducer:
tests/k66d/comprehensive_minimal_repro(proves the conditional shape is the trigger). - Filed sub-issue: noetl/ai-meta#67 (board 3, Todo).
2026-06-08 (#49 housekeeping — {{ execution_id }} multi-statement postgres finding resolved on v2.57.1)
Agent: Claude · Repos touched: noetl/ai-meta.wiki
Headline. Re-ran the open finding from the #54 sweep on the fresh v2.57.1 server — bug no longer reproduces. Umbrella-49 "Next concrete steps" item 4 strikethrough applied; Sessions-Log entry added; pointer bumped.
Two minimal reproducer playbooks against the live Rust kind stack (noetl-server-rust:v2.57.1 + noetl-worker-rust:v5.15.0-tools223):
-
tests/k66b/pg_exec_id_multistatement— postgresquery:alias withSELECT {{ execution_id }} AS exec_id_one(single) vsCREATE TEMP TABLE + DELETE WHERE eid = {{ execution_id }} + INSERT VALUES ({{ execution_id }}, ...) + SELECT {{ execution_id }}(multi). Both single + multi return the correct execution_id (322180876613980160) on the wire. -
tests/k66c/pg_command_alias_repro— postgrescommand:alias (the wiki-described failing shape) with the same CREATE+DELETE+INSERT+SELECT pattern thatloop_with_pagination'sstartstep uses.{{ execution_id }}renders correctly in every position; INSERT succeeds; verifying SELECT returns the row.
The "WHERE execution_id = ; syntax error at command-build time"
fingerprint the #54 sweep recorded no longer reproduces. Likely
fixed via the template defer-render series in server#71/#73 (or
adjacent template-scoping work between 2026-06-05 and now).
Strikethrough applied to Umbrella-49 page; no sub-issue needed.
- Umbrella: Umbrella: Rust Server Port.
- Original finding: server#72 / server#73 template-defer-render context.
- Reproducer fixtures (kept in
/tmp/on the operator's machine — not durable; the bug is closed so no checked-in fixture needed).
Agent: Claude · Repos touched: noetl/server (PR #158 merged, v2.57.1 released), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Fix for the cross-step {{ step.data }} template
gap surfaced as a finding during the previous session's #65
kind-val. Single-tool python steps producing a flat user dict
were reachable via {{ step.field }} but not via
{{ step.data }} / {{ step.data.field }} (the wrapped path
canonical v10 fixtures use interchangeably). Two-line semantic
fix in WorkflowState::build_context + two new unit tests.
-
noetl/server PR #158 OPEN — branch
fix/orchestrator-step-data-template-accessor. Inrepos/server/src/engine/state.rs, thebuild_contextstep-loop now wraps the extracted user_data with a self-referencing.datakey when no.datais already present. Guarded by!map.contains_key("data")so the task_sequence flatten path's existing.data(populated from a labeled sub-task'sdatafield) stays intact. - 2 new unit tests in
engine::state::tests:-
test_build_context_exposes_step_data_accessor_for_flat_user_dict— reproduces the live kind execution322087210360770560envelope from the previous session's #65 kind-val; asserts BOTHstep.status(flat) ANDstep.data.status(wrapped) resolve. -
test_build_context_data_accessor_does_not_clobber_existing_data_field— pins task_sequence flatten back-compat: a labeled-sub-taskdatafield stays addressable as both<step>.<label>.data.xAND<step>.data.x.
-
-
cargo test --lib engine::state→ 19/0/0 (was 17, +2 new). -
cargo test --lib→ 566/0/0. -
cargo build --release --bin noetl-control-plane→ clean.
- noetl/server v2.57.1 released — single-commit patch on top of v2.57.0 (the Phase D R5 R7 parity harness).
- PR #158 merged at 19:18 UTC;
Closes noetl/ai-meta#66keyword auto-closed the umbrella; board 3 status auto-moved to Done. - Server pointer in ai-meta bumped:
395f8cf→9526b26.
-
Built
localhost/noetl-server-rust:v2.57.1via podman; loaded into kind viapodman cp+ctr -n=k8s.io images import; rollednoetl-server-rustdeployment (uptime 25s, version reports 2.57.1). -
Registered
tests/k66/python_file_loader_v2and executed (execution322179661486362624). Playbook reachesplaybook.completedin ~4s;verify_data_accessorstep's output:{ "all_pass": true, "checks": { "flat_status_ok": true, "flat_total_ok": true, "wrapped_status_ok": true, "wrapped_total_ok": true, "wrapped_script_source_ok": true, "flat_vs_wrapped_status_match": true, "flat_vs_wrapped_total_match": true }, "flat_status": "success", "wrapped_status": "success", "flat_total": 3, "wrapped_total": 3, "wrapped_script_source": "file" }Both
{{ run_from_file.<field> }}(existing flat path) AND{{ run_from_file.data.<field> }}(new wrapped #66 path) resolve to the same upstream user_data. #66 fix verified end-to-end on the live Rust orchestrator.
- ai-task: noetl/ai-meta#66 (CLOSED).
- Server PR: noetl/server#158 (MERGED).
- Server release: noetl/server v2.57.1.
- File touched:
repos/server/src/engine/state.rs(build_context loop + 2 tests).
Agent: Claude · Repos touched: noetl/tools (PRs #38 + #39 merged), noetl/worker (PR #61 open), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Two follow-on noetl-tools releases that complete
the python script: block subsystem, kind-validated
end-to-end on the live worker. Surfaced one downstream
finding (cross-step template-context gap, filed as
noetl/ai-meta#66).
-
noetl/tools v2.22.0 (noetl/tools#38) — external python script loaders for
file/gcs/httpsource types.PythonSourceenum +resolve_source()pure classification + asyncload_script_code()onPythonTool(owns areqwest::Client+GcpAuth). GCS uses GCP ADC / workload-identity per the execution-model "already-in-place trust" rule. 15 new unit tests. -
noetl/tools v2.23.0 (noetl/tools#39) — legacy
main()function convention. When the user code doesn't set a non-NoneresultAND defines a callablemain, the wrapper introspects the signature, binds named params fromargs, forwards**kwargs, awaits asyncmainviaasyncio.run. Mirrorsnoetl/tools/python/executor.py_invoke_main. 4 new unit tests. -
noetl/worker#61 OPEN + mergeable — bumps
noetl-tools = "2.21"→"2.23". Worker lib 126/0 against the new tools. Image built locally (v5.15.0-tools223), loaded into kind, kind-val GREEN.
Built noetl-worker-rust:v5.15.0-tools223 → loaded into kind →
rolled worker deployment. Registered + executed
tests/k65/python_file_loader:
script:
uri: /tmp/loader_hello.py
source:
type: file
input:
name: NoETL
count: 3…where loader_hello.py defines main(name="World", count=1)
(no result global).
Result (execution 322087210360770560): playbook.completed
in ~6s. The loaded step's call.done data field:
{"status": "success",
"messages": ["Hello, NoETL! (#1)", "Hello, NoETL! (#2)", "Hello, NoETL! (#3)"],
"total_greetings": 3, "script_source": "file"}Proves the external file loader + main(name, count)
convention chain end-to-end on the live worker.
Surfaced finding — filed as #66
The kind-val playbook's downstream verify step
({{ run_from_file.data }}) resolved to None. The
upstream step's data IS in the event log + correct
(SELECT result::jsonb #> '{context,result,context,data}'
returns the full payload), but the orchestrator doesn't
project upstream-step call.done data into the next
step's template scope under that reference path. Filed
as a Rust-orchestrator finding (separate from #65; doesn't
block the loaders).
- noetl/ai-meta#65 closed with full kind-val evidence cited. gcs / http loaders validated at the unit-test layer (15 tests); hermetic kind-val of those paths needs external infra (GCS object / HTTP server) so they validate in CI/GKE rather than the local kind rig.
No ai-meta pointer changes — the tools work is a crates.io dep, not a submodule. The worker pointer bump rides noetl/worker#61's merge.
- noetl/ai-meta#65 (CLOSED).
- noetl/ai-meta#66 (NEW — surfaced finding).
- noetl/worker#61 (OPEN — adoption / deployment PR).
2026-06-07 (server#157 merged — Phase D R5 R7: cross-server parity harness, v2.57.0; Phase D R5 umbrella closes)
Agent: Claude · Repos touched: noetl/server (PR merged + sub-issue auto-closed), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Phase D R5 closes. Final slice — a cross-server
parity harness that feeds a synthetic event log through both
the Rust fold_replay_state port and a pre-recorded Python
snapshot, asserts structural equality field-by-field across
every projection.
Sub-issue noetl/server#148
auto-closed via PR body's Closes noetl/server#148 keyword
at 17:53:40Z. All seven Phase D R5 rounds shipped today
(v2.51.0 → v2.57.0).
-
noetl/server#157 merged (server@395f8cf, v2.57.0, closes server#148):
-
tests/parity_harness/events.json— 13-event synthetic log exercising all six replay projections (execution/stage/frame/command/business_object/loop) plus payload refs. -
tests/parity_harness/expected.json— Python's structured fold output for that fixture; committed alongside so the harness is hermetic (no live Python runtime needed at test time). -
tests/parity_harness/regenerate_expected.py— standalone Python 3.10+ script. Verbatim extract ofnoetl/server/api/replay/service.pyfold + helpers, nonoetl-package imports (dodges the transitive-dep chain —nats,snowflake-connector-python, env-var validators, …). Re-syncable when the Python implementation changes. -
tests/parity_harness/README.md— documents the contract. -
tests/parity_harness.rs— 8 Rust integration tests. Loads both files, folds with Rust, compares slice-by-slice with per-key failure messages. All pass.
-
- Structural — same projection keys, same per-key field values (status, counters, summaries, references). This is the load-bearing contract: the Rust port produces the same logical view as Python's source-of-truth.
-
NOT byte-for-byte hex on
checksum.value/projection_checksums[*].value. Python and Rust hash different digest inputs (Python normalizes to flat rows; Rust hashes the typed state directly — per R4's design). Both deliver determinism + replay validation; the typedChecksum { type, value }shape (R4) keeps the wire stable for future algorithm additions.
All seven rounds shipped:
| Round | Topic | Status |
|---|---|---|
| R1 | Endpoint scaffold + execution projection |
✅ v2.51.0 |
| R2 |
stages + frames + commands projections |
✅ v2.52.0 |
| R3 |
loops + business_objects projections |
✅ v2.53.0 |
| R4 | typed Checksum + projection_checksums
|
✅ v2.54.0 |
| R5 | snapshot seed + base_state + upcaster digest |
✅ v2.55.0 |
| R6 | payload resolver | ✅ v2.56.0 |
| R7 | cross-server parity harness against Python | ✅ v2.57.0 |
The Replay engine port is complete. Python's ~1236-LoC
noetl/server/api/replay/service.py is now ported to Rust
with structural-parity unit-test coverage (564/0/0) +
8-test cross-server parity harness.
R7 is purely a test/fixture addition — no runtime code
changes. v2.57.0 is a release-please rolling-MINOR bump
(commit message used the feat(replay): prefix) but the
deployed image is identical in behaviour to v2.56.0. Kind
reload skipped.
-
repos/server→395f8cf(v2.57.0).
- noetl/server#148 (Phase D R5 sub-issue) — CLOSED.
- noetl/ai-meta#49 Phase D R5 — Replay engine port complete.
Agent: Claude · Repos touched: noetl/server (PR merged + image build + kind reload + kind-val), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Phase D R5 sixth slice ships. Every event's
result.reference JSON gets parsed into a typed
PayloadSummary and appended to the relevant projection's
payload-refs list. Only R7 (cross-server parity harness)
remains in the umbrella.
-
noetl/server#156 merged (server@a8a054a, v2.56.0, Refs server#148). Mirrors Python's
_payload_ref/_payload_summary/ per-projectionpayload_refspopulation.-
PayloadSummarystruct with five fields:sha256/schema_digest/row_count/media_type/ref(the last renamed from Rustreference_uriviaserde(rename = "ref")). AllOption<…>+skip_serializing_if. -
PayloadRefEntrystruct —{event_id, reference, summary}per Python's dict shape. -
ReplayEventRow.result: Option<serde_json::Value>—noetl.event.resultjsonb column. SQL queries updated across all threeload_eventsvariants. -
ReplayExecutionState.payload_refs: Vec<PayloadRefEntry>— every event withresult.referenceappended in event_id order. -
ReplayFrameState.output_ref+output_ref_summary— populated onframe.committed/frame.failed; summary isSome(default)(all-None fields) when the terminal event has no reference (mirrors Python's_payload_summary(None)). -
ReplayBusinessObjectState.payload_refs+last_payload_ref— every event touching the BO with aresult.referenceappended;last_payload_refpoints at the most recent. -
extract_payload_ref(event)mirrors Python's_payload_ref— readsevent.result.reference, returnsNonewhen the result is absent, has noreferencekey, orreferenceisnull. -
payload_summary(reference)mirrors Python's_payload_summary— three-tier fallback per field (reference.<field>→reference.rows_ref.meta.<field>→reference.rows_ref.ipc.<field>);sha256falls back toreference.digest;reffalls back toreference.uri.
-
- 15 new unit tests covering all the fallback paths + per-projection population. Server lib 564/0/0 (was 549/0/0).
Built noetl-server-rust:v2.56.0 → loaded into kind → rolled
deployment. Re-probe of the prior fanout_reduce execution
(322023958058635264) returns identical shape as v2.55.0 with
new fields all empty (the fixture has no events carrying
result.reference).
A second execution (640422512395813188) confirms the live payload resolver path — three populated entries with real SHA-256 hex digests:
"execution.payload_refs": [
{"event_id": 640422601071788780,
"summary": {"sha256": "d0de6b8de78fd04b2e752a96ebef12df4a9b32e92565b3f6e55860ae12762133",
"row_count": null, "ref": null}},
{"event_id": 640422601071788782, "summary": {"sha256": "d0de6b8..."}},
{"event_id": 640422601080177391, "summary": {"sha256": "d0de6b8..."}}
]row_count + ref are null because the originating events'
result.reference JSON didn't carry those fields — the
fallback chain visited every location and returned None as
documented. R6 correctness verified end-to-end.
-
repos/server→a8a054a(v2.56.0).
| Round | Topic | Status |
|---|---|---|
| R1 | Endpoint scaffold + execution projection |
✅ MERGED (v2.51.0) |
| R2 |
stages + frames + commands projections |
✅ MERGED + kind-val GREEN (v2.52.0) |
| R3 |
loops + business_objects projections |
✅ MERGED + kind-val GREEN (v2.53.0) |
| R4 | typed Checksum + projection_checksums
|
✅ MERGED + kind-val GREEN (v2.54.0) |
| R5 | snapshot seed + base_state + upcaster digest |
✅ MERGED + kind-val GREEN (v2.55.0) |
| R6 | payload resolver | ✅ MERGED + kind-val GREEN (v2.56.0) |
| R7 | Cross-server parity harness against Python | ⏳ next — final round |
- noetl/server#148 (umbrella sub-issue; closes after R7).
- noetl/ai-meta#49 Phase D R5.
2026-06-07 (server#155 merged — Phase D R5 R5: snapshot seed + base_state + upcaster digest, v2.55.0, kind-val GREEN)
Agent: Claude · Repos touched: noetl/server (PR merged + image build + kind reload + kind-val), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Phase D R5 fifth slice ships. The replay fold
can now start from a prior fold's output and continue from
there rather than always re-folding from event 1. Mirrors
Python's base_state + snapshot_seed +
upcaster_registry_digest parameters on fold_replay_state.
-
noetl/server#155 merged (server@3dc6b66, v2.55.0, Refs server#148).
-
ReplaySnapshotSeedstruct —aggregate_id,aggregate_type,version: i64,checksum: Checksum,state: ReplayState,meta: Map. Mirrors Python's frozen dataclass fromnoetl/server/api/replay/types.py. -
ReplaySnapshotInfostruct — same shape minusstate(the full state already went intobase_state). Surfaced on the output asReplayState.replay_snapshot. -
ReplayFoldOptionsstruct (Defaultimpl) — carries the three optional inputs:base_state/snapshot_seed/upcaster_registry_digest. -
New
ReplayStatefields:upcaster_registry_digest: Option<String>+replay_snapshot: Option<ReplaySnapshotInfo>. Bothskip_serializing_if = "Option::is_none"so default-options folds produce the exact same JSON as R1–R4 — wire-shape back-compat preserved. -
fold_replay_state_with_optionsnew entry point. The existingfold_replay_state(5-arg) is now a thin shim that passesReplayFoldOptions::default(). - Continuation semantics:
base_statestrips its checksum + projection_checksums (they recompute at the end); counters (event_count,last_event_id, …) continue from where the base left off; caller'stenant_id/organization_id/execution_idoverride whatever the base recorded; caller'supcaster_registry_digestwins over base's, butNonefrom caller preserves base's value.
-
- 8 new unit tests covering option propagation, snapshot info surfacing, counter continuation, checksum stripping + recomputation, tenant/org override, upcaster digest precedence rules. Server lib 549/0/0 (was 541/0/0).
The HTTP handler stays unchanged for R5; it still folds from
event 1 with ReplayFoldOptions::default(). Wiring up a
snapshot store + load_snapshot_seed shape + deciding when
to seed is a downstream sub-issue against server#148. R5
lands the data-contract round; storage + HTTP opt-in are
the next slice.
Built noetl-server-rust:v2.55.0 → loaded into kind → rolled deployment. Re-probe of the prior fanout_reduce execution (322023958058635264) against v2.55.0:
{
"event_count": 25,
"last_event_type": "playbook.completed",
"exec_status": "COMPLETED",
"commands_n": 4,
"checksum_type": "sha256",
"checksum_value_len": 64,
"projection_checksum_keys": ["business_object", "command",
"execution", "frame", "loop", "stage"],
"replay_snapshot_in_keys": false,
"upcaster_in_keys": false
}Wire-shape back-compat verified: the new R5 fields don't
appear in JSON when their Option<…> is None — R1-R4
consumers see identical output. Snapshot-seeded behaviour is
covered by the unit-test layer (no snapshot store in kind
yet).
-
repos/server→3dc6b66(v2.55.0).
| Round | Topic | Status |
|---|---|---|
| R1 | Endpoint scaffold + execution projection |
✅ MERGED (v2.51.0) |
| R2 |
stages + frames + commands projections |
✅ MERGED + kind-val GREEN (v2.52.0) |
| R3 |
loops + business_objects projections |
✅ MERGED + kind-val GREEN (v2.53.0) |
| R4 | typed Checksum + projection_checksums
|
✅ MERGED + kind-val GREEN (v2.54.0) |
| R5 | snapshot seed + base_state + upcaster digest |
✅ MERGED + kind-val GREEN (v2.55.0) |
| R6 | Payload resolver | ⏳ next |
| R7 | Cross-server parity harness against Python | — |
- noetl/server#148 (umbrella sub-issue; stays open across all 7 rounds).
- noetl/ai-meta#49 Phase D R5.
2026-06-07 (server#154 merged — Phase D R5 R4: typed Checksum + projection_checksums, v2.54.0, kind-val GREEN)
Agent: Claude · Repos touched: noetl/server (PR merged + image build + kind reload + kind-val), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Phase D R5 fourth slice ships. Every replay
fold now produces a typed Checksum over the full state plus
a 6-entry projection_checksums map covering every per-projection
slot. The hash function is the type of the checksum (per
user direction this session) — ChecksumType enum gates future
types (BLAKE3, SHA-512, …) without a wire-format break.
-
noetl/server#154 merged (server@adec21c, v2.54.0, Refs server#148). Extends the R5 R3 fold:
-
ChecksumTypeenum with the initial variantSha256(serializes lowercase snake_case"sha256", matching Python'sstate["checksum_algorithm"]wire form). Future variants (Blake3,Sha512, …) slot in without a schema-level break. -
Checksumstruct ={ type: ChecksumType, value: String }— the value field carries the lowercase-hex digest. Replaces Python's flatstate["checksum_algorithm"]+state["checksum"]pair with a typed shape. -
ReplayState.checksum: Option<Checksum>(skip_serializing_if when None) +ReplayState.projection_checksums: BTreeMap<String, Checksum>(six entries on every fold:execution,stage,frame,command,business_object,loop). -
stable_json_byteshelper — encodes a value as deterministic JSON (sorted keys recursively + compact separators) matching Python'sjson.dumps(sort_keys=True, separators=(",", ":"))byte form. Used as the SHA-256 input. -
compute_checksumsruns once at the end offold_replay_state— per-projection SHA-256 over each typed sub-state (state.execution/state.stages/state.frames/state.commands/state.business_objects/state.loops), then the top-level digest over the full state withprojection_checksumspopulated andchecksumfield still None (skip_serializing_if handles the self-reference cleanly — the digest doesn't depend on itself).
-
- 9 new unit tests covering type-serialization shape, hex output format, deterministic re-runs, projection isolation, top-level self-non-dependence. Server lib 541/0/0 (was 532/0/0).
The R4 digest hashes over the typed Rust state directly,
not through Python's normalize_replayed_<projection>_projection
flat-row layer. Reasons documented in the PR body:
- The typed
BTreeMapordering +stable_json_bytessorted-key recursion give the same determinism guarantee. - The Rust state IS the source of truth for the server's view; normalizing to a Python-row shape adds a translation layer the Rust server never reads back.
- Cross-Python byte-for-byte parity isn't an R4 requirement — that's explicitly R7's "cross-server parity harness", which can add the normalize functions if/when parity matters.
If R7 finds Python parity needed, it adds the
normalize_replayed_*_projection helpers and recomputes the
projection_checksums over the normalized list — additive work
without touching the R4 wire shape.
Built noetl-server-rust:v2.54.0 → loaded into kind → rolled deployment. Re-probe of the prior fanout_reduce execution (322023958058635264) against v2.54.0:
{
"event_count": 25,
"last_event_type": "playbook.completed",
"exec_status": "COMPLETED",
"commands_n": 4,
"loops_n": 0,
"business_objects_n": 0,
"checksum": {
"type": "sha256",
"value": "41265876487f32350fc60c5039358456ded76598b99e7a0833ac4a17ceaae426"
},
"projection_checksum_keys": [
"business_object", "command", "execution",
"frame", "loop", "stage"
]
}Sample per-projection entry:
.projection_checksums.command = {
"type": "sha256",
"value": "58d8220005758b7f18e27d9042b3ef5fa8ca86471c9d2ea33a869fd0db31231b"
}Same event log → same digest hex (verified by the
fold_checksum_deterministic_across_runs unit test). Every
projection's hex differs from every other (different sub-state
input → different SHA-256 → distinct identity).
-
repos/server→adec21c(v2.54.0).
| Round | Topic | Status |
|---|---|---|
| R1 | Endpoint scaffold + execution projection |
✅ MERGED (v2.51.0) |
| R2 |
stages + frames + commands projections |
✅ MERGED + kind-val GREEN (v2.52.0) |
| R3 |
loops + business_objects projections |
✅ MERGED + kind-val GREEN (v2.53.0) |
| R4 | typed Checksum + projection_checksums
|
✅ MERGED + kind-val GREEN (v2.54.0) |
| R5 | Snapshot seeds + base_state
|
⏳ next |
| R6 | Payload resolver | — |
| R7 | Cross-server parity harness against Python | — |
- noetl/server#148 (umbrella sub-issue; stays open across all 7 rounds).
- noetl/ai-meta#49 Phase D R5.
2026-06-07 (server#153 merged — Phase D R5 R3: loops + business_objects projections, v2.53.0, kind-val GREEN)
Agent: Claude · Repos touched: noetl/server (PR merged + image build + kind reload + kind-val), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Phase D R5 third slice ships. The replay fold
now populates loops + business_objects projections — the
last two per-projection slots. All five map fields (stages /
frames / commands / loops / business_objects) are now
typed BTreeMaps with deterministic key ordering, ready for
R4's typed Checksum + projection_checksums bundle.
-
noetl/server#153 merged (server@3174c75, v2.53.0, Refs server#148). Extends the R5 R2 fold:
-
Two new typed state structs:
ReplayLoopState(loop_id+step_name+total+done+failed+completed+last_event_id) andReplayBusinessObjectState(object_key+object_type+object_id+status+version+event_count+first_event_id+last_event_id+deleted_event_id+last_event_type+attributes).payload_refsdeliberately deferred to R6 (the payload-resolver round). -
ReplayState.loops+ReplayState.business_objectsflip fromserde_json::Mapplaceholders toBTreeMap<String, Replay{Loop,BusinessObject}State>for deterministic key ordering. -
Two new ID extractors:
extract_loop_idmirrors Python's_loop_id(readsmeta.loop_id, thenmeta.loop_event_id, thenmeta.__loop_epoch_id);extract_business_object_identitymirrors Python's_business_object_identity(readsmeta.business_object.{object_type|type}/{object_id|id}first, then flatmeta.business_object_{type,id}/meta.object_{type,id}, then parsesaggregate_type=business_object+aggregate_id=business_object/<type>/<id>as fallback). -
business_object_statushelper mirrors Python's_business_object_status: explicit non-empty event-rowstatuswins; else suffix-match the event type (.deleted/.removed→DELETED,.created/.updated/.upserted→ACTIVE); else returnsNone(caller leaves existing status unchanged). -
Two new populate functions:
populate_loopincrementsdoneoncommand.completed/loop.shard.done,failedoncommand.failed/loop.shard.failed, flipscompleted=trueonloop.done/loop.fanin.completed;populate_business_objectupdateslast_event_id+last_event_typealways, bumpsevent_count, recomputesversion(meta.business_object.version||meta.business_object_version||event_count), updates status via the helper above (DELETED also setsdeleted_event_id), and applies attribute updates (meta.business_object.stateREPLACES,meta.business_object.patch/attributesPATCHES).
-
Two new typed state structs:
- 13 new unit tests covering ID extraction precedence, loop counter aggregation, business-object lifecycle through three stages (created → updated → deleted), status fallbacks, version sources, no-signal skip path. Server lib 532/0/0 (was 518/0/0).
- Cleanup follow-up (server@a235b60): R3 doc comments dropped the "canonical" qualifier per writing-style banned-word rule + user direction.
Per user direction: Round 4 ships the checksum +
projection_checksums fields with a typed shape, NOT
Python's flat checksum_algorithm + checksum pair. Reasons:
the hash function is the type of the checksum (not a
sibling field), and future checksum types (BLAKE3, SHA-512,
…) slot in via the enum without a wire-format break.
Proposed Rust types (final names land in the R4 PR):
pub enum ChecksumType { Sha256, /* future: Blake3, Sha512, ... */ }
pub struct Checksum { type: ChecksumType, value: String /* hex */ }
// in ReplayState:
pub checksum: Option<Checksum>,
pub projection_checksums: BTreeMap<String, Checksum>,Underlying hash stays SHA-256 (same algorithm Python's
_canonical_checksum uses). Parity test in R7 asserts
byte-for-byte hex-value equality. See server#148
comment
for the full design.
Built noetl-server-rust:v2.53.0 → loaded into kind → rolled deployment. Re-probe of the prior fanout_reduce execution (322023958058635264, the R5 R1/R2 reference run):
| Field | R2 (v2.52.0) | R3 (v2.53.0) |
|---|---|---|
event_count |
25 | 25 |
last_event_type |
playbook.completed |
playbook.completed |
execution.status |
COMPLETED |
COMPLETED |
commands (map size) |
4 | 4 |
stages / frames
|
0 / 0 | 0 / 0 |
loops |
(R2: untyped Map) | 0 (typed BTreeMap) |
business_objects |
(R2: untyped Map) | 0 (typed BTreeMap) |
The loops + business_objects maps remain empty for this
fixture — the v10 control-flow shape of fanout_reduce doesn't
emit loop.* events or business-object metadata. R3's fold
correctness for those projections is verified through the
unit-test layer (fold_populates_loop_with_counters_and_completion,
fold_populates_business_object_through_lifecycle). The wire
contract change — typed BTreeMap instead of serde_json::Map
— is the load-bearing shift, ready for R4's projection_checksums
to hash over.
-
repos/server→3174c75(v2.53.0).
| Round | Topic | Status |
|---|---|---|
| R1 | Endpoint scaffold + execution projection |
✅ MERGED (v2.51.0) |
| R2 |
stages + frames + commands projections |
✅ MERGED + kind-val GREEN (v2.52.0) |
| R3 |
loops + business_objects projections |
✅ MERGED + kind-val GREEN (v2.53.0) |
| R4 | typed Checksum + projection_checksums
|
⏳ next — design captured on server#148 |
| R5 | Snapshot seeds + base_state
|
— |
| R6 | Payload resolver | — |
| R7 | Cross-server parity harness against Python | — |
- noetl/server#148 (umbrella sub-issue; stays open across all 7 rounds).
- noetl/ai-meta#49 Phase D R5.
2026-06-07 (server#152 merged — Phase D R5 R2: stages + frames + commands projections, v2.52.0, kind-val GREEN)
Agent: Claude · Repos touched: noetl/server (PR merged + image build + kind reload + kind-val), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Phase D R5 second slice ships. The replay fold now
populates stages + frames + commands projections from the
event stream — mirrors Python's state["stages"] / state["frames"]
/ state["commands"] per-projection dicts from
noetl/server/api/replay/service.py.
-
noetl/server#152 merged (server@266b4c7, v2.52.0, Refs server#148). Extends the R5 R1 fold:
-
ReplayEventRowextended withstage_id/frame_id/command_id/worker_id/aggregate_type/aggregate_id/metacolumns (all#[sqlx(default)]for back-compat). SQL queries updated across all threeload_eventsvariants; R1'sAT TIME ZONE 'UTC' AS created_atcast preserved. -
Three new typed state structs (
ReplayStageState/ReplayFrameState/ReplayCommandState) replacing R1'sserde_json::Mapplaceholders.ReplayState.{stages,frames,commands}flip fromserde_json::MaptoBTreeMap<String, Replay{Stage,Frame,Command}State>for deterministic key ordering (matters when R4 lands the typedChecksum+projection_checksums). -
Three new ID extractors (
extract_stage_id/extract_frame_id/extract_command_id) mirror Python's resolution order: top-level column →aggregate_type+aggregate_idfallback (e.g.aggregate_type=stage+aggregate_id=stage/<id>) →meta.<key>fallback. -
Three new populate functions with full status-transition coverage: stage
opened → OPEN/closed → CLOSED; framedispatched → CLAIMED/started → RUNNING/committed → COMPLETED/failed → FAILED/abandoned → ABANDONED; command full lifecycle (issued → PENDING/claimed → CLAIMED/started → RUNNING/completed → COMPLETED(or uppercased event-type suffix) /failed → FAILED/cancelled → CANCELLED). Each populator is a no-op when the event doesn't carry the relevant identity.
-
- 10 new unit tests covering ID extraction precedence, lifecycle population per projection, status-fallback defaults, single-event multi-projection updates, no-identity skip path. Server lib 518/0/0 (was 508/0/0).
Built noetl-server-rust:v2.52.0 → loaded into kind → rolled deployment. Re-probe of the prior fanout_reduce execution (322023958058635264, the R5 R1 reference run):
| Field | R1 (v2.51.0) | R2 (v2.52.0) |
|---|---|---|
event_count |
25 | 25 |
last_event_type |
playbook.completed |
playbook.completed |
execution.status |
COMPLETED |
COMPLETED |
commands (map size) |
0 (placeholder) | 4 populated |
stages / frames
|
0 / 0 | 0 / 0 |
The 4 populated command entries (one per dispatched command in the
fixture: start, normalize_customer, enrich_customer,
reduce_customer) each carry worker_id, issued_event_id, and
last_event_id from the event row; status sits at RUNNING
because the fanout_reduce v10 fixture emits step-level termination
events (step.exit) rather than command.completed — exactly
the shape R2's PR body documented. stages + frames stay
empty for the same reason (no stage.* / frame.* events in the
v10 control-flow shape). Fold logic correctness verified through
the unit-test layer.
-
repos/server→266b4c7(v2.52.0).
| Round | Topic | Status |
|---|---|---|
| R1 | Endpoint scaffold + execution projection |
✅ MERGED (v2.51.0) |
| R2 |
stages + frames + commands projections |
✅ MERGED + kind-val GREEN (v2.52.0) |
| R3 |
loops + business_objects projections |
⏳ next |
| R4 | typed Checksum + projection_checksums + projection_checksums |
— |
| R5 | Snapshot seeds + base_state | — |
| R6 | Payload resolver | — |
| R7 | Cross-server parity harness against Python | — |
- noetl/server#148 (umbrella sub-issue; stays open across all 7 rounds).
- noetl/ai-meta#49 Phase D R5.
2026-06-07 (server#149 merged — Phase D R5 R1: Replay endpoint scaffold + execution projection, v2.51.0)
Agent: Claude · Repos touched: noetl/server (PR merged + sub-issue opened), noetl/ai-meta, noetl/ai-meta.wiki
Headline. Opens Phase D R5 — the Replay engine port (Python's
noetl/server/api/replay/service.py ~1236 LoC → Rust). Round 1
ships the endpoint scaffold + the minimal execution projection
on top of the Phase D R4 foundation; subsequent rounds extend
the fold incrementally without touching the surface again.
-
noetl/server sub-issue server#148 documenting the 7-round decomposition (R1 scaffold + execution / R2 stages+frames+commands / R3 loops+business_objects / R4 typed
Checksum+projection_checksums/ R5 snapshot seeds / R6 payload resolver / R7 cross-server parity harness). -
noetl/server#149 merged (server@f77dead, v2.51.0). New
GET /api/replay/stateroute mirroring Python'sendpoint.pybyte-for-byte (query params + defaults + projection enum + mutually-exclusive cutoffs returning 400). Newservices::replaymodule:ReplayService,ReplayCutoff,ReplayProjection,ReplayState,ReplayExecutionState, pure deterministicfold_replay_state. Round 1 only fills theexecutionprojection (status+last_node_name) using the same terminal-event short-circuit pattern Phase D R4 landed in the orchestrator + status endpoint. 9 new unit tests (8 service + 1 handler). Server lib 508/0/0 (was 499/0/0).
-
repos/server→f77dead(v2.51.0).
GET /api/replay/state?execution_id=<i64>[&tenant_id=...][&organization_id=...]
[&as_of_event_id=N | &as_of_position=N | &as_of_time=<rfc3339>]
[&projection=execution|stage|frame|command|business_object|loop|all]
[&limit=N] [&resolve_payloads=bool]
Returns {tenant_id, organization_id, execution_id, projection, event_count, last_event_id, last_event_type, execution:{status, last_node_name}, stages:{}, frames:{}, commands:{}, business_objects:{}, loops:{}} — the map fields stay empty in
R1 and populate in R2/R3.
| Round | Topic | Status |
|---|---|---|
| R1 | Endpoint scaffold + execution projection |
✅ MERGED (v2.51.0) |
| R2 |
stages + frames + commands projections |
⏳ next |
| R3 |
loops + business_objects projections |
— |
| R4 | typed Checksum + projection_checksums + projection_checksums |
— |
| R5 | Snapshot seeds + base_state | — |
| R6 | Payload resolver | — |
| R7 | Cross-server parity harness against Python | — |
- noetl/server#148 (umbrella sub-issue; stays open across all 7 rounds).
- noetl/ai-meta#49 Phase D R5 (the orchestrator + read-side status both already green; Replay is the natural next big piece).
Agent: Claude · Repos touched: noetl/server (PR merged), noetl-server-rust image build + kind reload, noetl/ai-meta, noetl/ai-meta.wiki
Headline. Phase D R4 follow-up: the read-side status query
bug surfaced during the fanout_reduce kind-val (filed as
server#146
earlier this session) is fixed + kind-validated. Operators
querying GET /api/executions/{id}/status now correctly see
COMPLETED / FAILED the moment the orchestrator emits a
terminal event.
-
noetl/server#147 merged (server@d26abf8, v2.50.1, closes server#146). Two compounding fixes:
- New
terminalSQL query inExecutionService::get_statuslooks upplaybook.completed/playbook.failedevents FIRST and returns COMPLETED / FAILED directly — mirrors thelist()endpoint's existingbool_or(playbook.completed)semantics on the per-execution path. - Widened the
completed_stepsSQL filter to acceptstatus IN ('COMPLETED', 'completed', 'success')so the progress counter matches what workers actually emit (command.completedevents carrystatus='success'lowercase).
- New
- 6 new unit tests for the in-memory
determine_statushelper; server lib went 493/0/0 → 499/0/0.
Built noetl-server-rust:v2.50.1 → loaded into kind → rolled deployment.
Re-query of prior execution (322018338286866432, the fanout_reduce_phase6 run from earlier in session):
| Field | Pre-fix | Post-fix |
|---|---|---|
status |
"RUNNING" |
"COMPLETED" |
progress.completed_steps |
0 |
4 |
Fresh fanout_reduce execution (322023958058635264):
- Started 14:48:15.172, COMPLETED returned at 14:48:15.782 — single 2s poll interval was sufficient (previously stuck on RUNNING indefinitely).
- All three Phase D R4 barrier assertions also green on this fresh run.
-
repos/server→d26abf8(v2.50.1).
- noetl/server#146 (closed via server#147 merge).
- noetl/ai-meta#49 Phase D R4 (orchestrator + read-side status both GREEN end-to-end now).
Agent: Claude · Repos touched: noetl-server-rust image build + kind reload, noetl/ai-meta.wiki, noetl/server (status-bug sub-issue), noetl/ai-meta#49 comment
Headline. Phase D R4 evidence captured end-to-end on the Rust-only kind cluster. Built noetl-server-rust:v2.50.0 from server@499b079 via podman + Dockerfile, loaded into kind, rolled the deployment, and ran the fanout_reduce_phase6 fixture from the just-merged e2e@5da36ea rig. All three barrier assertions pass.
-
Image built + loaded into kind: localhost/noetl-server-rust:v2.50.0 (podman build → kind load image-archive → kubectl set image). Server pod uptime confirms v2.50.0 via /api/health:
{"status":"ok","version":"2.50.0"}. - fanout_reduce_phase6 ran on Rust-only stack: noetl-server-rust + noetl-worker-rust + noetl-worker-system-pool (Python at 0). Execution_id 322018338286866432 completed in 550ms wall.
Direct DB query against noetl.event:
command.completed start @ 14:25:56.704
step.enter normalize_customer @ 14:25:56.708
step.enter enrich_customer @ 14:25:56.710
command.completed normalize_customer @ 14:25:56.954
command.completed enrich_customer @ 14:25:57.043 ← barrier waited for THIS
step.enter reduce_customer @ 14:25:57.051 ← then dispatched ONCE
command.completed reduce_customer @ 14:25:57.249
playbook.completed playbook @ 14:25:57.254
| Assertion | Expected | Observed | Status |
|---|---|---|---|
1. playbook.completed event |
exists | event_id 322018349171085312 @ 14:25:57.254 | ✅ |
2. step.enter for reduce_customer
|
count = 1 | 1 (event_id 322018348319641600) | ✅ |
3. reduce_customer.command.completed AFTER both branches |
r > a ∧ r > b | r=14:25:57.248 > a=14:25:56.954 ∧ > b=14:25:57.042 | ✅ |
Server log confirms: Step 'reduce_customer' already dispatched in this pass, skipping (same-pass dedup caught the sibling arc) + Orchestrator marked execution as terminal terminal_event=playbook.completed.
Separately, the GET /api/executions/{id}/status endpoint
continued to return RUNNING for 90s+ after playbook.completed
landed in the event log. Read-side bug — doesn't affect
orchestrator correctness — filed as noetl/server#146.
| Slice | PR | Status |
|---|---|---|
| 1 — fan-in / reduce barrier | server#143 v2.49.0 | ✅ MERGED + kind-green |
| 2 — apply_event step.skipped | server#145 v2.50.0 | ✅ MERGED + kind-green |
| 3 — fanout_reduce kind-val rig | e2e#32 | ✅ MERGED + ran |
| 4 — context-merging | (TBD) | ⏳ Deferred — kind-val passed without it; playbook templates can already read {{ <upstream>.<field> }}
|
The rig itself isn't in the e2e rig's CLI form (the v4.8 CLI surface differs from noetl playbook register shape the script assumes); the manual run used a mix of v4.8 commands + direct DB queries. The rig's three core assertions are valid + green — the script as-is can be updated in a follow-up to match the v4.8 CLI surface.
-
noetl/server#146 — fix the status-query bug so
GET /api/executions/{id}/statuscorrectly returnsCOMPLETEDafterplaybook.completedlands. - Optional: update the kind-val rig to match v4.8 CLI shape (
noetl register playbook --filevsnoetl playbook register --file).
Agent: Claude · Repos touched: noetl/e2e (PR merged), noetl/ai-meta (pointer bump), noetl/ai-meta.wiki
Headline. Phase D R4 slice 3: the kind-val rig that runs
the canonical fanout_reduce shape end-to-end against the
orchestrator's fan-in / reduce barrier (server v2.49.0 +
v2.50.0). Durable fixture + script land under noetl/e2e;
the actual run-on-kind is gated on a fresh server image rolling
into the cluster.
-
noetl/e2e#32 merged (e2e@5da36ea, closes e2e#31). Two artefacts:
-
fixtures/playbooks/fanout_reduce/fanout_reduce_phase6.yaml— copy of the Python reference fixture (start → branch_a/branch_b → reduce_customer → end) with a header comment documenting the orchestrator contract being exercised. -
scripts/kind_validate_fanout_reduce.sh— modeled afterkind_validate_container_callback.sh. Three assertions on the resulting event log: (1) final execution statusCOMPLETED, (2) exactly onestep.enterforreduce_customer(barrier prevented double-dispatch), (3)reduce_customer.command.completedarrives AFTER both branches'command.completed(orchestrator waited for both upstreams).
-
-
repos/e2e→5da36ea.
| Slice | PR | Status |
|---|---|---|
| 1 — fan-in / reduce barrier | server#143 v2.49.0 | ✅ MERGED |
| 2 — apply_event step.skipped | server#145 v2.50.0 | ✅ MERGED |
| 3 — fanout_reduce kind-val rig | e2e#32 | ✅ MERGED |
| 4 — context-merging | (TBD) | ⏳ Deferred — playbook templates can read {{ <upstream>.<field> }} individually today; not opening until the kind-val actually surfaces a need |
- Build a fresh noetl-server image carrying v2.50.0 + load into
kind. Then run
e2e/scripts/kind_validate_fanout_reduce.shto capture the green-state evidence for Phase D R4.
- noetl/e2e#31 (closed via e2e#32 merge).
- noetl/ai-meta#49 Phase D R4 (slices 1-3 shipped; kind-val run-on-kind is the next housekeeping step).
Agent: Claude · Repos touched: noetl/server (PR merged), noetl/ai-meta (pointer bump), noetl/ai-meta.wiki
Headline. Phase D R4 slice 2: apply_event now handles
step.skipped events, closing the gap exposed by the slice 1
PR's #[ignore] test. Fan-in barrier now correctly treats a
guard-skipped upstream as terminal — no more deferred-forever
reduce steps.
-
noetl/server#145 merged (server@499b079, v2.50.0, closes server#144). New
"step.skipped" | "step_skipped"arm instate::WorkflowState::apply_eventrecords the step intostate.stepswithStepState::Skippedand setsentered_at+completed_atto the event timestamp.is_step_donealready treated Skipped as terminal (state.rs:540) — the missing piece was the apply_event mapping. Slice 1's#[ignore]test (test_reduce_step_treats_skipped_upstream_as_done) flipped to active; 2 new state-level tests added. Server lib went from 490/0/+1 ignored to 493/0/0 ignored.
-
repos/server→499b079(v2.50.0).
| Slice | PR | Status |
|---|---|---|
| 1 — fan-in / reduce barrier | #143 v2.49.0 | ✅ MERGED |
| 2 — apply_event step.skipped | #145 v2.50.0 | ✅ MERGED |
| 3 — kind-val with fanout_reduce_phase6 fixture | (not yet) | ⏳ pending |
4 — context-merging (optional; playbook templates can already read {{ <upstream>.<field> }}) |
(TBD) | ⏳ deferred until kind-val signals a need |
- noetl/server#144 (closed via server#145 merge).
- noetl/ai-meta#49 Phase D R4 (slice 1 + 2 shipped at the orchestrator level; remaining work is kind-validation + possibly context-merging).
2026-06-07 (worker#60 merged — Container Tool Callback umbrella #43 Round 4 worker-side adoption complete, v5.14.0)
Agent: Claude · Repos touched: noetl/worker (PR merged), noetl/ai-meta (pointer bump), noetl/ai-meta.wiki
Headline. The last follow-up after the closed Container Tool
Callback umbrella (noetl/ai-meta#43) ships. Worker now recognises
ToolResult.pending_callback = Some(true) and skips its own
call.done emit — terminal call.done arrives via the server's
/api/internal/container-callback/... endpoint (Round 2, v2.48.0)
driven by noetl-k8s-watcher (Round 1, ops@8892043).
-
noetl/worker#60 merged (worker@f96da71, v5.14.0, closes worker#59).
executor::commandcheckstool_result.pending_callbackafter a successful tool execution: whenSome(true)logs INFO (withexecution_id), bumps newnoetl_worker_call_done_skipped_pending_callback_total{tool_kind}counter, skips its owncall.doneemit. WhenNone(every existing tool, default) the existing emit path is preserved bit-for-bit. Cargo.toml:noetl-tools = "2.18"→"2.21",noetl-executor = "0.3"→"0.4"(Cargo.lock resolves 0.4.1 — published earlier this session via cli#56). 126/0 lib tests against published noetl-executor 0.4.1.
-
repos/worker→f96da71(worker v5.14.0).
The umbrella itself was closed earlier this session with all four Rust rounds shipped (ops watcher, server callback endpoint, Tool::Container + marker field, e2e kind-val rig). Round 4 worker-side adoption was the LAST coordinated follow-up. Still on the punch list (chip-level, not blocking):
- Cloud Build a fresh worker image carrying v5.14.0 + load into kind.
- Re-run
e2e/scripts/kind_validate_container_callback.shagainst the new image. - Expected dashboard fingerprint after Round 4 lands: server's
noetl_container_callback_stale_total{state}stops moving (the race window closes — the worker no longer emits earlycall.done); worker'snoetl_worker_call_done_skipped_pending_callback_total{tool_kind="container"}≈ server'snoetl_container_callback_total{state=...}becomes the healthy-steady-state signal.
- noetl/worker#59 (closed via worker#60 merge).
- noetl/ai-meta#43 (umbrella was closed earlier; this trail wraps the last follow-up).
2026-06-07 (cli#56 + server#143 merged — noetl-executor 0.4.1 released, Phase D R4 fan-in / reduce barrier landed v2.49.0)
Agent: Claude · Repos touched: noetl/cli (PR merged), noetl/server (PR merged), noetl/ai-meta (pointer bumps), noetl/ai-meta.wiki
Headline. Two PRs from earlier this session merged. cli#56
shipped noetl-executor 0.4.1 (bridge propagates the new
ToolResult.pending_callback field — unblocks the worker-side
adoption PR that's still draft). server#143 shipped the Phase D
R4 first slice — fan-in / reduce barrier — closing the
cross-pass bug where a reduce step fired on the first completing
upstream instead of waiting for all.
-
noetl/cli#56 merged (cli@77be8be, v4.10.0, closes cli#55). noetl-executor 0.4.1 patch:
tools_bridge::reshape_duckdb_resultpropagatesresult.pending_callbackunchanged; HTTP test fixture pinspending_callback: None;noetl-toolsdep bumped to^2.21. 102/0 unit. The cli release pipeline (release-cliworkflow) is publishing noetl-executor 0.4.1 to crates.io now. -
noetl/server#143 merged (server@be37e5c, v2.49.0, closes server#142). Phase D R4 first slice — fan-in / reduce barrier (tracks noetl/ai-meta#49). Orchestrator gates dispatch of any step with
incoming_arcs(target).len() > 1until ALL upstream steps reach a terminal state. New module-privatebuild_incoming_arcs(steps)helper mirrors the Python planner'sincomingmap across all fourNextSpecvariants. Three active tests + one#[ignore]documenting anapply_eventstep.skippedfollow-up (the barrier code already handles Skipped on the reconstructed-state side viais_step_done; missing piece is the apply_event arm). Server lib 490/0 + 1 ignored.
-
repos/cli→77be8be(cli v4.10.0, noetl-executor 0.4.1). -
repos/server→be37e5c(server v2.49.0).
Still draft. cargo build against the local crates.io cache
resolves noetl-executor 0.4.0 (not 0.4.1 yet) — the release-cli
workflow is in flight. Once it publishes 0.4.1, cargo update -p noetl-executor picks it up and the worker PR can be un-drafted +
merged. Pointer bump for worker waits on that.
- noetl/cli#55 (closed via cli#56 merge).
- noetl/server#142 (closed via server#143 merge).
- noetl/ai-meta#49 Phase D R4 (in progress; this slice closes the barrier-gate piece; context-merging slice + kind-val with fanout_reduce fixture are separate follow-ups).
- Umbrella-Container-Tool-Callback page (worker#60 Round-4-follow-up table stays current — still blocked on the publish).
2026-06-07 (Container Tool Callback umbrella #43 Round 4 worker-side pending_callback adoption — PRs open, blocked on noetl-executor 0.4.1 publish)
Agent: Claude · Repos touched: noetl/cli (PR), noetl/worker (PR), noetl/ai-meta.wiki
Headline. Worker-side adoption of the pending_callback
marker (the last follow-up after the just-closed Container Tool
Callback umbrella #43) coded + PRs opened. Two-PR chain across
noetl/cli (executor bridge propagation) → noetl/worker
(call.done skip + counter). Both ride a fresh ai-task sub-issue
per repo.
-
noetl/cli sub-issue noetl/cli#55 + PR #56. noetl-executor
0.4.1 (patch):
tools_bridge::reshape_duckdb_resultpropagates the newpending_callback: Option<bool>field through unchanged; bridge test fixture pinspending_callback: None.noetl-toolsdep bumped to^2.21. Built + tested: 102/0 unit. Closes noetl/cli#55, refs noetl/ai-meta#43. -
noetl/worker sub-issue noetl/worker#59 + PR #60 (draft, blocked).
executor::commandcheckstool_result.pending_callbackafter successful tool execution. WhenSome(true): tracing INFO withexecution_id+ bumps new counternoetl_worker_call_done_skipped_pending_callback_total{tool_kind}-
skips its own
call.doneemit. WhenNone(every existing tool today, default): behaviour preserved bit-for-bit. Cargo.toml:noetl-tools = "2.18"→"2.21",noetl-executor = "0.3"→"0.4". New unit test inmetrics.rs. Local validation against the patched cli executor: 126/0 lib tests pass. Closes noetl/worker#59, refs noetl/ai-meta#43, depends on noetl/cli#56.
-
skips its own
CI on noetl/worker#60 fails until noetl-executor 0.4.1 publishes to crates.io. Sequence:
- Merge noetl/cli#56 (the bridge propagation patch).
- Tag + publish noetl-executor 0.4.1.
- CI on noetl/worker#60 turns green automatically; reviewers can un-draft.
The full Round 4 also will need a kind-validation pass against
the e2e rig at e2e/scripts/kind_validate_container_callback.sh
once the worker image is rebuilt with this PR — same rig that
landed in umbrella #43 Round 5.
After Round 4 lands + kind-validates, the container-callback
chain runs in the steady-state shape: worker dispatches the K8s
Job, releases its slot WITHOUT emitting call.done, the watcher
detects the terminal Pod state, and the server's
/api/internal/container-callback/... endpoint emits the only
call.done for the step. The server's
noetl_container_callback_stale_total{state} counter goes back
to ~0 (the race window closes), and dashboards can read
worker.skipped_total{tool_kind="container"} ≈
server.container_callback_total as the healthy-steady-state
fingerprint.
- noetl/cli#55 (sub-issue) → noetl/cli#56 (PR).
- noetl/worker#59 (sub-issue) → noetl/worker#60 (PR, draft).
- Container Tool Callback umbrella: noetl/ai-meta#43 (CLOSED).
- Wiki: Umbrella-Container-Tool-Callback (the closed umbrella's "Remaining follow-up" section gets a status note in the same change set).
2026-06-07 (Container Tool Callback umbrella #43 CLOSED — Round 5 e2e kind-val rig landed e2e@17de21d)
Agent: Claude · Repos touched: noetl/e2e (PR), noetl/ai-meta.wiki
Headline. Last round of the Container Tool Callback umbrella ships. All four Rust rounds are in; the umbrella closes with the e2e kind-val rig that proves Rounds 1 + 2 + 3 wire together end-to-end.
-
Round 5 of noetl/ai-meta#43
(e2e#30, closed
e2e#29; commit
17de21d). 3 new files (412 lines):-
fixtures/playbooks/container_callback_happy_path/container_callback_happy_path.yaml— 3-step playbook (python init →kind: containeralpineecho + sleep + echo→ python complete). Expected terminal state:succeeded. -
fixtures/playbooks/container_callback_oom/container_callback_oom.yaml— same shape; dispatch step runspython:3.12-alpinewith a 40 MiBbytes()allocation under a 32Mi memory limit. Expected:failed_oom(the watcher's jq classifier maps podstate.terminated.reason == "OOMKilled"). Pythonbytes()overdd if=/dev/zerobecause Python's allocation actually touches physical pages (defeats lazy-allocation optimisations). -
scripts/kind_validate_container_callback.sh— rig that preflights (kubectl + noetl + curl in PATH; watcher Deployment exists + rolled out), registers + executes each fixture, scrapes the server's/metricsBEFORE + AFTER, asserts the sum ofnoetl_container_callback_total{state=...}+noetl_container_callback_stale_total{state=...}moved by ≥ 1. The sum-both-counters strategy handles the worker-sidepending_callbackadoption transition. On failure: dumps watcher logs (tail 50) + server logs filtered to/container-callback/. Returns 0 if both probes pass; 1 otherwise.
-
| Round | Repo | PR / commit | Status |
|---|---|---|---|
| 1 | noetl/ops |
#167 → 8892043
|
CLOSED |
| 2 | noetl/server | #141 → v2.48.0 | CLOSED |
| 3 | noetl/tools | #37 → v2.21.0 | CLOSED |
| 4 | noetl/noetl (Python) | — | Parked per Rust-only standing direction |
| 5 | noetl/e2e |
#30 → 17de21d
|
CLOSED |
The umbrella closes. Worker-side pending_callback
adoption (suppressing the worker's own call.done emit when
the marker is set) remains as a follow-up tracked under the
umbrella; harmless during the transition (the watcher's
callback is recorded by noetl_container_callback_stale_total,
which is the migration dashboard signal).
- Bump
repos/e2eto commit17de21d. - ai-meta wiki: Home (
Last refreshed+ #43 moved from Active umbrellas to Recently closed; preamble count Three → Two), Sessions-Log (this entry), Umbrella-Container-Tool-Callback (mark CLOSED).
2026-06-07 (noetl/ai-meta#43 Round 3 — Tool::Container + ToolResult.pending_callback landed tools v2.21.0)
Agent: Claude · Repos touched: noetl/tools (PR), noetl/ai-meta.wiki
Headline. Third and last code round of the Container Tool Callback umbrella ships. After this round + a worker bump to the new noetl-tools, only Round 5 (e2e kind-val rig) remains to close the umbrella.
-
Round 3 of noetl/ai-meta#43
(tools#37, closed
tools#36; v2.21.0).
-
src/result.rs— extendToolResultwithpending_callback: Option<bool>marker. Additive +skip_serializing_if; existing consumers see no change. Set byTool::ContainertoSome(true)to signal "I created an external work item; suppress your normalcall.doneemit". 10 existing struct-literal sites backfilled withpending_callback: None. -
src/tools/container.rsnew module —ContainerToolimpl + 17 unit tests (~570 lines). -
ContainerConfigmirrors the umbrella's catalog YAML shape: image (required) + command + args + env (literalvalueXORvalue_from { secret_name, secret_key }) + resources (requests/limits maps) + timeout_seconds (Job'sactiveDeadlineSeconds) + service_account + namespace + backoff_limit + restart_policy. -
build_jobtranslatesContainerConfig+ ExecutionContext into a K8s Job:- Labels:
noetl.execution-id/noetl.step-name/noetl.tool-kind=containeron bothJob.metadata.labelsANDPodTemplateSpec.metadata.labels. -
generateName: noetl-container-<step-slug>-<eid>-; slug strips chars outside[a-zA-Z0-9-](DNS-1123-safe), truncates to 20 chars, lowercases. Empty slug → literal"step". - Default namespace:
noetl. - Default
backoffLimit: 0— the playbook's ownretry:block is the right place to express retry semantics; the Job controller's built-in retry would muddle the terminal-state mapping the watcher does. - Default
restartPolicy: Never. -
value↔value_frommutually exclusive at build time (returnsToolError::Configurationif both set).
- Labels:
-
execute()builds the kube client viaClient::try_default()(reads in-cluster SA token + cluster CA), POSTs the Job viaapi.create(), returns immediately withpending_callback: Some(true)and the Job handle indata. -
17 new unit tests covering label propagation,
generateName shape, slug stripping, env literal + secret-ref,
value XOR value_from, empty image rejection, resources
requests/limits propagation, defaults (backoff = 0;
restartPolicy = Never), service_account propagation, timeout
- deadline. Lib 258/0 (241 + 17 new).
-
Worker-side adoption of the marker (suppressing the worker's own
call.done emit when set) is a coordinated follow-up tracked
under the same umbrella. Until that lands, the worker will emit
call.done immediately, and the watcher's later callback will be
treated as stale by the server (recorded by
noetl_container_callback_stale_total).
That race is harmless during the transition — playbooks just see early completion; the stale-counter dashboard is the migration signal for when the worker adoption needs to ship.
| Round | Sub-issue | Repo | State |
|---|---|---|---|
| 1 | ops#166 | noetl/ops | CLOSED — ops@8892043 |
| 2 | server#140 | noetl/server | CLOSED — v2.48.0 |
| 3 | tools#36 | noetl/tools | CLOSED — v2.21.0 |
| 5 | e2e#29 | noetl/e2e | Open — kind-val rig |
3 of 4 Rust rounds done. Only Round 5 remains; the
worker-side pending_callback adoption is a coordinated
follow-up (currently tracked as a comment on the umbrella, not
yet a sub-issue).
- Bump
repos/toolsto v2.21.0 (commitbd8ded8). - ai-meta wiki: Home (
Last refreshed+ ecosystem-map tools cell v2.20.0 → v2.21.0), Sessions-Log (this entry), Releases (prepend v2.21.0 row), Umbrella-Container-Tool-Callback (Recent-activity table + Next-concrete-steps update).
Agent: Claude · Repos touched: noetl/ops (PR), noetl/ai-meta.wiki
Headline. Second concrete round of the Container Tool Callback
umbrella ships: the external K8s Job watcher that closes the loop
between Round 3's labeled-Job dispatch and Round 2's server
callback endpoint. With Round 1 + Round 2 in place, the watcher
side can be kind-validated end-to-end against the live endpoint
by manually kubectl apply-ing a labeled Job — Round 3
(Tool::Container) can land independently.
-
Round 1 of noetl/ai-meta#43
(ops#167, closed
ops#166; commit
8892043).-
ci/manifests/k8s-watcher/new directory with 5 files (527 lines):-
README.md— full design + kind-val recipe + contract spec. -
rbac.yaml— ServiceAccount + ClusterRole (Jobs/Podsget,list,watch— read-only, cluster-scoped because K8s watch streams filter by namespace at server side) + ClusterRoleBinding. -
configmap.yaml— two ConfigMaps; one for env (NOETL_SERVER_URL,NOETL_K8S_WATCH_NAMESPACE,NOETL_K8S_WATCH_LABEL_SELECTOR), one for the 147-linewatcher.shbody. Shipping the script in a ConfigMap lets us iterate the contract without rebuilding an image; when the watcher graduates to a Rust binary (follow-up), only the env config stays. -
deployment.yaml— single-replica Deployment, Recreate strategy (server's Round-2 endpoint is idempotent so a brief rollout race is harmless),bitnami/kubectl:1.30.3image + jq/curl installed via package manager at startup, initContainer waits for noetl-server's/api/health,NOETL_INTERNAL_API_TOKENmounted from the existingnoetl-internal-api-tokenSecret. -
kustomization.yaml—kubectl apply -kentry point.
-
-
MVP shape: shell wrapper around
kubectl get jobs --watch -o jsonpiped through jq + curl. Per the sub-issue, shell is acceptable for round 1 — the contract (POST body, label selector, terminal-state mapping) is what unblocks the umbrella. A pure-Rust binary is a clean follow-up once the contract proves itself. -
Contract:
- Watches Jobs in
NOETL_K8S_WATCH_NAMESPACEcarrying theNOETL_K8S_WATCH_LABEL_SELECTORlabel (defaultnoetl.execution-id). - Jobs without both
noetl.execution-id+noetl.step-namelabels are ignored. - On terminal-state transition (Complete / Failed
condition), classifies into the six
TerminalStatevariants matching the umbrella's failure-mode taxonomy. - POSTs JSON body to
{NOETL_SERVER_URL}/api/internal/container-callback/{eid}/{step}withAuthorization: Bearerheader. - Retries 3× with backoff on 5xx / transport errors; never on 4xx (Round 2's handler is idempotent so duplicate POSTs are harmless).
- In-memory dedup by Job UID to avoid double-posting within the watcher's lifetime — server-side idempotency (Round 2) is the actual contract; this is just a hot-path optimization.
- Watches Jobs in
-
jq classifier maps K8s Job conditions to the six
TerminalStateenum variants. Finer-grain mapping forfailed_image_pull/failed_oom/failed_node_lost(requires reading pod status, which the watcher has RBAC for) is staged at "minimal correctness now, finer-grain in a follow-up" — server treats unknown reasons asfailedso no information is lost. -
Sanity-checked:
kubectl kustomize ci/manifests/k8s-watcher/renders 327 lines of valid YAML;sh -n watcher.shclean; jq classification dry-run resolves aCompleteJob tosucceeded.
-
| Round | Sub-issue | Repo | State |
|---|---|---|---|
| 1 | ops#166 | noetl/ops | CLOSED — ops@8892043 |
| 2 | server#140 | noetl/server | CLOSED — v2.48.0 |
| 3 | tools#36 | noetl/tools | Open — Tool::Container with PendingCallback marker |
| 5 | e2e#29 | noetl/e2e | Open — kind-val rig |
Rounds 1 + 2 are both in. The Round-1 ↔ Round-2 chain can
be kind-validated end-to-end against the live endpoint by
manually kubectl apply-ing a labeled Job before Round 3 lands
the tool side. Round 3 (Tool::Container) is the last code
round in the chain — once it lands, the worker dispatches real
labeled Jobs and the umbrella's only remaining round is
Round 5 (e2e kind-val rig).
- Bump
repos/opsto commit8892043. - ai-meta wiki: Home (
Last refreshed+ ecosystem-map ops cell with new headline), Sessions-Log (this entry), Umbrella-Container-Tool-Callback (Recent-activity table + Next-concrete-steps update).
Agent: Claude · Repos touched: noetl/server (PR), noetl/ai-meta.wiki
Headline. First concrete round of the Container Tool Callback
umbrella ships on the server side: a new /api/internal/* handler
that consumes the future K8s watcher's POST and emits a call.done
event on the orchestrator's pipeline. Smallest blast radius —
unblocks the rest of the umbrella's four-round Rust path.
-
Round 2 of noetl/ai-meta#43 —
POST /api/internal/container-callback/{execution_id}/{step}(server#141, closed server#140; v2.48.0).-
src/handlers/container_callback.rsnew module — full handler + 7 unit tests in 400 lines. -
Six
TerminalStatevariants matching the umbrella's failure-mode taxonomy:succeeded/failed/failed_image_pull/failed_oom/failed_node_lost/failed_timeout. Each survives inmeta.terminal_stateso the playbook can branch on the specific failure reason. -
Stale check: single indexed SELECT on
noetl.eventfor the execution_id. Zero rows → bumpnoetl_container_callback_stale_total{state}+ log INFO + return 202 without emitting. Match → emitcall.donevia the standardinsert_eventpath. - Returns 202 unconditionally on path-param validation success (the watcher is idempotent + may race with retries; the server should never 4xx on a stale callback).
-
Auth: existing
RequireInternalApiTokenextractor (same shape as the rest of/api/internal/*). -
Observability per
observability.mdPrinciple 1: spancontainer_callbackcarryingexecution_id+step+state; countersnoetl_container_callback_total{state}+noetl_container_callback_stale_total{state}; structured INFO on emit + stale paths. - 7 new unit tests; lib 487/0 (480 + 7 new).
-
| Round | Sub-issue | Repo | State |
|---|---|---|---|
| 1 | ops#166 | noetl/ops | Open — watcher Deployment + RBAC |
| 2 | server#140 | noetl/server | CLOSED — endpoint shipped today |
| 3 | tools#36 | noetl/tools | Open — Tool::Container with PendingCallback marker |
| 5 | e2e#29 | noetl/e2e | Open — kind-val rig |
Round 2 unblocks Round 1: the watcher Deployment now has somewhere
to POST. Kind-validation of Round 1 can manually kubectl apply
a labeled Job to drive the endpoint end-to-end before Round 3
(Tool::Container) lands the worker side.
- Bump
repos/serverto v2.48.0 (commitfb898e5). - ai-meta wiki: Home (
Last refreshed+ ecosystem-map server cell v2.47.0 → v2.48.0), Sessions-Log (this entry), Releases (prepend v2.48.0 row).
2026-06-07 (noetl/ai-meta#64 closes — artifact tool kind added to Rust noetl-tools registry; tools v2.20.0)
Agent: Claude · Repos touched: noetl/tools (PR), noetl/ai-meta.wiki
Headline. Continuing-in-natural-order after the Secrets Wallet
umbrella closed, picked up the smallest remaining open umbrella
(#64) and shipped it. A thin ArtifactTool adapter in noetl-tools
translates the Python-era YAML shape (action: get +
input.result_ref) into a ResultFetchTool call. Keeps the three
e2e fixtures that use kind: artifact working without modification.
-
noetl/tools#35 (landed via tools v2.20.0; closes noetl/tools#34; closes noetl/ai-meta#64):
- New
src/tools/artifact.rs—ArtifactToolimplsTool;name() = "artifact". - Holds a delegate
ResultFetchTool+ aTemplateEngine.execute()template-renders the raw config first (soinput.result_ref: "{{ start._ref }}"resolves before deserialisation), translates to a synthetic result_fetch-shaped JSON, wraps in aToolConfigwithkind: "result_fetch", and delegates. -
Pass-throughs honoured:
prefer,flight_endpoint,bearer_token,tls_ca_path,client_cert_path,client_key_pathall copy through to the delegate unchanged. -
action: putreturnsToolError::Configurationpointing the operator at the worker'scall.doneemit path (R-2.1) peragents/rules/execution-model.md. The playbook-side push surface is intentionally absent in the Rust path: a step's result lands in the result store via the worker's emit, not via a tool kind invoked by the playbook author. - Unknown actions rejected with the unknown name surfaced.
-
Missing
input:block — typed deserialiser names the missing field for the operator. -
8 new unit tests in
tools::artifact::testscovering: get translation with ref-only, get with all six pass-throughs, defaults action to "get" (matches Python worker default), put returns error pointing at emit path, unknown action rejected, missing input returns config error, tool name is "artifact",ToolConfiground-trip translation. Lib 241/0 (8 new). - Backward compatible — new tool kind; existing tools untouched;
existing fixtures using
kind: result_fetchkeep working.
- New
noetl/ai-meta#64 listed two branches: (a) add artifact to the
Rust registry (or alias to result_fetch); (b) treat as Python-era
and migrate the fixtures to kind: result_fetch. Chose (a)
aliasing because:
- Smaller blast radius — touches noetl/tools only, not noetl/e2e.
- Three fixtures already in production use the shape; migration cost exceeds the adapter cost.
- The adapter is ~80 lines of actual logic (rest is doc + tests);
no behavior drift from the underlying
ResultFetchTool.
-
noetl/ai-meta#64 closes — the worker bumps to noetl-tools
v2.20.0 and the #54 e2e sweep can re-run
test_output_select/test_gcs_storage/test_storage_tiersonce the worker pointer bumps (separate ai-task issue if not already underway). - The Rust noetl-tools registry now matches the Python tool-kind inventory for the surfaces the e2e fixtures exercise.
- Bump
repos/toolsto v2.20.0 (commita48da13). - ai-meta wiki: Home (
Last refreshed, ecosystem-map tools cell v2.19.3 → v2.20.0, move #64 from Active umbrellas to Recently closed, preamble count Five → Four), Sessions-Log (this entry), Releases (prepend v2.20.0 row).
2026-06-07 (Secrets Wallet umbrella #61 closes — three 6d.X cloud-specific dynamic providers landed; v2.45.0 + v2.46.0 + v2.47.0)
Agent: Claude · Repos touched: noetl/server (3 PRs), noetl/ai-meta.wiki
Headline. All three cloud-specific dynamic-secret providers on the Secrets Wallet umbrella shipped this session. The umbrella noetl/ai-meta#61 is now feature-complete — every named phase + every queued follow-up has landed in noetl/server. The platform-side wallet has nothing left to ship; future work would be new product surface (e.g. additional providers, additional residency-policy modes) rather than completing the original umbrella scope.
-
Phase 6d.1 — AWS STS
AssumeRoleWithWebIdentityprovider (server#137, closed server#132; v2.45.0): exchanges the EKS-projected ServiceAccount JWT (AWS_WEB_IDENTITY_TOKEN_FILE) for short-lived AWS temporary credentials via STS. No SigV4 — theWebIdentityTokenIS the credential (STS anonymous action), so no staticAWS_ACCESS_KEY_IDneeded. Response parser accepts both XML (legacy / VPC endpoints) and JSON (modern STS). Reference shape[<region>:]<role-arn>[#<session-name>]. Re-reads the token file on every fetch (kubelet rotates projected tokens every ~hour by default). 15 new unit tests; lib 456/0. -
Phase 6d.3 — Azure AAD client-credentials provider
(server#139, closed
server#134;
v2.46.0): off-cluster (non-IMDS) AAD
client_credentialsflow for deployments running outside AKS that need to call Azure APIs. ReadsAZURE_TENANT_ID/AZURE_CLIENT_ID/AZURE_CLIENT_SECRETfrom env. Sovereign-cloud overrides viaNOETL_AZURE_AAD_HOST(Azure Gov / China). Reference shape[<tenant>:]<scope>; parser only treats the:-prefix as a tenant if it doesn't look like a URL scheme (https/http). 14 new unit tests. -
Phase 6d.2 — GCP
iamcredentials.generateAccessTokenprovider (server#138, closed server#133; v2.47.0): mints short-lived OAuth2 access tokens for a target service account via workload-identity impersonation. Reads the caller's Workload-Identity token from the GKE metadata server (shares the env overrideNOETL_GCP_METADATA_TOKEN_URLwithGcpSecretManager). Reference shape<target-sa-email>[#<scope>]. 10 new unit tests.
All three providers return SecretValue.expires_at populated
from the issuer's response — Phase 6d's cache_decision clamps
cache TTL to min(default_ttl, expires_at - now - safety_margin);
Phase 7c.3's background refresh re-resolves inside the refresh
window. Three discrete merge-conflict resolution cycles in
src/secrets/mod.rs (each new provider added to the same
factory's match arms + supported-providers error message string);
all conflicts were one-line string combinations and the additive
mod / pub use / match-arm lines auto-merged cleanly.
Secrets Wallet umbrella noetl/ai-meta#61 is feature-complete:
- 1 envelope encryption (v2.21.0)
- 2 GCP Cloud KMS
KeyManager(v2.22.0) - 3 Secret resolution via the
auth:/keychain path (v2.23.0 → v2.26.0) - 3b + 3c Keychain caching +
provider:-backed entries (v2.27.0) - Providers — 5 static (GCP Secret Manager, K8s Secrets, Vault, AWS Secrets Manager, Azure Key Vault) at v2.28.0 → v2.31.0
- 4 Transport mTLS (4a server v2.30.0 + 4b worker v5.12.0 + 4c cert-manager + 4d Helm)
- 5 Sealed payload delivery (5a v2.32.0 + 5b v2.33.0 + 5c worker v5.13.0)
- 6 Residency-aware distributed resolution (6a region tag v2.34.0
- 6b ProviderRegistry v2.35.0 + 6c residency-policy gate v2.36.0 + 6d primitives v2.37.0 + 6e cross-region broker v2.38.0)
- 6d Dynamic-secret providers (6d.1 AWS STS v2.45.0 + 6d.2 GCP iamcredentials v2.47.0 + 6d.3 Azure AAD v2.46.0)
- 7 Rotation + audit + auto-renewal (7a KEK rotation primitives v2.39.0 + 7a.2 endpoints v2.42.0 + 7b audit service v2.40.0 + 7b.2 table + GET endpoint v2.43.0 + 7c should_refresh primitive v2.41.0 + 7c.2 cache companion v2.43.0 + 7c.3 resolver-side wire-up + stampede collapse v2.44.0)
The umbrella issue gets closed as part of this change set.
- Bump
repos/serverto v2.47.0 (commit605b8b1). - ai-meta wiki: Home (
Last refreshed, ecosystem-map server cell v2.44.0 → v2.47.0, move #61 from Active umbrellas to Recently closed, drop the count from Six to Five), Sessions-Log (this entry), Releases (prepend v2.45.0 / v2.46.0 / v2.47.0), Umbrella-Secrets-Wallet (mark feature-complete + closing summary at top).
2026-06-07 (Secrets Wallet #61 Phase 7c.3 — resolver-side stampede mutex + background re-resolve, landed v2.44.0)
Agent: Claude · Repos touched: noetl/server (PR), noetl/ai-meta.wiki
Headline. Phase 7c.3 wires the Phase-7c decision primitive +
the Phase-7c.2 cache-side companion into the resolver's cache-hit
path. When CredentialService::try_resolve_keychain hits a
fresh-but-aging row, the cached value returns IMMEDIATELY (worker
fetches stay on the fast path) and a background tokio::spawn
re-resolves via the Phase-3b SecretProvider + updates the cache via
KeychainService::set. The Phase 7c series (primitive + cache
companion + resolver wire-up) is now wire-complete.
-
Phase 7c.3 — resolver-side stampede mutex + background re-resolve
(server#136, closed
server#135; v2.44.0):
-
Stampede collapse — new
src/services/keychain_refresh.rsRefreshInflightwrapsArc<tokio::sync::Mutex<HashSet<(i64, String)>>>withtry_claim(atomic insert returning whether the slot was free)-
release. The struct isClone(cheapArcclone) so everyCredentialServiceinstance derived from the same root shares the same inflight set. N workers crossing the refresh threshold for the same(catalog_id, alias)collapse to one provider call; concurrent callers piggy-back vianoetl_secret_refresh_total{outcome="stampede_collapsed"}.
-
-
Refactor — extracted the cache-miss provider resolution
(catalog → playbook → provider → cache write) into a separate
resolve_via_providermethod. Both the cache-miss inline path AND the background-refresh task call it — identical code path, no behavior drift between cold-miss and proactive-refresh. -
Background-task lifecycle — cache hit →
maybe_spawn_refresh→should_refreshcheck (single indexed cache read; awaited) →try_claim→tokio::spawnwith cloned service state → background:resolve_via_provider→ record outcome metric (succeeded | failed) + duration histogram → release slot. On stampede: bumpstampede_collapsed+ return. -
Failure modes handled —
should_refreshread errors → log- skip (cached value already went out, never fail the credential
lookup); provider failure in background → log + bump
outcome="failed"+ release slot; stampede → bumpoutcome="stampede_collapsed".
- skip (cached value already went out, never fail the credential
lookup); provider failure in background → log + bump
-
Six new unit tests in
services::keychain_refresh::tests: single-claim succeeds; second-claim for same key returns false (stampede signal); release allows re-claim; distinct keys don't collide; release is idempotent; clone shares inner state (load-bearing: verifies the stampede-collapse invariant works acrossCredentialServiceclones). Lib 441/0 passing.
-
Stampede collapse — new
Secrets Wallet umbrella noetl/ai-meta#61: all named phases (1–7) plus Phase 7c.3 resolver wire-up are now wire-complete on the platform side. The only remaining work is the three cloud-specific dynamic-secret providers — each its own sub-issue:
-
server#132 — Phase
6d.1 AWS STS
AssumeRoleWithWebIdentity. -
server#133 — Phase
6d.2 GCP
iamcredentials.generateAccessToken. - server#134 — Phase 6d.3 Azure AAD client-credentials.
Each provider implementation needs the target cloud reachable for integration testing, so they're best scoped as discrete future rounds that pull the actual cloud credentials when picked up.
- Bump
repos/serverto v2.44.0 (commit1851f68). - ai-meta wiki: Home (
Last refreshed, ecosystem-map server cell), Sessions-Log (this entry), Releases (v2.44.0), Umbrella-Secrets-Wallet (latest landings).
Agent: Claude · Repos touched: noetl/server (3 PRs), noetl/server.wiki, noetl/ai-meta.wiki
Headline. Three queued follow-up rounds for the Secrets Wallet umbrella shipped this session — operator-facing endpoints and DB storage that wrap the lib-only primitives shipped in v2.39.0 / v2.40.0 / v2.41.0. All three Phase-7 named rounds (rotation + audit + auto-renewal) now have functional endpoints; three cloud-specific dynamic-secret provider sub-issues filed for the remaining 6d.X follow-up work.
-
Phase 7a.2 — KEK rotation endpoint + key-status + DB scans (server#127, closed server#126; v2.42.0):
POST /api/internal/wallet/rotate-kek?batch_size=&max_batches=&table=runs a batched cursor scan acrossnoetl.credential+noetl.keychain, calls the Phase-7arewrap_storage_stringper row, returnsRotateSummary { processed, rewrapped, skipped, failed, last_id }for progress checkpointing across runs.GET /api/internal/wallet/key-statusreports per-version row counts so an operator can confirm completion before retiring the old KEK version.classify_failureheuristic maps thrown errors toparse_error | failed_unwrap | failed_wrap | failedso thenoetl_wallet_rotate_total{table, status}counter splits clean. Plaintext NEVER reconstructed (Phase 7a invariant preserved). -
Phase 7b.2 —
noetl.secret_audittable + DbAuditSink + GET endpoint (server#129, closed server#128; v2.43.0): durable storage path for the Phase-7b service.noetl.secret_audittable provisioned viaCREATE TABLE IF NOT EXISTSat server startup (server-owned, no out-of-band migration step) —audit_idPK,credential+execution_id+occurred_atindexed.DbAuditSinkimpl ofAuditSinkwrites viadb::queries::secret_audit::insertwithON CONFLICT DO NOTHINGfor idempotency. NewGET /api/internal/secret-audit?credential=&execution_id=&from=&to=&limit=returns bounded rows ORDER BY occurred_at DESC (hard cap 10_000).NoopAuditSinkstays the default whenNOETL_SECRET_AUDIT_REQUIREDis unset. Two merge conflicts with Phase-7a.2 indb/queries/mod.rs+main.rsresolved as additive (both modules / route blocks coexist). -
Phase 7c.2 —
KeychainService::should_refreshcache-side primitive (server#131, closed server#130; v2.43.0): cache-layer companion of the Phase-7c decision primitive.KeychainService::should_refresh(catalog_id, keychain_name, execution_id, scope_type, now)reads the cache row'sexpires_at, askssecrets::dynamic::should_refresh_default(honoursKEYCHAIN_CACHE_REFRESH_WINDOW_SECS), bumpsnoetl_secret_refresh_total{outcome="triggered"}on a true return. Falls through to false when the row is missing, has noexpires_at, is already expired (eviction path, not refresh), or is outside the refresh window. Backward compatible — new method, existing call sites unchanged.
-
server#132 — Phase
6d.1: AWS STS
AssumeRoleWithWebIdentitydynamic-secret provider (EKS IRSA path). -
server#133 — Phase
6d.2: GCP
iamcredentials.generateAccessTokendynamic-secret provider (workload-identity impersonation). - server#134 — Phase 6d.3: Azure AAD client-credentials dynamic-secret provider (off-cluster, non-IMDS).
- Phase 7c.3 (next session on this branch): per-
(catalog_id, alias)tokio::sync::Mutexstampede collapse +tokio::spawnbackground re-resolve via the Phase-3b SecretProvider path + write viaKeychainService::set. Held back from 7c.2 because the resolver-side wire-up wants the cache-side primitive stable first.
Secrets Wallet umbrella noetl/ai-meta#61: all named phases (1–7) shipped. Remaining work is the three cloud-specific dynamic-secret providers + Phase 7c.3 stampede collapse — all discrete sub-issues; umbrella stays open until they close. noetl/server is at v2.43.0.
- Bump
repos/serverto v2.43.0 (commitee8cebe). - ai-meta wiki: Home (
Last refreshed, ecosystem-map server cell, Active umbrellas #61 row), Sessions-Log (this entry), Releases (v2.42.0 + v2.43.0), Umbrella-Secrets-Wallet (recent activity + next concrete steps).
2026-06-06 (Secrets Wallet #61 Phase 7c — token auto-renewal, landed v2.41.0 — closes Phase 7; all named rounds 1–7 done)
Agent: Claude · Repos touched: noetl/server (PR), noetl/server.wiki, noetl/ai-meta.wiki
Headline. Phase 7 of the Secrets Wallet umbrella closes — every
named phase (1–7) is now complete. OAuth2 / JWT access tokens with
expires_in are the dominant short-lived credential shape in
production. Phase 6d's cache-TTL plumbing kept dead tokens out;
Phase 7c adds proactive refresh-before-expiry so the cache renews
when the remaining lifetime drops below a threshold. Worker fetches
stay on the cached-fresh-token fast path; the auth playbook runs at
most once per natural token lifetime instead of once per worker
burst. Tail latency stays flat across rotations.
What landed (server#125,
merged → noetl-server v2.41.0 (e1bb4f8), closed server#124):
-
secrets::dynamic::should_refresh(expires_at, refresh_window, now)decision primitive: returnstrueiffexpires_atis set, still valid (expires_at > now), and inside the refresh window (expires_at - refresh_window <= now). Pure function; no side effects. -
secrets::dynamic::should_refresh_default(expires_at, now)— convenience wrapper that reads the window from env. -
KEYCHAIN_CACHE_REFRESH_WINDOW_SECSenv (default 60 s) — how long beforeexpires_atto mark a cached row "renewable." -
noetl_secret_refresh_total{outcome}counter peragents/rules/observability.mdPrinciple 1.outcome ∈ {triggered, succeeded, failed, stampede_collapsed}.failedat sustained rate is alert-worthy — provider is unreachable AND a cached token is about to expire. Aliases are NOT a label (cardinality blowup); per-alias detail rides thesecret.refreshtracing span. -
noetl_secret_refresh_duration_secondshistogram — buckets[0.05, 0.1, 0.25, 0.5, 1, 2, 5](auth round-trips dominate), observed regardless of outcome so dashboards surface "slow" + "failing" independently.
Tests. Five new in secrets::dynamic::tests: returns false when
no expires_at; returns false when already expired (defensive —
that's the eviction path); returns false when outside window; returns
true inside window; boundary case (expires_at = now + window)
returns true. Lib 427 / 0 passing.
Wiki. Server deployment-specification got a new "Token
auto-renewal (Phase 7c primitives)" subsection under Secret providers
- the new env + both metrics (server-wiki@0ad86de).
Phase 7 architectural shape (now complete):
Phase 7a: rotation primitives — rewrap stored envelopes under new KEK
version without reconstructing plaintext.
Phase 7b: audit primitives — AuditEvent + AuditSink + record_strict /
record_async modes; NEVER stores the value.
Phase 7c: refresh primitives — should_refresh() decides when to renew a
still-valid cached token in the background.
Phase 7 closes. All named rounds of the Secrets Wallet umbrella (1 envelope encryption → 2 KMS providers → 3 secret resolution + 3c keychain cache → 4 transport mTLS → 5 sealed payload delivery → 6 residency + cross-region broker → 7 rotation + audit + auto-renewal) are complete.
Remaining queue (all discrete follow-up sub-issues, each its own bounded round):
-
7a.2 — Wallet KEK rotation endpoint
(
POST /api/internal/wallet/rotate-kek) + DB scans overnoetl.credential+noetl.keychain+ diagnosticGET /api/internal/wallet/key-status. -
7b.2 —
noetl.secret_audittable +DbAuditSinkimpl +GET /api/internal/secret-auditquery endpoint + wire-up to the four credential surfaces. -
7c.2 —
KeychainService::should_refresh+ resolver wire-up + per-(catalog_id, alias)stampede mutex + refresh path records its own Phase-7bAuditEvent. -
6d.1 / 6d.2 / 6d.3 — AWS STS
AssumeRoleWithWebIdentity/ GCPiamcredentials.generateAccessToken/ Azure AAD client-credentials dynamic-secret providers.
These plug into the existing primitives without further trait / schema surgery.
2026-06-06 (Secrets Wallet #61 Phase 7b primitives — secret-resolution audit service, landed v2.40.0)
Agent: Claude · Repos touched: noetl/server (PR), noetl/server.wiki, noetl/ai-meta.wiki
Headline. Phase 7 round 2. Today the wallet has no durable record
of "who accessed credential X at time Y, on which execution, with
what outcome." The tracing-span surface evaporates with log
retention; compliance regimes (SOC 2, ISO 27001, FedRAMP, PCI-DSS)
require a queryable audit trail with retention measured in years.
This round ships the in-process service primitives; the actual
noetl.secret_audit table + query endpoint + handler integration
lands in 7b.2.
What landed (server#123,
merged → noetl-server v2.40.0 (eb1840f), closed server#122):
-
services::secret_audit::AuditEventstruct —audit_id(application-side snowflake) +occurred_at+credential+ boundedoperation+ boundedoutcome+worker_id/execution_id/server_region/broker_region/kek_version/notes. NEVER contains the secret value. -
Operation+Outcomebounded enums withas_str()— drift guard against the strings used in thenoetl_secret_audit_writes_total{operation, outcome, ...}metric labels. -
AuditSinktrait +NoopAuditSinkdefault impl (audit-disabled deployments + the audit-disabled test path). -
SecretAuditServicewrapper with three calls:-
record_async— fire-and-forget; spawns a tokio task; never blocks the resolver. Failed writes log + drop +noetl_secret_audit_writes_total{status="dropped_async"}. -
record_strict— awaits the result. Used when compliance requires the audit row exist before the value releases. Failed writes propagate the error to the handler. -
record— branches bystrict. Typical handler call.
-
-
NOETL_SECRET_AUDIT_REQUIREDenv (default false;1/true/TRUE/yes/YESenable strict mode). -
noetl_secret_audit_writes_total{operation, outcome, status}counter —status ∈ {written, dropped_async, failed_strict}.failed_strictis alert-worthy — wallet refused to release a credential because the audit couldn't be recorded.
Tests. Eight new in services::secret_audit::tests: builder
fills audit_id + occurred_at; Operation + Outcome as_str
round-trip (drift guard); noop sink always succeeds; record_strict
blocks on sink failure (mock sink with fail=true); record_strict
persists on success (mock sink records the event); record
dispatches async when not strict; noop service records without
blocking; from_env respects truthy values. Lib 422 / 0
passing.
Wiki. Server deployment-specification got a new "Secret-
resolution audit service (Phase 7b primitives)" subsection with the
AuditEvent wire shape + the new env + metric
(server-wiki@d50ec40).
Lib-only. No schema migration. Backward compatible — existing
deployments get NoopAuditSink + non-strict mode; no behavior
change until 7b.2 wires the production sink + the DB table.
Next. Phase 7c — token auto-renewal. OAuth2 / JWT access
tokens with expires_in are the dominant short-lived credential
shape. Phase-3c cache currently evicts when expires_at is past;
on the next worker fetch, the resolver re-runs the auth playbook
(slow path). 7c adds a refresh-before-expiry hook so the cache
proactively renews when the remaining lifetime drops below a
threshold — the worker's next fetch hits the cached fresh token
instead of paying the auth round-trip. Plus Phase 7a.2 + 7b.2 still
queued (rotation endpoint + audit table + endpoint).
Agent: Claude · Repos touched: noetl/server (PR), noetl/server.wiki, noetl/ai-meta.wiki
Headline. Starts Phase 7 of the Secrets Wallet umbrella — rotation, audit, and auto-renewal. Phase 6 closed with full residency coverage; Phase 7 hardens the wallet for the operational lifecycle. This round ships the rotation primitives so the actual rotation endpoint + table scans (7a.2) can land cleanly.
What landed (server#121,
merged → noetl-server v2.39.0 (3510170), closed server#120):
-
KeyManager::current_key_version()trait accessor with a safe default ("unknown").LocalDevKmsreports its own version string (defaults"v1"; test constructor accepts an explicit label). -
EnvelopeCipher::rewrap_storage_string(raw)primitive:- Parses
rawas a stored envelope. - If
wrapped.key_version == current_key_version()→ returnsRewrapOutcome::Skipped { key_version }(no KMS call). - Else: unwraps DEK under historical KEK version → re-wraps under
current →
RewrapOutcome::Rewrapped { old_key_version, new_key_version, new_storage_string }. -
Plaintext payload is NEVER reconstructed. Pure DEK re-wrap —
AES-GCM ciphertext bytes stay byte-identical; only the
dekfield of the stored envelope changes.
- Parses
-
noetl_wallet_rotate_total{table, status}counter per observability.md Principle 1.table ∈ {credential, keychain};status ∈ {skipped, rewrapped, failed_unwrap, failed_wrap, parse_error}.failed_unwrapis alert-worthy — it means the KMS deleted the historical key version and the rotation can't complete without operator intervention.
Tests. Four new in crypto::envelope::tests: rewrap skips records
already on current version (no KMS call); rewrap emits new envelope
under current version when older (new storage string carries kv:"v2"
and decrypts to the original plaintext); rewrap rejects non-envelope
storage value (forward-only contract preserved); LocalDevKms reports
its key version (drift guard). Lib 414 / 0 passing.
Wiki. Server deployment-specification got a new "Wallet KEK
rotation primitives (Phase 7a)" subsection + the new metric + the
planned 7a.2 operator workflow
(server-wiki@36f3cfa).
Lib-only. No schema migration. Backward compatible — existing records resolve unchanged; only an explicit rotation pass touches them.
Next. Phase 7a.2 — the rotation endpoint
(POST /api/internal/wallet/rotate-kek) + DB scans over
noetl.credential + noetl.keychain + diagnostic
GET /api/internal/wallet/key-status reporting per-version row
counts. Phase 7b — secret_audit append-only table with one row per
resolution attempt (alongside in this session). Phase 7c — token
auto-renewal (OAuth2/JWT refresh-before-expiry inside the Phase-3c
cache).
Agent: Claude · Repos touched: noetl/server (PR), noetl/server.wiki, noetl/ai-meta.wiki
Headline. Phase 6 of the Secrets Wallet umbrella closes. Phase 6c's
residency gate is fail-closed: a server in us-east-1 denied a
credential whose home is eu-central-1 returns HTTP 403 — the workflow
step fails. That's correct for hard-isolation use cases. The more
common operational shape is "the credential should be resolved IN the
EU and the cleartext should never leave EU memory, but the worker that
needs it happens to run in US." Phase 6e wires this pattern up by
chaining residency-denied resolutions through a broker server in the
credential's home region that re-seals the result to the requesting
worker via the Phase-5 sealing primitives. Cleartext stays in the home
region; only the sealed envelope crosses the wire; only the addressed
worker can open it.
What landed (server#119,
merged → noetl-server v2.38.0 (2803f05), closed server#118):
-
src/secrets/broker.rs—-
BrokerRegistry—region → broker_urlmap fromNOETL_SECRET_BROKER_REGISTRYenv (JSON object). Empty by default; deployments without a broker keep the pre-6e fail-closed behaviour. -
BrokerClient— forwards a sealed-credential request to a peer. Maps every failure mode toAppError::CrossRegionUnreachable. -
CrossRegionResolveRequestwire-shape struct.
-
-
src/handlers/cross_region.rs—POST /api/internal/cross-region/resolvepeer-server endpoint: validatesexpected_entry_region == server_region()(defensive against misconfigured peer registries — a stale registry can't silently coerce a server into serving credentials for the wrong region; returns 403 on mismatch), resolves locally, seals via Phase-5a primitives to the requesting worker's pubkey, returns theSealedEnvelope. -
KeychainDef.no_broker_fallback: bool— per-credential opt-out for credentials whose policy says "this data physically cannot leave its home region, full stop." -
AppError::CrossRegionUnreachable { broker_url, cause }— new variant → HTTP 502. Distinguishes "policy says no" (403 fromResidencyViolation) from "policy says yes via broker, but broker is down" (transient). -
get_sealedhandler — onAppError::ResidencyViolation, look up the entry's region in theBrokerRegistry; when configured, forward to the broker viaBrokerClient; return the broker's envelope directly. Otherwise propagate the violation per Phase-6c semantics. -
noetl_secret_broker_call_total{broker_region, outcome}counter — outcomes:ok/unreachable/denied_by_broker/wrong_region/bad_pubkey/resolve_error/serialize_error/seal_error.wrong_regionis the alert-worthy combination — it means a peer's broker registry is out of date. -
noetl_secret_broker_call_duration_seconds{broker_region}histogram — buckets[0.05, 0.1, 0.25, 0.5, 1, 2, 5]s, observed regardless of outcome. -
NOETL_SECRET_BROKER_TIMEOUT_SECSenv (default 10 s).
Tests. Ten new across secrets::broker::tests and
handlers::cross_region::tests: BrokerRegistry default empty;
from_map lookup; from_env parses JSON; empty / invalid JSON treated
as empty; empty-string values filtered; BrokerClient builds; trailing-
slash URL handling; decode_pubkey round-trips X25519 / rejects wrong
length / non-base64; CrossRegionResolveRequest round-trips JSON
(wire-shape drift guard between requesting server and broker).
Existing secrets::residency::tests constructor updated for the new
no_broker_fallback field. Lib 410 / 0 passing.
Phase 6 architectural shape:
Worker A (us-east-1) → Server-US: GET /api/credentials/eu_token/sealed
Server-US: residency check → Deny (entry_region=eu-central-1)
Server-US: BrokerRegistry["eu-central-1"] → https://broker-eu.example.com
Server-US → POST broker-eu /api/internal/cross-region/resolve
Broker-EU: resolve locally, seal to Worker A's pubkey
Broker-EU → Server-US: SealedEnvelope
Server-US → Worker A: SealedEnvelope (cleartext never left EU)
After Phase 6 — both residency shapes operational:
-
Hard isolation —
residency: strict+ no broker → fail-closed HTTP 403. -
Soft federation —
residency: strict+ broker registered → transparent cross-region routing via the broker. Cleartext stays in the home region; the asking worker receives only the sealed envelope it can open.
That covers the original umbrella goal G7 in full.
Wiki. Server deployment-specification got a new "Cross-region
broker (Phase 6e)" subsection under Secret providers + the new envs +
endpoint + both metrics
(server-wiki@51c1dfd).
Lib-only. No schema migration. Broker registry is opt-in via env; deployments without a broker keep pre-6e fail-closed behaviour.
Next. Phase 6 closes. The queue is:
-
Phase 6 follow-ups (each its own sub-issue under #61): 6d.1
AWS STS
AssumeRoleWithWebIdentity· 6d.2 GCPiamcredentials.generateAccessToken· 6d.3 Azure AAD client-credentials. These plug into the Phase-6d primitives that already shipped. -
Phase 7 — KEK rotation (per-record
kek_version+ online re-encrypt job) +secret_auditappend-only table + token auto-renewal (OAuth2/JWT refresh-before-expiry inside the Phase-3c cache).
2026-06-06 (Secrets Wallet #61 Phase 6d primitives — dynamic-secret support + cache honors issuer TTL, landed v2.37.0)
Agent: Claude · Repos touched: noetl/server (PR), noetl/server.wiki, noetl/ai-meta.wiki
Headline. Phase 6 round 4. Some providers return secrets the issuer
expires on a clock — AWS STS bearer tokens (15 min – 12 h), AAD access
tokens (1 h), GCP iamcredentials.generateAccessToken (1 h default),
OAuth2 access tokens with expires_in. The Phase-3c keychain cache
used a fixed 600 s TTL; caching a token past expires_at means the
next worker fetch gets a 401 and the playbook step fails.
This round ships the primitives + cache plumbing so the concrete cloud-specific dynamic providers can land as separate clean rounds (6d.1 / 6d.2 / 6d.3).
What landed (server#117,
merged → noetl-server v2.37.0 (eda5cb3), closed server#116):
-
SecretValue.expires_at: Option<DateTime<Utc>>— issuer-reported expiry. Existing providers (GCP SM / AWS SM / Azure KV / Vault / K8s) passNonesince they return long-lived secrets. -
src/secrets/dynamic.rs—cache_decision(expires_at, default_ttl, safety_margin, now)returnsCacheFor(secs)for normal-case secrets, orSkipCacheAlreadyExpiredwhen the deadline is already past or inside the safety margin. Honoursmin(default_ttl, expires_at - now - safety_margin), floored atMIN_EFFECTIVE_TTL_SECS = 5. -
KEYCHAIN_CACHE_DYNAMIC_SAFETY_MARGIN_SECSenv (default 60 s) — buffer for clock skew + wall-clock between cache write and next worker fetch. -
resolve_keychain_entry_with_meta— companion to the existingresolve_keychain_entrythat also returns the bundle'sexpires_at. For amap-shaped keychain entry that bundles several secrets, the bundle TTL is the earliest of any contributing secre
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture