Skip to content

Umbrella Decoupled Context Event Chain

Kadyapam edited this page Jun 22, 2026 · 21 revisions

Umbrella / RFC: Decoupled Context + One-Level Event Chain (#115)

Issue: noetl/ai-meta#115 Status: Phase 1 shipped 2026-06-19 (references-in-state consume side — worker selective render-time ref-resolution + server _ref/_store surfacing + refs_in_state default true; server v3.30.0 + worker v5.36.0). Closed #113 + #114 (all 9 stalled fixtures green gate-ON). Phase 2 MERGED 2026-06-19 (one-level prev_event_id chain links; noetl/server#244 → server v3.31.0 f5bd4a8 + noetl/noetl#667 → noetl ecd16a2; ai-meta pointer afdb365; post-merge verified on live kind). Phase 3 MERGED — the chain-walk state builder (server-side, flagged NOETL_STATE_BUILD_MODE, default event_scan; server v3.32.0). Phase 4 DRIVE CUTOVER SHIPPED + gate-ON parity-validated 2026-06-19 — the drive now builds WorkflowState off-server from the noetl_events WAL (wasm run/from_events) under NOETL_STATE_BUILDER=offserver (default server, prod unchanged), with a durable noetl_state_builder consumer + a staleness guard (expected_head) that keeps the WAL state never staler than the server's view; worker v5.38.0 + server v3.34.0 + ops b1da9f1 + e2e b38b6dd (parity rig PASS: offserver==server fingerprint, fan-in barrier exactly-once, served +3 / 0 scans / sole-writer / lag-0). Phase-4 remainder (remove the residual server-side chain-walk bookkeeping → fully zero server reads on the drive path) + Phases 5–6 not yet started. Original RFC proposed 2026-06-19. Primary repo: noetl/server (orchestrate-core + drive) — cross-repo: noetl/worker, noetl/tools, noetl/ops. Reframes: noetl/ai-meta#101 · Subsumes storage tier of: noetl/ai-meta#104 · State-construction design for: noetl/ai-meta#107 steps 2–4 · Unblocks: noetl/ai-meta#111 · Builds on write path of: noetl/ai-meta#103 · #114 stays as a safety cap.

Sibling #117 SHIPPED (2026-06-20) — off-server spine ordering. Phase 4's worker-side off-server from_events builder sorted the spine by event_id, assuming id order == causal (prev_event_id) chain order. Under high-concurrency fan-out two branch completions arrive at the owner reordered vs their producer ids, so emit_events stamps a higher-id event as the predecessor of a lower-id one. The worker tracked the head as max(event_id) but ChainHeads.link_batch advances the watermark to event_ids.last() (the real causal tip), so under the inversion a max-id walk MISSED the inverted tip → from_events never saw that branch's command.completed → the fan-in reduce never fired (wedge ~1/9 on the 2-replica affinity topology). Fixed in worker#122 (v5.40.1, baeae78): build_offserver_input builds from expected_head via build_spine_to/advance_to/chain_walk_from and orders by the prev_event_id walk (head→root, reversed) — SpineOrder::Causal default (NOETL_OFFSERVER_SPINE_ORDER=event_id reverts); staleness guard intrinsic. Byte-identical to the old sort for monotonic chains (single-replica + low-concurrency unchanged). 2-replica affinity gate-ON stress 6/6 iterations / 108 execs COMPLETE; 15 execs with a real prev_event_id > event_id inversion all fired reduce_customer + completed. e2e adds NOETL_COHERENCE_FANOUT_BURST (#72, cdf1768). A residual single-replica terminal-finalize chain-linking race (playbook.completed stamped NULL prev_event_id, non-wedging) is staged as the next write-ordering follow-up.

This is a design deliverable. No feature code ships from this page. The phased plan at the bottom is the implementation contract; each phase is its own PR set with its own kind validation.


0. One-paragraph statement

Today an orchestrator step receives the entire accumulated execution context — every prior step's full result, every loop variable, the whole workload — and the drive rebuilds that context by scanning and replaying the whole noetl.event log. Both grow with the execution: O(n) context per step, O(n²) work over a run, a 17.4 MB drive-state and a 1.32 MB command observed on real fixtures. This RFC removes the growth at the source. Result data is decoupled from context: the noetl.* schema carries references only, never bulk payloads. State is reconstructed not by scanning the event table but by walking a one-level event chain (each event points to its predecessor), pointer by pointer, resolving only the references a step actually needs. The walk-and-build runs off the server on a system worker pool that caches the assembled state. Each worker that runs a tool receives only its atomic working-item context — not the playbook, not the other steps. The invariant that ties it together: nothing in the state/drive read path ever scans noetl.event.


1. Problem statement — unbounded context growth

1.1 The growth, in code

The accumulated context is assembled and then cloned forward at every arc:

  • repos/server/orchestrate-core/src/state.rsWorkflowState::build_context (~L953–1060) folds, in order: the workload (top-level + workload namespace), every completed step's full result (both steps.<name> envelope and the flattened user-data), then the durable ctx loop-variable overlay. The result is a HashMap whose size is the sum of all step outputs so far.
  • repos/server/orchestrate-core/src/state.rsWorkflowState (~L263–315) holds steps: HashMap<String, StepInfo>, and each StepInfo.result (~L107–208) is the full result envelope. Nothing caps it.
  • repos/server/orchestrate-core/src/orchestrator.rs — at ~L1339 let mut step_context = context.clone(); clones the whole accumulated context per transition arc, renders the arc's set: against it, and hands it on.
  • repos/server/orchestrate-core/src/commands.rsbuild_command (~L88–142) takes context: &HashMap<…> — the entire map — so every emitted command's render_context carries all upstream results.

1.2 The replay, in code

The drive reconstructs WorkflowState from the event table:

  • repos/server/src/handlers/events.rsrebuild_state (~L1829) runs SELECT … FROM noetl.event WHERE execution_id = $1 ORDER BY event_id ASC (ORCH_EVENT_COLS, ~L1705) — a full scan of the execution's events on the no-snapshot path; even with a projection_snapshot it scans the delta (event_id > $2 OR created_at > $3).
  • repos/server/orchestrate-core/src/state.rsfrom_events (~L367–417) requires all events and applies them in a single forward pass: for event in events { state.apply_event(event); }. There is no incremental entry point in the core itself; #101's OrchStateCache works around this by caching the built state and re-applying only newer events.

1.3 The evidence

Symptom Measure Source
__orchestrate__ drive-state bloat 17.4 MB (test_storage_tiers) #114 / Umbrella-Orchestrator-Scaling
Next-command context over NATS max_payload 1,324,800 B (test_output_select; ctx[start] 1,059,444 + ctx[steps] 265,134) #114 root cause
command.issued events seen during CQRS 2d 5.4 MB (cursor fan-out rendered context) server#206 note
Runaway replay pre-#101 20,844 events, work-queue stuck at pending=5 #101

1.4 Why the existing fixes don't reach it

  • #101 made the rebuild incremental (cache + apply-delta) and resolved over-budget refs back into the state (hydrate_result_references, src/handlers/events.rs:1582). It bounded the replay cost but kept data in the state — the context a step receives is still the sum of upstream results.
  • #114 offloads a command.issued context over NOETL_COMMAND_CONTEXT_MAX_BYTES (512 KB) to noetl.result_store with a {__context_ref__} marker (src/handlers/execute.rs maybe_offload_command_context). It caps the event payload but the worker still receives and renders the full context after resolution — the growth is intact, just relocated.
  • refs_in_state=true (src/config/app.rs:53) keeps {reference, extracted} on events instead of resolving inline — the publish side of the answer. But the consume side is unimplemented: the worker's resolve_context_references (repos/worker/src/executor/command.rs ~L1411) resolves refs back to full data at render time unconditionally, so turning the flag on bloats the worker render instead, and bulk-consuming fixtures (storage_tiers, output_select) break. That gap is exactly Phase 1 below.

Conclusion: the cause is architectural — data travels inside context, and state is built by scanning the log. Bounding payloads is necessary plumbing but cannot stop the growth. This RFC changes the data model and the read path.


2. Design tenets (the spec)

  1. Root cause = unbounded context growth. Each step passes the entire accumulated context to the next. Capping oversized payloads (#114) is a safety net, not the fix.
  2. Decouple result DATA from CONTEXT. No result/payload data lives in the noetl.* schema — the schema holds references (pointers) only. Data lives in the result/object store, addressed by reference.
  3. No scan of noetl.event at ANY time in the state/drive read path. True write/read segregation — beyond #103's sole-writer, this removes the event-table reads from the drive entirely.
  4. One-level event chaining. A command refers to its immediately-previous event; each event refers to its previous event by reference. State is followed pointer-by-pointer, one level at a time — not a full-table scan, not a full replay.
  5. Worker-side state construction + cache. A dedicated system worker-pool playbook walks the reference chain, builds the context, and caches it. State assembly leaves the server.
  6. Atomic-working-item context only. A worker running a tool/step receives only the minimal, self-contained item context it needs — not the whole playbook with its steps, loops, and tool list. Large data passed by reference.

3. The reference / pointer data model (tenets 2)

3.1 Where data lives, where pointers live

Lives in Holds
Object store (noetl.result_store today → Feather/Parquet/JSON tier per #104) the payload bytes — tabular as Arrow IPC (noetl_tools::arrow_codec::try_encode_tabular_json, media application/vnd.apache.arrow.stream), non-tabular as JSON/Parquet
noetl.event row event identity + chain link + a reference to its result + a bounded extracted predicate block. No bulk result/context JSON.
noetl.command row command identity + chain link + a reference to its render-context + the tool kind + the bounded item slice. No full render_context.
System-pool cache the assembled WorkflowState / item context, keyed by chain position (§5)

This is the strict form of the #101 follow-up ("event.result and command.context carry references only, never inline data") and the #104 "results addressed by derivable URN" model, applied to all noetl.* rows.

3.2 How a result is addressed

Two URN forms already exist and are kept:

  • Physical store refnoetl://execution/<eid>/result/<name>/<id> (repos/server/src/services/result_store.rs:202). Minted on write; resolved by ResultStoreService::resolve (parse → queries::get_by_ref).
  • Logical locator URN (derivable)noetl://<tenant>/<project>/results/<eid>/<step>/<frame>/<row>/<attempt> (worker stamp_logical_uri, repos/worker/src/executor/command.rs ~L1320; noetl_tools::locator::ResultCoordinates). Because it is derivable from the execution coordinates, a consumer can address a result without the producer carrying the string — the #104 derivability property. attempt fixed at 1 so retries overwrite the same key (idempotent-overwrite).

A reference in a row is the small triple { ref: <urn>, extracted: <bounded predicate block>, ipc?: <hint> }. The extracted block (build_extracted/summarise_value, worker ~L1446, bounded to MAX_EXTRACTED_BYTES = 4096) is navigable — it preserves object keys and the first array element so when: / set: / cursor-fan-out predicates like {{ output.data.rows[0].facility_mapping_id }} evaluate without resolving the payload. This is the mechanism that lets the drive make routing decisions against references, never bulk.

3.3 What changes vs today

  • The over-budget path (worker stages to the store + emits {data:{_ref}} + reference, ~L1135) becomes the only path for any non-predicate data, not just over-budget data. Small scalars used directly in predicates can stay inline (they are the extracted block); bulk never does.
  • hydrate_result_references (server) stops splicing payloads back into events on the read path. Its keep_refs=true branch (which already keeps the reference + surfaces extracted as context.data) becomes the default and only behavior.
  • The worker's resolve_context_references becomes lazy + selective (Phase 1): resolve a noetl:// ref only when the tool's actual input binding consumes the bulk (not for predicate-only access), and resolve only the refs that step needs — closing the consume-side gap.

4. The one-level event chain (tenet 4)

4.1 Shape

Each event is a node carrying a pointer to its immediately-previous event; each command carries a pointer to the event that authorized it. The chain is an append-only, immutable singly-linked list along the control spine, with frame-indexed side-links for fan-out (§4.4).

genesis ◄── e1 ◄── e2 ◄── e3 ◄── … ◄── e_head
                                          ▲
                                  command (prev = e_head)

Concretely, each noetl.event gains a prev_event_id (and, in the WAL/object tier, a derivable prev-URN) naming the predecessor in the execution's causal order. Each noetl.command gains prev_event_id = the event whose application issued it (the step.enter / unblocking completion). These are additive columns; populating them is Phase 2.

4.2 What each event node carries

  • event_id, execution_id, event_type, node_name, status
  • prev_event_id — the chain link (tenet 4)
  • result_ref — a reference (§3.2), never bulk
  • extracted — the bounded predicate block (§3.2)
  • meta — small routing (pool segment, W3C trace) — already kept tiny per #103

A node is self-describing: from it you can read the step's status and predicate fields, follow prev_event_id one level back, and resolve result_ref on demand. Nothing about a node requires reading any other node by query.

4.3 Walking the chain (no scan)

To build the context for the next atomic item, the builder starts at the chain head named by the command's prev_event_id and walks backward one level at a time, collecting node summaries until it has covered the upstream steps the target item binds (§6). It resolves a result_ref only when the binding consumes the bulk. The walk is bounded by the chain depth it must cover — typically the few upstream steps an item references, not the whole execution.

Crucially the walk follows pointers in the WAL / object tier, not SELECT … WHERE execution_id. The chain is the index. This is what makes tenet 3 (no noetl.event scan) achievable: the head pointer comes in on the command; every subsequent node is reached by following a link, not by querying a table keyed on execution_id.

4.4 Fan-out / loops / joins

Linear chains don't directly express N-way cursor fan-out. The design keeps the control spine linear and adds frame-indexed branch links:

  • The cursor claim event is a spine node; each body command is stamped with its {phase:"body", frame, row} coordinate (already done — worker stamp_logical_uri reads __cursor_frame/__cursor_row; core CursorFrame at state.rs:210 tracks claim_completed, body_issued, body_completed, claim_ref).
  • A body event's prev_event_id points to its claim event (its fan-out origin), not to a sibling body. Bodies are independent leaves off the claim node — they never need to see each other.
  • A join/aggregation step references the claim node and resolves the per-frame body results by their derivable logical URNs (…/<step>/<frame>/<row>/1) — a bounded set named by coordinate, not a scan. StepInfo.cursor_frames (BTreeMap<i64, CursorFrame>) already holds the per-frame progress to enumerate them.

This preserves the existing cursor semantics while keeping every individual node reachable by pointer.


5. Worker-side state construction + cache (tenet 5)

5.1 The system-pool state builder

Today rebuild_state + evaluate_state run in the server (or, post-#108, the drive runs off-server but the server still rebuilds WorkflowState from events and resolves cursor refs to bound the input — the gap #111 names). This RFC moves state construction off-server too:

  • A new system/state_builder playbook (WASM, on the system worker pool, behind the #105 capability ring) takes a command's chain head + the target step's declared input bindings, walks the chain (§4.3), resolves the needed references, and assembles the WorkflowState / atomic-item context.
  • It runs as an ordinary system playbook — dispatched off NATS, releasing its slot when done — honoring execution-model.md (workers = atomic compute blocks) and data-access-boundary.md (it reaches noetl.* only via server-API/internal endpoints + the object store, never direct DB).
  • New capability-ring host functions it needs: event_get_by_ref(urn) (one chain node), result_resolve(urn) (object-store payload), state_cache_get/put. These extend the deny-by-default syscall surface in #105.

The orchestrate-core (from_events/evaluate_state) is already referentially-transparent and wasm-compiled (repos/server/plugins/orchestrate, run/run_state ABI) — the builder reuses the same core, fed by a chain walk instead of an event vector.

5.2 The cache and its consistency model

  • Key: (execution_id, chain-head event_id) — i.e. the immutable chain prefix the state summarizes. Because the chain is append-only and immutable, a cached node/state for a given head is valid forever — there is no stale read. Advancing = extend from the cached head by walking only the new tail (the mechanical analogue of #101's last_event_id incremental advance, but the watermark is a chain pointer, not a Postgres COUNT(*)).
  • No consistency COUNT(*). #101's OrchStateCache runs an O(events) COUNT(*) to detect a late straggler before trusting the cache (ExecOrchState.applied_count, last_count_check). The chain removes the need: continuity is verified by pointer linkage (does the new tail's prev_event_id reach the cached head?), not by counting rows.
  • Invalidation: none in the staleness sense — only eviction for memory (terminal execution, or LRU on the pool). A cold miss rebuilds deterministically by re-walking from any durable chain head (§7.3).
  • Composition with the sole-writer gate (#103): the materializer stays the sole writer of noetl.event (audit/query materialization). The builder's cache is built from the WAL chain, not from the materialized table — so there is no read-your-writes lag against Postgres and the #103 freshness gate (2c) is not needed on this path. The materializer and the builder read the same WAL; they don't serialize against each other.

5.3 Why this is the right home

State assembly is compute over an immutable log → a pure function → ideal for the ephemeral, horizontally-scaled system pool. Moving it off the server is the remaining half of program step 2 (#107): #108 moved the drive off-server; this moves the state construction the drive depends on off-server, which is the prerequisite for steps 3–4 (Postgres demotion) — the server no longer rebuilds state, so it no longer needs the event table for the hot path.


6. Atomic-working-item context contract (tenet 6)

6.1 What a tool worker receives

Today a command's render_context carries the whole accumulated context and the worker rebuilds ctx/workload shims over all of it (repos/worker/src/executor/command.rs ~L311–342). Under this contract a tool-running worker receives only:

  • the tool kind + the step's render template (the step definition, not the playbook),
  • the resolved minimal inputs for this item — by reference where large,
  • the bounded extracted predicate block for any reference it routes on,
  • its fan-out coordinate (__cursor_frame/__cursor_row) when applicable.

It does not receive the other steps' results, the loop definition, the full workload, or the tool list. It is a true atomic compute block: run tool T on input I (ref R); write result to URN U; emit a boundary event.

6.2 How loops/fan-out work when the worker can't see the playbook

The state builder owns the topology, the worker owns one item:

  • The builder computes each body item's slice from the claim rows and issues a body command carrying that item's coordinate + the row reference for just that slice. The worker resolves its one slice, runs the tool, writes its result to the derivable URN, emits its boundary event. It never enumerates the loop.
  • Aggregation/joins are a separate step the builder schedules once the frame's bodies are complete (it tracks cursor_frames), resolving the per-frame results by coordinate URN.

6.3 Dependency: explicit input binding

For the builder to hand a step exactly the upstream results it needs (and to bound the chain walk), each step must declare what it consumes. This is the existing Umbrella: Explicit Input Binding work — it becomes a prerequisite for Phase 5. Where a step uses an undeclared implicit {{ steps.X.y }}, the builder must conservatively walk further (or the linter rejects it). Tightening implicit access is called out as a risk (§8).

✅ Resolved 2026-06-20. #77 (Explicit Input Binding) is CLOSED — it shipped the declaration surface (input:/args: + set:input:). Phase 5 added the missing piece: orchestrate-core::input_binding::analyze(step) statically extracts the base-context keys a step references and reports whether that set is fully bounded; project_context narrows to exactly those keys. The "conservatively walk further" path is realized as bounded = false → keep the full context (no linter-reject required). So implicit {{ steps.X.y }} access degrades to full-context for that step rather than breaking — the migration cost named in §8.2 is opt-in, not a hard gate. Shipped server#250 (v3.37.0)

  • worker#121 (v5.40.0) + e2e#69, behind NOETL_ATOMIC_ITEM_CONTEXT (default off).

7. The "never scan noetl.event" invariant + correctness

7.1 How reads are served without a scan

Read need Served by
Hot-path state for the next drive step system-pool state cache (§5.2)
Cache miss / cold rebuild chain walk from a durable head (§4.3) over WAL + object store
Result payload object store by URN (§3.2)
Predicate eval (when:/set:/fan-out) the inline extracted block — no resolve
Audit / external query / replay-for-humans the materialized noetl.event table (written by the materializer, #103) — read by operators/tools, never by the drive

noetl.event keeps existing as the audit/query projection; the invariant is scoped: the state/drive read path issues zero SELECTs against it. The concrete deletions are rebuild_state's ORCH_EVENT_COLS query and the projection_snapshot read-model dependency on the hot path (Phase 6).

7.2 Ordering & idempotency

  • Ordering: the chain's prev_event_id links define a total order along the spine and a deterministic partial order for fan-out (claim → bodies → join). Per-node sequence is intrinsic to the link, so ordering needs no global sort.
  • Idempotency: commands remain at-least-once; apply is keyed by chain position / event_id. Result writes are idempotent-overwrite via the attempt-fixed logical URN (§3.2), so a re-run lands on the same key. The builder's cache put is keyed by the immutable head, so concurrent builders converge.

7.3 Crash recovery

Rebuild the state cache by re-walking from the last durable chain head — the WAL ack offset (#104) is the resume point. No truth lives in the worker; a crashed builder or tool worker loses only in-flight compute, re-derivable from the chain + object store. This is the #104 "resume from last acked offset, replay the local-only tail" model applied to state construction.

7.4 Cache-vs-chain consistency & replay

  • The cache is a pure function of an immutable chain prefix → a cached entry can never disagree with the chain; it can only be behind (missing tail), which the pointer-continuity check detects and the tail-walk repairs.
  • Replay walks the chain from genesis → deterministic state at any point — same guarantee as today's from_events, but by link-following instead of table scan. Human/audit replay can still use the materialized table.

8. Open questions & risks (red-team)

  1. DAG vs linked-list for complex joins. The frame-indexed branch model (§4.4) covers cursor fan-out. A step that joins multiple distinct upstream steps (not loop frames) needs either multiple prev links or resolution-by-name against the declared bindings. Decide: a prev_event_ids set on join nodes vs. binding-driven multi-walk. (Leaning binding-driven: §6.3 already names the upstream steps.)
  2. Implicit context access. Steps that read {{ steps.X.y }} without an explicit binding break the "hand exactly what's needed" contract and force a conservative full walk. Requires the explicit-input-binding lint to graduate from advisory to enforced before Phase 5 — a migration cost across existing playbooks.
  3. Does #103's read side get superseded? Yes — 2c (orchestrator reads projection_snapshot) and the freshness gate are not needed once the drive reads the chain. The write side (publish-only + materializer sole writer) stays. The already-shipped sole-writer gate (NOETL_EVENT_INGEST_PUBLISH_ONLY, flip-ready, kind-proven) is unaffected — this RFC layers a new read path; it does not touch the gate's write semantics or default. We do not change the kind gate state.
  4. Does #104 get superseded or subsumed? Subsumed as the storage substrate (URN derivation, NATS-as-WAL, Feather tier, side-effect barrier). #104's open questions (attempt semantics in the URN, local tail bound, non-tabular format, side-effecting-tool attribute) become this umbrella's data-tier questions.
  5. Cache memory on the system pool. Even bounded extracted blocks (4 KB × steps) accumulate; cursor fan-out with many frames needs an eviction policy (LRU + terminal eviction). The win is huge (4 KB/step vs the 17.4 MB observed) but not zero — size it.
  6. Walk depth on deep chains. A step binding a very early upstream result forces a long walk. Mitigations: the cache short-circuits most walks; the builder can persist periodic chain checkpoints (a chain node summarizing state-so-far by reference) so a walk stops at the nearest checkpoint — the chain-native analogue of projection_snapshot, but reference-only and built by the system pool, not the server.
  7. Two URN namespaces. Physical …/result/<name>/<id> and logical …/results/<eid>/<step>/<frame>/<row>/<attempt> coexist today. The reference-only model should converge on the logical (derivable) one as the addressing key; sequencing that migration is a sub-question.
  8. Predicate completeness. If a routing predicate needs a field the bounded extracted block truncated (_truncated: true), the drive can't decide without a resolve. Need either a guaranteed-fields declaration per step or a fallback single-resolve path — quantify how often real playbooks hit it.

9. Phased implementation plan

Each phase is independently shippable, reversible, and kind-validated. Phase 1 is the immediate unblock (it is #101's consume side) and can ship before the rest is ratified.

Phase What ships Repos Unblocks / depends
1 — schema-holds-refs-onlySHIPPED 2026-06-19 Events/commands carry {reference, extracted} only (no inline bulk); worker resolves refs lazily + selectively at render time (closed the resolve_context_references consume-side gap); hydrate_result_references surfaces _ref/_store/_uri on kept refs (stops splicing payloads); refs_in_state default flipped to true. Landed: worker#117 (v5.36.0) + server#243 (v3.30.0). worker, server Unblocked the 3 stuck #114 fixtures (output_select, storage_tiers, lease_expiry) + closed #113 + #114 (all 9 stalls green). Off-server-drive prod cutover (#107/#111) unblocked on the size axis. This is #101's consume side.
2 — prev-event chain linksVALIDATED 2026-06-19 (PRs open) prev_event_id added to noetl.event + noetl.command, populated on emit. The chokepoint emit_events stamps each event's link from a per-execution chain-head watermark (ChainHeads) — one path covers drive events + command.issued + worker-lifecycle (via handle_event), on both gate-off INSERT and gate-on publish; the materializer persists it. Command link = the real step.enter/unblocking completion (the head before the command.issued batch), so fan-out bodies share their branch origin (§4.4). Self-ensuring + non-fatal event_chain::ensure_columns. Additive; no reader yet. Implemented server-side only (no orchestrate-core change needed — the chokepoint subsumes the "core carries prev" design, since off-server-driven results flow through the same emit path). noetl/server#244 + noetl/noetl#667. server (+ noetl DDL) Foundation for §4; no behavior change. Done pending merge + pointer bump.
3 — chain-walk state builder (server-side, flagged) New chain-walk builder behind a flag replaces rebuild_state's noetl.event scan; event-scan kept as the =false fallback. Validate the no-event-scan invariant on kind (assert 0 SELECT … FROM noetl.event on the drive path). server, orchestrate-core Proves tenet 3/4 in-process before moving off-server.
4 — off-server state builder + cache system/state_builder WASM playbook on the system pool; capability-ring host fns (event_get_by_ref, result_resolve, state_cache_*); pool-side cache keyed by chain head; cold-rebuild path. server (host fns + dispatch), worker (cache + pool), tools (playbook), ops (system-pool wiring) Realizes tenet 5; completes the off-server half of #107 step 2; prerequisite for step 4 (Postgres demotion).
5 — atomic-item context contractSHIPPED 2026-06-20 (flag-gated) The drive hands each worker only the minimal slice of base-context keys the step statically references — not the whole accumulated context. orchestrate-core::input_binding (analyze/project_context over minijinja undeclared_variables, conservative bail-to-full) + CommandBuilder::with_atomic_item_context; plugin input structs carry the flag; NOETL_ATOMIC_ITEM_CONTEXT (default false). Built on Explicit Input Binding (#77, the declaration surface). Landed: server#250 (v3.37.0) + worker#121 (v5.40.0) + e2e#69. orchestrate-core, worker, server Realizes tenet 6; builds on Explicit Input Binding.
6 — retire the event read path Drop projection_snapshot from the hot path + the #103 read-side (2c) freshness gate; noetl.event becomes audit-only. End-to-end never-scan validation. server, ops Closes the invariant; folds into #107 step 4 (Postgres demotion).

9.1 Phase 1 in detail (the part to start)

Goal: the worker resolves noetl:// references only for the inputs a step actually consumes as bulk, and only those refs — so over-budget upstream results stay as references in state and a step's render_context never carries foreign bulk. This is precisely the "consume side" that the #114 kind experiment proved missing (turning refs_in_state=true fixed the state bloat but broke storage_tiers/output_select because the worker render couldn't resolve refs selectively).

Touch points:

  • repos/worker/src/executor/command.rs — make resolve_context_references (~L1411) selective: resolve a step's ref only when the rendered template / declared input binds its bulk (not for extracted-only predicate access), and resolve only the named refs — not every steps.* entry.
  • repos/server/src/handlers/events.rshydrate_result_references (~L1582): default to the keep_refs=true behavior on the drive read path (surface extracted as context.data, keep the reference); stop the inline splice.
  • repos/server/src/config/app.rs:53 — once validated, flip refs_in_state default to true (the on-ramp; reversible).
  • repos/tools — ensure the result-store ref + extracted round-trip is the shared contract both sides read.

Acceptance: the 3 fixtures (test_output_select, test_storage_tiers, kind_playbook_lease_expiry) reach playbook.completed under NOETL_ORCHESTRATE_PLUGIN_DRIVE=true on kind gate-ON with no command.issued / drive-state over budget and no per-step foreign bulk in render_context; extend kind_validate_orchestrate_offserver.sh with the bulk-consume assertion.


10. Recent activity

Date What Pointers
2026-06-22 #123 SHIPPED + CLOSED — a non-iterable loop in: no longer silently wedges (commands=0, RUNNING-forever) under the off-server drive; it surfaces a terminal playbook.failed. An observability regression uncovered during the #120 2×2 repro (it produced the false-positive #122): a loop step whose in: rendered to a non-iterable (e.g. loop: { in: '{{ workload.batch_slots }}' } with batch_slots absent → null) wedged silently. Root cause: evaluate_loop already returns CoreError::Validation("… did not evaluate to an iterable"), and the in-process drive already turned that Err into a terminal playbook.failed — but under the prod-default off-server drive (NOETL_ORCHESTRATE_PLUGIN_DRIVE=true) the system/orchestrate wasm plug-in returns a structured {"error":…} envelope instead of an OrchestrationResult; apply_worker_orchestration couldn't decode that, logged a WARN, recorded decode_error, and returned Ok(0) → no terminal event → RUNNING forever (the v3.4.2 explicit validation error was lost when the drive moved off-server). The fix (server#258, squash 275b914v3.39.6 7f109a9): server apply_worker_orchestration decodes the drive ERROR envelope (decode_orchestrate_error) and emits a terminal playbook.failed (metric noetl_orchestrate_drive_total{stage="drive_error"}, structured execution_id), matching the in-process drive — a transient decode miss (no envelope) stays on the benign re-drive path; orchestrate-core prefixes the offending step name onto the existing evaluate_loop error (prefix_loop_step_error). An empty iterable ([]/{}) still short-circuits to next — only a non-iterable errors. Code-only, server-only (worker stays v5.40.3); 600 server + 135 orchestrate-core tests + clippy clean; kind-validated prod-exact (PLUGIN_DRIVE=true+PUBLISH_ONLY=true+STATE_BUILDER=offserver): an absent workload.batch_slotsFAILED with loop step 'process': Loop expression '{{ workload.batch_slots }}' did not evaluate to an iterable (got null), a valid [1,2,3] loop still COMPLETED 3-way, #120 barrier and #124 binding unaffected, 0 pod restarts. Merged + ai-meta repos/server pointer bump → v3.39.6 + #123 closed (deliberate Closes #123) + board 3 → Done. PROD NOT redeployed — it stays healthy on v3.39.5 + the live off-server cutover; this ships to prod on a future server rollout. #127 stays OPEN (perf follow-up). noetl/server#258 (v3.39.6 7f109a9, squash 275b914) · #123 · #120 · #124 · #115
2026-06-22 #121 SECOND-HALF SHIPPED + CLOSED — off-server system/* WAL-chain wedge fully fixed (server v3.39.5). #256 (v3.39.4) was only the first half: a live-prod re-cutover on v3.39.4 still wedged system/scheduled_cleanup (37+ WAL chain incomplete on the system-pool worker, applied frozen at 12 while offserver_retry climbed in lockstep with dispatched_offserver_stateless), because the system-pool worker drives system executions off-server regardless of the server-side gate — STATE_BUILDER=offserver on the worker drives them off the WAL, and their INSERT-not-publish chain leaves a NULL-prev orphan. So #121 was reopened (auto-closed by #256, re-opened with the live prod repro: tenant cqls_cutover_202606220127, exec 327261539490865152). The fix (server#257, squash 54ac277v3.39.5 c421273): gate both off-server-drive decision sites in trigger_orchestrator_inner on should_publish(catalog_id) via a new pure is_system_path helper (+ unit test) so system/* execs (should_publish=false) fall through to server-built run_state, while regular publishable execs keep the off-server path (preserves the #256 win and avoids any #256 regression). Server-only (worker stays v5.40.3); 612 tests + clippy green; kind full-gate before/after (img noetl-server:121-syssrv, prod-exact PUBLISH_ONLY=true+STATE_BUILDER=offserver on server+system-pool): BEFORE unfixed looped 9× WAL chain incomplete on system/scheduled_cleanup; AFTER the wedged exec went RUNNING→COMPLETED on fixed-server takeover, a fresh system exec COMPLETED 0 loops (offserver=false server-built), and regular test/simple_loop COMPLETED still off-server. Invariants clean (roots=1/dangling=0/dup=0, materializer sole-writer projected==acked, worker off-server scans=0; server state_build_event_scans=4 = the system execs' intended server-built reads, by design NOT a regression). Phase A this session: merged + ai-meta pointer bump → v3.39.5 + #121 closed (deliberate Closes #121) + board 3 → Done. PROD GKE off-server re-cutover (3rd attempt) is the Phase-B follow-up. noetl/server#257 (v3.39.5 c421273, squash 54ac277) · #121 · #117 · #118 · #119 · #115
2026-06-21 #121 FIRST-HALF SHIPPED (partial) — off-server WAL-chain-incomplete re-drive loop on system/ executions FIXED. Surfaced during the prod off-server-drive cutover (server v3.39.3 + off-server gate), in the #117/#118/#119 chain-completeness family — a new uncovered case. Two distinct defects. (1) Orphaned command.claimed (NULL prev_event_id): the gate-off claim_command raw in-tx INSERT (and the handle_batch_events gate-off batch INSERT) bypassed the event_write::emit_events/ChainHeads link chokepoint, so command.claimed got prev_event_id=NULL AND the per-execution chain head was never advanced — the next event (command.started) linked back to command.issued, skipping the orphaned claim, so the off-server chain_walk_from hit a NULL-prev non-genesis head → build_spine_to Incomplete. Fired for every system/ playbook because should_publish is false for system executions even under a global PUBLISH_ONLY=true. Fixed by stamping prev_event_id via ChainHeads::link_batch on both gate-off INSERT paths (same advance-then-write ordering emit_events uses). (2) The actual loop: the stateless off-server drive builds state from the noetl_events WAL, but system-execution events INSERT to noetl.event and never enter the WAL → the worker's WAL build could never complete → __offserver_retry__ no-op → server reconciler re-drives → repeat. Fixed by gating the off-server WAL drive on should_publish(catalog_id); system/ execs drive server-built. Fix in server#256 (v3.39.4, 77aaa06, squash 28b17cb) — changes confined to src/handlers/events.rs + src/state.rsserver-only rebuild, no worker bump; 598 tests + clippy clean (new unit test chain_head_claim_orphan_vs_linked); kind-validated prod-exact (PUBLISH_ONLY=true+STATE_BUILDER=offserver): system/scheduled_cleanup orphan + 17×+ WAL chain incomplete (wedged 120s+) → linked chain, 0 loop lines, COMPLETED in 6s; non-system off-server unaffected. Live prod repro (read-only): the full off-server cutover on prod WEDGED — applied froze while dispatched_offserver_stateless/offserver_retry climbed in lockstep to ~389, the system pool logged WAL chain incomplete; returning no-op for multiple execs (incl. prod's own 327130580493803520), all prev_event_id=NULL; the armed revert (PUBLISH_ONLY=false+STATE_BUILDER=server) recovered cleanly → #121 confirmed the blocker for the prod off-server cutover. PROD GKE default (STATE_BUILDER=server) + all defaults untouched. #123 + #127 stay OPEN (separate). noetl/server#256 (v3.39.4 77aaa06, squash 28b17cb) · #121 · #117 · #118 · #119 · #115
2026-06-21 Siblings #125 + #126 SHIPPED + CLOSED — the two Python→Rust regressions behind the batch pft_flow_test are fixed; the full 10×1000 batch now runs end-to-end, correctness-clean. #126 (tools#72, noetl-tools v3.13.1 8dd0e1f): the Rust http tool nested the parsed response body under data.body; playbooks read output.data.data (Python-era contract) → save_batch's jsonb_to_recordset got a non-array object → Postgres error → 0 rows. Fix: expose body under data; body = back-compat alias. #125 (tools#73, noetl-tools v3.14.0 638c3c6): the task_sequence runner matched only "fail" in the do: branch; do: jump/to:, do: break, do: retry (attempts/backoff + infinite-jump guard) were silently ignored — each slot ran once and returned, leaving batches in progress. Worker adoption (worker#124, noetl-worker v5.40.3 6dd3449): Cargo.lock bumps the noetl-tools pin 3.133.14 so the worker runs both fixes. Result: 10×1000 = 10,000-patient batch pft_flow_test COMPLETE on kind, zero patient loss, invariants clean (roots=1/dangling=0, materializer sole-writer, never-scan). Perf follow-up #127 filed — throughput plateaus ~60 patients/s (10k in ~166–172 s), ~3× slower than the Python ~54 s kind baseline; throttled ≈ unthrottled so the rate limiter is NOT the bottleneck — capped by the NoETL pipeline (worker concurrency 4×2 + serial per-slot drain + off-server/publish-only per-step overhead). Correctness-clean; perf is a follow-up investigation. #121 (orphaned-spine WAL-chain-incomplete loop) + #123 (loop-non-iterable observability) stay OPEN. PROD GKE + all defaults untouched. noetl/tools#72 (v3.13.1 8dd0e1f) · noetl/tools#73 (v3.14.0 638c3c6) · noetl/worker#124 (v5.40.3 6dd3449) · #125 · #126 · #127 · #115
2026-06-21 Sibling #124 SHIPPED + CLOSED — distributed task_sequence forward set:/sibling bindings no longer render empty (data now propagates between sub-tasks). Surfaced chasing the batch PFT benchmark, but a shared orchestrate-core command-build bug, not an off-server / #115 regression. Inside a multi-tool (task_sequence) step, a later sub-task's templates that reference a value a prior sub-task produces at runtime — a forward set:, a policy-rule set:, or a sibling result — were rendered to empty at command-build time, before the worker's per-sub-task binding ran. orchestrate-core/src/commands.rs::render_pipeline_config preserved set/args/spec/command verbatim but rendered every other sub-task field (url/params/method/…) against the step-entry context under UndefinedBehavior::Chainable, so runtime-only references ({{ iter.* }}, sibling labels) silently collapsed to empty — concretely pft_flow_test's claim_batch writes iter.data_type via its policy set:, then fetch_batch's url: …/api/v1/pft/batch/{{ iter.data_type }} pre-rendered to …/batch/ → 404 → 0 rows. Fix in server#255 (v3.39.3, 365d3be): new TemplateRenderer::render_value_deferring_unresolved renders only templates whose variable paths all resolve in the build-time context; any template referencing an unresolved path is preserved verbatim so the worker re-renders it against the per-sub-task running context (a superset). cargo test 134/134 + clippy clean; kind-verified (fetch_batch now hits the real per-type URLs). The batch benchmark stays blocked behind two further distinct Python→Rust regressions uncovered behind this binding fix — #125 (task_sequence honours only do: fail, ignores do: jump/break/retry) + #126 (http tool nests the body under data.body but fixtures read output.data.data). PROD GKE + all defaults untouched. noetl/server#255 (v3.39.3 365d3be, squash d53e095) · #124 · #125 · #126 · #115
2026-06-21 Sibling #120 SHIPPED + CLOSED — reduce barrier no longer deadlocks (commands=0) on open/asymmetric loop joins. Surfaced during the off-server prod validation but NOT an off-server / #115 / #117 / #118 / #119 regression — the barrier predates them (it lives in shared orchestrate-core, invoked identically by the in-server and off-server drives). A back-edge U → T whose forward return path from T is absent (T does NOT forward-reach U, e.g. a loop the workload now bypasses) was counted by build_incoming_arcs as a genuine fan-in parent of T; the reduce barrier then deferred dispatch of T forever because U never runs on the taken path (pft_flow_test setup_facility_work stalled commands=0 for 4.5 min; setup_facility_work dispatched 0× in 24h → deterministic, not a race). Fix in server#254 (v3.39.2, 28e8950): a runtime liveness filter in the barrier — an upstream blocks dispatch only if it is live on the current path (done/skipped, entered/in-flight, or forward-reachable from a currently-active not-yet-terminal step); a declared predecessor that can never run no longer blocks. build_incoming_arcs left unchanged (an open back-edge is still a genuine static fan-in). Closed loops + genuine reduces unaffected. New unit test test_open_loop_back_edge_does_not_block_dispatch (sibling to test_build_incoming_arcs_excludes_loop_back_edge); cargo test 133/133 + clippy clean; kind-validated (post-fix the 2×2 off-server/gate matrix all COMPLETE, fanout_reduce/pagination/loop spot-checks green). PROD GKE + all defaults untouched. noetl/server#254 (v3.39.2 28e8950, squash fbb855f) · #120 · #115
2026-06-20 #119 + #118 SHIPPED + gate-ON kind-validated + CLOSED — single- AND multi-replica off-server are now blemish-free (single-root incl. finalize, zero fallback) AND restart-robust (the WAL index rehydrates). #119 (off-server WAL-drain index stall on worker restart — the blocker that, the prior session, prevented executions from completing and so hid the #118 symptom): the authoritative WAL state-builder drain (state_builder.rs::run_drain_loop) used a durable noetl_state_builder consumer whose cursor persists across worker pod restarts, while the in-memory WalEventIndex rebuilds empty on each boot → after a restart the cursor sat past the events the fresh index needed (delivered+acked to a prior process, never redelivered) → build_spine_to(expected_head) permanently Incomplete → the off-server drive looped offserver_retry and single-replica executions never reached a terminal event (wal_events_total=0 while the consumer showed delivered+acked). Fix (worker-only, inside NOETL_STATE_BUILDER=offserver; PROD runs the in-server drive so untouched): the authoritative drain now defaults to an ephemeral DeliverPolicy::All consumer — the shadow consumer shape — rebuilding the full in-memory index from the retained noetl_events WAL on every boot, so a persisted cursor can never outrun a freshly-restarted worker's empty index (also correct for >1 worker pod: each holds the complete event set for the executions it may drive, not the load-balanced subset a shared durable would give). Ack-policy keyed on durable-presence, advance-timing on mode (decoupled). Instant revert NOETL_STATE_BUILDER_DURABLE=1 (not restart-safe without an index snapshot). Proof: one-shot index rehydrated from retained noetl_events WAL log + new gauge noetl_worker_state_builder_indexed_executions. Never reintroduces a noetl.event scan (rebuild reads the WAL stream only — tenet 3). noetl-worker (worker#123v5.40.2 48b0bde): 224 lib tests incl. fresh_index_rebuilds_from_full_replay_after_restart. #118 (single-replica off-server terminal-finalize chain fork): a duplicate finalize (a straggler drive on a materializer-lagged single replica re-drove off the lagged WAL and emitted a 2nd playbook.completed that linked to the now-evicted chain head → NULL-prev_event_id 2nd root + a benign event-scan). Fix = a bounded process-local FinalizedGuard (exactly-one-terminal-per-execution) suppressing the duplicate at emit_events before the chain linker (a suppressed duplicate never advances/consumes the head); first terminal wins; gate-off byte-identical; metric noetl_terminal_dedup_total{suppressed}; rig gains a HARD per-exec terminals==1 assertion. Absent under multi-replica execution-affinity (#116 serializes finalize to the owner). noetl-server (server#253v3.39.1 c5f8cb2) + noetl/e2e (e2e#73 fe97d92). Gate-ON kind-validated (server 118-finalize v3.39.1 + worker 119-rehydrate v5.40.2; offserver + publish_only + audit_only + plugin_drive): restart-rehydration — a forced mid-flight kubectl delete pod --force of the system-pool worker → the new pod logged index rehydrated … indexed_executions=17 wal_events=200 (pre-fix this gauge was 0 → the stall); single-replica #118NOETL_COHERENCE_FANOUT_BURST=12 × 6 consecutive iterations (~126 execs) 6/6 PASS, 0 anomalies — every chain roots=1(incl. the terminal), dangling=0, walk==rows, terminals=1, orch_events=0, zero state_build_event_scans + zero hot-path noetl.event scans (no genuine duplicate-finalize arose, so terminal_dedup stayed 0 — the fork never happened; the suppression path is unit-proven); multi-replica — the 2-replica execution-affinity StatefulSet (NOETL_COHERENCE_DRIVE_AFFINITY=shipped, burst=12) 21 execs COMPLETE, every chain roots=1/terminals=1, forwarded_ok +202, kv_remote_hit +12, zero scans. Baseline restored; PROD GKE + all defaults untouched. Closes the off-server hardening gap #117 left open. noetl/worker#123 (v5.40.2 48b0bde) · noetl/server#253 (v3.39.1 c5f8cb2) · noetl/e2e#73 (fe97d92) · #119 · #118 · #117 · #115 · #107
2026-06-20 Program-scale step 2 SHIPPED + multi-replica gate-ON validated — execution-affinity single-owner WRITE ORDERING; the off-server stack is now multi-replica chain-COHERENT. Step 1 (KV data coherence) made 2+ replicas resolve the same head/descriptor but was necessary-not-sufficient: the command.issued prev read in handlers::execute and the head CAS-advance in event_write::emit_events are two non-atomic steps, so concurrent cross-replica emits forked the chain. Affinity closes it by routing every trigger (POST /api/events, which also fires the drive) to the single replica that sharding::ShardConfig::owns(execution_id) owns (stable XxHash64); a non-owner forwards (transparent reverse-proxy POST, one-hop loop guard, degrade-to-local on failure). On the owner the existing single-process drive lock + in-memory ChainHeads make the read→advance atomic with no distributed lock; KV coherence composes as the genesis-on-other-replica + handoff vehicle (owner resolves LOCAL → kv_remote_hit trends to 0 by design). Chose forwarding over a distributed drive lease (option ii) — solves the chain fork AND the double-drive with one mechanism, reusing src/sharding.rs verbatim. noetl-server (server#252v3.39.0 5e00d0a): new src/affinity.rs (ExecutionAffinity + shard_index_from_hostname); flags NOETL_EXECUTION_AFFINITY / NOETL_PEER_URL_TEMPLATE / NOETL_SHARD_INDEX_FROM_HOSTNAME (all default off/unset, prod unchanged); handle_event forwards; reconcile poller skips non-owned; metric noetl_execution_affinity_total{outcome} (forwarded_ok = the write-ordering proof). noetl/e2e (e2e#71 66b6e1b): 2-replica StatefulSet topology (manifests/replica-affinity/ — distinct shard index per pod via hostname ordinal + headless DNS) + deploy_replica_affinity_topology.sh (up/down, flips system-pool worker offserver) + rig flipped HARD (forwarded_ok proof; kv_remote_hit informational under affinity since the owner resolves local). Multi-replica gate-ON kind-validated (2-replica StatefulSet, server offserver+audit_only+nats_kv+affinity+publish_only, worker offserver): NOETL_COHERENCE_DRIVE_AFFINITY=shipped PASS — linear/loop/fanout COMPLETE; every chain roots=1/dangling=0/walk==total (NO fork — exactly what forked without affinity); forwarded_ok +9; state_build_event_scans +0 + hotpath scan +0 (never-scan across replicas); sole-writer rows==distinct, __orchestrate__ event=0; degraded +0; single-replica unchanged. 595 server tests + clippy green; baseline restored. Known follow-up #117 (separate, pre-existing): off-server from_events spine is ordered by event_id; under a chain-order≠id-order inversion (affinity's forwarding makes it likelier under high-concurrency fan-out) the fan-in reduce wedged on 1/9 execs in a 9-way concurrent run (chain stayed clean — NOT a fork). PROD GKE untouched; all affinity flags default off; no gate/mode/builder default changed. noetl/server#252 (v3.39.0 5e00d0a) · noetl/e2e#71 (66b6e1b) · #116 · #117 · #115 · #107
2026-06-20 Program-scale step 1 SHIPPED — multi-replica coherence DATA LAYER (NATS-KV-backed ChainHeads + ExecDescriptor); execution-affinity STAGED. The off-server drive keys two facts off in-memory AppState maps — the ChainHeads prev_event_id watermark + the ExecDescriptor (catalog_id + routing + terminal) — both single-replica-local. Behind NOETL_REPLICA_COHERENCE=nats_kv (default local, prod unchanged) both are backed by JetStream KV buckets (noetl_chain_heads, noetl_exec_descriptors): the head advance is a CAS so concurrent emits serialise into one chain; the descriptor is a CAS read-modify-write so seed + terminal merge across replicas. The in-process maps become a write-through cache / degraded-mode fallback (KV down / local → bit-identical to today). noetl-server (server#251v3.38.0 8f39a79): new src/coherence.rs (CoherenceKv + KvRead + CAS helpers; lazy buckets); ChainHeads/ExecDescriptors methods now async; ExecDescriptor serde-derived; proof metric noetl_replica_coherence_total{structure,op,outcome} (load-bearing series outcome="kv_remote_hit" = a head/descriptor another replica seeded, resolved coherently — a cold server-built fallback avoided). noetl/e2e (e2e#70 e222877): kind_validate_replica_coherence.sh (new) + un-staled the #113/#114 offload asserts behind NOETL_RIG_EXPECT_OFFLOAD (default false — refs_in_state=true keeps contexts reference-only, so the offload paths legitimately stay flat). Kind-validated: single-replica nats_kv is bit-for-bit parity with local (linear/loop/fan-out ×2 all COMPLETE, chain integrity perfect — roots=1/dangling=0/walk==total, state_build_event_scans +0, hotpath scan +0, sole-writer intact, __orchestrate__ event=0); 2-replica nats_kv proved the coherence resolves work (kv_remote_hit advanced for head + descriptor, no kv_unavailable). The data layer is necessary but NOT sufficient: on 2+ replicas concurrent cross-replica emits still fork the chain (validation observed forked chains + a cross-execution prev from the non-atomic issuing_event-read vs head-advance across replicas), so executions don't reliably COMPLETE on 2+ replicas yet. The remaining piece is execution-affinity (one replica owns an execution's drive + chain write) — option (ii), STAGED as program-scale step 2 (substrate already present: src/sharding.rs shard_for/ShardConfig::owns). 588 server tests + clippy green; baseline restored. PROD GKE untouched; default local; no gate/mode/builder default changed. noetl/server#251 (v3.38.0 8f39a79) · noetl/e2e#70 (e222877) · #115 · #107 · #111
2026-06-20 Phase 5 SHIPPED + gate-ON validated — atomic-working-item context (tenet 6): the drive hands a worker only its minimal declared slice. NOETL_ATOMIC_ITEM_CONTEXT=true|false (default false, prod unchanged). #77 dependency resolved: Explicit Input Binding is CLOSED (BREAKING v3.0.0) — it shipped the declaration surface (input:/args: + set:input:); Phase 5 adds the missing extractor + drive narrowing (the server previously attached the full accumulated context to every command regardless of input:). noetl-server (server#250v3.37.0 a96ade8): new orchestrate-core::input_bindinganalyze(step) statically extracts the base-context keys a step references (minijinja undeclared_variables over the serialized tool def + step input; ctx.XX, workloadworkload, bare step-name→key, injected roots→none), conservative (any unbounded ref → bounded=false → full context); project_context narrows the worker-bound context; CommandBuilder/WorkflowOrchestrator::with_atomic_item_context narrow the persisted context for plain non-loop steps while server-side rendering still runs against the full context; plugin OrchestrateInput/OrchestrateStateInput carry a #[serde(default)] flag; metric noetl_atomic_item_context_total{outcome}. noetl-worker (worker#121v5.40.0 2484d17): forwards the flag onto the off-server from_events drive input. noetl/e2e (e2e#69 79505fa): atomic_item_context.yaml (consumer binds only {{ producer_a.tag }}) + kind_validate_atomic_item_context.sh. Gate-ON kind-validated (server p5-atomic, flag on, STATE_BUILDER=server + PLUGIN_DRIVE=true + PUBLISH_ONLY=true): flag-ON consumer render_context = [producer_a] ONLY (producer_b + start + steps + workload + execution_id + catalog_id + path all dropped), execution COMPLETED, narrowed metric +1; flag-OFF full context [8 keys incl producer_b], COMPLETED (back-compat); offserver regression COMPLETED / __orchestrate__ event=0 / dispatched+applied advance / system-pool isolation / lag-0; 7 input_binding + 132 orchestrate-core + 584 server + 10 worker tests + clippy green; baseline restored. PROD GKE untouched; default false; no gate/mode/builder default changed. Realizes tenet 6. Remaining #115: program-scale (per-shard WAL, multi-replica descriptor coherence). noetl/server#250 (v3.37.0 a96ade8) · noetl/worker#121 (v5.40.0 2484d17) · noetl/e2e#69 (79505fa) · #115 · #107 · #111
2026-06-20 Phase 6 SHIPPED + gate-ON literal-zero validated — the hot-path noetl.event read class is RETIRED; the table is now AUDIT-ONLY. NOETL_EVENT_READ_PATH=event_scan|audit_only (default event_scan, prod unchanged). Phase 4 removed the drive's state-rebuild scan under offserver; Phase 6 retires the remaining execution-lifecycle readers of noetl.event — the WHERE execution_id replay class that runs outside the drive. noetl-server (server#249v3.36.0 b71ca1d): under audit_only get_catalog_id (per-ingest), inherit_parent_trace, the subscription dedup-audit + container-callback catalog/existence reads serve from the in-memory execute-time ExecDescriptor; on a cold descriptor (a post-terminal straggler after the descriptor is evicted on terminal, or a restart mid-execution) catalog_id resolves from noetl.command (the synchronous queue, authoritative under the gate) — never a noetl.event scan. A cold descriptor never re-seeds (re-seeding an evicted terminal exec would re-accumulate the per-execution memory the eviction frees). New proof metric noetl_event_hotpath_reads_total{site,outcome} (served_descriptor|served_command|scan). noetl/ops (ops#199 e5b0737) pins NOETL_EVENT_READ_PATH=event_scan on the prod server manifest (operator-gated flip, taken with offserver). noetl/e2e (e2e#67+#68 0ab3c0a) — kind_validate_event_read_path_phase6.sh asserts the end-to-end never-scan invariant. Gate-ON kind-validated (PUBLISH_ONLY + offserver + materializer sole-writer + audit_only): hot-path scan Δ0 (served_descriptor +96 live + served_command +3 terminal-stragglers), drive state_build_total Δ0 + state_build_event_scans Δ0ZERO noetl.event scans anywhere on the hot path, end-to-end; linear(13)/loop(62)/fan-out(25)/output_select(31) all COMPLETE; sole-writer rows==distinct, catalog0=0, __orchestrate__ event=0, materializer dup 0, lag-0; audit still works — direct SELECT FROM noetl.event returns the rows, status API COMPLETED, replay GET /api/replay/state folds event_count=25; committed orchestrate-gate rig PASS with audit_only on (no regression); 585 server tests + clippy green; baseline restored. PROD GKE untouched; default event_scan; no gate/mode/builder default changed. The RFC's never-scan end state (tenet 3) is reached under the flag. Remaining: Phase 5 (atomic-item context, needs #77) + program-scale (per-shard WAL, multi-replica descriptor coherence). noetl/server#249 (v3.36.0 b71ca1d) · noetl/ops#199 (e5b0737) · noetl/e2e#67+#68 (0ab3c0a) · #115 · #107 · #111
2026-06-20 Phase 4 REMAINDER SHIPPED + gate-ON validated — the off-server drive edge is now STATELESS. Removed the server's residual chain-walk bookkeeping on the drive path: under NOETL_STATE_BUILDER=offserver the server performs ZERO state rebuild + ZERO noetl.event reads on the drive path — it just routes commands + persists events. noetl-server (server#248v3.35.0 6e30fc3) adds a per-execution ExecDescriptor (catalog_id + routing seeded at playbook_started; terminal flag stamped at the emit_events chokepoint for cancel/finalize/playbook-terminal). trigger_orchestrator_inner's stateless branch (dispatch_offserver_stateless_drive) routes system/orchestrate WITHOUT building WorkflowState — catalog_id+routing from the descriptor, expected_head from the in-memory ChainHeads, trigger_event_id passed (worker resolves the trigger type off its WAL), no server-built state (__stateless__). apply_worker_orchestration sources catalog_id+routing from the descriptor (skips the cold-rebuild) + evicts on a terminal worker-built result.state. A cold descriptor (server restart) falls through to the server-built path (re-seeds) → chain_walk + event_scan stay the fallbacks, correctness never below the proven server-built drive. noetl-worker (worker#120v5.39.0 8e1f651) resolves trigger_event_type off the WAL index (ExecutionChain::event_type_of) from the server-supplied trigger_event_id; under the stateless edge an incomplete WAL after the bounded retry is a benign {__offserver_retry__:true} no-op the reconcile poller re-drives (OffserverDispatch{Wasm|Noop}) — never a partial state, never a wedge. noetl/e2e (e2e#66 f4bb342) asserts the zero-rebuild invariant (noetl_state_build_total Δ0 + dispatched_offserver_stateless/applied_stateless advance). Gate-ON kind-validated: state_build_total Δ0 + event_scans Δ0 across linear(13)/loop(62)/fan-out(25)/output_select(31), all COMPLETE; dispatched_offserver_stateless +3 / applied_stateless +3; offserver==server fingerprint [enrich:1,normalize:1,reduce:1,start:1], fan-in once; worker served +3 / scans 0 / wal_events +25 / cache cold+1/incr+2; sole-writer 25==25 / __orchestrate__ event=0 / materializer dup 0 / lag-0; committed gate rig PASS; 583 server + 218 worker tests + clippy green; baseline restored. PROD GKE untouched; default server; no gate/mode/builder default changed. Completes #107 step 2 server-side (state rebuild + event reads removed from the drive path under the flag). Phase 5 (atomic-item context, needs #77) + Phase 6 (retire the event read path) remain. noetl/server#248 (v3.35.0 6e30fc3) · noetl/worker#120 (v5.39.0 8e1f651) · noetl/e2e#66 (f4bb342) · #115 · #107 · #111
2026-06-19 Phase 4 DRIVE CUTOVER SHIPPED + gate-ON parity-validated — off-server WAL build authoritative. The orchestrator drive now obtains its WorkflowState from the pool-side WAL builder (the wasm run/from_events entry) instead of the server building it and shipping run_state, under NOETL_STATE_BUILDER=offserver (default server, prod unchanged). noetl-worker (worker#119v5.38.0 bef13e5) promotes the shadow WalEventIndex to a shared index fed by an authoritative durable consumer (noetl_state_builder, explicit-ack — mirrors the materializer); dispatch_wasm detects an __offserver_build__ command and builds the drive state from the WAL spine (zero noetl.event reads), with a staleness guard — it serves only once the index head ≥ the server's expected_head, else a bounded retry waits for the drain or falls back to the server-built run_state (so the WAL state is never staler than the server's view → no lag-induced fan-in re-issue). noetl-server (server#247v3.34.0 f0922bd) marks the offserver command + carries expected_head. noetl/ops (ops#198 b1da9f1) adds the NOETL_STATE_BUILDER knob on the system-pool manifests (server default, operator-gated flip). Gate-ON parity rig (e2e#65 b38b6dd, kind_validate_state_builder_offserver.sh, two-leg offserver-vs-server): PARITY — identical completed-step fingerprint [enrich:1,normalize:1,reduce:1,start:1], both COMPLETED (fan-in barrier fires exactly once); WAL-build authoritative — worker drive_builds{served}=+3 / fallback 0 / event_scans=0 / wal_events=+25; no server scanstate_build_event_scans=0 (chain_walk); cache cold +1 / incremental +2; sole-writer 25==25 / catalog0=0 / __orchestrate__ event=0 cmd=3 / lag-0. Plus manual linear + loop legs COMPLETED off-server (served +13, 0 scans). 10 worker + 580 server tests + clippy green; baseline restored. PROD GKE untouched; no gate/mode/builder default changed. Server still keeps its chain-walk bookkeeping for terminal/cancel/catalog/routing — removing that residual server rebuild is the precisely-staged Phase-4 remainder. noetl/worker#119 (v5.38.0 bef13e5) · noetl/server#247 (v3.34.0 f0922bd) · noetl/ops#198 (b1da9f1) · noetl/e2e#65 (b38b6dd) · #115 · #107 · #111
2026-06-19 Phase 4 KERNEL + FLAG SHIPPED + shadow kind-validated — off-server state builder. The pool-side builder reconstructs WorkflowState from the noetl_events WAL (not the materialized noetl.event): noetl-worker src/state_builder.rs (worker#118v5.37.0 fef961c) — a per-execution WalEventIndex fed from the WAL, a chain_walk() that walks prev_event_id head→root and returns the spine in event_id order (== the server event-scan input → parity by construction, same from_events), and a cache keyed by the immutable chain head (CacheHit unchanged / Incremental tail-only walk with pointer-continuity instead of COUNT(*) / ColdRebuild miss-restart / terminal eviction). A live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off; ephemeral DeliverAll/AckNone consumer, never the materializer's durable one) + metrics. Server NOETL_STATE_BUILDER=offserver|server flag scaffold (server#246v3.33.0 3e6006d, default server). Gate-ON kind-validated (PUBLISH_ONLY + off-server drive + materializer sole-writer): (1) PARITY — shadow spines indexed==spine with sizes matching the Phase-3 topologies (linear 13 / loop 62 / fan-out 25 / output_select 31 / storage_tiers 55), fresh fan-out spine 25 == DB event_rows 25; (2) WAL-read wal_events_total=993 with event_scans_total=0; (3) cache cold_rebuild=28 + incremental=21 (fresh fan-out Incremental(5)); (4) fresh fan-out COMPLETED, event_rows==distinct, 0 __orchestrate__ event rows, materializer pending=0/project_errors=0; 8 worker unit tests + 2 server config tests + clippy green; baseline restored. The offserver drive cutover (drive consumes the builder's state) is staged. PROD GKE untouched; no gate/mode/builder default changed. noetl/worker#118 (v5.37.0) · noetl/server#246 (v3.33.0) · #115 · #107 · #111
2026-06-19 Phase 3 MERGED — the chain-walk state builder. noetl/server#245 self-merged (no auto-mode classifier block) → merge 28c45b3, release CI v3.32.0 (8338417); ai-meta repos/server pointer bumped. Behind NOETL_STATE_BUILD_MODE=chain_walk|event_scan (default event_scan, prod unchanged) the drive rebuilds WorkflowState by walking prev_event_id head→root: head from the in-memory ChainHeads watermark (no DB read), then (execution_id,event_id) PK lookups head→root (never a WHERE execution_id scan), collected events sorted by event_id → SAME from_events (orchestrate-core unchanged; parity by construction). Conservative fallback to event-scan on cold-head / node-not-materialized-under-gate / non-genesis. Adds NOETL_STATE_BUILD_PARITY_CHECK (shadow-build both ways in one REPEATABLE READ tx + assert equality) + metrics (state_build_total{mode,outcome}, …event_scans_total no-scan counter, …chain_hops, …parity_total). Validated gate-ON (PUBLISH_ONLY + off-server drive + materializer sole-writer) across linear/loop(62)/fan-out+reduce/output_select(31)/storage_tiers(55): parity 41/41 MATCH 0-mismatch (tx-isolated + normalized), NO-SCAN (event_scans_total=0 across 40 builds / 1064 PK hops / 0 fallbacks), all COMPLETE in chain_walk mode, sole-writer + lag-0, 577 lib tests + clippy green. PROD untouched; no gate/mode default changed. noetl/server#245 (v3.32.0 8338417) · #115
2026-06-19 Phase 3 IMPLEMENTED + kind-validated — the chain-walk state builder (PR open). Behind NOETL_STATE_BUILD_MODE=chain_walk|event_scan (default event_scan, prod unchanged) the drive rebuilds WorkflowState by walking prev_event_id head→root: head from the in-memory ChainHeads watermark (no DB read), then (execution_id,event_id) PK lookups head→root (never a WHERE execution_id scan), collected events sorted by event_id → SAME from_events (orchestrate-core unchanged; parity by construction). Conservative fallback to event-scan on cold-head / node-not-materialized-under-gate / non-genesis. Adds NOETL_STATE_BUILD_PARITY_CHECK (shadow-build both ways in one REPEATABLE READ tx + assert equality) + metrics (state_build_total{mode,outcome}, …event_scans_total no-scan counter, …chain_hops, …parity_total). Validated gate-ON (PUBLISH_ONLY + off-server drive + materializer sole-writer) across linear/loop(62)/fan-out+reduce/output_select(31)/storage_tiers(55): parity 41/41 MATCH 0-mismatch (tx-isolated + normalized — surfaced + excluded a pre-existing created_at-timestamp-without-tz → Utc::now() non-determinism in started_at/entered_at/completed_at, present in both build paths); structural parity (recursive prev_event_id walk == scan: walk_reached==total, 1 root, 0 dup across all); NO-SCAN (event_scans_total=0 across 40 builds / 1064 PK hops / 0 fallbacks); behavioral parity (all COMPLETE in chain_walk mode); sole-writer (rows==distinct, catalog0=0, __orchestrate__ event rows=0) + lag-0; kind_validate_orchestrate_gate.sh PASS in chain_walk mode; 577 lib tests + clippy green. Left open for review; pointer bump staged. PROD untouched; no gate/mode default changed. noetl/server#245 · #115
2026-06-19 Phase 2 MERGED → ai-meta pointers bumped; Phase 3 started. Both Phase-2 PRs merged: noetl/server#244 → server v3.31.0 (f5bd4a8, merge 29a8d69) + noetl/noetl#667 → noetl ecd16a2. ai-meta pointer bump afdb365. Post-merge verified on live kind: both prev_event_id columns present in the running DB (noetl.event + noetl.command), server image noetl-server:p2-chain reflects the merged code, gate-ON baseline live (PUBLISH_ONLY + off-server drive + materializer sole-writer). #115 → board "In progress". Phase 3 (chain-walk state builder, server-side, flagged) started — branch kadyapam/115-phase3-chain-walk-builder. afdb365 · noetl/server#244 · noetl/noetl#667 · #115
2026-06-19 Phase 2 IMPLEMENTED + kind-validated — one-level prev_event_id chain links. Each noetl.event carries prev_event_id (the immediately-previous event in causal order); each noetl.command carries the issuing-event link. The emit chokepoint emit_events stamps the event link from a per-execution chain-head watermark (ChainHeads in AppState) — covering drive events + command.issued + worker-lifecycle events (via handle_event) on both the gate-off INSERT and the gate-on publish (to_stream_json), with the materializer persisting it (EventEnvelope + project_events). Command link = the real step.enter/unblocking completion (chain head before the batch), so cursor-fan-out bodies share their branch origin (§4.4). Server-only (no orchestrate-core change — the chokepoint subsumes the "core carries prev" design since off-server-driven results re-enter the same emit path). Chain-correctness proven gate-ON across 6 executions (linear 13/13, loop 62/62, fan-out 25/25 with a real shared branch origin, sub-playbook 46/46, + Phase-1 output_select 31/31 & storage_tiers 55/55 bounded): every exec has 1 root, 0 dangling/0 duplicate event prev, 1 head, pointer-walk == full sequence (no gaps, no noetl.event scan), real-step command dangling=0. Gate sole-writer green (kind_validate_orchestrate_gate.sh PASS, lag 0); 573 lib tests + clippy green. PROD untouched; no gate default changed. PRs open, awaiting merge → pointer bump. noetl/server#244 · noetl/noetl#667 · #115
2026-06-19 Phase 1 SHIPPED — references-in-state consume side. Worker resolve_context_references made selective (resolve a noetl:// ref only when this command's tool input binds the step's bulk — a path the bounded extracted summary can't satisfy; predicate / scalar / _ref access reads off the summary, unconsumed refs stay references). Server hydrate_result_references surfaces _ref/_store/_uri on the kept summary (so {{ step._ref }} lazy-load + {{ step._ref is defined }}/{{ step._store }} predicates resolve without bulk); refs_in_state default flipped to true. Validated kind gate-ON (PUBLISH_ONLY + off-server drive + materializer sole-writer): all 9 #113 stalls COMPLETE (max command ctx 412KB, 0 __orchestrate__ event rows, materializer lag 0); the 3 targets bounded (output_select 1.32MB→10KB, storage_tiers 17.4MB→36KB, lease_expiry 201 spinning orch cmds→16). Closed #113 + #114. PROD GKE untouched (pre-#108 drive); flip is a code default. worker#117 (v5.36.0 0a66b41) · server#243 (v3.30.0 3014f6f) · #115
2026-06-19 RFC proposed + filed. Six-tenet reframe of the off-server-drive / event-sourcing model: reference-only noetl.* schema, one-level event chain (no noetl.event scan), off-server system-pool state builder + cache, atomic-working-item context. Reframes #101 (its consume side becomes Phase 1); subsumes #104's storage tier; state-construction design for #107 steps 2–4; unblocks #111; builds on #103's write path; #114 stays a safety cap. 6-phase plan; Phase 1 = the #101 consume side (immediate unblock for the 3 stuck #114 fixtures + the prod cutover). #115 · this page

11. Next concrete steps

  1. Phase 1 (#101 consume side) — ✅ DONE (worker#117 v5.36.0 + server#243 v3.30.0). Closed #113 + #114; off-server cutover unblocked on the size axis.
  2. Phase 2 — prev-event chain links — ✅ MERGED 2026-06-19 (noetl/server#244 → server v3.31.0 f5bd4a8 + noetl/noetl#667 → noetl ecd16a2; ai-meta pointer afdb365). Additive linkage proven walkable / no-gap / no-scan across 6 gate-ON executions; sole-writer + Phase-1 bounded sizes intact. Post-merge verified on live kind.
  3. Phase 3 — chain-walk state builder (server-side, flagged) — ✅ MERGED 2026-06-19 (noetl/server#245 → server v3.32.0 8338417; ai-meta pointer bumped). Replaces rebuild_state's noetl.event scan with a chain walk that follows prev_event_id head→root (each hop a (execution_id, event_id) PK lookup), behind NOETL_STATE_BUILD_MODE=chain_walk|event_scan (default event_scan); head from the in-memory ChainHeads watermark, conservative fallback to event-scan on cold-head / lag / non-genesis; orchestrate-core from_events unchanged. No-scan invariant proven gate-ON (event_scans_total=0 across 40 builds / 1064 PK hops / 0 fallbacks) + parity 41/41 match 0-mismatch + structural parity (walk==scan) + all topologies COMPLETE in chain_walk mode + sole-writer / lag-0 / gate rig PASS. (A derivable prev-URN — the WAL/object-tier twin of prev_event_id — and an OrchestrationResult-carried prev were left out of Phase 2 deliberately: the chokepoint watermark already stamps every path, so they're only needed when the off-server builder walks the WAL — fold them into Phase 3/4.)
  4. Phase 4 — off-server state builder + pool-side WAL cache — ✅ KERNEL + FLAG SHIPPED + shadow kind-validated 2026-06-19 (noetl/worker#118 → v5.37.0 fef961c + noetl/server#246 → v3.33.0 3e6006d; ai-meta pointers bumped). The pool-side state_builder reconstructs WorkflowState from the noetl_events WAL (a per-execution chain index walks prev_event_id head→root → spine in event_id order = parity by construction; cache keyed by the immutable chain head with CacheHit / Incremental tail-advance / ColdRebuild). A live WAL shadow loop (NOETL_STATE_BUILDER_SHADOW, default off) proves it on-cluster; the server NOETL_STATE_BUILDER=offserver flag (default server) is in place. Gate-ON kind-validated: WAL-read 993 events / 0 noetl.event scans / cache 28 cold + 21 incremental / shadow spines indexed==spine matching the Phase-3 topologies / fresh fan-out COMPLETED gate-ON with sole-writer + lag-0. DRIVE CUTOVER SHIPPED 2026-06-19 (noetl/worker#119 → v5.38.0 + noetl/server#247 → v3.34.0 + noetl/ops#198 + noetl/e2e#65): the drive obtains state from the pool-side WAL builder under offserver, with the expected_head staleness guard; server-built state kept as the worker's incomplete-WAL fallback. REMAINDER SHIPPED 2026-06-20 — the server edge is now STATELESS (noetl/server#248 → v3.35.0 6e30fc3 + noetl/worker#120 → v5.39.0 8e1f651 + noetl/e2e#66 f4bb342; ai-meta pointers bumped): the residual server chain-walk bookkeeping is removed — under offserver the server does ZERO state rebuild + ZERO noetl.event reads on the drive path (catalog_id+routing from an execute-time ExecDescriptor, expected_head from ChainHeads, terminal from the descriptor flag, trigger-type resolved off the worker WAL; no server-built state rides the stateless command, an incomplete WAL is a benign __offserver_retry__ no-op). chain_walk + event_scan stay the cold-descriptor fallbacks. Gate-ON-validated: state_build_total Δ0 + event_scans Δ0 across 4 topologies, dispatched_offserver_stateless/applied_stateless advance, offserver==server parity, sole-writer + lag-0. This fully removes the server-side state rebuild + event read #111 flagged → completes #107 step 2 server-side. Phase 4 ✅ COMPLETE.
  5. Program-scale — multi-replica off-server execution (#107 step 3). Step 1 (KV data coherence) ✅ + Step 2 (execution-affinity write ordering) ✅ SHIPPED + multi-replica gate-ON validated (noetl/server#252 → v3.39.0 5e00d0a + noetl/e2e#71 66b6e1b; ai-meta pointer bumped). Under NOETL_EXECUTION_AFFINITY=true + NOETL_REPLICA_COHERENCE=nats_kv (both default off) 2+ replicas produce one unforked chain and executions COMPLETE — the rig PASSES HARD (roots=1/dangling=0/walk==total, forwarded_ok proof, never-scan + sole-writer across replicas). Remaining: #117 — the off-server from_events spine orders by event_id, which wedges the fan-in under a chain-order≠id-order inversion (affinity + high-concurrency fan-out); fix = order the spine by chain (prev_event_id) walk. Needed for fully-reliable horizontally-scaled FAN-OUT completion; linear/loop already reliable.
  6. Batch benchmark — the 10×1000 pft_flow_test now runs correctness-clean on kind (zero patient loss; all four prior blockers fixed: #120 open-loop barrier, #124 forward-binding, #126 http data-shape, #125 task_sequence control flow). The open follow-up is #127: throughput plateaus ~60 patients/s (10k in ~166–172 s), ~3× slower than the Python ~54 s kind baseline. Rate-limiter is NOT the bottleneck (throttled ≈ unthrottled); capped by the NoETL pipeline (worker concurrency 4×2 + serial per-slot drain + off-server/publish-only per-step overhead). Investigation is separate from correctness and from the #115 roadmap items.
  7. Owner review of the remaining design — ratify the chain shape (§4.4 DAG question §8.1) and the explicit-input-binding dependency (§6.3) before Phase 5.
  8. Refresh the stale kind_validate_orchestrate_offserver.sh #113/#114 must-advance assertions (now dormant under Phase-1 refs_in_state=true) — make them conditional on actually exceeding budget. Tracked under #111/#98.
  9. Resolve the red-team questions §8.1 (join DAG) and §8.8 (predicate completeness) — both gate Phase 5.

12. Related

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally