-
Notifications
You must be signed in to change notification settings - Fork 0
Umbrella Event WAL Storage
Tracking issue: noetl/ai-meta#104
· Board: roadmap 3
· Blueprint: docs/architecture/event_wal_and_derivable_storage.md
· Program roof: Umbrella: Server Dissolution → Global Grid (#107)
Status (2026-06-22): RFC for review — the event-WAL half is LIVE on prod; the result-data half is the remaining work. The original blueprint (2026-06-16) folded #103 (CQRS split) and #101 (references-in-state) into one model. Since then the event side of that model shipped and went live: the materializer is the sole writer of
noetl.event, the off-server drive builds state from thenoetl_eventsJetStream WAL by walking theprev_event_idchain (never scanningnoetl.event), and the full off-server CQRS cutover is live on prod (server v3.39.5, 2026-06-22). The result-DATA half has now begun landing: Phase A (server accepts the canonical URI), Phase B (shadow Feather result tier), and Phase C (resolve-by-URN read path — server GCS object backend + cell registry +GET /api/internal/cells, worker resolve-by-URN + fixes B/B1, closes OQ6) are all MERGED (flags default-off → inert in prod until a future rollout), and Phase D (the minting flip — the URN tier becomes the authoritative result store withnoetl.result_storeas the reversible dual-write fallback) is MERGED 2026-06-23 (server#263 →noetl-server 3.43.0, worker#129 →noetl-worker 5.43.0, ops#204, e2e#78, flag default-off → inert in prod). What remains after Phase D: OQ5 — theresult_storeretirement window (now resolved as a metric-gated decision, see §OQ5; the prod minting cutover is a separate task). Phase E (the side-effecting-tool durability barrier, tools#78 → noetl-tools v3.17.0 / worker#130 → noetl-worker v5.44.0 / e2e#79) and Phase F (GC + DR — server#264 → noetl-server v3.44.0 / worker#131 → noetl-worker v5.45.0 / ops#205 / e2e#80) are both MERGED to main 2026-06-23 (flags default-off → inert in prod; ai-meta pointers bumped). With Phase F the BUILD phases A–F are complete — what remains on #104 is operational only: prod GCS infra, the OQ5 byte-source re-plumbing prerequisite, and the staged prod enablement/minting cutover. This page is the RFC; review requested on the issue.
Every atomic cycle of an execution publishes its event to the
noetl_events NATS JetStream stream; the publish-ack is the durability
boundary, so the stream is the write-ahead log and the synchronous
noetl.event INSERT is gone from the hot path. Independent system-pool
consumers drain that log: a materializer folds it into noetl.event
(now an audit/query projection, not the source of truth) and the
off-server drive rebuilds WorkflowState from the WAL by walking the
one-level prev_event_id chain. (All of the preceding is shipped and
live on prod.) The remaining design, proposed here, is that result
data stops living inline/Postgres and instead is materialized as Arrow
Feather files in object store, addressed by a derivable logical
URI (noetl://<tenant>/<project>/results/<execution_id>/<step>/<frame>/<row>/<attempt>)
that resolves to a physical cell/shard key with zero central lookup;
state carries only the bounded extracted predicate block plus the
derivable coordinates, never the payload and never an opaque reference
string; and a crashed instance resumes from the last acked offset,
re-running re-runnable cycles and skipping a side-effecting cycle whose
result URN already exists.
The 2026-06-16 blueprint described one model spanning two halves. The event half landed through #103 + #115 and is live on prod:
-
NATS-as-WAL for events. The worker publishes events to the
noetl_eventsJetStream stream; the materializer (repos/worker/src/materializer.rs, durable consumernoetl_materializer,AckMode::Defer) drains the stream, POSTs/api/internal/events/project, and acks only on 2xx — making it the sole writer ofnoetl.event(materializer.rs:8,22,46,50,184,195). -
State off the WAL, no event scan. The off-server drive builds
WorkflowStatefrom the WAL viaWalEventIndex+ExecutionChain, walking theprev_event_idchain from the server's authoritative tip (expected_head) —repos/worker/src/state_builder.rschain_walk_from,AdvanceOutcome. It never scansnoetl.event(that table is audit-only underNOETL_EVENT_READ_PATH=audit_only). -
One-level chaining.
ChainHeads.link_batch(repos/server/src/state.rs:230) stamps each event'sprev_event_id; multi-replica coherence CAS-advances a shared NATS-KV head. -
Live on prod. The full off-server CQRS cutover (
PUBLISH_ONLY=true+STATE_BUILDER=offserver) is live on prod as of server v3.39.5 (2026-06-22): server writes zeronoetl.eventrows, materializer is the sole writer, materializer-lag alert is the guardrail.
So the question #104 still owns is not "how do events become durable" — it's "how does result data become durable, addressable, and tiered" on top of that now-live WAL.
Today an over-budget tool result is not in object store and not Feather. It is a JSONB row in Postgres:
- The inline-vs-reference threshold (default 100 KB, env
NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES) gates inline-vs-durable (repos/server/src/handlers/events.rs:3456). - Over-budget results are written to
noetl.result_store— one row per result, the payload in adata JSONBcolumn (repos/server/src/db/queries/result_store.rs). - They are addressed by a server-minted opaque ref
noetl://execution/<eid>/result/<name>/<id>(events.rs:3463). - The worker additively stamps the canonical logical URI
reference.uri = noetl://default/default/results/<eid>/<step>/<frame>/<row>/<attempt>viaResultCoordinates::logical_uri()(command.rs:1547) — but nothing consumes it; the server result-store path and the materializer still use the legacy opaque ref. - A bounded, navigable
extractedpredicate block (≤ ~4 KB) is spliced inline in all tiers so the drive can evaluatewhen:/set:/ cursor fan-out without a fetch (build_extracted/summarise_value,repos/worker/src/executor/command.rs:1859).
This leaves three concrete problems #104 closes:
- Result bytes bloat Postgres and the connection pool — the same small-synchronous-DB pressure CQRS removed for events, still present for result payloads. They belong in object store, not a JSONB column.
- Addressing is an opaque carried string, not derivable — replay, dedup, and topology-aware routing all want a name computed from identity, not a pointer threaded through state.
- There is no result-durability barrier — a crash mid-execution can re-run a side-effecting tool because nothing checks "is this cycle's result already durable" by URN existence.
Honest separation of what is settled, what this RFC proposes, and what is still open. (Open items are detailed in §7.)
| # | Tenet | Status |
|---|---|---|
| T1 | JetStream publish-ack is the event durability boundary; noetl.event is a derived audit projection. |
Decided + shipped (#103/#115, live on prod). |
| T2 | The drive builds state by walking the one-level prev_event_id chain off the WAL; never scans noetl.event. |
Decided + shipped (#115). |
| T3 | State carries the bounded extracted predicate block + derivable coordinates only — never payload, never an opaque ref string. |
extracted shipped; drop-the-opaque-ref is proposed (this RFC). |
| T4 | Result names are a two-layer derivable scheme: a stable logical URI resolved to a physical cell/shard key. | Naming module shipped (locator); production adoption proposed. |
| T5 | Over-budget tabular results materialize to Arrow Feather in object store; non-tabular to JSON/Parquet; small stays inline. |
Shadow writer MERGED (Phase B, flag default-off → inert in prod). Tabular → Feather, non-tabular → JSON (OQ3) at the derived §7 key, alongside result_store; nothing reads it until Phase C. |
| T6 | The durability barrier sits at side-effecting tool boundaries only — resume skips a cycle whose result URN already exists. |
MERGED 2026-06-23 (Phase E; flag default-off → inert in prod). Tool-registry side_effecting classifier (tools#78 → noetl-tools v3.17.0) + worker barrier (worker#130 → noetl-worker v5.44.0) + rig (e2e#79); worker depends on the published noetl-tools 3.17. Tier-object existence signal shipped; inline-result ("cycle acked") signal is a follow-up. |
| T7 | All drain/materialise/resolve work runs as system-pool playbooks (plug-in ring), not bespoke Rust services. | Decided (system-pool ADR); materializer already runs this way. |
The "treat JetStream as the WAL" decision is implemented for events:
atomic cycle (tool result | condition eval)
│ worker emit path (compiled core)
▼
noetl_events JetStream stream ── the WAL
publish-ack = durable, replicated
│
┌───────────┴────────────┐
▼ ▼
materializer pool off-server drive (state_builder)
→ noetl.event (audit → WalEventIndex + ExecutionChain;
projection; sole walks prev_event_id chain from
writer; ack-after-2xx) expected_head; NEVER scans noetl.event
- The local in-process buffer the blueprint describes is realized as
the worker's
WalEventIndex— a per-process read cache rebuilt from the retainednoetl_eventsWAL on boot (ephemeralDeliverPolicy::Allconsumer;NOETL_STATE_BUILDER_DURABLE=1to revert to a persisted cursor). It accelerates a process reading its own recent appends; it is not a shared cache. -
noetl.eventis now a projection. Its role is audit + ad-hoc query- replay reconstruction, not the source of truth and not on the drive's read path.
The WAL today carries event envelopes. For the result-data tier, the same
WAL is the trigger source for a result materializer (a second
system-pool consumer of noetl_events, sibling to the event
materializer):
- It reads
call.doneevents whosereferenceindicates an over-budget result, derives the physical Feather key from the logical URI (§4), and writes the Feather file to object store (§5). - It is at-least-once and idempotent: dedup on
event_id; overwrite at the resolved key (same property the event materializer already relies on viaON CONFLICT). -
Decision (proposed): the result materializer is a separate
consumer from the event materializer, not folded into it — so result
I/O (object-store latency) never back-pressures the
noetl.eventaudit fold that the lag alert guards.
For events, publish-ack is already the boundary. For results, the boundary is the side-effecting-tool rule (§4.4): the result's publish-ack + URN existence is the commit point a resume checks. This is the one place the local-ahead-of-durable window has correctness bite, and it is scoped to side-effecting tools only — not every cycle.
Result names reuse the platform's existing Resource Locator (§4/§7/§8 of
the Global Hybrid Supercluster Blueprint),
not an invented urn:noetl:result:... grammar. The module exists and
is unit-tested in repos/tools/src/locator.rs (noetl_tools::locator,
v3.12.0):
Layer 1 — stable logical URI (location-independent; what state carries; dedup/replay key on it; never renamed on migration/failover):
noetl://<tenant>/<project>/results/<execution_id>/<step>/<frame>/<row>/<attempt>@<version>
-
tenant/projectlead (multitenancy + sharding dimensions). -
frame/roware mandatory for cursor fan-out — without them two rows of one step collide. (ResultCoordinatescarries both;command.rsthreadscursor.{frame,row}intorender_context.) -
attempt/@versiondisambiguate retries.
Layer 2 — topology resolution (derivable, then one small registry):
shard_key = FNV1a(tenant + project + execution_affinity) % shard_count
→ region + cell + shard (ResultCoordinates::shard_key)
physical = noetl/env=…/region=…/cell=…/shard=…/tenant=…/project=…/
date=…/execution=…/results/<step>/<frame>/<row>/<attempt>.<ext>
(ResultCoordinates::physical_key)
A consumer that knows (tenant, project, execution_id) derives the
home cell/shard and the object key with zero central lookup. The only
registry is the small, slow-changing cell endpoint map (cell →
provider/bucket/endpoint).
-
Wired: the worker stamps
reference.uri(the logical URI) additively on over-budget results (stamp_logical_uri,command.rs:1547). -
Unconsumed: nothing reads
reference.uriyet. The server still mints + resolves the legacy opaquenoetl://execution/...ref; the result store is still Postgres JSONB. -
Proposed (R02c + materializer): server/orchestrator parse accepts
both shapes (back-compat); the materializer writes Feather at
physical_key; the read path resolves the logical URI → physical key → object fetch. Flip minting so the logical URI is the stored ref only after the read path proves out.
The derivation gives (region, cell, shard); turning a cell into a
concrete bucket/endpoint needs the cell endpoint map. IMPLEMENTED
(Phase C, 2026-06-23): a small server-served, cacheable registry
(GET /api/internal/cells → cell → {provider, bucket, endpoint, region}), seeded from ops config — dozens of entries, not a per-fetch
SPOF (server#262 →
noetl-server v3.42.0; resolves OQ6). Single-cell deployments (today)
resolve every cell to the one configured bucket; the registry is the seam
that lets the grid grow without changing names. Default-off until a future
rollout enables the read path.
The load-bearing correctness decision. Three atomic kinds, different treatment:
| Atomic kind | Replay safety | Treatment |
|---|---|---|
| Condition evaluation | Pure — re-derives free | Local + async publish; never blocks. |
| Tool, no side effect | Re-runnable | Local + async publish. |
| Tool with side effect | Not replay-safe | Resume must not re-dispatch a cycle whose result URN already exists. |
Before a resumed execution re-dispatches a side-effecting tool for
(execution_id, step, frame, row, attempt), it checks whether the derived
result URN already exists (object HEAD / cycle acked). If it does, the
cycle is skipped and the recorded result adopted. Open (§7): how a
tool declares itself side-effecting — proposed as a tool-registry
attribute (side_effecting: true) so the barrier knows where to block.
- Stored (in object store): the result payload bytes — Arrow Feather for tabular, JSON/Parquet for non-tabular.
-
Derived (never carried): the location — computed from the
logical URI via
physical_key. -
Carried inline (in the event, all tiers): the bounded
extractedpredicate block — enough to evaluate guards/fan-out without a fetch.
The size threshold that gates inline-vs-reference selects the tier:
- Small result → inline in the event; no object-store write.
-
Over-budget tabular → Arrow Feather at the resolved key.
Feather is the on-disk form of the Arrow IPC stream the worker already
encodes (
noetl_tools::arrow_codec::try_encode_tabular_json→TabularEncoding, media typeARROW_STREAM); mmap-able, zero-copy frompyarrowand the Rustarrowcrate. - Over-budget non-tabular (shell stdout, opaque HTTP JSON) → JSON or Parquet at the same resolved key. Open (§7): JSON vs Parquet as the fallback default.
-
Idempotent overwrite: a fixed
@versionrewrites the same key; a bumped version keeps every attempt (§7 open question on the default). -
GC by prefix: per-execution / per-shard prefix delete on TTL or
explicit cleanup — the
system/scheduled_cleanupplaybook gains a Feather-prefix sweep keyed offexpires_at(already aresult_storecolumn today). -
No hot-path eviction concern: the drive never reads the Feather tier
for guard evaluation (the
extractedblock covers that); only an explicit downstream consume (input:binding that needs the full payload) fetches it.
Three caches/stores, three distinct roles — the RFC keeps them separate:
-
shm Arrow IPC cache (
noetl-arrow-cache, lease expiry) — same-node acceleration for a payload handed between colocated blocks. Volatile; not durable; survives this model unchanged. -
noetl.result_store(Postgres JSONB) — today's durable store. The Feather tier replaces it for over-budget payloads; small inline results need no store at all.result_storemay remain for a transition window (dual-write behind a flag) then retire. - Feather in object store — the new durable, cross-node, cross-cell tier, addressed by derived URN, replicated/DR'd via the §8 location descriptor.
A WASM worked-example already proves the host-capability shape: the
reference-materializer plug-in
(repos/worker/plugins/reference-materializer/src/lib.rs) writes an
Arrow/Feather buffer to object store via the granted noetl.object_put
capability ring (it currently writes a hardcoded key
noetl/results/reference/0/0/1.feather). Production path (proposed):
generalize this into the system/result_materializer playbook that (a)
derives the real physical_key from the event's logical URI, (b) calls
object_put, and (c) is driven by the noetl_events WAL consumer — same
plug-in-ring shape as the event materializer, hot-reloadable via catalog
version bump (system-pool ADR Phase 4).
-
Restart reads the read model + the WAL up to the last offset its
executions were durably acked at (the off-server drive already does
this from
expected_head; the result materializer from its durable consumer cursor). - Replay the local-only tail: pure conditions re-derive; re-runnable tools re-run; side-effecting tools are skipped if their result URN already exists (§4.4), else dispatched.
-
Both materializers are at-least-once + idempotent: event
materializer dedups on
event_id(ON CONFLICT); result materializer overwrites at the resolved key. Neither reintroduces anoetl.eventscan. - Replay determinism: the location is computed from the envelope + shard function, never carried — so a replay of the same events resolves to the same keys. This is the property the carried opaque ref could not guarantee across migration/failover.
-
OQ1 — attempt/version semantics. Overwrite-on-retry (fix
@version) vs keep-every-attempt (bump). Overwrite is GC-friendly; keep-every is better for forensic replay. Proposed default: overwrite; keep-every behind a debug flag. - OQ2 — local tail bound before back-pressure. How far local may run ahead of the last ack. Too far widens the replay window + idempotency burden; too tight approaches the synchronous write. Needs a configurable bound + a metric.
- OQ3 — non-tabular fallback format. JSON vs Parquet for the over-budget non-tabular tier. Leaning JSON for round-trip simplicity; Parquet if columnar non-tabular shows up.
-
OQ4 — side-effect classification. RESOLVED static (Phase E, 2026-06-23).
A static, per-kind tool-registry attribute (
registry::kind_is_side_effecting; conservative defaulttrue, onlynoop/rhaifalse). A per-invocation predicate (http GET vs POST) is deliberately not done: because the barrier is adopt-only (it can only turn a duplicate side effect into one, never drop work), over-classification is safe, so the static flag is sufficient for correctness — per-invocation is a future optimization, not a requirement. (tools#78 / worker#130.) -
OQ5 —
result_storeretirement. DECIDED 2026-06-23: metric-gated + a hard re-plumbing prerequisite (surfaced live by Phase D). Phase D ships the dual-write window (tier authoritative +result_storewritten as the reversible fallback); it does NOT itself stop the dual-write. The retirement decision is now settled along two axes:-
When to drop the dual-write — metric-gated. Retire
result_storeonly oncenoetl_worker_result_mint_authoritative_total{path="legacy_fallback"}holds 0 across a staging soak (the tier never misses), combined with a retention-period time floor (dual-write must run at least one full result retention period at flag-on so any in-flight resume can still fall back). Both must hold — the metric proves the tier is sufficient, the time floor protects in-flight resumes. (Chosen over pure time-bounding (a) and keep-indefinitely (c): the metric is the real signal; the time floor is the safety margin.) -
Historical rows — let existing
noetl.result_storepayloads age out viaexpires_atwhile the tier owns all new mints (no bulk back-migration). A hard prerequisite that gates the actual retirement (not just a policy choice, and NOT done): the materializer fetches the over-budget payload fromresult_storetoday, so retiring it requires re-plumbing how the tier write obtains the bytes (e.g. the producer stages directly to the tier, or the materializer reads the inline over-budget payload off the event). Until that re-plumbing lands, droppingresult_storewould starve the tier writer of its byte source. So even with the metric gate green, retirement is blocked on this prerequisite — which is itself out of Phase D scope and not yet implemented.
-
When to drop the dual-write — metric-gated. Retire
-
OQ6 — cell endpoint registry ownership. RESOLVED (Phase C, 2026-06-23).
Server-served: the registry is owned by the server and exposed at
GET /api/internal/cells; the resolve-by-URN read path queries it (with the worker-side resolve path consuming the result), default-off until a future rollout. (server#262 → noetl-server v3.42.0; worker#128 → noetl-worker v5.42.0.) -
OQ7 — locator dependency shape (surfaced in Phase A, 2026-06-22). Phase A
followed RFC §8's "server adds the
noetl-toolsdep," butnoetl-toolshas no feature gating just the pure-stdlocatormodule —duckdb(bundled C++),kube,arrow-flight,tonic,rhai,gcp_authare non-optional core deps, so the entire tool-registry graph lands in the control-plane server (the musl-static image still builds clean, so it's a cost not a blocker). Proposed: extractlocatorinto a slim, dependency-lightnoetl-locatorcrate that bothnoetl-toolsand the server depend on — same single source of truth, none of the weight. Decide before Phase B/C add morelocatorcall sites server-side. - Risk — object-store latency on the result path. Mitigated by keeping the result materializer a separate consumer (§3.2) so it never back-pressures the event audit fold / lag alert.
- Risk — two writers of the same key. Avoided: exactly one result materializer instance per shard owns the write (execution-affinity, the same single-owner property #116 established for the drive).
Phased + reversible, mirroring the #103 / #115 program. Each phase is behind a flag and kind-validated before the next.
-
R-naming (done).
noetl_tools::locatormodule + worker stampsreference.uriadditively. ✅ -
Phase A — server accepts the canonical URI (R02c). ✅ MERGED
(2026-06-22), flag default-off — inert in prod until a future server
rollout. Server parses both legacy and canonical shapes via the slim
noetl-locator(parse_result_ref+ResultRef); a shadow-accept hook on thenormalize_event_to_rowchokepoint validatesreference.uriand recordsnoetl_result_uri_accept_total{outcome}, never failing the event. Gated behindNOETL_RESULT_URI_ACCEPT(default off / byte-identical no-op); no schema change (the URI is already persisted in thereferenceJSON). OQ7 resolved in the same change set: the locator was extracted into the lean, dependency-freenoetl-locatorcrate (purestd) so the control plane parses the URI without pulling noetl-tools' heavy graph (duckdb/kube/arrow/tonic/rhai/gcp_auth confirmed absent from the server tree);noetl-toolsre-exports it asnoetl_tools::locatorso the worker stamp path is unchanged. Shipped: noetl/tools#76 (→noetl-tools 3.15.0+ newnoetl-locator 0.1.0on crates.io) + noetl/server#260 (→noetl-server 3.40.0) + noetl/e2e#75 (rig). Repos:tools,server,e2e. -
Phase B — result materializer writes Feather (flagged, shadow).
✅ MERGED (2026-06-22), flag default-off → inert in prod until a
future rollout enables it. A separate
noetl_eventsconsume-loop on the system pool (consumernoetl_result_materializer, own ack cursor) resolves an over-budget result's payload (read-only), tiers it (tabular → Arrow Feather, non-tabular → JSON [OQ3], small → no write), and writes the body to the derived §7physical_keyvia the server-mediatedPUT /api/internal/objects/{key}— shadow, alongsidenoetl.result_store; nothing reads it yet (Phase C). Keep-every attempt URN (OQ1); single-cell seed for write-side URN resolution (OQ6's multi-cell/miss part deferred to Phase C). Gated behindNOETL_RESULT_MATERIALIZER_ENABLED(default off / true no-op); never alters the authoritative result, never fails an event. Kind gate-ON: tabular → real Arrow.feather, non-tabular →.jsonat the derived key, flag-off Δ0, event-materializer sole-writer intact every leg. Merged: worker#127 (→noetl-worker 5.41.0) + tools#77 (→noetl-tools 3.16.0+ newnoetl-locator 0.1.1on crates.io) + server#261 (→noetl-server 3.41.0) + ops#203 + e2e#76. Repos:worker,tools,server,ops,e2e. -
Phase C — resolve-by-URN read path. MERGED 2026-06-23 (all flags
default-off → inert in prod until a future rollout). The
input:/consume side resolves the logical URI → physical key → object fetch, with the shm Arrow cache in front for same-node hits. The server side adds a GCS object backend + a cell-endpoint registry +GET /api/internal/cells(resolves OQ6); the worker side adds the resolve-by-URN read path (references-in-state behavior,flatten_single_tool_result) + fixes B/B1; 38 worker unit tests. No new crate publish — worker resolvesnoetl-tools 3.16.0+ adds the publishedarrow = "53"direct dep, server resolvesnoetl-locator 0.1.1, both from the registry (no git/branch dep → no repoint). Merged in dependency order: server#262 (→noetl-server 3.42.0c2d5ca9) + worker#128 (→noetl-worker 5.42.07971041) + e2e#77 (39dc880, 3-pass rig + fake-gcs). Repos:server,worker,e2e. -
Phase D — minting flip +
result_storedual-write window. ✅ MERGED 2026-06-23 (server#263 →noetl-server 3.43.06f6b9ef, worker#129 →noetl-worker 5.43.0be6863a, ops#204b19b759, e2e#7807e85aa), flag default-off → inert in prod. One flag —NOETL_RESULT_MINT_AUTHORITATIVE— makes the URN → Feather/GCS tier the authoritative result store: the materializer becomes the authoritative tier writer (implies the Phase B flag), resolve-by-URN becomes the primary consume read path (implies the Phase C flag), and the server keeps writingnoetl.result_storeas the reversible dual-write fallback leg (counted onnoetl_result_store_dual_write_total). A tier miss falls back fail-safe to the dual-written store (rollback safety), recorded onnoetl_worker_result_mint_authoritative_total{path}. The tier write stays worker-side because the slim control plane cannot encode Feather (OQ7); the async producer→materializer window is exactly what the dual-write covers. The retirement ofresult_store(stopping the dual-write) is OQ5 — DECIDED metric-gated (§OQ5), gated on a not-yet-done byte-source re-plumbing prerequisite, NOT Phase D; flag-off rolls back cleanly. Repos: server#263 · worker#129 · ops#204 · e2e#78. -
Phase E — side-effect durability barrier. ✅ MERGED 2026-06-23
(flags default-off → inert in prod; ai-meta pointers bumped; tools#78 →
noetl-tools v3.17.0, worker#130 → noetl-worker v5.44.0, e2e#79). Tool-registry
side_effectingclassification + a barrier in the worker: before (re-)dispatching a side-effecting cycle, if the cycle's derived result URN already resolves to a durable result (Phase C read path), the worker SKIPS re-execution and adopts the recorded result, so an external side effect fires exactly once across a crash-resume / re-drive. Non-side-effecting cycles are never blocked. GatedNOETL_SIDE_EFFECT_BARRIER(default off → true no-op). The gate looks through the orchestrator'stask_sequencewrapper (side-effecting iff any sub-task is), sonoop/rhaisteps are exempt. Adopt-only safety:resolve_by_urnreturns Some only on a durable hit, so the barrier can only turn a duplicate side effect into one, never drop work — which is why OQ4 resolves to static classification (per-invocation is an optimization, not a correctness need).attempt=1is fixed, so the barrier keys on durable-success existence at the coordinate, not the attempt number → OQ1 keep-every-attempt + #125 retry compose cleanly (retry-after-failure re-executes; resume-after-success skips). Metricnoetl_worker_side_effect_barrier_total{outcome,tool}. Kind gate-ON 3-pass green (prod-exact off-server gate + fake-gcs, deterministic forged re-drive + marker-object side-effect counter): PASS A flag-on re-drive SKIPPED (marker stays 1,barrier{skipped}Δ>0); PASS B flag-off RE-EXECUTES (marker 1→2, metric Δ0); PASS C noop re-drive never checked (Δ0); invariants (sole-writer, roots=1, dangling=0, terminal=1) intact. Server unchanged (worker-only; reuses the Phase C GCS backend + cell registry). Scope follow-up (not a blocker): the implemented existence signal is the tier-object half of §4.4's "object HEAD / cycle acked" — small/inline side-effecting results (not tiered) re-execute today; the event-completion signal for inline results is a clean Phase-E follow-up. tools#78 · worker#130 · e2e#79. Repos:tools,worker,e2e(server not needed). -
Phase F — GC + DR. ✅ MERGED 2026-06-23 (flags default-off →
inert in prod; ai-meta pointers bumped; server#264 → noetl-server
v3.44.0, worker#131 → noetl-worker v5.45.0, ops#205, e2e#80). The
final build phase.
-
GC (server) —
POST /api/internal/result-tier/gc, gatedNOETL_RESULT_TIER_GC(default off → disabled no-op),dry_rundefault true. A conservative sweeper that reclaims only provably-dead objects: an object whose §7 key parses anexecution=<eid>segment, whose execution has no survivingnoetl.eventrow (aged out / orphan), and which is past a grace window (from the eid mint time). It never deletes a live-referenced object — the verdict (decide) skips any object whose execution still has events, unit-tested. Object backend gainslist/delete; metricnoetl_result_tier_gc_total{outcome}. Thesystem/scheduled_cleanupplaybook calls it (double-gated, dry-run by default). -
DR (worker) — the tier is derivable from the WAL, so the result
materializer gains a verify-and-repair mode (
NOETL_RESULT_TIER_DR, default off): a missing/corrupt object is rebuilt from its source byte-identically (deterministic encode) by re-running the materialization for its URN; healthy objects are left untouched; never alters the authoritativeresult_storesource. Metricnoetl_worker_result_tier_dr_total{outcome}. -
Scoped OUT (reported, not guessed): attempt-version reaping (OQ1, open)
and
result_storeretirement (OQ5) — tier GC keys off execution retention, independent of both. - Kind gate-ON 5-pass green (off-server gate + fake-gcs): GC-1 dry-run lists
the dead orphan + skips the live object (
live_candidates=0) + deletes nothing; GC-2 delete reclaims only the orphan, the live object survives + serves; GC-3 flag-off no-op; DR-1 a deleted referenced object re-derived byte-identically (sha256 match) + served by the read path; DR-2 flag-off no-op. Invariants intact every real execution. -
Merge order (Phase E → F) — done: Phase E merged first (tools#78
published noetl-tools 3.17.0; worker#130 repointed onto the published crate,
no patch); then worker#131 + e2e#80 were rebased off
feat/104-phase-e-barrierontomainand merged. server#264 + ops#205 (Phase-E-independent) merged alongside. worker#130/#131 depend on the published noetl-tools 3.17. -
noetl/server#264 ·
noetl/worker#131 ·
noetl/ops#205 ·
noetl/e2e#80. Repos:
server,worker,ops,e2e.
-
GC (server) —
With Phase F MERGED the BUILD phases A–F are complete. What remains on
#104 is operational only: (1) prod GCS infra provisioning, (2) the OQ5
byte-source re-plumbing prerequisite to retiring result_store, and (3) the
staged prod enablement + minting cutover. No more build phases.
Each phase ships its span + metric + execution_id correlation in the
same change set (observability rule). Kind-validation gate per
deployment-validation.md before any GKE rollout.
- #103 (CQRS split) — CLOSED, live on prod. Provides the stream, the materializer, the sole-writer guarantee. This RFC's result materializer is a sibling consumer on the same stream. No blocker.
-
#101 (references-in-state) — CLOSED, reframed by #115. Provides the
extractedpredicate block (kept). Its carriedreference.refis the thing this RFC supersedes by derivation. No blocker. -
#115 (decoupled context + event chain) — in progress, far along.
Provides T1–T3 (chain walk, never-scan, reference-only schema). #104's
result-data tier is the storage substrate #115 Phase 1's
"schema-holds-refs-only" assumes. Sequencing: Phase A (canonical URI
acceptance) should land after or with #115's reference-only schema so
both agree the event carries
{logical_uri, extracted}only. #115's atomic-item context (Phase 5) and #104's resolve-by-URN read path (Phase C) touch the same worker consume path — coordinate to avoid a merge collision. - #107 (server dissolution → global grid) — program roof. #104 is program step 3 ("per-shard NATS-as-WAL; Postgres demoted to a derived projection"). The event half of step 3 is done (Postgres-event is now a projection); #104's result-data tier finishes step 3 by demoting the result store too. Step 4 ("drop Postgres-as-source-of-truth → projection-on-demand") depends on #104 Phase D landing. The cell endpoint registry (§4.3) is the seam step 5 (cross-shard federation) builds on. #104 is on the critical path for #107 steps 4–5.
Prerequisites before starting Phase B: (1) #115 reference-only schema
ratified so the event shape is settled; (2) a default tenant/project
source threaded (single-tenant default/default today is fine for Phase
A–C); (3) the object-store capability (object_put) confirmed available to
the system pool in the target cluster (the worked example proves the kind
path).
- Cross-shard federation / NATS superclusters (#107 step 5) — the URN scheme is designed to support it (topology resolves from the name), but the federation routing + consistency design is its own note.
-
The shm Arrow cache internals (
noetl-arrow-cache) — unchanged by this RFC; same-node acceleration only. - Event-WAL mechanics — settled and shipped (#103/#115); this RFC does not revisit publish-ack, the event materializer, or the chain walk.
- Multi-tenant tenant/project provisioning — the URN carries the coordinates; how tenants are created/billed is elsewhere.
- Re-charting the prod cutover — the off-server CQRS cutover is live; this is design-only and changes no prod default.
| Date | What | Pointer |
|---|---|---|
| 2026-06-23 |
Phases E + F MERGED to main — the FINAL build phases; repo-only, flags default-off → inert in prod. Phase E: tools#78 → noetl-tools v3.17.0 (registry::kind_is_side_effecting; members noetl-directives + noetl-locator re-published first), worker#130 barrier → noetl-worker v5.44.0 (repointed onto published 3.17, no patch), e2e#79. Phase F: server#264 GC sweeper → noetl-server v3.44.0, worker#131 DR re-derive → noetl-worker v5.45.0 (rebased onto main), ops#205, e2e#80. ai-meta pointers + deployment-spec wiki rows bumped. A–F build complete; #104 stays OPEN (operational items remain). |
tools 1d49dd5 · server 341b614 · worker dd07016 · ops 26185ff · e2e d7372be · #104
|
| 2026-06-23 |
Phase F (GC + DR) implemented + kind-validated (→ MERGED 2026-06-23, see row above) (the final build phase; flags default-off → inert, PRs unmerged, no ai-meta pointer bump). GC (server): a conservative dry-run-first POST /api/internal/result-tier/gc sweeper (gated NOETL_RESULT_TIER_GC) reclaiming only provably-dead tier objects (execution aged out of noetl.event, past a grace window) — never a live-referenced one (unit-tested decide); object backend list/delete; system/scheduled_cleanup GC step (double-gated). DR (worker): result-materializer verify-and-repair mode (NOETL_RESULT_TIER_DR) that rebuilds a missing/corrupt object from its WAL-derivable source byte-identically. OQ1 version-reaping + OQ5 retirement deliberately scoped OUT. Kind 5-pass green (GC dry-run/delete safety, DR byte-identical sha256 match, flag-off no-ops, invariants intact). A–F build phases complete; only operational items remain. Worker/e2e stacked on the unmerged Phase E branch (merge Phase E first). |
server#264 · worker#131 · ops#205 · e2e#80 · #104 |
| 2026-06-23 |
Phase E implemented + kind-validated (→ MERGED 2026-06-23, see row above) (flag default-off → inert in prod; PRs unmerged, no ai-meta pointer bump, server unchanged). The side-effect durability barrier (RFC §4.4 / T6): tool-registry side_effecting classifier + a worker barrier that, before re-dispatching a side-effecting cycle whose result URN already resolves to a durable result, SKIPS re-execution and adopts it — so a side effect fires exactly once across a crash-resume / re-drive. Gated NOETL_SIDE_EFFECT_BARRIER. OQ4 resolved → static (adopt-only makes over-classification safe). Kind gate-ON 3-pass green (forged re-drive + marker-object counter): flag-on re-drive SKIPPED (marker stays 1), flag-off RE-EXECUTES (1→2), noop re-drive never checked; invariants intact. Inline-result ("cycle acked") signal is a follow-up. |
tools#78 · worker#130 · e2e#79 · #104 |
| 2026-06-23 |
Phase D MERGED (all 4 PRs squash-merged in dependency order server → worker → ops → e2e; flags default-off → inert in prod). The minting flip ships the dual-write window. No new crate publish — server resolves noetl-locator 0.1.1, worker resolves noetl-locator/noetl-events/noetl-tools from the registry (no Cargo.toml dep change in either PR → no repoint). semantic-release cut noetl-server 3.43.0 + noetl-worker 5.43.0. ai-meta pointers bumped. OQ5 DECIDED metric-gated (drop dual-write once mint_authoritative_total{path=legacy_fallback} holds 0 across a staging soak + retention time floor), gated on a not-yet-done byte-source re-plumbing prerequisite (materializer fetches the payload from result_store today). Prod minting cutover is a separate next task. #104 stays OPEN (E/F + minting cutover remain). |
server#263 (3.43.0 6f6b9ef) · worker#129 (5.43.0 be6863a) · ops#204 (b19b759) · e2e#78 (07e85aa) · #104
|
| 2026-06-23 |
Phase D implemented + kind-validated — in review (flag default-off → inert in prod; no ai-meta pointer bump, PRs unmerged). The minting flip: one flag NOETL_RESULT_MINT_AUTHORITATIVE makes the URN → Feather/GCS tier the authoritative result store — materializer = authoritative tier writer, resolve-by-URN = primary consume path — with noetl.result_store kept as the reversible dual-write fallback (noetl_result_store_dual_write_total; consume fallback on noetl_worker_result_mint_authoritative_total{path}). Kind gate-ON 3-pass green: PASS 1 tier-authoritative (gcs put Δ4) + dual-write (row + Δ1) + resolve-from-tier (gcs get Δ2, mint{tier} Δ2), 1200 rows; PASS 2 flag-off no-op (all Δ0) + parity; PASS 3 forced tier-miss (object deleted pre-consume) → rollback to result_store (mint{legacy_fallback} Δ1, fallback_object_miss Δ1), 1200 rows. Sole-writer intact every leg. OQ5 (result_store retirement window) surfaced as the open decision before the prod cutover — not decided here.
|
server#263 · worker#129 · ops#204 · e2e#78 · #104 |
| 2026-06-23 |
Phase C MERGED (all flags default-off → inert in prod until a future rollout enables the resolve-by-URN read path). The result-DATA read half: server gains a GCS object backend + a cell-endpoint registry + GET /api/internal/cells (resolves OQ6); worker gains the resolve-by-URN read path (references-in-state behavior, flatten_single_tool_result) + fixes B/B1 (38 unit tests). All 3 PRs squash-merged in dependency order (server → worker → e2e). No new crate publish — worker resolves noetl-tools 3.16.0 + adds the published arrow = "53" direct dep; server resolves noetl-locator 0.1.1; both from the registry (no git/branch dep → no repoint). semantic-release cut noetl-server 3.42.0 + noetl-worker 5.42.0. ai-meta pointers bumped. |
server#262 (3.42.0 c2d5ca9) · worker#128 (5.42.0 7971041) · e2e#77 (39dc880) · #104
|
| 2026-06-22 |
Phase B MERGED (flag default-off → inert in prod until a future rollout enables NOETL_RESULT_MATERIALIZER_ENABLED). All 5 PRs squash-merged in dependency order: noetl-locator member bumped 0.1.0→0.1.1 pre-merge so the additive ResultCoordinates::parse/from_locator API publishes; semantic-release cut noetl-tools 3.16.0 + published noetl-locator 0.1.1 (member-publish ordering); noetl-server 3.41.0 (sibling noetl_result_materializer consumer) + noetl-worker 5.41.0 (shadow consume-loop) cut. No downstream repoint needed — worker resolves noetl-tools ^3.14.2 + a self-contained local inversion; server resolves noetl-locator ^0.1.0; both from the registry, no git/branch dep. ai-meta pointers bumped. |
tools#77 (3.16.0/locator 0.1.1) · server#261 (3.41.0) · worker#127 (5.41.0) · ops#203 · e2e#76 · #104
|
| 2026-06-22 |
Phase B implemented + kind-validated — in review (flag default-off, inert in prod). Separate noetl_result_materializer consume-loop on the system pool writes the over-budget result tier (tabular → Arrow Feather, non-tabular → JSON, small → inline no-op) to the derived §7 key in shadow, alongside noetl.result_store; keep-every URN (OQ1), single-cell seed (OQ6 multi-cell deferred to C). Never alters the authoritative result, never fails an event. Kind gate-ON: tabular → real .feather (269 KB, Arrow magic), non-tabular → .json, flag-off Δ0, event-materializer sole-writer intact every leg. |
worker#127 · tools#77 · server#261 · ops#203 · e2e#76 · #104 |
| 2026-06-22 |
Phase A MERGED (flag default-off, inert in prod until a future server rollout enables it). Slim dependency-free noetl-locator 0.1.0 extracted + published to crates.io (resolves OQ7 — heavy graph stays off the control plane; noetl-tools re-exports it); server accepts the canonical result URI behind NOETL_RESULT_URI_ACCEPT. Dep repointed git→0.1.0 pre-merge; 623 server tests green, heavy crates (duckdb/kube/arrow/tonic/rhai/gcp_auth) absent. |
tools#76 (noetl-tools 3.15.0) · server#260 (noetl-server 3.40.0) · e2e#75 · #104
|
| 2026-06-22 |
Phase A implemented + kind-validated — in review. Server accepts the canonical result URI (parse_result_ref accepts both shapes via noetl_tools::locator; shadow-accept hook on normalize_event_to_row records noetl_result_uri_accept_total{outcome}, never fails the event; gated NOETL_RESULT_URI_ACCEPT, default off / no-op; no schema change). Kind gate-ON: flag-on {canonical} +1, flag-off Δ0, both COMPLETED, sole-writer intact. Surfaced OQ7 (slim noetl-locator crate — noetl-tools drags its whole graph into the server). |
server#260 · e2e#75 · server.wiki@7993b02 · #104 |
| 2026-06-22 |
RFC upgraded for review. Reconciled the 2026-06-16 blueprint with the now-live state: event-WAL half shipped (materializer sole writer, off-server chain-walk, prev_event_id, off-server CQRS cutover live on prod v3.39.5); RFC re-scoped to the result-DATA half (derivable-URN Feather tier, resolve-by-URN read path, side-effect barrier). Decided/proposed/open separated; phased plan A–F; #107 critical-path sequencing noted. Review requested on #104. |
this page · #104 |
| 2026-06-19 |
Off-server-drive × gate reconciliation PROVEN. Gate-ON (PUBLISH_ONLY=true) with the off-server drive (PLUGIN_DRIVE=true) + materializer sole writer green on kind; server rebuilds bounded state from the committed log (read-your-writes via relocated trigger), cold-cache crash-recovery via WAL rebuild. |
server#238 (v3.29.2) · e2e#61 |
| 2026-06-16 |
R02b — worker stamps reference.uri (the stable logical URI .../<frame>/<row>/<attempt>), additive; first consumer is the materialiser. |
worker#98 · PR worker#99 |
| 2026-06-16 |
R02 / R02a — fan-out coordinate (tools). ResultCoordinates gains row; logical URI + §7 key → .../<frame>/<row>/<attempt>. v3.12.0. |
tools#69 · PR tools#70 |
| 2026-06-16 |
Round 01 — naming foundation. Shared noetl_tools::locator (ResourceLocator, ResultCoordinates, shard_key, CellPlacement, legacy parse). 12 unit tests, not yet wired. v3.11.0. |
tools#67 · PR tools#68 |
| 2026-06-16 | Umbrella opened; design blueprint landed + topology aligned with the Resource Locator. | #104 · docs PR docs#180 |
-
Enable
NOETL_RESULT_URI_ACCEPTon a future server rollout — Phase A is merged and inert (flag default-off). Flipping it on a kind/staging rollout exercises the shadow-accept hook before Phase B builds on it. - Review the rest of the RFC — platform-owner ratification on #104 (decided vs proposed vs open; the §9 #107 critical-path claim; OQ1/OQ3/OQ4 defaults).
-
Coordinate Phase B with #115's reference-only schema so the event
shape (
{logical_uri, extracted}only) is agreed before the result materializer writes Feather. -
Phase B is MERGED (2026-06-22) — worker#127 / tools#77 / server#261 /
ops#203 / e2e#76 landed, ai-meta pointers bumped. The Feather tier stays
inert in prod until a future rollout enables
NOETL_RESULT_MATERIALIZER_ENABLED. -
Phase C is MERGED (2026-06-23) — server#262 / worker#128 / e2e#77
landed, ai-meta pointers bumped. The resolve-by-URN read path (GCS object
backend + cell registry +
GET /api/internal/cellsserver-side; worker resolve-by-URN + fixes B/B1, closes OQ6) stays inert in prod until a future rollout enables the read path. No new crate publish was needed. -
Phase D — minting flip is MERGED (2026-06-23) — server#263 →
noetl-server 3.43.06f6b9ef, worker#129 →noetl-worker 5.43.0be6863a, ops#204b19b759, e2e#7807e85aa; flag default-off → inert in prod, ai-meta pointers bumped. No new crate publish. The tier is authoritative;result_storeis the reversible dual-write fallback. Prod minting cutover is a separate next task (rolls server 3.43.0 + worker 5.43.0 to GKE). -
OQ5 —
result_storeretirement window — DECIDED metric-gated (2026-06-23): drop the dual-write oncenoetl_worker_result_mint_authoritative_total{path="legacy_fallback"}holds 0 across a staging soak (tier never misses) plus a retention-period time floor; historical rows age out byexpires_at(no back-migration). The actual retirement is blocked on a not-yet-done prerequisite: the materializer fetches the over-budget payload fromresult_storetoday, so the tier-write byte source must be re-plumbed first. Phase D ships dual-write only; that prerequisite is out of Phase D scope and not yet implemented. -
Phase E — side-effect durability barrier is implemented +
kind-validated, in review (2026-06-23): tools#78 (
side_effectingclassifier) + worker#130 (the barrier,NOETL_SIDE_EFFECT_BARRIER) + e2e#79 (the rig); flag default-off → inert in prod, server unchanged. On merge: release tools → bump worker dep → release worker → bump ai-meta pointers. Phase-E follow-up: the inline-result ("cycle acked") existence signal for small side-effecting results (the tier-object signal is shipped). -
Phase F — GC + DR is implemented + kind-validated, in review
(2026-06-23): server#264 (GC sweeper + endpoint,
NOETL_RESULT_TIER_GC), worker#131 (DR verify-and-repair,NOETL_RESULT_TIER_DR), ops#205 (GC step + flags), e2e#80 (the 5-pass rig); flags default-off → inert. Worker/e2e stacked on the unmerged Phase E branch — merge Phase E first, then Phase F. This is the final build phase.
The BUILD phases A–F of #104 are now complete (Phase F in review). The
remaining open items on #104 are operational, not build: (1) prod GCS infra
provisioning; (2) the OQ5 byte-source re-plumbing prerequisite to retiring
result_store (the materializer still fetches the over-budget payload from
result_store); (3) the staged prod enablement + minting cutover (roll the A–F
flags on in order, each behind its kind gate). #104 stays OPEN for those.
- Umbrella: Decoupled Context + One-Level Event Chain (#115) — the reference-only schema + chain walk this builds on.
-
Umbrella: Orchestrator Scaling — the
extractedpredicate block + CQRS lineage. - Umbrella: System Pool Design — the plug-in-ring system-pool shape both materializers run on.
- Blueprint:
event_wal_and_derivable_storage.md· predecessor Sink-Driven Data Storage. - Naming source: Global Hybrid Supercluster Blueprint §4 (Cell + Shard), §7 (Object Store), §8 (Resource Locator).
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture