-
Notifications
You must be signed in to change notification settings - Fork 0
Umbrella Event WAL Storage
Tracking issue: noetl/ai-meta#104
· Board: roadmap 3
· Blueprint: docs/architecture/event_wal_and_derivable_storage.md
· Program roof: Umbrella: Server Dissolution → Global Grid (#107)
Status (2026-06-23): staged result-tier enablement A–D LIVE on prod; only
result_storeretirement remains (gated). Phases A–D are now ENABLED on prod GKE — A (canonical URI), B (RESULT_MATERIALIZER_ENABLED, system-pool), C (RESULT_URI_RESOLVE, worker-rust), and D (RESULT_MINT_AUTHORITATIVEon system-pool + server only — the URN→Feather/GCS tier is the authoritative result store,noetl.result_storekept as the reversible dual-write fail-safe). Phase D enablement was validated live across 4 over-budget execs (tier-only-at-cell=usc1-a,dual_write_total1:1,resolved_feathercorrect, sole-writerprojected==acked, lag 0, never-scan 0, 0 restarts) and LEFT ENABLED; GC/DR stay OFF. The D flag is placed on system-pool + server only — on worker-rust it would spawn a stray result materializer at the wrong cell (spawn gatedmaterializer_enabled() OR mint_authoritative()), and worker-rust already resolves from the tier via Phase C. Remaining: OQ5 byte-source re-plumb +noetl.result_storeretirement + OQ5 soak (still gated). See the 2026-06-23 Sessions-Log entry. The historical RFC/build status follows.Status (2026-06-22): RFC for review — the event-WAL half is LIVE on prod; the result-data half is the remaining work. The original blueprint (2026-06-16) folded #103 (CQRS split) and #101 (references-in-state) into one model. Since then the event side of that model shipped and went live: the materializer is the sole writer of
noetl.event, the off-server drive builds state from thenoetl_eventsJetStream WAL by walking theprev_event_idchain (never scanningnoetl.event), and the full off-server CQRS cutover is live on prod (server v3.39.5, 2026-06-22). The result-DATA half has now begun landing: Phase A (server accepts the canonical URI), Phase B (shadow Feather result tier), and Phase C (resolve-by-URN read path — server GCS object backend + cell registry +GET /api/internal/cells, worker resolve-by-URN + fixes B/B1, closes OQ6) are all MERGED (flags default-off → inert in prod until a future rollout), and Phase D (the minting flip — the URN tier becomes the authoritative result store withnoetl.result_storeas the reversible dual-write fallback) is MERGED 2026-06-23 (server#263 →noetl-server 3.43.0, worker#129 →noetl-worker 5.43.0, ops#204, e2e#78, flag default-off → inert in prod). What remains after Phase D: OQ5 — theresult_storeretirement window (now resolved as a metric-gated decision, see §OQ5; the prod minting cutover is a separate task). Phase E (the side-effecting-tool durability barrier, tools#78 → noetl-tools v3.17.0 / worker#130 → noetl-worker v5.44.0 / e2e#79) and Phase F (GC + DR — server#264 → noetl-server v3.44.0 / worker#131 → noetl-worker v5.45.0 / ops#205 / e2e#80) are both MERGED to main 2026-06-23 (flags default-off → inert in prod; ai-meta pointers bumped). With Phase F the BUILD phases A–F are complete — what remains on #104 is operational only: prod GCS infra, the OQ5 byte-source re-plumbing prerequisite, and the staged prod enablement/minting cutover. This page is the RFC; review requested on the issue.
Every atomic cycle of an execution publishes its event to the
noetl_events NATS JetStream stream; the publish-ack is the durability
boundary, so the stream is the write-ahead log and the synchronous
noetl.event INSERT is gone from the hot path. Independent system-pool
consumers drain that log: a materializer folds it into noetl.event
(now an audit/query projection, not the source of truth) and the
off-server drive rebuilds WorkflowState from the WAL by walking the
one-level prev_event_id chain. (All of the preceding is shipped and
live on prod.) The remaining design, proposed here, is that result
data stops living inline/Postgres and instead is materialized as Arrow
Feather files in object store, addressed by a derivable logical
URI (noetl://<tenant>/<project>/results/<execution_id>/<step>/<frame>/<row>/<attempt>)
that resolves to a physical cell/shard key with zero central lookup;
state carries only the bounded extracted predicate block plus the
derivable coordinates, never the payload and never an opaque reference
string; and a crashed instance resumes from the last acked offset,
re-running re-runnable cycles and skipping a side-effecting cycle whose
result URN already exists.
The 2026-06-16 blueprint described one model spanning two halves. The event half landed through #103 + #115 and is live on prod:
-
NATS-as-WAL for events. The worker publishes events to the
noetl_eventsJetStream stream; the materializer (repos/worker/src/materializer.rs, durable consumernoetl_materializer,AckMode::Defer) drains the stream, POSTs/api/internal/events/project, and acks only on 2xx — making it the sole writer ofnoetl.event(materializer.rs:8,22,46,50,184,195). -
State off the WAL, no event scan. The off-server drive builds
WorkflowStatefrom the WAL viaWalEventIndex+ExecutionChain, walking theprev_event_idchain from the server's authoritative tip (expected_head) —repos/worker/src/state_builder.rschain_walk_from,AdvanceOutcome. It never scansnoetl.event(that table is audit-only underNOETL_EVENT_READ_PATH=audit_only). -
One-level chaining.
ChainHeads.link_batch(repos/server/src/state.rs:230) stamps each event'sprev_event_id; multi-replica coherence CAS-advances a shared NATS-KV head. -
Live on prod. The full off-server CQRS cutover (
PUBLISH_ONLY=true+STATE_BUILDER=offserver) is live on prod as of server v3.39.5 (2026-06-22): server writes zeronoetl.eventrows, materializer is the sole writer, materializer-lag alert is the guardrail.
So the question #104 still owns is not "how do events become durable" — it's "how does result data become durable, addressable, and tiered" on top of that now-live WAL.
Today an over-budget tool result is not in object store and not Feather. It is a JSONB row in Postgres:
- The inline-vs-reference threshold (default 100 KB, env
NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES) gates inline-vs-durable (repos/server/src/handlers/events.rs:3456). - Over-budget results are written to
noetl.result_store— one row per result, the payload in adata JSONBcolumn (repos/server/src/db/queries/result_store.rs). - They are addressed by a server-minted opaque ref
noetl://execution/<eid>/result/<name>/<id>(events.rs:3463). - The worker additively stamps the canonical logical URI
reference.uri = noetl://default/default/results/<eid>/<step>/<frame>/<row>/<attempt>viaResultCoordinates::logical_uri()(command.rs:1547) — but nothing consumes it; the server result-store path and the materializer still use the legacy opaque ref. - A bounded, navigable
extractedpredicate block (≤ ~4 KB) is spliced inline in all tiers so the drive can evaluatewhen:/set:/ cursor fan-out without a fetch (build_extracted/summarise_value,repos/worker/src/executor/command.rs:1859).
This leaves three concrete problems #104 closes:
- Result bytes bloat Postgres and the connection pool — the same small-synchronous-DB pressure CQRS removed for events, still present for result payloads. They belong in object store, not a JSONB column.
- Addressing is an opaque carried string, not derivable — replay, dedup, and topology-aware routing all want a name computed from identity, not a pointer threaded through state.
- There is no result-durability barrier — a crash mid-execution can re-run a side-effecting tool because nothing checks "is this cycle's result already durable" by URN existence.
Honest separation of what is settled, what this RFC proposes, and what is still open. (Open items are detailed in §7.)
| # | Tenet | Status |
|---|---|---|
| T1 | JetStream publish-ack is the event durability boundary; noetl.event is a derived audit projection. |
Decided + shipped (#103/#115, live on prod). |
| T2 | The drive builds state by walking the one-level prev_event_id chain off the WAL; never scans noetl.event. |
Decided + shipped (#115). |
| T3 | State carries the bounded extracted predicate block + derivable coordinates only — never payload, never an opaque ref string. |
extracted shipped; drop-the-opaque-ref is proposed (this RFC). |
| T4 | Result names are a two-layer derivable scheme: a stable logical URI resolved to a physical cell/shard key. | Naming module shipped (locator); production adoption proposed. |
| T5 | Over-budget tabular results materialize to Arrow Feather in object store; non-tabular to JSON/Parquet; small stays inline. |
Shadow writer MERGED (Phase B, flag default-off → inert in prod). Tabular → Feather, non-tabular → JSON (OQ3) at the derived §7 key, alongside result_store; nothing reads it until Phase C. |
| T6 | The durability barrier sits at side-effecting tool boundaries only — resume skips a cycle whose result URN already exists. |
MERGED 2026-06-23 (Phase E; flag default-off → inert in prod). Tool-registry side_effecting classifier (tools#78 → noetl-tools v3.17.0) + worker barrier (worker#130 → noetl-worker v5.44.0) + rig (e2e#79); worker depends on the published noetl-tools 3.17. Tier-object existence signal shipped; inline-result ("cycle acked") signal is a follow-up. |
| T7 | All drain/materialise/resolve work runs as system-pool playbooks (plug-in ring), not bespoke Rust services. | Decided (system-pool ADR); materializer already runs this way. |
The "treat JetStream as the WAL" decision is implemented for events:
atomic cycle (tool result | condition eval)
│ worker emit path (compiled core)
▼
noetl_events JetStream stream ── the WAL
publish-ack = durable, replicated
│
┌───────────┴────────────┐
▼ ▼
materializer pool off-server drive (state_builder)
→ noetl.event (audit → WalEventIndex + ExecutionChain;
projection; sole walks prev_event_id chain from
writer; ack-after-2xx) expected_head; NEVER scans noetl.event
- The local in-process buffer the blueprint describes is realized as
the worker's
WalEventIndex— a per-process read cache rebuilt from the retainednoetl_eventsWAL on boot (ephemeralDeliverPolicy::Allconsumer;NOETL_STATE_BUILDER_DURABLE=1to revert to a persisted cursor). It accelerates a process reading its own recent appends; it is not a shared cache. -
noetl.eventis now a projection. Its role is audit + ad-hoc query- replay reconstruction, not the source of truth and not on the drive's read path.
The WAL today carries event envelopes. For the result-data tier, the same
WAL is the trigger source for a result materializer (a second
system-pool consumer of noetl_events, sibling to the event
materializer):
- It reads
call.doneevents whosereferenceindicates an over-budget result, derives the physical Feather key from the logical URI (§4), and writes the Feather file to object store (§5). - It is at-least-once and idempotent: dedup on
event_id; overwrite at the resolved key (same property the event materializer already relies on viaON CONFLICT). -
Decision (proposed): the result materializer is a separate
consumer from the event materializer, not folded into it — so result
I/O (object-store latency) never back-pressures the
noetl.eventaudit fold that the lag alert guards.
For events, publish-ack is already the boundary. For results, the boundary is the side-effecting-tool rule (§4.4): the result's publish-ack + URN existence is the commit point a resume checks. This is the one place the local-ahead-of-durable window has correctness bite, and it is scoped to side-effecting tools only — not every cycle.
Result names reuse the platform's existing Resource Locator (§4/§7/§8 of
the Global Hybrid Supercluster Blueprint),
not an invented urn:noetl:result:... grammar. The module exists and
is unit-tested in repos/tools/src/locator.rs (noetl_tools::locator,
v3.12.0):
Layer 1 — stable logical URI (location-independent; what state carries; dedup/replay key on it; never renamed on migration/failover):
noetl://<tenant>/<project>/results/<execution_id>/<step>/<frame>/<row>/<attempt>@<version>
-
tenant/projectlead (multitenancy + sharding dimensions). -
frame/roware mandatory for cursor fan-out — without them two rows of one step collide. (ResultCoordinatescarries both;command.rsthreadscursor.{frame,row}intorender_context.) -
attempt/@versiondisambiguate retries.
Layer 2 — topology resolution (derivable, then one small registry):
shard_key = FNV1a(tenant + project + execution_affinity) % shard_count
→ region + cell + shard (ResultCoordinates::shard_key)
physical = noetl/env=…/region=…/cell=…/shard=…/tenant=…/project=…/
date=…/execution=…/results/<step>/<frame>/<row>/<attempt>.<ext>
(ResultCoordinates::physical_key)
A consumer that knows (tenant, project, execution_id) derives the
home cell/shard and the object key with zero central lookup. The only
registry is the small, slow-changing cell endpoint map (cell →
provider/bucket/endpoint).
-
Wired: the worker stamps
reference.uri(the logical URI) additively on over-budget results (stamp_logical_uri,command.rs:1547). -
Unconsumed: nothing reads
reference.uriyet. The server still mints + resolves the legacy opaquenoetl://execution/...ref; the result store is still Postgres JSONB. -
Proposed (R02c + materializer): server/orchestrator parse accepts
both shapes (back-compat); the materializer writes Feather at
physical_key; the read path resolves the logical URI → physical key → object fetch. Flip minting so the logical URI is the stored ref only after the read path proves out.
The derivation gives (region, cell, shard); turning a cell into a
concrete bucket/endpoint needs the cell endpoint map. IMPLEMENTED
(Phase C, 2026-06-23): a small server-served, cacheable registry
(GET /api/internal/cells → cell → {provider, bucket, endpoint, region}), seeded from ops config — dozens of entries, not a per-fetch
SPOF (server#262 →
noetl-server v3.42.0; resolves OQ6). Single-cell deployments (today)
resolve every cell to the one configured bucket; the registry is the seam
that lets the grid grow without changing names. Default-off until a future
rollout enables the read path.
The load-bearing correctness decision. Three atomic kinds, different treatment:
| Atomic kind | Replay safety | Treatment |
|---|---|---|
| Condition evaluation | Pure — re-derives free | Local + async publish; never blocks. |
| Tool, no side effect | Re-runnable | Local + async publish. |
| Tool with side effect | Not replay-safe | Resume must not re-dispatch a cycle whose result URN already exists. |
Before a resumed execution re-dispatches a side-effecting tool for
(execution_id, step, frame, row, attempt), it checks whether the derived
result URN already exists (object HEAD / cycle acked). If it does, the
cycle is skipped and the recorded result adopted. Open (§7): how a
tool declares itself side-effecting — proposed as a tool-registry
attribute (side_effecting: true) so the barrier knows where to block.
- Stored (in object store): the result payload bytes — Arrow Feather for tabular, JSON/Parquet for non-tabular.
-
Derived (never carried): the location — computed from the
logical URI via
physical_key. -
Carried inline (in the event, all tiers): the bounded
extractedpredicate block — enough to evaluate guards/fan-out without a fetch.
The size threshold that gates inline-vs-reference selects the tier:
- Small result → inline in the event; no object-store write.
-
Over-budget tabular → Arrow Feather at the resolved key.
Feather is the on-disk form of the Arrow IPC stream the worker already
encodes (
noetl_tools::arrow_codec::try_encode_tabular_json→TabularEncoding, media typeARROW_STREAM); mmap-able, zero-copy frompyarrowand the Rustarrowcrate. - Over-budget non-tabular (shell stdout, opaque HTTP JSON) → JSON or Parquet at the same resolved key. Open (§7): JSON vs Parquet as the fallback default.
-
Idempotent overwrite: a fixed
@versionrewrites the same key; a bumped version keeps every attempt (§7 open question on the default). -
GC by prefix: per-execution / per-shard prefix delete on TTL or
explicit cleanup — the
system/scheduled_cleanupplaybook gains a Feather-prefix sweep keyed offexpires_at(already aresult_storecolumn today). -
No hot-path eviction concern: the drive never reads the Feather tier
for guard evaluation (the
extractedblock covers that); only an explicit downstream consume (input:binding that needs the full payload) fetches it.
Three caches/stores, three distinct roles — the RFC keeps them separate:
-
shm Arrow IPC cache (
noetl-arrow-cache, lease expiry) — same-node acceleration for a payload handed between colocated blocks. Volatile; not durable; survives this model unchanged. -
noetl.result_store(Postgres JSONB) — today's durable store. The Feather tier replaces it for over-budget payloads; small inline results need no store at all.result_storemay remain for a transition window (dual-write behind a flag) then retire. - Feather in object store — the new durable, cross-node, cross-cell tier, addressed by derived URN, replicated/DR'd via the §8 location descriptor.
A WASM worked-example already proves the host-capability shape: the
reference-materializer plug-in
(repos/worker/plugins/reference-materializer/src/lib.rs) writes an
Arrow/Feather buffer to object store via the granted noetl.object_put
capability ring (it currently writes a hardcoded key
noetl/results/reference/0/0/1.feather). Production path (proposed):
generalize this into the system/result_materializer playbook that (a)
derives the real physical_key from the event's logical URI, (b) calls
object_put, and (c) is driven by the noetl_events WAL consumer — same
plug-in-ring shape as the event materializer, hot-reloadable via catalog
version bump (system-pool ADR Phase 4).
-
Restart reads the read model + the WAL up to the last offset its
executions were durably acked at (the off-server drive already does
this from
expected_head; the result materializer from its durable consumer cursor). - Replay the local-only tail: pure conditions re-derive; re-runnable tools re-run; side-effecting tools are skipped if their result URN already exists (§4.4), else dispatched.
-
Both materializers are at-least-once + idempotent: event
materializer dedups on
event_id(ON CONFLICT); result materializer overwrites at the resolved key. Neither reintroduces anoetl.eventscan. - Replay determinism: the location is computed from the envelope + shard function, never carried — so a replay of the same events resolves to the same keys. This is the property the carried opaque ref could not guarantee across migration/failover.
-
OQ1 — attempt/version semantics. Overwrite-on-retry (fix
@version) vs keep-every-attempt (bump). Overwrite is GC-friendly; keep-every is better for forensic replay. Proposed default: overwrite; keep-every behind a debug flag. - OQ2 — local tail bound before back-pressure. How far local may run ahead of the last ack. Too far widens the replay window + idempotency burden; too tight approaches the synchronous write. Needs a configurable bound + a metric.
- OQ3 — non-tabular fallback format. JSON vs Parquet for the over-budget non-tabular tier. Leaning JSON for round-trip simplicity; Parquet if columnar non-tabular shows up.
-
OQ4 — side-effect classification. RESOLVED static (Phase E, 2026-06-23).
A static, per-kind tool-registry attribute (
registry::kind_is_side_effecting; conservative defaulttrue, onlynoop/rhaifalse). A per-invocation predicate (http GET vs POST) is deliberately not done: because the barrier is adopt-only (it can only turn a duplicate side effect into one, never drop work), over-classification is safe, so the static flag is sufficient for correctness — per-invocation is a future optimization, not a requirement. (tools#78 / worker#130.) -
OQ5 —
result_storeretirement. DECIDED 2026-06-23: metric-gated + a hard re-plumbing prerequisite (surfaced live by Phase D). Phase D ships the dual-write window (tier authoritative +result_storewritten as the reversible fallback); it does NOT itself stop the dual-write. The retirement decision is now settled along two axes:-
When to drop the dual-write — metric-gated. Retire
result_storeonly oncenoetl_worker_result_mint_authoritative_total{path="legacy_fallback"}holds 0 across a staging soak (the tier never misses), combined with a retention-period time floor (dual-write must run at least one full result retention period at flag-on so any in-flight resume can still fall back). Both must hold — the metric proves the tier is sufficient, the time floor protects in-flight resumes. (Chosen over pure time-bounding (a) and keep-indefinitely (c): the metric is the real signal; the time floor is the safety margin.) -
Historical rows — let existing
noetl.result_storepayloads age out viaexpires_atwhile the tier owns all new mints (no bulk back-migration). A hard prerequisite that gates the actual retirement (not just a policy choice, and NOT done): the materializer fetches the over-budget payload fromresult_storetoday, so retiring it requires re-plumbing how the tier write obtains the bytes (e.g. the producer stages directly to the tier, or the materializer reads the inline over-budget payload off the event). Until that re-plumbing lands, droppingresult_storewould starve the tier writer of its byte source. So even with the metric gate green, retirement is blocked on this prerequisite — which is itself out of Phase D scope and not yet implemented.
-
When to drop the dual-write — metric-gated. Retire
-
OQ6 — cell endpoint registry ownership. RESOLVED (Phase C, 2026-06-23).
Server-served: the registry is owned by the server and exposed at
GET /api/internal/cells; the resolve-by-URN read path queries it (with the worker-side resolve path consuming the result), default-off until a future rollout. (server#262 → noetl-server v3.42.0; worker#128 → noetl-worker v5.42.0.) -
OQ7 — locator dependency shape (surfaced in Phase A, 2026-06-22). Phase A
followed RFC §8's "server adds the
noetl-toolsdep," butnoetl-toolshas no feature gating just the pure-stdlocatormodule —duckdb(bundled C++),kube,arrow-flight,tonic,rhai,gcp_authare non-optional core deps, so the entire tool-registry graph lands in the control-plane server (the musl-static image still builds clean, so it's a cost not a blocker). Proposed: extractlocatorinto a slim, dependency-lightnoetl-locatorcrate that bothnoetl-toolsand the server depend on — same single source of truth, none of the weight. Decide before Phase B/C add morelocatorcall sites server-side. - Risk — object-store latency on the result path. Mitigated by keeping the result materializer a separate consumer (§3.2) so it never back-pressures the event audit fold / lag alert.
- Risk — two writers of the same key. Avoided: exactly one result materializer instance per shard owns the write (execution-affinity, the same single-owner property #116 established for the drive).
Phased + reversible, mirroring the #103 / #115 program. Each phase is behind a flag and kind-validated before the next.
-
R-naming (done).
noetl_tools::locatormodule + worker stampsreference.uriadditively. ✅ -
Phase A — server accepts the canonical URI (R02c). ✅ MERGED
(2026-06-22), flag default-off — inert in prod until a future server
rollout. Server parses both legacy and canonical shapes via the slim
noetl-locator(parse_result_ref+ResultRef); a shadow-accept hook on thenormalize_event_to_rowchokepoint validatesreference.uriand recordsnoetl_result_uri_accept_total{outcome}, never failing the event. Gated behindNOETL_RESULT_URI_ACCEPT(default off / byte-identical no-op); no schema change (the URI is already persisted in thereferenceJSON). OQ7 resolved in the same change set: the locator was extracted into the lean, dependency-freenoetl-locatorcrate (purestd) so the control plane parses the URI without pulling noetl-tools' heavy graph (duckdb/kube/arrow/tonic/rhai/gcp_auth confirmed absent from the server tree);noetl-toolsre-exports it asnoetl_tools::locatorso the worker stamp path is unchanged. Shipped: noetl/tools#76 (→noetl-tools 3.15.0+ newnoetl-locator 0.1.0on crates.io) + noetl/server#260 (→noetl-server 3.40.0) + noetl/e2e#75 (rig). Repos:tools,server,e2e. -
Phase B — result materializer writes Feather (flagged, shadow).
✅ MERGED (2026-06-22), flag default-off → inert in prod until a
future rollout enables it. A separate
noetl_eventsconsume-loop on the system pool (consumernoetl_result_materializer, own ack cursor) resolves an over-budget result's payload (read-only), tiers it (tabular → Arrow Feather, non-tabular → JSON [OQ3], small → no write), and writes the body to the derived §7physical_keyvia the server-mediatedPUT /api/internal/objects/{key}— shadow, alongsidenoetl.result_store; nothing reads it yet (Phase C). Keep-every attempt URN (OQ1); single-cell seed for write-side URN resolution (OQ6's multi-cell/miss part deferred to Phase C). Gated behindNOETL_RESULT_MATERIALIZER_ENABLED(default off / true no-op); never alters the authoritative result, never fails an event. Kind gate-ON: tabular → real Arrow.feather, non-tabular →.jsonat the derived key, flag-off Δ0, event-materializer sole-writer intact every leg. Merged: worker#127 (→noetl-worker 5.41.0) + tools#77 (→noetl-tools 3.16.0+ newnoetl-locator 0.1.1on crates.io) + server#261 (→noetl-server 3.41.0) + ops#203 + e2e#76. Repos:worker,tools,server,ops,e2e. -
Phase C — resolve-by-URN read path. MERGED 2026-06-23 (all flags
default-off → inert in prod until a future rollout). The
input:/consume side resolves the logical URI → physical key → object fetch, with the shm Arrow cache in front for same-node hits. The server side adds a GCS object backend + a cell-endpoint registry +GET /api/internal/cells(resolves OQ6); the worker side adds the resolve-by-URN read path (references-in-state behavior,flatten_single_tool_result) + fixes B/B1; 38 worker unit tests. No new crate publish — worker resolvesnoetl-tools 3.16.0+ adds the publishedarrow = "53"direct dep, server resolvesnoetl-locator 0.1.1, both from the registry (no git/branch dep → no repoint). Merged in dependency order: server#262 (→noetl-server 3.42.0c2d5ca9) + worker#128 (→noetl-worker 5.42.07971041) + e2e#77 (39dc880, 3-pass rig + fake-gcs). Repos:server,worker,e2e. -
Phase D — minting flip +
result_storedual-write window. ✅ MERGED 2026-06-23 (server#263 →noetl-server 3.43.06f6b9ef, worker#129 →noetl-worker 5.43.0be6863a, ops#204b19b759, e2e#7807e85aa), flag default-off → inert in prod. One flag —NOETL_RESULT_MINT_AUTHORITATIVE— makes the URN → Feather/GCS tier the authoritative result store: the materializer becomes the authoritative tier writer (implies the Phase B flag), resolve-by-URN becomes the primary consume read path (implies the Phase C flag), and the server keeps writingnoetl.result_storeas the reversible dual-write fallback leg (counted onnoetl_result_store_dual_write_total). A tier miss falls back fail-safe to the dual-written store (rollback safety), recorded onnoetl_worker_result_mint_authoritative_total{path}. The tier write stays worker-side because the slim control plane cannot encode Feather (OQ7); the async producer→materializer window is exactly what the dual-write covers. The retirement ofresult_store(stopping the dual-write) is OQ5 — DECIDED metric-gated (§OQ5), gated on a not-yet-done byte-source re-plumbing prerequisite, NOT Phase D; flag-off rolls back cleanly. Repos: server#263 · worker#129 · ops#204 · e2e#78. -
Phase E — side-effect durability barrier. ✅ MERGED 2026-06-23
(flags default-off → inert in prod; ai-meta pointers bumped; tools#78 →
noetl-tools v3.17.0, worker#130 → noetl-worker v5.44.0, e2e#79). Tool-registry
side_effectingclassification + a barrier in the worker: before (re-)dispatching a side-effecting cycle, if the cycle's derived result URN already resolves to a durable result (Phase C read path), the worker SKIPS re-execution and adopts the recorded result, so an external side effect fires exactly once across a crash-resume / re-drive. Non-side-effecting cycles are never blocked. GatedNOETL_SIDE_EFFECT_BARRIER(default off → true no-op). The gate looks through the orchestrator'stask_sequencewrapper (side-effecting iff any sub-task is), sonoop/rhaisteps are exempt. Adopt-only safety:resolve_by_urnreturns Some only on a durable hit, so the barrier can only turn a duplicate side effect into one, never drop work — which is why OQ4 resolves to static classification (per-invocation is an optimization, not a correctness need).attempt=1is fixed, so the barrier keys on durable-success existence at the coordinate, not the attempt number → OQ1 keep-every-attempt + #125 retry compose cleanly (retry-after-failure re-executes; resume-after-success skips). Metricnoetl_worker_side_effect_barrier_total{outcome,tool}. Kind gate-ON 3-pass green (prod-exact off-server gate + fake-gcs, deterministic forged re-drive + marker-object side-effect counter): PASS A flag-on re-drive SKIPPED (marker stays 1,barrier{skipped}Δ>0); PASS B flag-off RE-EXECUTES (marker 1→2, metric Δ0); PASS C noop re-drive never checked (Δ0); invariants (sole-writer, roots=1, dangling=0, terminal=1) intact. Server unchanged (worker-only; reuses the Phase C GCS backend + cell registry). Scope follow-up (not a blocker): the implemented existence signal is the tier-object half of §4.4's "object HEAD / cycle acked" — small/inline side-effecting results (not tiered) re-execute today; the event-completion signal for inline results is a clean Phase-E follow-up. tools#78 · worker#130 · e2e#79. Repos:tools,worker,e2e(server not needed). -
Phase F — GC + DR. ✅ MERGED 2026-06-23 (flags default-off →
inert in prod; ai-meta pointers bumped; server#264 → noetl-server
v3.44.0, worker#131 → noetl-worker v5.45.0, ops#205, e2e#80). The
final build phase.
-
GC (server) —
POST /api/internal/result-tier/gc, gatedNOETL_RESULT_TIER_GC(default off → disabled no-op),dry_rundefault true. A conservative sweeper that reclaims only provably-dead objects: an object whose §7 key parses anexecution=<eid>segment, whose execution has no survivingnoetl.eventrow (aged out / orphan), and which is past a grace window (from the eid mint time). It never deletes a live-referenced object — the verdict (decide) skips any object whose execution still has events, unit-tested. Object backend gainslist/delete; metricnoetl_result_tier_gc_total{outcome}. Thesystem/scheduled_cleanupplaybook calls it (double-gated, dry-run by default). -
DR (worker) — the tier is derivable from the WAL, so the result
materializer gains a verify-and-repair mode (
NOETL_RESULT_TIER_DR, default off): a missing/corrupt object is rebuilt from its source byte-identically (deterministic encode) by re-running the materialization for its URN; healthy objects are left untouched; never alters the authoritativeresult_storesource. Metricnoetl_worker_result_tier_dr_total{outcome}. -
Scoped OUT (reported, not guessed): attempt-version reaping (OQ1, open)
and
result_storeretirement (OQ5) — tier GC keys off execution retention, independent of both. - Kind gate-ON 5-pass green (off-server gate + fake-gcs): GC-1 dry-run lists
the dead orphan + skips the live object (
live_candidates=0) + deletes nothing; GC-2 delete reclaims only the orphan, the live object survives + serves; GC-3 flag-off no-op; DR-1 a deleted referenced object re-derived byte-identically (sha256 match) + served by the read path; DR-2 flag-off no-op. Invariants intact every real execution. -
Merge order (Phase E → F) — done: Phase E merged first (tools#78
published noetl-tools 3.17.0; worker#130 repointed onto the published crate,
no patch); then worker#131 + e2e#80 were rebased off
feat/104-phase-e-barrierontomainand merged. server#264 + ops#205 (Phase-E-independent) merged alongside. worker#130/#131 depend on the published noetl-tools 3.17. -
noetl/server#264 ·
noetl/worker#131 ·
noetl/ops#205 ·
noetl/e2e#80. Repos:
server,worker,ops,e2e.
-
GC (server) —
With Phase F MERGED the BUILD phases A–F are complete. What remains on
#104 is operational only: (1) prod GCS infra provisioning, (2) the OQ5
byte-source re-plumbing prerequisite to retiring result_store, and (3) the
staged prod enablement + minting cutover. No more build phases.
Each phase ships its span + metric + execution_id correlation in the
same change set (observability rule). Kind-validation gate per
deployment-validation.md before any GKE rollout.
- #103 (CQRS split) — CLOSED, live on prod. Provides the stream, the materializer, the sole-writer guarantee. This RFC's result materializer is a sibling consumer on the same stream. No blocker.
-
#101 (references-in-state) — CLOSED, reframed by #115. Provides the
extractedpredicate block (kept). Its carriedreference.refis the thing this RFC supersedes by derivation. No blocker. -
#115 (decoupled context + event chain) — in progress, far along.
Provides T1–T3 (chain walk, never-scan, reference-only schema). #104's
result-data tier is the storage substrate #115 Phase 1's
"schema-holds-refs-only" assumes. Sequencing: Phase A (canonical URI
acceptance) should land after or with #115's reference-only schema so
both agree the event carries
{logical_uri, extracted}only. #115's atomic-item context (Phase 5) and #104's resolve-by-URN read path (Phase C) touch the same worker consume path — coordinate to avoid a merge collision. - #107 (server dissolution → global grid) — program roof. #104 is program step 3 ("per-shard NATS-as-WAL; Postgres demoted to a derived projection"). The event half of step 3 is done (Postgres-event is now a projection); #104's result-data tier finishes step 3 by demoting the result store too. Step 4 ("drop Postgres-as-source-of-truth → projection-on-demand") depends on #104 Phase D landing. The cell endpoint registry (§4.3) is the seam step 5 (cross-shard federation) builds on. #104 is on the critical path for #107 steps 4–5.
Prerequisites before starting Phase B: (1) #115 reference-only schema
ratified so the event shape is settled; (2) a default tenant/project
source threaded (single-tenant default/default today is fine for Phase
A–C); (3) the object-store capability (object_put) confirmed available to
the system pool in the target cluster (the worked example proves the kind
path).
- Cross-shard federation / NATS superclusters (#107 step 5) — the URN scheme is designed to support it (topology resolves from the name), but the federation routing + consistency design is its own note.
-
The shm Arrow cache internals (
noetl-arrow-cache) — unchanged by this RFC; same-node acceleration only. - Event-WAL mechanics — settled and shipped (#103/#115); this RFC does not revisit publish-ack, the event materializer, or the chain walk.
- Multi-tenant tenant/project provisioning — the URN carries the coordinates; how tenants are created/billed is elsewhere.
- Re-charting the prod cutover — the off-server CQRS cutover is live; this is design-only and changes no prod default.
| Date | What | Pointer |
|---|---|---|
| 2026-06-23 |
🪶 OQ5 Option A SHIPPED — producer-staged result tier. Producing worker stages the over-budget tier object at emit time under NOETL_RESULT_PRODUCER_STAGE (default off → byte-identical no-op), decoupling the tier write from noetl.result_store; materializer skip-on-exists; shared decide_tier → byte-identical. result_store-retirement soak defined (gate = mint_authoritative{legacy_fallback}=0 with producers staging in steady state); NOT started — gated on a prod producer-stage flip (deploy v5.46.0 + NOETL_RESULT_PRODUCER_STAGE=true on noetl-worker-rust). No prod default changed. |
worker#132 → v5.46.0 27c7c17; e2e#81; docs#186; ai-meta#128; ops#206 (soak-gate rules, open) |
| 2026-06-23 |
🚀 Operational prerequisite COMPLETE: real-GCS auth (WI/ADC) MERGED + server ROLLED TO PROD as the WI KSA — tier still OFF. server#265 (fad5d8a) merged → noetl-server v3.45.0 (21da3ef); prod image built (Cloud Build, e2-highcpu-8) server-rust@sha256:d3cbf1ad…. Rolled to prod GKE with serviceAccountName: noetl-server-rust (operator-provisioned WI-bound KSA) + result-tier GCS/cell ENV applied on the server, cell-seed ENV on the system-pool. No tier-enable flag set (RESULT_MATERIALIZER_ENABLED/URI_RESOLVE/MINT_AUTHORITATIVE/TIER_GC/TIER_DR all OFF) → tier inert, behavior == prior stack; off-server gate + CPU limits preserved. LIVE validation: server healthy auth=adc (no token minted — lazy), /api/internal/cells reads the config, off-server cutover sole-writer/lag-0/never-scan (smoke + system/scheduled_cleanup COMPLETED, published==acked=13, chain clean), 0 restarts. Prod configured + WI-live, ready for staged B→C→D enablement.
|
ai-meta repos/server→21da3ef; server#265
|
| 2026-06-23 |
Operational prerequisite: real-GCS auth (Workload Identity / ADC) implemented + kind-validated — in review (server-only; no flag, no prod change, ai-meta pointer NOT bumped). The GCS object backend authenticated only via a static NOETL_OBJECT_STORE_GCS_TOKEN (or none, for the kind emulator); it did not mint from WI/ADC, so the tier could not be served in steady-state prod (bucket has public-access-prevention enforced; server runs under a WI-bound KSA). New auth-mode matrix selected by NOETL_OBJECT_STORE_GCS_AUTH (default auto): none (emulator), static (token override), adc (mint + auto-refresh a short-lived OAuth token via gcp_auth, scope devstorage.read_write — the prod path). auto → static token wins, else real-GCS endpoint → adc, else none. Lazy provider, gcp_auth caches+refreshes (token/request, not mint/request). Reuses gcp_auth = "0.12" (already in worker/tools). Metric noetl_object_store_gcs_auth_total{mode,outcome}. 623 tests + clippy green; ADC unit-tested, first exercised live at prod enablement. Kind regression check: Phase C resolve rig re-run against fake-gcs under the prod-exact off-server gate — server logged auth=none for the emulator, served the tier end to end (PASS 1 gcs put Δ4/get Δ2, 1200 rows; PASS 2 fallback; PASS 3 legacy parity; sole-writer intact every leg) → no-auth path not regressed. Closes the "real-GCS auth" prerequisite for prod enablement. GCS bucket provisioning + the operator IAM/WI block + the staged enablement sequence are recorded in §12.1. |
server#265 · #104 |
| 2026-06-23 |
Phases E + F MERGED to main — the FINAL build phases; repo-only, flags default-off → inert in prod. Phase E: tools#78 → noetl-tools v3.17.0 (registry::kind_is_side_effecting; members noetl-directives + noetl-locator re-published first), worker#130 barrier → noetl-worker v5.44.0 (repointed onto published 3.17, no patch), e2e#79. Phase F: server#264 GC sweeper → noetl-server v3.44.0, worker#131 DR re-derive → noetl-worker v5.45.0 (rebased onto main), ops#205, e2e#80. ai-meta pointers + deployment-spec wiki rows bumped. A–F build complete; #104 stays OPEN (operational items remain). |
tools 1d49dd5 · server 341b614 · worker dd07016 · ops 26185ff · e2e d7372be · #104
|
| 2026-06-23 |
Phase F (GC + DR) implemented + kind-validated (→ MERGED 2026-06-23, see row above) (the final build phase; flags default-off → inert, PRs unmerged, no ai-meta pointer bump). GC (server): a conservative dry-run-first POST /api/internal/result-tier/gc sweeper (gated NOETL_RESULT_TIER_GC) reclaiming only provably-dead tier objects (execution aged out of noetl.event, past a grace window) — never a live-referenced one (unit-tested decide); object backend list/delete; system/scheduled_cleanup GC step (double-gated). DR (worker): result-materializer verify-and-repair mode (NOETL_RESULT_TIER_DR) that rebuilds a missing/corrupt object from its WAL-derivable source byte-identically. OQ1 version-reaping + OQ5 retirement deliberately scoped OUT. Kind 5-pass green (GC dry-run/delete safety, DR byte-identical sha256 match, flag-off no-ops, invariants intact). A–F build phases complete; only operational items remain. Worker/e2e stacked on the unmerged Phase E branch (merge Phase E first). |
server#264 · worker#131 · ops#205 · e2e#80 · #104 |
| 2026-06-23 |
Phase E implemented + kind-validated (→ MERGED 2026-06-23, see row above) (flag default-off → inert in prod; PRs unmerged, no ai-meta pointer bump, server unchanged). The side-effect durability barrier (RFC §4.4 / T6): tool-registry side_effecting classifier + a worker barrier that, before re-dispatching a side-effecting cycle whose result URN already resolves to a durable result, SKIPS re-execution and adopts it — so a side effect fires exactly once across a crash-resume / re-drive. Gated NOETL_SIDE_EFFECT_BARRIER. OQ4 resolved → static (adopt-only makes over-classification safe). Kind gate-ON 3-pass green (forged re-drive + marker-object counter): flag-on re-drive SKIPPED (marker stays 1), flag-off RE-EXECUTES (1→2), noop re-drive never checked; invariants intact. Inline-result ("cycle acked") signal is a follow-up. |
tools#78 · worker#130 · e2e#79 · #104 |
| 2026-06-23 |
Phase D MERGED (all 4 PRs squash-merged in dependency order server → worker → ops → e2e; flags default-off → inert in prod). The minting flip ships the dual-write window. No new crate publish — server resolves noetl-locator 0.1.1, worker resolves noetl-locator/noetl-events/noetl-tools from the registry (no Cargo.toml dep change in either PR → no repoint). semantic-release cut noetl-server 3.43.0 + noetl-worker 5.43.0. ai-meta pointers bumped. OQ5 DECIDED metric-gated (drop dual-write once mint_authoritative_total{path=legacy_fallback} holds 0 across a staging soak + retention time floor), gated on a not-yet-done byte-source re-plumbing prerequisite (materializer fetches the payload from result_store today). Prod minting cutover is a separate next task. #104 stays OPEN (E/F + minting cutover remain). |
server#263 (3.43.0 6f6b9ef) · worker#129 (5.43.0 be6863a) · ops#204 (b19b759) · e2e#78 (07e85aa) · #104
|
| 2026-06-23 |
Phase D implemented + kind-validated — in review (flag default-off → inert in prod; no ai-meta pointer bump, PRs unmerged). The minting flip: one flag NOETL_RESULT_MINT_AUTHORITATIVE makes the URN → Feather/GCS tier the authoritative result store — materializer = authoritative tier writer, resolve-by-URN = primary consume path — with noetl.result_store kept as the reversible dual-write fallback (noetl_result_store_dual_write_total; consume fallback on noetl_worker_result_mint_authoritative_total{path}). Kind gate-ON 3-pass green: PASS 1 tier-authoritative (gcs put Δ4) + dual-write (row + Δ1) + resolve-from-tier (gcs get Δ2, mint{tier} Δ2), 1200 rows; PASS 2 flag-off no-op (all Δ0) + parity; PASS 3 forced tier-miss (object deleted pre-consume) → rollback to result_store (mint{legacy_fallback} Δ1, fallback_object_miss Δ1), 1200 rows. Sole-writer intact every leg. OQ5 (result_store retirement window) surfaced as the open decision before the prod cutover — not decided here.
|
server#263 · worker#129 · ops#204 · e2e#78 · #104 |
| 2026-06-23 |
Phase C MERGED (all flags default-off → inert in prod until a future rollout enables the resolve-by-URN read path). The result-DATA read half: server gains a GCS object backend + a cell-endpoint registry + GET /api/internal/cells (resolves OQ6); worker gains the resolve-by-URN read path (references-in-state behavior, flatten_single_tool_result) + fixes B/B1 (38 unit tests). All 3 PRs squash-merged in dependency order (server → worker → e2e). No new crate publish — worker resolves noetl-tools 3.16.0 + adds the published arrow = "53" direct dep; server resolves noetl-locator 0.1.1; both from the registry (no git/branch dep → no repoint). semantic-release cut noetl-server 3.42.0 + noetl-worker 5.42.0. ai-meta pointers bumped. |
server#262 (3.42.0 c2d5ca9) · worker#128 (5.42.0 7971041) · e2e#77 (39dc880) · #104
|
| 2026-06-22 |
Phase B MERGED (flag default-off → inert in prod until a future rollout enables NOETL_RESULT_MATERIALIZER_ENABLED). All 5 PRs squash-merged in dependency order: noetl-locator member bumped 0.1.0→0.1.1 pre-merge so the additive ResultCoordinates::parse/from_locator API publishes; semantic-release cut noetl-tools 3.16.0 + published noetl-locator 0.1.1 (member-publish ordering); noetl-server 3.41.0 (sibling noetl_result_materializer consumer) + noetl-worker 5.41.0 (shadow consume-loop) cut. No downstream repoint needed — worker resolves noetl-tools ^3.14.2 + a self-contained local inversion; server resolves noetl-locator ^0.1.0; both from the registry, no git/branch dep. ai-meta pointers bumped. |
tools#77 (3.16.0/locator 0.1.1) · server#261 (3.41.0) · worker#127 (5.41.0) · ops#203 · e2e#76 · #104
|
| 2026-06-22 |
Phase B implemented + kind-validated — in review (flag default-off, inert in prod). Separate noetl_result_materializer consume-loop on the system pool writes the over-budget result tier (tabular → Arrow Feather, non-tabular → JSON, small → inline no-op) to the derived §7 key in shadow, alongside noetl.result_store; keep-every URN (OQ1), single-cell seed (OQ6 multi-cell deferred to C). Never alters the authoritative result, never fails an event. Kind gate-ON: tabular → real .feather (269 KB, Arrow magic), non-tabular → .json, flag-off Δ0, event-materializer sole-writer intact every leg. |
worker#127 · tools#77 · server#261 · ops#203 · e2e#76 · #104 |
| 2026-06-22 |
Phase A MERGED (flag default-off, inert in prod until a future server rollout enables it). Slim dependency-free noetl-locator 0.1.0 extracted + published to crates.io (resolves OQ7 — heavy graph stays off the control plane; noetl-tools re-exports it); server accepts the canonical result URI behind NOETL_RESULT_URI_ACCEPT. Dep repointed git→0.1.0 pre-merge; 623 server tests green, heavy crates (duckdb/kube/arrow/tonic/rhai/gcp_auth) absent. |
tools#76 (noetl-tools 3.15.0) · server#260 (noetl-server 3.40.0) · e2e#75 · #104
|
| 2026-06-22 |
Phase A implemented + kind-validated — in review. Server accepts the canonical result URI (parse_result_ref accepts both shapes via noetl_tools::locator; shadow-accept hook on normalize_event_to_row records noetl_result_uri_accept_total{outcome}, never fails the event; gated NOETL_RESULT_URI_ACCEPT, default off / no-op; no schema change). Kind gate-ON: flag-on {canonical} +1, flag-off Δ0, both COMPLETED, sole-writer intact. Surfaced OQ7 (slim noetl-locator crate — noetl-tools drags its whole graph into the server). |
server#260 · e2e#75 · server.wiki@7993b02 · #104 |
| 2026-06-22 |
RFC upgraded for review. Reconciled the 2026-06-16 blueprint with the now-live state: event-WAL half shipped (materializer sole writer, off-server chain-walk, prev_event_id, off-server CQRS cutover live on prod v3.39.5); RFC re-scoped to the result-DATA half (derivable-URN Feather tier, resolve-by-URN read path, side-effect barrier). Decided/proposed/open separated; phased plan A–F; #107 critical-path sequencing noted. Review requested on #104. |
this page · #104 |
| 2026-06-19 |
Off-server-drive × gate reconciliation PROVEN. Gate-ON (PUBLISH_ONLY=true) with the off-server drive (PLUGIN_DRIVE=true) + materializer sole writer green on kind; server rebuilds bounded state from the committed log (read-your-writes via relocated trigger), cold-cache crash-recovery via WAL rebuild. |
server#238 (v3.29.2) · e2e#61 |
| 2026-06-16 |
R02b — worker stamps reference.uri (the stable logical URI .../<frame>/<row>/<attempt>), additive; first consumer is the materialiser. |
worker#98 · PR worker#99 |
| 2026-06-16 |
R02 / R02a — fan-out coordinate (tools). ResultCoordinates gains row; logical URI + §7 key → .../<frame>/<row>/<attempt>. v3.12.0. |
tools#69 · PR tools#70 |
| 2026-06-16 |
Round 01 — naming foundation. Shared noetl_tools::locator (ResourceLocator, ResultCoordinates, shard_key, CellPlacement, legacy parse). 12 unit tests, not yet wired. v3.11.0. |
tools#67 · PR tools#68 |
| 2026-06-16 | Umbrella opened; design blueprint landed + topology aligned with the Resource Locator. | #104 · docs PR docs#180 |
Next (current, 2026-06-23): OQ5 result_store retirement — gated soak. OQ5 Option A is MERGED (noetl-worker v5.46.0) — producer-stage decouples the over-budget tier write from result_store (materializer skip-on-exists). The retirement gate is noetl_worker_result_mint_authoritative_total{path="legacy_fallback"} holding at 0 across a 72h window (48h time floor + a volume floor) while producers stage in steady state. Starting the soak requires a gated prod change — deploy v5.46.0 + NOETL_RESULT_PRODUCER_STAGE=true on noetl-worker-rust, then kubectl apply -f the ops#206 alert rules. Until that go the soak is not running and the result_store dual-write stays in place; dropping the dual-write is a separate explicit-go step gated on the soak passing.
-
Enable
NOETL_RESULT_URI_ACCEPTon a future server rollout — Phase A is merged and inert (flag default-off). Flipping it on a kind/staging rollout exercises the shadow-accept hook before Phase B builds on it. - Review the rest of the RFC — platform-owner ratification on #104 (decided vs proposed vs open; the §9 #107 critical-path claim; OQ1/OQ3/OQ4 defaults).
-
Coordinate Phase B with #115's reference-only schema so the event
shape (
{logical_uri, extracted}only) is agreed before the result materializer writes Feather. -
Phase B is MERGED (2026-06-22) — worker#127 / tools#77 / server#261 /
ops#203 / e2e#76 landed, ai-meta pointers bumped. The Feather tier stays
inert in prod until a future rollout enables
NOETL_RESULT_MATERIALIZER_ENABLED. -
Phase C is MERGED (2026-06-23) — server#262 / worker#128 / e2e#77
landed, ai-meta pointers bumped. The resolve-by-URN read path (GCS object
backend + cell registry +
GET /api/internal/cellsserver-side; worker resolve-by-URN + fixes B/B1, closes OQ6) stays inert in prod until a future rollout enables the read path. No new crate publish was needed. -
Phase D — minting flip is MERGED (2026-06-23) — server#263 →
noetl-server 3.43.06f6b9ef, worker#129 →noetl-worker 5.43.0be6863a, ops#204b19b759, e2e#7807e85aa; flag default-off → inert in prod, ai-meta pointers bumped. No new crate publish. The tier is authoritative;result_storeis the reversible dual-write fallback. Prod minting cutover is a separate next task (rolls server 3.43.0 + worker 5.43.0 to GKE). -
OQ5 —
result_storeretirement window — DECIDED metric-gated (2026-06-23): drop the dual-write oncenoetl_worker_result_mint_authoritative_total{path="legacy_fallback"}holds 0 across a staging soak (tier never misses) plus a retention-period time floor; historical rows age out byexpires_at(no back-migration). The actual retirement is blocked on a not-yet-done prerequisite: the materializer fetches the over-budget payload fromresult_storetoday, so the tier-write byte source must be re-plumbed first. Phase D ships dual-write only; that prerequisite is out of Phase D scope and not yet implemented. -
Phase E — side-effect durability barrier is implemented +
kind-validated, in review (2026-06-23): tools#78 (
side_effectingclassifier) + worker#130 (the barrier,NOETL_SIDE_EFFECT_BARRIER) + e2e#79 (the rig); flag default-off → inert in prod, server unchanged. On merge: release tools → bump worker dep → release worker → bump ai-meta pointers. Phase-E follow-up: the inline-result ("cycle acked") existence signal for small side-effecting results (the tier-object signal is shipped). -
Phase F — GC + DR is implemented + kind-validated, in review
(2026-06-23): server#264 (GC sweeper + endpoint,
NOETL_RESULT_TIER_GC), worker#131 (DR verify-and-repair,NOETL_RESULT_TIER_DR), ops#205 (GC step + flags), e2e#80 (the 5-pass rig); flags default-off → inert. Worker/e2e stacked on the unmerged Phase E branch — merge Phase E first, then Phase F. This is the final build phase.
The BUILD phases A–F of #104 are now complete (Phase F in review). The
remaining open items on #104 are operational, not build: (1) prod GCS infra
provisioning; (2) the OQ5 byte-source re-plumbing prerequisite to retiring
result_store (the materializer still fetches the over-budget payload from
result_store); (3) the staged prod enablement + minting cutover (roll the A–F
flags on in order, each behind its kind gate). #104 stays OPEN for those.
The result tier cannot be served in steady-state prod until real-GCS auth exists: the prod bucket enforces public-access-prevention and the server runs under a Workload-Identity-bound KSA, so a static/no-auth backend cannot read or write it. Disposition as of 2026-06-23:
Bucket (created). gs://noetl-demo-19700101-results — location
us-central1, uniform bucket-level access, public-access-prevention
enforced, object versioning ON with a noncurrent-version lifecycle
(age-out of noncurrent versions). Single results bucket for the prod
cell (the cell registry resolves every cell to it for now).
Real-GCS auth (WI/ADC) — MERGED + LIVE ON PROD (server#265, fad5d8a → noetl-server v3.45.0; prod server-rust@sha256:d3cbf1ad… runs as the WI-bound KSA noetl-server-rust, auth=adc).
The GCS object backend now mints + auto-refreshes a short-lived OAuth
token from Workload Identity / Application Default Credentials when no
static token is set (scope devstorage.read_write), selected by
NOETL_OBJECT_STORE_GCS_AUTH (auto default → none/static/adc).
Reuses gcp_auth = "0.12"; the no-auth emulator path is preserved
(kind-validated). See the server deployment-spec env catalogue
for the full var docs.
Tier env (server + system-pool) — APPLIED on prod 2026-06-23 (tier still OFF):
NOETL_OBJECT_STORE_BACKEND=gcs
NOETL_OBJECT_STORE_GCS_BUCKET=noetl-demo-19700101-results
NOETL_OBJECT_STORE_GCS_ENDPOINT=https://storage.googleapis.com # real GCS → auto-selects adc
NOETL_OBJECT_STORE_GCS_AUTH=auto # or adc explicitly
NOETL_RESULT_CELL=usc1-a
NOETL_RESULT_CELL_ENV=prod
NOETL_RESULT_CELL_REGION=usc1
NOETL_RESULT_SHARD_COUNT=256
Server (cell registry) and system-pool (materializer) must carry identical cell/region/env/shard values or write keys ≠ read keys.
Operator IAM / Workload Identity (operator-gated — NOT executed):
PROJECT=noetl-demo-19700101
BUCKET=noetl-demo-19700101-results
GSA=noetl-result-tier
NS=noetl
KSA=noetl-server-rust
gcloud iam service-accounts create "$GSA" --project "$PROJECT" \
--display-name "NoETL result-tier GCS access"
gcloud storage buckets add-iam-policy-binding "gs://$BUCKET" \
--member "serviceAccount:${GSA}@${PROJECT}.iam.gserviceaccount.com" \
--role roles/storage.objectAdmin
gcloud iam service-accounts add-iam-policy-binding \
"${GSA}@${PROJECT}.iam.gserviceaccount.com" \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:${PROJECT}.svc.id.goog[${NS}/${KSA}]"
kubectl -n "$NS" annotate serviceaccount "$KSA" \
iam.gke.io/gcp-service-account="${GSA}@${PROJECT}.iam.gserviceaccount.com"Remaining enablement sequence:
-
Operator: IAM / WI — ✅ DONE (GSA
noetl-result-tier+storage.objectAdminon the bucket + WI binding + KSAnoetl-server-rustannotation; verified live). -
Ship the WI/ADC change — ✅ DONE: server#265 merged (
fad5d8a), noetl-server v3.45.0 released, ai-metarepos/server→21da3efbumped. -
Apply tier env + run server as the WI KSA — ✅ DONE 2026-06-23: server rolled to
v3.45.0 with
serviceAccountName: noetl-server-rust+ the GCS/cell ENV; system-pool carries the matching cell seed. Validated healthyauth=adc(no token minted — lazy), off-server cutover sole-writer/lag-0/never-scan, 0 restarts. All tier flags stay OFF. -
Staged flag enablement B → C → D (NEXT — operator-gated), each behind its kind gate then a
prod soak:
NOETL_RESULT_MATERIALIZER_ENABLED(B) →NOETL_RESULT_URI_RESOLVE(C) →NOETL_RESULT_MINT_AUTHORITATIVE(D). OQ5 retirement stays gated on the byte-source re-plumbing prerequisite.
- Umbrella: Decoupled Context + One-Level Event Chain (#115) — the reference-only schema + chain walk this builds on.
-
Umbrella: Orchestrator Scaling — the
extractedpredicate block + CQRS lineage. - Umbrella: System Pool Design — the plug-in-ring system-pool shape both materializers run on.
- Blueprint:
event_wal_and_derivable_storage.md· predecessor Sink-Driven Data Storage. - Naming source: Global Hybrid Supercluster Blueprint §4 (Cell + Shard), §7 (Object Store), §8 (Resource Locator).
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture