-
Notifications
You must be signed in to change notification settings - Fork 0
Umbrella System Pool Design
ai-task: noetl/ai-meta#46
· Opened: 2026-06-02
· Promoted to primary migration umbrella: 2026-06-02 (afternoon — same day)
· Last update: 2026-06-18 (off-server orchestrate topology now has a committed e2e rig — #111 / e2e#59)
· Status: Design done (ADR v2 landed); Phase 1 + the first system playbook (system/outbox_publisher) ready to pick up
· ADR (v2): System Worker Pool and WASM Plug-in Surface
· Closed predecessors: #30 Appendix H worker migration, #45 compiled Python rewrite
Primary migration umbrella as of 2026-06-02 afternoon. Rest of the Python→Rust migration goes through this umbrella, not via more compiled binaries. The platform extends itself via its own primitives: compilable + pluggable system playbooks on a privileged worker pool.
Build the system worker pool — a privileged Rust worker pool that runs platform-internal logic (auth, RBAC, scheduled cleanups, AND publisher + projector) as NoETL playbooks under a system/ namespace. WASM is the compilation target so playbooks are hot-reloadable AND tenant-overridable.
Model analogy: Oracle's SYS schema (privileged namespace, platform extends itself with its own primitives) plus PostgreSQL extensions (CREATE EXTENSION loads compiled code at runtime via dlopen).
┌──────────────────────┐ ┌──────────────────────────┐
│ noetl/ai-meta #49 │ │ noetl/ai-meta #46 │
│ Rust server port │ │ System pool playbooks │
│ │ │ (THIS UMBRELLA) │
│ │ │ │
│ Phase C: internal ├─────────►│ Phase 2: │
│ endpoints │ unblocks │ system/outbox_publisher│
│ │ │ system/projector │
└──────────┬───────────┘ └──────────┬───────────────┘
│ │
│ produces │ consumes
▼ ▼
┌──────────────────────────────────────────────────────────────┐
│ HTTP API surface (server is the data gatekeeper) │
│ │
│ POST /api/internal/outbox/claim │
│ POST /api/internal/outbox/mark-published │
│ POST /api/internal/outbox/mark-failed │
│ GET /api/internal/outbox/pending-count │
│ POST /api/internal/events/project │
└──────────────────────────────────────────────────────────────┘
+----------------------------------------------------------------+
| |
| Phase 1.a — server /api/internal/* endpoints ✅ shipped |
| Python (v4.10.1) ✅ kind-validated |
| Rust (v2.1.1) ✅ unit-tested (kind-val pending) |
| |
| Phase 1.b — system worker pool deployment ✅ shipped |
| Pool deployed + idle on noetl_worker_pool_system |
| KEDA ScaledObject READY |
| Bearer token mounted, /api/internal call works |
| |
| Phase 2.a.1 — system/outbox_publisher.yaml ✅ shipped |
| Playbook registered, uses tool: http + tool: nats |
| auth: { type: bearer, credential: NOETL_INTERNAL_API_TOKEN }
| |
| Phase 2.a.2 — server-side system/* routing ✅ shipped |
| pool_routing.py POOL_PATH_PREFIX_MAP + catalog_path_for |
| LRU cache. Subject: noetl.commands.system.<eid>. |
| PRs noetl/noetl#661 (v4.11.0) + noetl/ops#144 merged. |
| Kind-validated: consumer-delivery delta = 1 each pool. |
| |
| Phase 2.a.3 — deployment env wiring ✅ shipped |
| NOETL_KEYCHAIN_ENV_VARS=NOETL_INTERNAL_API_TOKEN added |
| PR noetl/ops#143 merged; manifest applied to kind |
| |
| Phase 2.b — system/projector.yaml ⏳ blocked |
| on noetl/ai-meta#52 (js_consume tool op missing) |
| Phase 3 — additional playbooks (auth, RBAC, etc.) ⏳ later |
| Phase 4 — WASM compilation ⏳ later |
| |
+----------------------------------------------------------------+
TODAY (2026-06-02): AFTER #46 PHASE 2: LONG TERM:
PYTHON: PYTHON: PYTHON:
noetl-server (FastAPI) noetl-server (none in platform)
noetl-worker (Py pool) noetl-worker (Py pool)
noetl-outbox-publisher ❌ retired Python lives INSIDE
noetl-projector ❌ retired containers dispatched
Python tool impls Python tool impls by the container tool
DSL parser (Python) DSL parser (Python) kind (#43):
- Agent container
RUST: RUST: - Tenant containers
noetl-worker (user pool) noetl-worker (user pool)
noetl/server (skeleton) noetl/server (Phase C done) Each container is a
noetl/cli noetl/cli one-shot K8s Job;
noetl/tools noetl/tools exits when done.
noetl/executor noetl/executor
noetl-worker-system-pool Platform runtime is
(Rust, runs system playbooks) 100% Rust; Python is
a container payload.
Instead of writing more compiled Rust services, encode services as playbooks. Three properties land for free:
- Audit + replay — every system action emits events the same way user actions do; the event log covers system actions too.
- Versioning — catalog entries are versioned; system playbook changes are tracked in git AND in the catalog.
-
Tenant override — tenants can supply
<tenant>/system/<name>to customise platform behavior for their tenant without forking the platform.
WASM compilation makes the perf argument moot — a YAML playbook compiled to WASM runs at near-native speed inside the worker's wasmtime host.
The bearer-token plumbing that connects a K8s Secret to a system playbook's tool: http call uses only existing NoETL primitives — no new code required. Five hops:
+------------------------------------------+
| K8s Secret noetl-internal-api-token |
| key=token, value=<32-byte hex token> |
+--------------------+---------------------+
|
| valueFrom.secretKeyRef
v
+------------------------------------------+
| Pod env (worker-system-pool) |
| NOETL_INTERNAL_API_TOKEN=<token> |
| NOETL_KEYCHAIN_ENV_VARS=NOETL_INTERNAL_API_TOKEN
+--------------------+---------------------+
|
| worker startup
| load_keychain_env_allowlist()
v
+------------------------------------------+
| ctx.secrets["NOETL_INTERNAL_API_TOKEN"] |
| = "<token>" (verbatim env-var name |
| used as the keychain alias) |
+--------------------+---------------------+
|
| playbook command dispatch
v
+------------------------------------------+
| tool: http |
| auth: |
| type: bearer |
| credential: NOETL_INTERNAL_API_TOKEN|
| |
| AuthResolver.resolve_bearer(): |
| ctx.get_secret("NOETL_INTERNAL_..") |
+--------------------+---------------------+
|
| HTTPS request
v
+------------------------------------------+
| GET/POST /api/internal/outbox/... |
| Authorization: Bearer <token> |
+--------------------+---------------------+
|
| server auth gate
| secrets.compare_digest(provided, expected)
v
+------------------------------------------+
| noetl-server |
| NOETL_INTERNAL_API_TOKEN env match |
| ✅ 200 OK with payload |
+------------------------------------------+
Each hop reuses an existing primitive: K8s Secret, env-var mount, NOETL_KEYCHAIN_ENV_VARS allow-list, AuthResolver, secrets.compare_digest server gate. Phase 2.a.3 turned out to be configuration-only (no Rust code).
System playbooks that manipulate NoETL platform state — system/outbox_publisher, system/projector, system/scheduled_cleanup — use server HTTP API only, not direct postgres access against noetl.* tables. Rationale: connection pool isolation, sharding readiness, single point of consistency. Full rule: agents/rules/data-access-boundary.md.
External-subsystem playbooks — system/auth, system/credential_rotate, system/notify_alert — are exempt; they target Auth0 / Vault / PagerDuty etc. and use tool kinds direct.
This adds a sub-task to Phase 1: the server must expose new /api/internal/* endpoints before any playbook in Phase 2 can run.
| Phase | What | Status |
|---|---|---|
| 1 | System worker pool runtime | Not started; ready to pick up |
| 2 | Migration system playbooks (publisher + projector) | Not started; depends on Phase 1 |
| 3 | Additional system playbooks (auth, RBAC, cleanup) | Not started; depends on Phase 1 |
| 4 | YAML → WASM compilation pipeline + hot reload | Not started; perf optimisation |
The system worker pool is NOT a new compiled binary. It reuses the existing noetl/worker Rust binary with distinct configuration. Sub-tasks:
System playbooks call the server, never the DB directly. These endpoints must land BEFORE any Phase 2 playbook can run.
Python side ✅ landed + kind-validated 2026-06-02 via noetl/noetl#659 (noetl v4.10.0) + noetl/noetl#660 (v4.10.1 with kind-validation fixes). Bumped via ai-meta@cdb597d + 996c9d3.
Rust side ✅ landed 2026-06-02 via noetl/server#12 (server v2.1.0) + noetl/server#13 (v2.1.1 with schema fix). Bumped via ai-meta@c94075e + 996c9d3. Rust kind validation pending (side-by-side deploy with Python).
| Endpoint | Python | Rust |
|---|---|---|
POST /api/internal/outbox/claim?limit=N — replaces Python claim_outbox_batch
|
✅ kind-validated | ✅ unit-tested |
POST /api/internal/outbox/mark-published — replaces Python mark_outbox_published
|
✅ kind-validated | ✅ unit-tested |
POST /api/internal/outbox/mark-failed — UPDATE with exp backoff |
✅ kind-validated (backoff confirmed) | ✅ unit-tested |
GET /api/internal/outbox/pending-count — KEDA scaler trigger source |
✅ kind-validated | ✅ unit-tested |
POST /api/internal/events/project — replaces Python projector batch INSERT |
✅ kind-validated (fresh + idempotent) | ✅ unit-tested |
ServiceAccount bearer-token auth (NOETL_INTERNAL_API_TOKEN) |
✅ kind-validated (403 + 503 paths) | ✅ unit-tested |
Span / tracing::instrument per endpoint |
✅ |
Phase 1.a complete on both servers; Python kind-validated 11/11, Rust validation deferred until side-by-side deploy. System playbooks can deploy against either because the API contract is byte-identical.
Validation harness: noetl/ops automation/development/validate-internal-api.sh (PR awaiting merge).
Phase 1.b (system worker pool Helm deployment) is the next unblocker for Phase 2 playbooks.
Landed via noetl/ops#141 (closes noetl/ops#140). Bumped via ai-meta@8f40fc7.
| Item | Status |
|---|---|
Deployment noetl-worker-system-pool — image localhost/noetl-worker:dev, distinct SA, env NATS_CONSUMER=noetl_worker_pool_system + NATS_FILTER_SUBJECT=noetl.commands.system.>
|
✅ |
| KEDA ScaledObject — generator-emitted NATS-JetStream trigger on consumer lag | ✅ READY |
| RBAC — system pool's SA reads bearer-token Secret + ConfigMaps; NO direct DB / Job / Pod / Secret-create permissions (system playbooks use server API per data-access-boundary) | ✅ |
| Bearer-token Secret stub + recipe to create real token | ✅ |
Token mounted via secretKeyRef env (NOETL_INTERNAL_API_TOKEN) |
✅ |
Pool's token authenticates against server's /api/internal/* end-to-end |
✅ verified: {"pending":0}
|
POOL_FILTER_MAP extension in repos/noetl/noetl/core/runtime/pool_routing.py — system_* tool kinds route to the system pool segment |
⏳ pending Phase 2.a |
Server-side validation — only catalog entries under system/ may declare system_* kinds |
⏳ pending Phase 2.a |
Helm workerPools.system values section |
⏳ deferred to GKE chart update |
| Drift-guard test entry for the new KEDA manifest | ⏳ sibling noetl/noetl PR |
Acceptance for Phase 1.b: met — pool deploys, NATS consumer registered, KEDA scaled, token works end-to-end. The POOL_FILTER_MAP extension lives in Phase 2.a since the playbook ships at the same time as the routing entry.
These playbooks REPLACE today's Python pods. Each one is pure YAML using existing tool kinds.
-
system/outbox_publisher.yaml— ✅ shipped 2026-06-02 via noetl/ops#142. Per the data-access-boundary rule, claim + mark go through the server API; only the NATS publish is a direct tool call:-
tool: http POST /api/internal/outbox/claim— server runs the SELECT FOR UPDATE SKIP LOCKED + IN_FLIGHT marking; returns the batch. -
tool: nats(iterator over batch rows) — publish JSON payload tonoetl.events.<tenant>.<org>.<eid>.<shard>. -
tool: http POST /api/internal/outbox/mark-published— server marks the batch PUBLISHED. - (On per-row publish failure)
tool: http POST /api/internal/outbox/mark-failed— server marks with exponential backoff (deferred; happy-path only in first cut).
Triggered by KEDA on
GET /api/internal/outbox/pending-countORPOST /api/execute path=system/outbox_publishermanually. Auth viaauth: { type: bearer, credential: NOETL_INTERNAL_API_TOKEN }— alias resolved by the worker'sload_keychain_env_allowlistfrom the pod'sNOETL_INTERNAL_API_TOKENenv (wired via noetl/ops#143).End-to-end kind validation pending Phase 2.a.2 (server-side
system/*path routing).Execution flow:
┌──────────────────────────────────┐ │ KEDA HTTP scaler monitors: │ │ GET /api/internal/outbox/ │ │ pending-count │ │ │ │ pending > 0 → scale system pool │ │ 1 → N replicas │ │ │ │ Server publishes: │ │ noetl.commands.system.<eid> │ │ for system/outbox_publisher │ └──────────────┬───────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ noetl-worker-system-pool │ │ (Rust worker, system config: │ │ NATS_CONSUMER= │ │ noetl_worker_pool_system, │ │ filter=noetl.commands.system.>)│ └──────────────┬───────────────────┘ │ │ claims command → dispatches playbook ▼ ┌─────────────────────────────────────────────────────────┐ │ Playbook: system/outbox_publisher │ │ │ │ Step 1: claim_batch │ │ tool: http │ │ POST {server}/api/internal/outbox/claim?limit=100 │ │ auth: system-pool-sa-token │ │ → server: SELECT FOR UPDATE SKIP LOCKED │ │ UPDATE status=IN_FLIGHT, attempts+1 │ │ ← {rows: [{outbox_id, payload, ...}, ...]} │ │ │ │ Step 2: publish_batch (iterator over rows) │ │ tool: nats │ │ subject: noetl.events.{tenant}.{org}.{eid}.{shard} │ │ payload: item.payload (already JSON) │ │ → published to NATS NOETL_EVENTS stream │ │ │ │ Step 3: mark_published │ │ tool: http │ │ POST {server}/api/internal/outbox/mark-published │ │ payload: {outbox_ids: publish_batch.success_ids} │ │ → server: UPDATE status=PUBLISHED, published_at=now()│ │ │ │ Step 4: mark_failed (per-row failure branch) │ │ tool: http │ │ POST {server}/api/internal/outbox/mark-failed │ │ payload: {outbox_id, error, attempts} │ │ → server: UPDATE status=FAILED with backoff in │ │ available_at = now() + 2^attempts seconds │ └─────────────────────────────────────────────────────────┘ │ │ playbook completes ▼ ┌─────────────────────────────────────────────────────────┐ │ Worker emits call.done via POST /api/events │ │ → server writes to event_outbox (PENDING) │ │ → next iteration of this playbook picks it up │ │ (bootstrap-tolerant; one-tick delay; audit closes) │ └─────────────────────────────────────────────────────────┘ -
-
system/projector— replacespython -m noetl.projector. Same boundary rule — write tonoetl.eventgoes through the server:-
tool: nats— pull batch fromnoetl.events.>(consumer name =system-projector-<shard_id>, derived from worker pod name). -
tool: http POST /api/internal/events/project— server runs the batch INSERT withON CONFLICT (event_id) DO NOTHING. - Ack the NATS batch on server 2xx.
Sharded — one playbook execution per shard, shard derived from worker_id.
Blocked on noetl/ai-meta#52 — the noetl-tools
natstool kind today only supportsjs_publish/js_get_msg/js_stream_info; pulling from a durable JetStream consumer (step 1 above) needs a newjs_consumeoperation. Server-sidePOST /api/internal/events/project(step 2) is already kind-validated on both Python (v4.10.1) and Rust (v2.1.2). -
Acceptance: the Python noetl-outbox-publisher Deployment and noetl-projector StatefulSet retire from the kind cluster + GKE. Outbox throughput + projection lag match or exceed today's Python baseline.
The interesting part — once the runtime + first two playbooks land, the rest of the platform's internal logic moves to playbooks too:
-
system/auth— session validation, token lookup, IdP integration. Tenant override:<tenant>/system/auth_with_saml,<tenant>/system/auth_with_oidc, etc. -
system/rbac— per-action authorisation. Tenant override supported. -
system/scheduled_cleanup— TTL enforcement, stale-row reaping. Scheduled via cron-style tool kind. -
system/credential_rotate— refresh long-lived tokens before expiry. - Custom dispatcher rules — tenant-supplied logic for routing decisions, e.g. "route all
mcpcalls for tenant X to pool Y".
The WASM compilation pipeline that closes the perf gap for hot-loop system playbooks (projector, publisher):
- YAML → WASM compiler — server-side at catalog register time (or first execute). Output cached by
(path, version, digest). - wasmtime host inside
noetl/worker— capability-based imports per the ADR. Different capability sets for system vs. tenant-supplied modules. - Hot reload — catalog version bump invalidates cache; next claim recompiles + loads. No process restart.
- WasmPlaybook catalog kind — first-class catalog entry type (per the ADR's Option 1; Option 2 is the future "YAML stays source; WASM is internal" optimisation).
Step 1 = Phase 1 + system/outbox_publisher playbook (Phase 2 first half). Minimum cut that retires the Python outbox publisher pod.
Why this cut: validates the full pipe end-to-end (NATS routing → system pool worker → playbook execution → tool dispatch → Postgres + NATS effects) with the smallest scope. Once it works, adding more system playbooks is incremental.
After Step 1 lands:
- Step 2:
system/projector(retires the second Python pod). - Step 3:
system/auth(the first non-migration system playbook). - Step 4: WASM compilation pipeline.
- HTTP server FastAPI parity in Rust — separate concern, separate future umbrella if needed.
- Rust worker tool-kind parity gaps (#47, #48) — those stay in Umbrella: Rust Worker Parity Gaps.
- Container tool kind callback design (#43) — separate concern.
| Date | Event |
|---|---|
| 2026-06-02 (morning) | Issue filed during the placement discussion under #45 |
| 2026-06-02 (morning) | Hot-reload trade-off matrix (libloading / WASM / sub-process / closure JIT) captured as comment on #46 |
| 2026-06-02 (morning) | ADR v1 merged via noetl/docs#176 — publisher + projector classified as compiled core |
| 2026-06-02 (morning) | Cross-linked wiki pages live: noetl-server runtime-shape (capability surface) and noetl-ops system-worker-pool |
| 2026-06-02 (afternoon) | noetl/server#10 opened as #45 step 1 (Rust publisher binary) |
| 2026-06-02 (afternoon) | Architecture pivot: user decision that publisher + projector belong as system playbooks, not compiled binaries. #30 closed, #45 closed, noetl/server#10 closed, this umbrella promoted to primary. ADR v2 in flight to reflect corrected classification. |
| 2026-06-02 (afternoon) | No implementation started. Phase 1 + system/outbox_publisher ready to pick up. |
| 2026-06-18 |
🎯 The system worker pool now carries the orchestrator drive by default. #108 (c) flipped NOETL_ORCHESTRATE_PLUGIN_DRIVE to default true (server#233, v3.28.0) — every triggering event now dispatches the system/orchestrate WASM plug-in to the system pool (noetl.commands.system.>) instead of driving in-process. Scale-soak-gated on kind: a 694-drive cursor+fan-out run COMPLETED with all drives claimed on the system pool, the default/shared pool getting only the real steps, and __orchestrate__ rows in noetl.event = 0. This is the first production-default consumer of the system pool — the "platform extends itself via its own primitives" thesis, live. #108 CLOSED. |
| 2026-06-18 |
Committed e2e coverage for the off-server topology. #111 / e2e#59 (merged → e2e 977efc2) added kind_validate_orchestrate_offserver.sh — a self-contained kind rig that hard-asserts the drive runs off-server on the system pool (COMPLETED, __orchestrate__ in noetl.event = 0, off-server dispatch + apply via the worker, noetl_orchestrate_shadow_total absent). Live-green against server v3.28.0 (post-#110). Two operator decisions surfaced on #111: (A) retiring the in-process drive fallback is gated on prod adopting a post-#108 image first (prod still runs batch-dispatch-v1, pre-#108); (B) __orchestrate__ PENDING delivery rows accumulate in noetl.command (never reconciled, since the meta-command's events are suppressed) and want a reaping strategy. |
-
Land ADR v2 — corrects the classification (publisher + projector → plug-in ring). PR open on
noetl/docs. -
File sub-issue on
noetl/ops— Phase 1 deployment work (Helm chart + manifests + RBAC). Possibly also onnoetl/noetlfor thePOOL_FILTER_MAPextension. -
Build the system worker pool deployment on kind — image is existing
ghcr.io/noetl/worker, no new code needed. -
Write
system/outbox_publisher.yaml— pure YAML usingpostgres+natstool kinds. Lives inrepos/ops/playbooks/system/(or a newnoetl/system-playbooksrepo if it grows). -
Register the playbook in the kind cluster catalog under
system/outbox_publisher. -
Validate end-to-end — insert a row into
noetl.outbox(status=PENDING); confirm playbook claims it via NATS, publishes toNOETL_EVENTS, marks row PUBLISHED. - Retire the Python outbox publisher pod — flip the Deployment image or just scale to 0.
- Umbrella: Rust Worker Migration — closed predecessor (worker side of migration; #30).
- Umbrella: Python Services to Rust — closed predecessor (compiled-Rust framing dropped; #45).
- Umbrella: Container Tool Callback — sibling, in-flight design (#43).
- Umbrella: Rust Worker Parity Gaps — sibling, tactical gaps (#47, #48).
- ADR: System Worker Pool and WASM Plug-in Surface (v2 in flight).
- noetl-server wiki: Runtime shape — implementation-level companion (will be revised to reflect playbook approach).
- noetl-ops wiki: System worker pool — deploy topology.
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture