Umbrella System Pool Design

Umbrella — System Worker Pool + Compilable Playbook Plug-ins (PRIMARY)

ai-task: noetl/ai-meta#46 · Opened: 2026-06-02 · Promoted to primary migration umbrella: 2026-06-02 (afternoon — same day) · Last update: 2026-06-02 · Status: Design done (ADR v2 landed); Phase 1 + the first system playbook (system/outbox_publisher) ready to pick up · ADR (v2): System Worker Pool and WASM Plug-in Surface · Closed predecessors: #30 Appendix H worker migration, #45 compiled Python rewrite

Primary migration umbrella as of 2026-06-02 afternoon. Rest of the Python→Rust migration goes through this umbrella, not via more compiled binaries. The platform extends itself via its own primitives: compilable + pluggable system playbooks on a privileged worker pool.

Goal

Build the system worker pool — a privileged Rust worker pool that runs platform-internal logic (auth, RBAC, scheduled cleanups, AND publisher + projector) as NoETL playbooks under a system/ namespace. WASM is the compilation target so playbooks are hot-reloadable AND tenant-overridable.

Model analogy: Oracle's SYS schema (privileged namespace, platform extends itself with its own primitives) plus PostgreSQL extensions (CREATE EXTENSION loads compiled code at runtime via dlopen).

The big idea

Instead of writing more compiled Rust services, encode services as playbooks. Three properties land for free:

Audit + replay — every system action emits events the same way user actions do; the event log covers system actions too.
Versioning — catalog entries are versioned; system playbook changes are tracked in git AND in the catalog.
Tenant override — tenants can supply <tenant>/system/<name> to customise platform behavior for their tenant without forking the platform.

WASM compilation makes the perf argument moot — a YAML playbook compiled to WASM runs at near-native speed inside the worker's wasmtime host.

Data access boundary (2026-06-02 PM refinement)

System playbooks that manipulate NoETL platform state — system/outbox_publisher, system/projector, system/scheduled_cleanup — use server HTTP API only, not direct postgres access against noetl.* tables. Rationale: connection pool isolation, sharding readiness, single point of consistency. Full rule: agents/rules/data-access-boundary.md.

External-subsystem playbooks — system/auth, system/credential_rotate, system/notify_alert — are exempt; they target Auth0 / Vault / PagerDuty etc. and use tool kinds direct.

This adds a sub-task to Phase 1: the server must expose new /api/internal/* endpoints before any playbook in Phase 2 can run.

Phases

Phase	What	Status
1	System worker pool runtime	Not started; ready to pick up
2	Migration system playbooks (publisher + projector)	Not started; depends on Phase 1
3	Additional system playbooks (auth, RBAC, cleanup)	Not started; depends on Phase 1
4	YAML → WASM compilation pipeline + hot reload	Not started; perf optimisation

Phase 1 — System worker pool runtime

The system worker pool is NOT a new compiled binary. It reuses the existing noetl/worker Rust binary with distinct configuration. Sub-tasks:

Phase 1.a — Server-side API endpoints (NEW per data-access-boundary rule)

System playbooks call the server, never the DB directly. These endpoints must land BEFORE any Phase 2 playbook can run:

POST /api/internal/outbox/claim?limit=N — replaces Python claim_outbox_batch. Server runs SELECT ... FOR UPDATE SKIP LOCKED, marks rows IN_FLIGHT, returns the batch.
POST /api/internal/outbox/mark-published — replaces Python mark_outbox_published. Batch UPDATE status=PUBLISHED.
POST /api/internal/outbox/mark-failed — replaces Python mark_outbox_failed. Batch UPDATE with exponential backoff in available_at.
GET /api/internal/outbox/pending-count — KEDA scaler trigger source for the system pool.
POST /api/internal/events/project — replaces Python projector batch INSERT. Batch INSERT INTO noetl.event ... ON CONFLICT DO NOTHING.
ServiceAccount auth middleware — /api/internal/* routes require the system pool's K8s ServiceAccount token; user playbooks calling these get 403.
Span + metric + execution_id correlation per endpoint per agents/rules/observability.md Principle 1.

Lands on the Python noetl-server (in repos/noetl/noetl/server/api/). Will eventually carry to the Rust port.

Phase 1.b — System worker pool deployment

Helm Deployment noetl-worker-system-pool — image ghcr.io/noetl/worker:<v>, distinct ServiceAccount, env NATS_CONSUMER=noetl_worker_pool_system + NATS_FILTER_SUBJECT=noetl.commands.system.>.
KEDA ScaledObject — smaller cap (5) and tighter lag threshold (5) than user pools. Pre-sketched at noetl-ops wiki — System worker pool.
POOL_FILTER_MAP extension in repos/noetl/noetl/core/runtime/pool_routing.py — system_* tool kinds route to the system pool segment.
Server-side validation — only catalog entries under system/ path may declare system_* tool kinds.
RBAC for the system pool's ServiceAccount — read keychain, NO direct DB access (must use /api/internal/* endpoints per the data-access-boundary rule).
Helm workerPools.system values section (opt-in, default disabled).
Kind validation rig — repos/ops/automation/development/system-pool-validation.yaml + validate-system-pool.sh.

Acceptance: noetl-worker-system-pool deployment scales 1→N on noetl_worker_pool_system consumer lag; commands published to noetl.commands.system.<eid> get claimed by the system pool and not by user pools. A trivial system/echo playbook executes end-to-end as smoke test.

Phase 2 — Migration system playbooks (retire Python pods)

These playbooks REPLACE today's Python pods. Each one is pure YAML using existing tool kinds.

system/outbox_publisher — replaces python -m noetl.outbox_publisher. Per the data-access-boundary rule, claim + mark go through the server API; only the NATS publish is a direct tool call:
1. tool: http POST /api/internal/outbox/claim — server runs the SELECT FOR UPDATE SKIP LOCKED + IN_FLIGHT marking; returns the batch.
2. tool: nats (iterator over batch rows) — publish JSON payload to noetl.events.<tenant>.<org>.<eid>.<shard>.
3. tool: http POST /api/internal/outbox/mark-published — server marks the batch PUBLISHED.
4. (On per-row publish failure) tool: http POST /api/internal/outbox/mark-failed — server marks with exponential backoff.
Triggered by KEDA on GET /api/internal/outbox/pending-count (KEDA's HTTP scaler) OR by a NATS system/outbox.tick notification from a server-side LISTEN/NOTIFY bridge (TBD).
system/projector — replaces python -m noetl.projector. Same boundary rule — write to noetl.event goes through the server:
1. tool: nats — pull batch from noetl.events.> (consumer name = system-projector-<shard_id>, derived from worker pod name).
2. tool: http POST /api/internal/events/project — server runs the batch INSERT with ON CONFLICT (event_id) DO NOTHING.
3. Ack the NATS batch on server 2xx.
Sharded — one playbook execution per shard, shard derived from worker_id.

Acceptance: the Python noetl-outbox-publisher Deployment and noetl-projector StatefulSet retire from the kind cluster + GKE. Outbox throughput + projection lag match or exceed today's Python baseline.

Phase 3 — Additional system playbooks (pluggable surface)

The interesting part — once the runtime + first two playbooks land, the rest of the platform's internal logic moves to playbooks too:

system/auth — session validation, token lookup, IdP integration. Tenant override: <tenant>/system/auth_with_saml, <tenant>/system/auth_with_oidc, etc.
system/rbac — per-action authorisation. Tenant override supported.
system/scheduled_cleanup — TTL enforcement, stale-row reaping. Scheduled via cron-style tool kind.
system/credential_rotate — refresh long-lived tokens before expiry.
Custom dispatcher rules — tenant-supplied logic for routing decisions, e.g. "route all mcp calls for tenant X to pool Y".

Phase 4 — Compilation (perf + plug-in surface)

The WASM compilation pipeline that closes the perf gap for hot-loop system playbooks (projector, publisher):

YAML → WASM compiler — server-side at catalog register time (or first execute). Output cached by (path, version, digest).
wasmtime host inside noetl/worker — capability-based imports per the ADR. Different capability sets for system vs. tenant-supplied modules.
Hot reload — catalog version bump invalidates cache; next claim recompiles + loads. No process restart.
WasmPlaybook catalog kind — first-class catalog entry type (per the ADR's Option 1; Option 2 is the future "YAML stays source; WASM is internal" optimisation).

Sequencing — Step 1 cut

Step 1 = Phase 1 + system/outbox_publisher playbook (Phase 2 first half). Minimum cut that retires the Python outbox publisher pod.

Why this cut: validates the full pipe end-to-end (NATS routing → system pool worker → playbook execution → tool dispatch → Postgres + NATS effects) with the smallest scope. Once it works, adding more system playbooks is incremental.

After Step 1 lands:

Step 2: system/projector (retires the second Python pod).
Step 3: system/auth (the first non-migration system playbook).
Step 4: WASM compilation pipeline.

What's NOT in this umbrella

HTTP server FastAPI parity in Rust — separate concern, separate future umbrella if needed.
Rust worker tool-kind parity gaps (#47, #48) — those stay in Umbrella: Rust Worker Parity Gaps.
Container tool kind callback design (#43) — separate concern.

Recent activity

Date	Event
2026-06-02 (morning)	Issue filed during the placement discussion under #45
2026-06-02 (morning)	Hot-reload trade-off matrix (libloading / WASM / sub-process / closure JIT) captured as comment on #46
2026-06-02 (morning)	ADR v1 merged via noetl/docs#176 — publisher + projector classified as compiled core
2026-06-02 (morning)	Cross-linked wiki pages live: noetl-server runtime-shape (capability surface) and noetl-ops system-worker-pool
2026-06-02 (afternoon)	noetl/server#10 opened as #45 step 1 (Rust publisher binary)
2026-06-02 (afternoon)	Architecture pivot: user decision that publisher + projector belong as system playbooks, not compiled binaries. #30 closed, #45 closed, noetl/server#10 closed, this umbrella promoted to primary. ADR v2 in flight to reflect corrected classification.
2026-06-02 (afternoon)	No implementation started. Phase 1 + `system/outbox_publisher` ready to pick up.

Next concrete steps

Land ADR v2 — corrects the classification (publisher + projector → plug-in ring). PR open on noetl/docs.
File sub-issue on noetl/ops — Phase 1 deployment work (Helm chart + manifests + RBAC). Possibly also on noetl/noetl for the POOL_FILTER_MAP extension.
Build the system worker pool deployment on kind — image is existing ghcr.io/noetl/worker, no new code needed.
Write system/outbox_publisher.yaml — pure YAML using postgres + nats tool kinds. Lives in repos/ops/playbooks/system/ (or a new noetl/system-playbooks repo if it grows).
Register the playbook in the kind cluster catalog under system/outbox_publisher.
Validate end-to-end — insert a row into noetl.outbox (status=PENDING); confirm playbook claims it via NATS, publishes to NOETL_EVENTS, marks row PUBLISHED.
Retire the Python outbox publisher pod — flip the Deployment image or just scale to 0.

Umbrella: Rust Worker Migration — closed predecessor (worker side of migration; #30).
Umbrella: Python Services to Rust — closed predecessor (compiled-Rust framing dropped; #45).
Umbrella: Container Tool Callback — sibling, in-flight design (#43).
Umbrella: Rust Worker Parity Gaps — sibling, tactical gaps (#47, #48).
ADR: System Worker Pool and WASM Plug-in Surface (v2 in flight).
noetl-server wiki: Runtime shape — implementation-level companion (will be revised to reflect playbook approach).
noetl-ops wiki: System worker pool — deploy topology.

NoETL Dashboard

Home — overview
Repo Map
Releases
Sessions Log

Active Umbrellas

Domain-Specific SLM Platform (#139) — RFC (design); travel#63 is the reference impl
Secrets Wallet (#61) — SECURITY (design)
Rust Server Port (#49) — PRIMARY
Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
System Pool Design (#46) — PRIMARY
Regression Baseline Migration (#98) — e2e
Subscription / Listener Tool (#90) — RFC
Container Tool Callback (#43)
Rust Worker Parity Gaps (#47 · #48)
Event Envelope Reconciliation (#51 in TaskList)

Closed Umbrellas

Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
Rust Worker Migration (#30)
Python Services → Rust (#45)

Conventions

Per-repo wikis

noetl/noetl wiki — app + DSL
noetl/server wiki — Rust control plane
noetl/worker wiki — Rust pull worker
noetl/tools wiki — tool registry crate
noetl/cli wiki — CLI + local mode
noetl/gateway wiki — gatekeeper
noetl/ops wiki — Helm + manifests
noetl/travel wiki — domain SPA reference
Docs site — engineer-facing architecture

Uh oh!

Umbrella System Pool Design

Umbrella — System Worker Pool + Compilable Playbook Plug-ins (PRIMARY)

Goal

The big idea

Data access boundary (2026-06-02 PM refinement)

Phases

Phase 1 — System worker pool runtime

Phase 1.a — Server-side API endpoints (NEW per data-access-boundary rule)

Phase 1.b — System worker pool deployment

Phase 2 — Migration system playbooks (retire Python pods)

Phase 3 — Additional system playbooks (pluggable surface)

Phase 4 — Compilation (perf + plug-in surface)

Sequencing — Step 1 cut

What's NOT in this umbrella

Recent activity

Next concrete steps

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally