-
Notifications
You must be signed in to change notification settings - Fork 0
Umbrella System Pool Design
ai-task: noetl/ai-meta#46
· Opened: 2026-06-02
· Promoted to primary migration umbrella: 2026-06-02 (afternoon — same day)
· Last update: 2026-06-02
· Status: Design done (ADR v2 landed); Phase 1 + the first system playbook (system/outbox_publisher) ready to pick up
· ADR (v2): System Worker Pool and WASM Plug-in Surface
· Closed predecessors: #30 Appendix H worker migration, #45 compiled Python rewrite
Primary migration umbrella as of 2026-06-02 afternoon. Rest of the Python→Rust migration goes through this umbrella, not via more compiled binaries. The platform extends itself via its own primitives: compilable + pluggable system playbooks on a privileged worker pool.
Build the system worker pool — a privileged Rust worker pool that runs platform-internal logic (auth, RBAC, scheduled cleanups, AND publisher + projector) as NoETL playbooks under a system/ namespace. WASM is the compilation target so playbooks are hot-reloadable AND tenant-overridable.
Model analogy: Oracle's SYS schema (privileged namespace, platform extends itself with its own primitives) plus PostgreSQL extensions (CREATE EXTENSION loads compiled code at runtime via dlopen).
┌──────────────────────┐ ┌──────────────────────────┐
│ noetl/ai-meta #49 │ │ noetl/ai-meta #46 │
│ Rust server port │ │ System pool playbooks │
│ │ │ (THIS UMBRELLA) │
│ │ │ │
│ Phase C: internal ├─────────►│ Phase 2: │
│ endpoints │ unblocks │ system/outbox_publisher│
│ │ │ system/projector │
└──────────┬───────────┘ └──────────┬───────────────┘
│ │
│ produces │ consumes
▼ ▼
┌──────────────────────────────────────────────────────────────┐
│ HTTP API surface (server is the data gatekeeper) │
│ │
│ POST /api/internal/outbox/claim │
│ POST /api/internal/outbox/mark-published │
│ POST /api/internal/outbox/mark-failed │
│ GET /api/internal/outbox/pending-count │
│ POST /api/internal/events/project │
└──────────────────────────────────────────────────────────────┘
TODAY (2026-06-02): AFTER #46 PHASE 2: LONG TERM:
PYTHON: PYTHON: PYTHON:
noetl-server (FastAPI) noetl-server (none in platform)
noetl-worker (Py pool) noetl-worker (Py pool)
noetl-outbox-publisher ❌ retired Python lives INSIDE
noetl-projector ❌ retired containers dispatched
Python tool impls Python tool impls by the container tool
DSL parser (Python) DSL parser (Python) kind (#43):
- Agent container
RUST: RUST: - Tenant containers
noetl-worker (user pool) noetl-worker (user pool)
noetl/server (skeleton) noetl/server (Phase C done) Each container is a
noetl/cli noetl/cli one-shot K8s Job;
noetl/tools noetl/tools exits when done.
noetl/executor noetl/executor
noetl-worker-system-pool Platform runtime is
(Rust, runs system playbooks) 100% Rust; Python is
a container payload.
Instead of writing more compiled Rust services, encode services as playbooks. Three properties land for free:
- Audit + replay — every system action emits events the same way user actions do; the event log covers system actions too.
- Versioning — catalog entries are versioned; system playbook changes are tracked in git AND in the catalog.
-
Tenant override — tenants can supply
<tenant>/system/<name>to customise platform behavior for their tenant without forking the platform.
WASM compilation makes the perf argument moot — a YAML playbook compiled to WASM runs at near-native speed inside the worker's wasmtime host.
System playbooks that manipulate NoETL platform state — system/outbox_publisher, system/projector, system/scheduled_cleanup — use server HTTP API only, not direct postgres access against noetl.* tables. Rationale: connection pool isolation, sharding readiness, single point of consistency. Full rule: agents/rules/data-access-boundary.md.
External-subsystem playbooks — system/auth, system/credential_rotate, system/notify_alert — are exempt; they target Auth0 / Vault / PagerDuty etc. and use tool kinds direct.
This adds a sub-task to Phase 1: the server must expose new /api/internal/* endpoints before any playbook in Phase 2 can run.
| Phase | What | Status |
|---|---|---|
| 1 | System worker pool runtime | Not started; ready to pick up |
| 2 | Migration system playbooks (publisher + projector) | Not started; depends on Phase 1 |
| 3 | Additional system playbooks (auth, RBAC, cleanup) | Not started; depends on Phase 1 |
| 4 | YAML → WASM compilation pipeline + hot reload | Not started; perf optimisation |
The system worker pool is NOT a new compiled binary. It reuses the existing noetl/worker Rust binary with distinct configuration. Sub-tasks:
System playbooks call the server, never the DB directly. These endpoints must land BEFORE any Phase 2 playbook can run.
Python side ✅ landed + kind-validated 2026-06-02 via noetl/noetl#659 (noetl v4.10.0) + noetl/noetl#660 (v4.10.1 with kind-validation fixes). Bumped via ai-meta@cdb597d + 996c9d3.
Rust side ✅ landed 2026-06-02 via noetl/server#12 (server v2.1.0) + noetl/server#13 (v2.1.1 with schema fix). Bumped via ai-meta@c94075e + 996c9d3. Rust kind validation pending (side-by-side deploy with Python).
| Endpoint | Python | Rust |
|---|---|---|
POST /api/internal/outbox/claim?limit=N — replaces Python claim_outbox_batch
|
✅ kind-validated | ✅ unit-tested |
POST /api/internal/outbox/mark-published — replaces Python mark_outbox_published
|
✅ kind-validated | ✅ unit-tested |
POST /api/internal/outbox/mark-failed — UPDATE with exp backoff |
✅ kind-validated (backoff confirmed) | ✅ unit-tested |
GET /api/internal/outbox/pending-count — KEDA scaler trigger source |
✅ kind-validated | ✅ unit-tested |
POST /api/internal/events/project — replaces Python projector batch INSERT |
✅ kind-validated (fresh + idempotent) | ✅ unit-tested |
ServiceAccount bearer-token auth (NOETL_INTERNAL_API_TOKEN) |
✅ kind-validated (403 + 503 paths) | ✅ unit-tested |
Span / tracing::instrument per endpoint |
✅ |
Phase 1.a complete on both servers; Python kind-validated 11/11, Rust validation deferred until side-by-side deploy. System playbooks can deploy against either because the API contract is byte-identical.
Validation harness: noetl/ops automation/development/validate-internal-api.sh (PR awaiting merge).
Phase 1.b (system worker pool Helm deployment) is the next unblocker for Phase 2 playbooks.
- Helm Deployment
noetl-worker-system-pool— imageghcr.io/noetl/worker:<v>, distinct ServiceAccount, envNATS_CONSUMER=noetl_worker_pool_system+NATS_FILTER_SUBJECT=noetl.commands.system.>. - KEDA ScaledObject — smaller cap (5) and tighter lag threshold (5) than user pools. Pre-sketched at noetl-ops wiki — System worker pool.
-
POOL_FILTER_MAPextension inrepos/noetl/noetl/core/runtime/pool_routing.py—system_*tool kinds route to thesystempool segment. - Server-side validation — only catalog entries under
system/path may declaresystem_*tool kinds. - RBAC for the system pool's ServiceAccount — read keychain, NO direct DB access (must use
/api/internal/*endpoints per the data-access-boundary rule). - Helm
workerPools.systemvalues section (opt-in, default disabled). - Kind validation rig —
repos/ops/automation/development/system-pool-validation.yaml+validate-system-pool.sh.
Acceptance: noetl-worker-system-pool deployment scales 1→N on noetl_worker_pool_system consumer lag; commands published to noetl.commands.system.<eid> get claimed by the system pool and not by user pools. A trivial system/echo playbook executes end-to-end as smoke test.
These playbooks REPLACE today's Python pods. Each one is pure YAML using existing tool kinds.
-
system/outbox_publisher— replacespython -m noetl.outbox_publisher. Per the data-access-boundary rule, claim + mark go through the server API; only the NATS publish is a direct tool call:-
tool: http POST /api/internal/outbox/claim— server runs the SELECT FOR UPDATE SKIP LOCKED + IN_FLIGHT marking; returns the batch. -
tool: nats(iterator over batch rows) — publish JSON payload tonoetl.events.<tenant>.<org>.<eid>.<shard>. -
tool: http POST /api/internal/outbox/mark-published— server marks the batch PUBLISHED. - (On per-row publish failure)
tool: http POST /api/internal/outbox/mark-failed— server marks with exponential backoff.
Triggered by KEDA on
GET /api/internal/outbox/pending-count(KEDA's HTTP scaler) OR by a NATSsystem/outbox.ticknotification from a server-side LISTEN/NOTIFY bridge (TBD).Execution flow:
┌──────────────────────────────────┐ │ KEDA HTTP scaler monitors: │ │ GET /api/internal/outbox/ │ │ pending-count │ │ │ │ pending > 0 → scale system pool │ │ 1 → N replicas │ │ │ │ Server publishes: │ │ noetl.commands.system.<eid> │ │ for system/outbox_publisher │ └──────────────┬───────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ noetl-worker-system-pool │ │ (Rust worker, system config: │ │ NATS_CONSUMER= │ │ noetl_worker_pool_system, │ │ filter=noetl.commands.system.>)│ └──────────────┬───────────────────┘ │ │ claims command → dispatches playbook ▼ ┌─────────────────────────────────────────────────────────┐ │ Playbook: system/outbox_publisher │ │ │ │ Step 1: claim_batch │ │ tool: http │ │ POST {server}/api/internal/outbox/claim?limit=100 │ │ auth: system-pool-sa-token │ │ → server: SELECT FOR UPDATE SKIP LOCKED │ │ UPDATE status=IN_FLIGHT, attempts+1 │ │ ← {rows: [{outbox_id, payload, ...}, ...]} │ │ │ │ Step 2: publish_batch (iterator over rows) │ │ tool: nats │ │ subject: noetl.events.{tenant}.{org}.{eid}.{shard} │ │ payload: item.payload (already JSON) │ │ → published to NATS NOETL_EVENTS stream │ │ │ │ Step 3: mark_published │ │ tool: http │ │ POST {server}/api/internal/outbox/mark-published │ │ payload: {outbox_ids: publish_batch.success_ids} │ │ → server: UPDATE status=PUBLISHED, published_at=now()│ │ │ │ Step 4: mark_failed (per-row failure branch) │ │ tool: http │ │ POST {server}/api/internal/outbox/mark-failed │ │ payload: {outbox_id, error, attempts} │ │ → server: UPDATE status=FAILED with backoff in │ │ available_at = now() + 2^attempts seconds │ └─────────────────────────────────────────────────────────┘ │ │ playbook completes ▼ ┌─────────────────────────────────────────────────────────┐ │ Worker emits call.done via POST /api/events │ │ → server writes to event_outbox (PENDING) │ │ → next iteration of this playbook picks it up │ │ (bootstrap-tolerant; one-tick delay; audit closes) │ └─────────────────────────────────────────────────────────┘ -
-
system/projector— replacespython -m noetl.projector. Same boundary rule — write tonoetl.eventgoes through the server:-
tool: nats— pull batch fromnoetl.events.>(consumer name =system-projector-<shard_id>, derived from worker pod name). -
tool: http POST /api/internal/events/project— server runs the batch INSERT withON CONFLICT (event_id) DO NOTHING. - Ack the NATS batch on server 2xx.
Sharded — one playbook execution per shard, shard derived from worker_id.
-
Acceptance: the Python noetl-outbox-publisher Deployment and noetl-projector StatefulSet retire from the kind cluster + GKE. Outbox throughput + projection lag match or exceed today's Python baseline.
The interesting part — once the runtime + first two playbooks land, the rest of the platform's internal logic moves to playbooks too:
-
system/auth— session validation, token lookup, IdP integration. Tenant override:<tenant>/system/auth_with_saml,<tenant>/system/auth_with_oidc, etc. -
system/rbac— per-action authorisation. Tenant override supported. -
system/scheduled_cleanup— TTL enforcement, stale-row reaping. Scheduled via cron-style tool kind. -
system/credential_rotate— refresh long-lived tokens before expiry. - Custom dispatcher rules — tenant-supplied logic for routing decisions, e.g. "route all
mcpcalls for tenant X to pool Y".
The WASM compilation pipeline that closes the perf gap for hot-loop system playbooks (projector, publisher):
- YAML → WASM compiler — server-side at catalog register time (or first execute). Output cached by
(path, version, digest). - wasmtime host inside
noetl/worker— capability-based imports per the ADR. Different capability sets for system vs. tenant-supplied modules. - Hot reload — catalog version bump invalidates cache; next claim recompiles + loads. No process restart.
- WasmPlaybook catalog kind — first-class catalog entry type (per the ADR's Option 1; Option 2 is the future "YAML stays source; WASM is internal" optimisation).
Step 1 = Phase 1 + system/outbox_publisher playbook (Phase 2 first half). Minimum cut that retires the Python outbox publisher pod.
Why this cut: validates the full pipe end-to-end (NATS routing → system pool worker → playbook execution → tool dispatch → Postgres + NATS effects) with the smallest scope. Once it works, adding more system playbooks is incremental.
After Step 1 lands:
- Step 2:
system/projector(retires the second Python pod). - Step 3:
system/auth(the first non-migration system playbook). - Step 4: WASM compilation pipeline.
- HTTP server FastAPI parity in Rust — separate concern, separate future umbrella if needed.
- Rust worker tool-kind parity gaps (#47, #48) — those stay in Umbrella: Rust Worker Parity Gaps.
- Container tool kind callback design (#43) — separate concern.
| Date | Event |
|---|---|
| 2026-06-02 (morning) | Issue filed during the placement discussion under #45 |
| 2026-06-02 (morning) | Hot-reload trade-off matrix (libloading / WASM / sub-process / closure JIT) captured as comment on #46 |
| 2026-06-02 (morning) | ADR v1 merged via noetl/docs#176 — publisher + projector classified as compiled core |
| 2026-06-02 (morning) | Cross-linked wiki pages live: noetl-server runtime-shape (capability surface) and noetl-ops system-worker-pool |
| 2026-06-02 (afternoon) | noetl/server#10 opened as #45 step 1 (Rust publisher binary) |
| 2026-06-02 (afternoon) | Architecture pivot: user decision that publisher + projector belong as system playbooks, not compiled binaries. #30 closed, #45 closed, noetl/server#10 closed, this umbrella promoted to primary. ADR v2 in flight to reflect corrected classification. |
| 2026-06-02 (afternoon) | No implementation started. Phase 1 + system/outbox_publisher ready to pick up. |
-
Land ADR v2 — corrects the classification (publisher + projector → plug-in ring). PR open on
noetl/docs. -
File sub-issue on
noetl/ops— Phase 1 deployment work (Helm chart + manifests + RBAC). Possibly also onnoetl/noetlfor thePOOL_FILTER_MAPextension. -
Build the system worker pool deployment on kind — image is existing
ghcr.io/noetl/worker, no new code needed. -
Write
system/outbox_publisher.yaml— pure YAML usingpostgres+natstool kinds. Lives inrepos/ops/playbooks/system/(or a newnoetl/system-playbooksrepo if it grows). -
Register the playbook in the kind cluster catalog under
system/outbox_publisher. -
Validate end-to-end — insert a row into
noetl.outbox(status=PENDING); confirm playbook claims it via NATS, publishes toNOETL_EVENTS, marks row PUBLISHED. - Retire the Python outbox publisher pod — flip the Deployment image or just scale to 0.
- Umbrella: Rust Worker Migration — closed predecessor (worker side of migration; #30).
- Umbrella: Python Services to Rust — closed predecessor (compiled-Rust framing dropped; #45).
- Umbrella: Container Tool Callback — sibling, in-flight design (#43).
- Umbrella: Rust Worker Parity Gaps — sibling, tactical gaps (#47, #48).
- ADR: System Worker Pool and WASM Plug-in Surface (v2 in flight).
- noetl-server wiki: Runtime shape — implementation-level companion (will be revised to reflect playbook approach).
- noetl-ops wiki: System worker pool — deploy topology.
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture