Skip to content

Umbrella System Pool Design

Kadyapam edited this page Jun 18, 2026 · 17 revisions

Umbrella — System Worker Pool + Compilable Playbook Plug-ins (PRIMARY)

ai-task: noetl/ai-meta#46 · Opened: 2026-06-02 · Promoted to primary migration umbrella: 2026-06-02 (afternoon — same day) · Last update: 2026-06-18 (off-server orchestrate topology now has a committed e2e rig — #111 / e2e#59) · Status: Design done (ADR v2 landed); Phase 1 + the first system playbook (system/outbox_publisher) ready to pick up · ADR (v2): System Worker Pool and WASM Plug-in Surface · Closed predecessors: #30 Appendix H worker migration, #45 compiled Python rewrite

Primary migration umbrella as of 2026-06-02 afternoon. Rest of the Python→Rust migration goes through this umbrella, not via more compiled binaries. The platform extends itself via its own primitives: compilable + pluggable system playbooks on a privileged worker pool.

Goal

Build the system worker pool — a privileged Rust worker pool that runs platform-internal logic (auth, RBAC, scheduled cleanups, AND publisher + projector) as NoETL playbooks under a system/ namespace. WASM is the compilation target so playbooks are hot-reloadable AND tenant-overridable.

Model analogy: Oracle's SYS schema (privileged namespace, platform extends itself with its own primitives) plus PostgreSQL extensions (CREATE EXTENSION loads compiled code at runtime via dlopen).

Visual — interlock with #49

   ┌──────────────────────┐          ┌──────────────────────────┐
   │  noetl/ai-meta #49   │          │   noetl/ai-meta #46      │
   │  Rust server port    │          │   System pool playbooks  │
   │                      │          │   (THIS UMBRELLA)        │
   │                      │          │                          │
   │  Phase C: internal   ├─────────►│  Phase 2:                │
   │   endpoints          │ unblocks │   system/outbox_publisher│
   │                      │          │   system/projector       │
   └──────────┬───────────┘          └──────────┬───────────────┘
              │                                 │
              │ produces                        │ consumes
              ▼                                 ▼
   ┌──────────────────────────────────────────────────────────────┐
   │ HTTP API surface (server is the data gatekeeper)             │
   │                                                              │
   │  POST /api/internal/outbox/claim                             │
   │  POST /api/internal/outbox/mark-published                    │
   │  POST /api/internal/outbox/mark-failed                       │
   │  GET  /api/internal/outbox/pending-count                     │
   │  POST /api/internal/events/project                           │
   └──────────────────────────────────────────────────────────────┘

Implementation status (2026-06-02)

+----------------------------------------------------------------+
|                                                                |
|  Phase 1.a — server /api/internal/* endpoints       ✅ shipped |
|     Python (v4.10.1) ✅ kind-validated                         |
|     Rust   (v2.1.1)  ✅ unit-tested (kind-val pending)         |
|                                                                |
|  Phase 1.b — system worker pool deployment          ✅ shipped |
|     Pool deployed + idle on noetl_worker_pool_system           |
|     KEDA ScaledObject READY                                    |
|     Bearer token mounted, /api/internal call works             |
|                                                                |
|  Phase 2.a.1 — system/outbox_publisher.yaml         ✅ shipped |
|     Playbook registered, uses tool: http + tool: nats          |
|     auth: { type: bearer, credential: NOETL_INTERNAL_API_TOKEN }
|                                                                |
|  Phase 2.a.2 — server-side system/* routing         ✅ shipped |
|     pool_routing.py POOL_PATH_PREFIX_MAP + catalog_path_for    |
|     LRU cache.  Subject: noetl.commands.system.<eid>.          |
|     PRs noetl/noetl#661 (v4.11.0) + noetl/ops#144 merged.      |
|     Kind-validated: consumer-delivery delta = 1 each pool.     |
|                                                                |
|  Phase 2.a.3 — deployment env wiring                ✅ shipped |
|     NOETL_KEYCHAIN_ENV_VARS=NOETL_INTERNAL_API_TOKEN added      |
|     PR noetl/ops#143 merged; manifest applied to kind          |
|                                                                |
|  Phase 2.b — system/projector.yaml                  ⏳ blocked |
|     on noetl/ai-meta#52 (js_consume tool op missing)            |
|  Phase 3   — additional playbooks (auth, RBAC, etc.) ⏳ later  |
|  Phase 4   — WASM compilation                       ⏳ later  |
|                                                                |
+----------------------------------------------------------------+

Long-term Python trajectory

TODAY (2026-06-02):           AFTER #46 PHASE 2:           LONG TERM:

PYTHON:                       PYTHON:                       PYTHON:
  noetl-server (FastAPI)        noetl-server                   (none in platform)
  noetl-worker (Py pool)        noetl-worker (Py pool)
  noetl-outbox-publisher        ❌ retired                     Python lives INSIDE
  noetl-projector               ❌ retired                     containers dispatched
  Python tool impls             Python tool impls              by the container tool
  DSL parser (Python)           DSL parser (Python)            kind (#43):
                                                                 - Agent container
RUST:                         RUST:                              - Tenant containers
  noetl-worker (user pool)      noetl-worker (user pool)
  noetl/server (skeleton)       noetl/server (Phase C done)    Each container is a
  noetl/cli                     noetl/cli                      one-shot K8s Job;
  noetl/tools                   noetl/tools                    exits when done.
  noetl/executor                noetl/executor
                                noetl-worker-system-pool      Platform runtime is
                                (Rust, runs system playbooks) 100% Rust; Python is
                                                              a container payload.

The big idea

Instead of writing more compiled Rust services, encode services as playbooks. Three properties land for free:

  1. Audit + replay — every system action emits events the same way user actions do; the event log covers system actions too.
  2. Versioning — catalog entries are versioned; system playbook changes are tracked in git AND in the catalog.
  3. Tenant override — tenants can supply <tenant>/system/<name> to customise platform behavior for their tenant without forking the platform.

WASM compilation makes the perf argument moot — a YAML playbook compiled to WASM runs at near-native speed inside the worker's wasmtime host.

Auth flow — Secret to API call (kind-validated 2026-06-02)

The bearer-token plumbing that connects a K8s Secret to a system playbook's tool: http call uses only existing NoETL primitives — no new code required. Five hops:

+------------------------------------------+
|  K8s Secret  noetl-internal-api-token    |
|    key=token, value=<32-byte hex token>  |
+--------------------+---------------------+
                     |
                     | valueFrom.secretKeyRef
                     v
+------------------------------------------+
|  Pod env (worker-system-pool)            |
|    NOETL_INTERNAL_API_TOKEN=<token>      |
|    NOETL_KEYCHAIN_ENV_VARS=NOETL_INTERNAL_API_TOKEN
+--------------------+---------------------+
                     |
                     | worker startup
                     | load_keychain_env_allowlist()
                     v
+------------------------------------------+
|  ctx.secrets["NOETL_INTERNAL_API_TOKEN"] |
|    = "<token>" (verbatim env-var name    |
|       used as the keychain alias)        |
+--------------------+---------------------+
                     |
                     | playbook command dispatch
                     v
+------------------------------------------+
|  tool: http                              |
|    auth:                                 |
|      type: bearer                        |
|      credential: NOETL_INTERNAL_API_TOKEN|
|                                          |
|  AuthResolver.resolve_bearer():          |
|    ctx.get_secret("NOETL_INTERNAL_..")   |
+--------------------+---------------------+
                     |
                     | HTTPS request
                     v
+------------------------------------------+
|  GET/POST /api/internal/outbox/...       |
|    Authorization: Bearer <token>         |
+--------------------+---------------------+
                     |
                     | server auth gate
                     | secrets.compare_digest(provided, expected)
                     v
+------------------------------------------+
|  noetl-server                            |
|    NOETL_INTERNAL_API_TOKEN env match    |
|  ✅ 200 OK with payload                  |
+------------------------------------------+

Each hop reuses an existing primitive: K8s Secret, env-var mount, NOETL_KEYCHAIN_ENV_VARS allow-list, AuthResolver, secrets.compare_digest server gate. Phase 2.a.3 turned out to be configuration-only (no Rust code).

Data access boundary (2026-06-02 PM refinement)

System playbooks that manipulate NoETL platform state — system/outbox_publisher, system/projector, system/scheduled_cleanup — use server HTTP API only, not direct postgres access against noetl.* tables. Rationale: connection pool isolation, sharding readiness, single point of consistency. Full rule: agents/rules/data-access-boundary.md.

External-subsystem playbooks — system/auth, system/credential_rotate, system/notify_alert — are exempt; they target Auth0 / Vault / PagerDuty etc. and use tool kinds direct.

This adds a sub-task to Phase 1: the server must expose new /api/internal/* endpoints before any playbook in Phase 2 can run.

Phases

Phase What Status
1 System worker pool runtime Not started; ready to pick up
2 Migration system playbooks (publisher + projector) Not started; depends on Phase 1
3 Additional system playbooks (auth, RBAC, cleanup) Not started; depends on Phase 1
4 YAML → WASM compilation pipeline + hot reload Not started; perf optimisation

Phase 1 — System worker pool runtime

The system worker pool is NOT a new compiled binary. It reuses the existing noetl/worker Rust binary with distinct configuration. Sub-tasks:

Phase 1.a — Server-side API endpoints (NEW per data-access-boundary rule)

System playbooks call the server, never the DB directly. These endpoints must land BEFORE any Phase 2 playbook can run.

Python side ✅ landed + kind-validated 2026-06-02 via noetl/noetl#659 (noetl v4.10.0) + noetl/noetl#660 (v4.10.1 with kind-validation fixes). Bumped via ai-meta@cdb597d + 996c9d3.

Rust side ✅ landed 2026-06-02 via noetl/server#12 (server v2.1.0) + noetl/server#13 (v2.1.1 with schema fix). Bumped via ai-meta@c94075e + 996c9d3. Rust kind validation pending (side-by-side deploy with Python).

Endpoint Python Rust
POST /api/internal/outbox/claim?limit=N — replaces Python claim_outbox_batch ✅ kind-validated ✅ unit-tested
POST /api/internal/outbox/mark-published — replaces Python mark_outbox_published ✅ kind-validated ✅ unit-tested
POST /api/internal/outbox/mark-failed — UPDATE with exp backoff ✅ kind-validated (backoff confirmed) ✅ unit-tested
GET /api/internal/outbox/pending-count — KEDA scaler trigger source ✅ kind-validated ✅ unit-tested
POST /api/internal/events/project — replaces Python projector batch INSERT ✅ kind-validated (fresh + idempotent) ✅ unit-tested
ServiceAccount bearer-token auth (NOETL_INTERNAL_API_TOKEN) ✅ kind-validated (403 + 503 paths) ✅ unit-tested
Span / tracing::instrument per endpoint ⚠️ partial

Phase 1.a complete on both servers; Python kind-validated 11/11, Rust validation deferred until side-by-side deploy. System playbooks can deploy against either because the API contract is byte-identical.

Validation harness: noetl/ops automation/development/validate-internal-api.sh (PR awaiting merge).

Phase 1.b (system worker pool Helm deployment) is the next unblocker for Phase 2 playbooks.

Phase 1.b — System worker pool deployment ✅ kind-validated 2026-06-02

Landed via noetl/ops#141 (closes noetl/ops#140). Bumped via ai-meta@8f40fc7.

Item Status
Deployment noetl-worker-system-pool — image localhost/noetl-worker:dev, distinct SA, env NATS_CONSUMER=noetl_worker_pool_system + NATS_FILTER_SUBJECT=noetl.commands.system.>
KEDA ScaledObject — generator-emitted NATS-JetStream trigger on consumer lag ✅ READY
RBAC — system pool's SA reads bearer-token Secret + ConfigMaps; NO direct DB / Job / Pod / Secret-create permissions (system playbooks use server API per data-access-boundary)
Bearer-token Secret stub + recipe to create real token
Token mounted via secretKeyRef env (NOETL_INTERNAL_API_TOKEN)
Pool's token authenticates against server's /api/internal/* end-to-end ✅ verified: {"pending":0}
POOL_FILTER_MAP extension in repos/noetl/noetl/core/runtime/pool_routing.pysystem_* tool kinds route to the system pool segment ⏳ pending Phase 2.a
Server-side validation — only catalog entries under system/ may declare system_* kinds ⏳ pending Phase 2.a
Helm workerPools.system values section ⏳ deferred to GKE chart update
Drift-guard test entry for the new KEDA manifest ⏳ sibling noetl/noetl PR

Acceptance for Phase 1.b: met — pool deploys, NATS consumer registered, KEDA scaled, token works end-to-end. The POOL_FILTER_MAP extension lives in Phase 2.a since the playbook ships at the same time as the routing entry.

Phase 2 — Migration system playbooks (retire Python pods)

These playbooks REPLACE today's Python pods. Each one is pure YAML using existing tool kinds.

  • system/outbox_publisher.yaml — ✅ shipped 2026-06-02 via noetl/ops#142. Per the data-access-boundary rule, claim + mark go through the server API; only the NATS publish is a direct tool call:

    1. tool: http POST /api/internal/outbox/claim — server runs the SELECT FOR UPDATE SKIP LOCKED + IN_FLIGHT marking; returns the batch.
    2. tool: nats (iterator over batch rows) — publish JSON payload to noetl.events.<tenant>.<org>.<eid>.<shard>.
    3. tool: http POST /api/internal/outbox/mark-published — server marks the batch PUBLISHED.
    4. (On per-row publish failure) tool: http POST /api/internal/outbox/mark-failed — server marks with exponential backoff (deferred; happy-path only in first cut).

    Triggered by KEDA on GET /api/internal/outbox/pending-count OR POST /api/execute path=system/outbox_publisher manually. Auth via auth: { type: bearer, credential: NOETL_INTERNAL_API_TOKEN } — alias resolved by the worker's load_keychain_env_allowlist from the pod's NOETL_INTERNAL_API_TOKEN env (wired via noetl/ops#143).

    End-to-end kind validation pending Phase 2.a.2 (server-side system/* path routing).

    Execution flow:

                ┌──────────────────────────────────┐
                │  KEDA HTTP scaler monitors:      │
                │  GET /api/internal/outbox/       │
                │      pending-count               │
                │                                  │
                │  pending > 0 → scale system pool │
                │                  1 → N replicas  │
                │                                  │
                │  Server publishes:               │
                │  noetl.commands.system.<eid>     │
                │  for system/outbox_publisher     │
                └──────────────┬───────────────────┘
                               │
                               ▼
                ┌──────────────────────────────────┐
                │  noetl-worker-system-pool        │
                │  (Rust worker, system config:    │
                │   NATS_CONSUMER=                 │
                │     noetl_worker_pool_system,    │
                │   filter=noetl.commands.system.>)│
                └──────────────┬───────────────────┘
                               │
                               │ claims command → dispatches playbook
                               ▼
     ┌─────────────────────────────────────────────────────────┐
     │  Playbook: system/outbox_publisher                      │
     │                                                         │
     │  Step 1: claim_batch                                    │
     │    tool: http                                           │
     │    POST {server}/api/internal/outbox/claim?limit=100    │
     │    auth: system-pool-sa-token                           │
     │    → server: SELECT FOR UPDATE SKIP LOCKED              │
     │              UPDATE status=IN_FLIGHT, attempts+1        │
     │    ← {rows: [{outbox_id, payload, ...}, ...]}           │
     │                                                         │
     │  Step 2: publish_batch  (iterator over rows)            │
     │    tool: nats                                           │
     │    subject: noetl.events.{tenant}.{org}.{eid}.{shard}   │
     │    payload: item.payload  (already JSON)                │
     │    → published to NATS NOETL_EVENTS stream              │
     │                                                         │
     │  Step 3: mark_published                                 │
     │    tool: http                                           │
     │    POST {server}/api/internal/outbox/mark-published     │
     │    payload: {outbox_ids: publish_batch.success_ids}     │
     │    → server: UPDATE status=PUBLISHED, published_at=now()│
     │                                                         │
     │  Step 4: mark_failed  (per-row failure branch)          │
     │    tool: http                                           │
     │    POST {server}/api/internal/outbox/mark-failed        │
     │    payload: {outbox_id, error, attempts}                │
     │    → server: UPDATE status=FAILED with backoff in       │
     │              available_at = now() + 2^attempts seconds  │
     └─────────────────────────────────────────────────────────┘
                               │
                               │ playbook completes
                               ▼
     ┌─────────────────────────────────────────────────────────┐
     │  Worker emits call.done via POST /api/events            │
     │  → server writes to event_outbox (PENDING)              │
     │  → next iteration of this playbook picks it up          │
     │    (bootstrap-tolerant; one-tick delay; audit closes)   │
     └─────────────────────────────────────────────────────────┘
    
  • system/projector — replaces python -m noetl.projector. Same boundary rule — write to noetl.event goes through the server:

    1. tool: nats — pull batch from noetl.events.> (consumer name = system-projector-<shard_id>, derived from worker pod name).
    2. tool: http POST /api/internal/events/project — server runs the batch INSERT with ON CONFLICT (event_id) DO NOTHING.
    3. Ack the NATS batch on server 2xx.

    Sharded — one playbook execution per shard, shard derived from worker_id.

    Blocked on noetl/ai-meta#52 — the noetl-tools nats tool kind today only supports js_publish / js_get_msg / js_stream_info; pulling from a durable JetStream consumer (step 1 above) needs a new js_consume operation. Server-side POST /api/internal/events/project (step 2) is already kind-validated on both Python (v4.10.1) and Rust (v2.1.2).

Acceptance: the Python noetl-outbox-publisher Deployment and noetl-projector StatefulSet retire from the kind cluster + GKE. Outbox throughput + projection lag match or exceed today's Python baseline.

Phase 3 — Additional system playbooks (pluggable surface)

The interesting part — once the runtime + first two playbooks land, the rest of the platform's internal logic moves to playbooks too:

  • system/auth — session validation, token lookup, IdP integration. Tenant override: <tenant>/system/auth_with_saml, <tenant>/system/auth_with_oidc, etc.
  • system/rbac — per-action authorisation. Tenant override supported.
  • system/scheduled_cleanup — TTL enforcement, stale-row reaping. Scheduled via cron-style tool kind.
  • system/credential_rotate — refresh long-lived tokens before expiry.
  • Custom dispatcher rules — tenant-supplied logic for routing decisions, e.g. "route all mcp calls for tenant X to pool Y".

Phase 4 — Compilation (perf + plug-in surface)

The WASM compilation pipeline that closes the perf gap for hot-loop system playbooks (projector, publisher):

  • YAML → WASM compiler — server-side at catalog register time (or first execute). Output cached by (path, version, digest).
  • wasmtime host inside noetl/worker — capability-based imports per the ADR. Different capability sets for system vs. tenant-supplied modules.
  • Hot reload — catalog version bump invalidates cache; next claim recompiles + loads. No process restart.
  • WasmPlaybook catalog kind — first-class catalog entry type (per the ADR's Option 1; Option 2 is the future "YAML stays source; WASM is internal" optimisation).

Sequencing — Step 1 cut

Step 1 = Phase 1 + system/outbox_publisher playbook (Phase 2 first half). Minimum cut that retires the Python outbox publisher pod.

Why this cut: validates the full pipe end-to-end (NATS routing → system pool worker → playbook execution → tool dispatch → Postgres + NATS effects) with the smallest scope. Once it works, adding more system playbooks is incremental.

After Step 1 lands:

  • Step 2: system/projector (retires the second Python pod).
  • Step 3: system/auth (the first non-migration system playbook).
  • Step 4: WASM compilation pipeline.

What's NOT in this umbrella

  • HTTP server FastAPI parity in Rust — separate concern, separate future umbrella if needed.
  • Rust worker tool-kind parity gaps (#47, #48) — those stay in Umbrella: Rust Worker Parity Gaps.
  • Container tool kind callback design (#43) — separate concern.

Recent activity

Date Event
2026-06-02 (morning) Issue filed during the placement discussion under #45
2026-06-02 (morning) Hot-reload trade-off matrix (libloading / WASM / sub-process / closure JIT) captured as comment on #46
2026-06-02 (morning) ADR v1 merged via noetl/docs#176 — publisher + projector classified as compiled core
2026-06-02 (morning) Cross-linked wiki pages live: noetl-server runtime-shape (capability surface) and noetl-ops system-worker-pool
2026-06-02 (afternoon) noetl/server#10 opened as #45 step 1 (Rust publisher binary)
2026-06-02 (afternoon) Architecture pivot: user decision that publisher + projector belong as system playbooks, not compiled binaries. #30 closed, #45 closed, noetl/server#10 closed, this umbrella promoted to primary. ADR v2 in flight to reflect corrected classification.
2026-06-02 (afternoon) No implementation started. Phase 1 + system/outbox_publisher ready to pick up.
2026-06-18 🎯 The system worker pool now carries the orchestrator drive by default. #108 (c) flipped NOETL_ORCHESTRATE_PLUGIN_DRIVE to default true (server#233, v3.28.0) — every triggering event now dispatches the system/orchestrate WASM plug-in to the system pool (noetl.commands.system.>) instead of driving in-process. Scale-soak-gated on kind: a 694-drive cursor+fan-out run COMPLETED with all drives claimed on the system pool, the default/shared pool getting only the real steps, and __orchestrate__ rows in noetl.event = 0. This is the first production-default consumer of the system pool — the "platform extends itself via its own primitives" thesis, live. #108 CLOSED.
2026-06-18 Committed e2e coverage for the off-server topology. #111 / e2e#59 (merged → e2e 977efc2) added kind_validate_orchestrate_offserver.sh — a self-contained kind rig that hard-asserts the drive runs off-server on the system pool (COMPLETED, __orchestrate__ in noetl.event = 0, off-server dispatch + apply via the worker, noetl_orchestrate_shadow_total absent). Live-green against server v3.28.0 (post-#110). Two operator decisions surfaced on #111: (A) retiring the in-process drive fallback is gated on prod adopting a post-#108 image first (prod still runs batch-dispatch-v1, pre-#108); (B) __orchestrate__ PENDING delivery rows accumulate in noetl.command (never reconciled, since the meta-command's events are suppressed) and want a reaping strategy.

Next concrete steps

  1. Land ADR v2 — corrects the classification (publisher + projector → plug-in ring). PR open on noetl/docs.
  2. File sub-issue on noetl/ops — Phase 1 deployment work (Helm chart + manifests + RBAC). Possibly also on noetl/noetl for the POOL_FILTER_MAP extension.
  3. Build the system worker pool deployment on kind — image is existing ghcr.io/noetl/worker, no new code needed.
  4. Write system/outbox_publisher.yaml — pure YAML using postgres + nats tool kinds. Lives in repos/ops/playbooks/system/ (or a new noetl/system-playbooks repo if it grows).
  5. Register the playbook in the kind cluster catalog under system/outbox_publisher.
  6. Validate end-to-end — insert a row into noetl.outbox (status=PENDING); confirm playbook claims it via NATS, publishes to NOETL_EVENTS, marks row PUBLISHED.
  7. Retire the Python outbox publisher pod — flip the Deployment image or just scale to 0.

Related

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally