Skip to content

system worker pool

Kadyapam edited this page Jun 2, 2026 · 1 revision

System worker pool — deploy topology

Status: Design proposal — tracked under noetl/ai-meta#45 and noetl/ai-meta#46. Not yet implemented. This page reserves the operational shape so when the work lands, the manifests have a known home.

For the architectural rationale, see the docs site: System Worker Pool and WASM Plug-in Surface.

For the implementation-level Rust crate layout, see the noetl-server wiki — Runtime shape page.

What lives in this deploy

After the full Rust migration plus the system-pool design, the NoETL cluster has:

Workload Image Role Pool Replicas
noetl-server ghcr.io/noetl/server:<v> HTTP control plane n/a 1-3
noetl-outbox-publisher ghcr.io/noetl/server:<v> Postgres outbox → NATS n/a 1
noetl-projector ghcr.io/noetl/server:<v> NATS → event log n/a 1-N (sharded)
noetl-worker-rust ghcr.io/noetl/worker:<v> User-playbook compute worker-rust-pool 1-20 (KEDA)
noetl-worker-cpu ghcr.io/noetl/worker:<v> User-playbook compute, Python tools fallback worker-cpu-01 1-20 (KEDA)
noetl-worker-system-pool ghcr.io/noetl/server:<v> System playbook compute (WASM) worker-system-pool 1-3 (KEDA)

The system worker pool is new. It runs the same image as the server (because it shares the wasmtime host code + the capability surface), but with --mode=system and a NATS consumer that filters on noetl.commands.system.>.

NATS routing for system traffic

The per-pool routing scheme from noetl/ai-meta#42 extends naturally:

noetl.commands              (legacy bare subject, drained post-cutover)
noetl.commands.shared.<eid>   (default — Rust + Python pools race)
noetl.commands.python.<eid>   (Python-only kinds, e.g. agent)
noetl.commands.system.<eid>   (NEW — system pool only)

POOL_FILTER_MAP in the server gains the system family:

POOL_FILTER_MAP = {
    "agent": "python",
    "system_auth": "system",
    "system_rbac": "system",
    "system_cleanup": "system",
    "system_credential_rotate": "system",
    # ... default → "shared"
}

Server-side validation: only catalog entries under the system/ path may declare system_* tool kinds. User playbooks attempting to declare kind: system_auth are rejected at register-time.

KEDA scaler manifest (reserved shape)

To live at ci/manifests/keda/scaledobject-worker-system-pool.yaml once the work lands:

# NoETL system worker pool autoscaler.
#
# Scales `noetl-worker-system-pool` based on backlog of the
# `noetl_worker_pool_system` JetStream consumer.  System playbooks
# are typically low-frequency (auth checks, scheduled cleanups,
# credential rotation), so the pool defaults to 1 replica with
# room to scale on bursts.
#
# Generated by:
#   from noetl.core.runtime.keda import (
#       ScaledObjectSpec, build_worker_scaledobject, dump_scaledobject_yaml,
#   )
#   spec = ScaledObjectSpec(
#       worker_pool_urn="noetl://tenant/default/org/default/worker/worker-system-pool",
#       deployment="noetl-worker-system-pool",
#       nats_consumer="noetl_worker_pool_system",
#   )
#   print(dump_scaledobject_yaml(build_worker_scaledobject(spec)))
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: noetl-worker-system-pool-scaler
  namespace: noetl
  labels:
    app: noetl-worker-system-pool
    worker-pool: worker-system-pool
    managed-by: noetl
spec:
  scaleTargetRef:
    name: noetl-worker-system-pool
  minReplicaCount: 1
  maxReplicaCount: 5     # smaller cap than user pools
  pollingInterval: 10
  cooldownPeriod: 30
  triggers:
  - type: nats-jetstream
    metadata:
      natsServerMonitoringEndpoint: nats.nats.svc.cluster.local:8222
      account: NOETL
      stream: NOETL_COMMANDS
      consumer: noetl_worker_pool_system
      lagThreshold: '5'           # tighter than user pools (10)
      activationLagThreshold: '1'
      useHttps: 'false'

Smaller maxReplicaCount than user pools (5 vs 20) reflects the expected workload — system playbooks are not high-throughput. Tighter lagThreshold (5 vs 10) keeps auth + RBAC latency low during bursts.

Deployment manifest (reserved shape)

To live at ci/manifests/noetl/worker-system-pool-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: noetl-worker-system-pool
  namespace: noetl
  labels:
    app: noetl-worker-system-pool
    component: system-worker
    runtime: rust
    worker-pool: worker-system-pool
spec:
  replicas: 1
  selector:
    matchLabels:
      app: noetl-worker-system-pool
      worker-pool: worker-system-pool
  template:
    metadata:
      labels:
        app: noetl-worker-system-pool
        component: system-worker
        runtime: rust
        worker-pool: worker-system-pool
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: noetl-worker-system-pool   # distinct RBAC
      initContainers:
        - name: wait-for-api
          image: curlimages/curl:8.7.1
          command: ["sh", "-c", "until curl -sf http://noetl.noetl.svc.cluster.local:8082/api/health; do sleep 3; done"]
      containers:
        - name: system-pool
          image: ghcr.io/noetl/server:<v>
          imagePullPolicy: IfNotPresent
          args: ["--mode=system"]
          ports:
            - name: metrics
              containerPort: 9090
          env:
            - name: WORKER_POOL_NAME
              value: worker-system-pool
            - name: WORKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NATS_URL
              value: nats://noetl:noetl@nats.nats.svc.cluster.local:4222
            - name: NATS_STREAM
              value: NOETL_COMMANDS
            - name: NATS_CONSUMER
              value: noetl_worker_pool_system
            - name: NATS_FILTER_SUBJECT
              value: noetl.commands.system.>
            - name: NOETL_SERVER_URL
              value: http://noetl.noetl.svc.cluster.local:8082
            - name: WASM_MODULE_CACHE_DIR
              value: /var/cache/noetl/wasm
            - name: RUST_LOG
              value: "info,noetl_server_system_pool=debug"
          resources:
            requests:
              cpu: "100m"
              memory: "256Mi"      # WASM modules + cache
            limits:
              cpu: "1000m"
              memory: "1Gi"
          volumeMounts:
            - name: wasm-cache
              mountPath: /var/cache/noetl/wasm
      volumes:
        - name: wasm-cache
          emptyDir:
            sizeLimit: 512Mi

Key differences from worker-rust-deployment.yaml:

  • Image: ghcr.io/noetl/server (not worker) — the system pool ships in the same crate as the server because it shares the wasmtime host code and the capability surface.
  • Service account: noetl-worker-system-pool — distinct from the user-pool service account. RBAC grants:
    • read access to the catalog (for fetching system playbook YAML)
    • write access to the keychain (for host_get_credential / host_credential_rotate)
    • write access to noetl.event (for host_put_event)
    • read access to noetl.event (for host_read_event_log)
  • Memory request: 256Mi (vs user pool's 128Mi) for the WASM module cache.
  • Volume: wasm-cache emptyDir for compiled WASM module artefacts. Catalog version bump invalidates entries by (path, version, digest).
  • No WORKER_MAX_CONCURRENT: system pool serialises by default (one WASM execution per worker pod at a time) for determinism. Scale horizontally via KEDA instead.

Service account + RBAC (reserved shape)

To live at ci/manifests/noetl/serviceaccount-system-pool.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: noetl-worker-system-pool
  namespace: noetl
  labels:
    app: noetl-worker-system-pool
---
# K8s-side RBAC.  Application-side RBAC (which keychain
# credentials, which catalog paths) lives in the keychain ACL
# and the catalog ACL, not here.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: noetl
  name: noetl-worker-system-pool-role
rules:
  # Read configmap with system pool config
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["noetl-worker-system-pool-config"]
    verbs: ["get", "list", "watch"]
  # Read secret with NATS + DB credentials
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["noetl-worker-system-pool-secrets"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: noetl
  name: noetl-worker-system-pool-rolebinding
subjects:
  - kind: ServiceAccount
    name: noetl-worker-system-pool
    namespace: noetl
roleRef:
  kind: Role
  name: noetl-worker-system-pool-role
  apiGroup: rbac.authorization.k8s.io

Helm chart integration

The system pool extends the chart's worker-pool template (noetl/ops/helm/noetl/templates/worker-pool.yaml — present today for the Rust and Python pools). Add a third values.yaml section:

# noetl/ops/helm/noetl/values.yaml
workerPools:
  cpu-01:                       # existing — Python pool
    enabled: true
    image: ghcr.io/noetl/worker
    replicas: 1
    natsConsumer: noetl_worker_pool
  rust-pool:                    # existing — Rust user pool
    enabled: true
    image: ghcr.io/noetl/worker
    replicas: 1
    natsConsumer: noetl_worker_pool_shared
    natsFilterSubject: noetl.commands.shared.>
  system-pool:                  # NEW
    enabled: false              # default off until plug-in ring lands
    image: ghcr.io/noetl/server # NOT noetl/worker
    args: ["--mode=system"]
    replicas: 1
    natsConsumer: noetl_worker_pool_system
    natsFilterSubject: noetl.commands.system.>
    wasmCacheSize: 512Mi

When the system pool is disabled, no Deployment is rendered, no KEDA scaler is rendered, no service account is created. Opt-in per cluster.

kind validation

Per agents/rules/deployment-validation.md, every operational manifest validates on the local kind cluster before GKE. The system pool's validation rig will live at:

  • repos/ops/automation/development/system-pool-validation.yaml
  • repos/ops/automation/development/validate-system-pool.sh

Smoke-test playbook: a tiny system/echo WASM module that takes an input string and echoes it back as the result. Exercises:

  1. Catalog can store a WasmPlaybook entry
  2. Server publishes the dispatch to noetl.commands.system.<eid>
  3. System pool worker claims, fetches the WASM, executes
  4. Result lands back via POST /api/events (the same boundary the Rust user pool uses)
  5. Catalog version bump invalidates the cached module — re-claim compiles the new version

Sequencing — when each manifest lands

Per the implementation sequencing in the server wiki:

Step New manifests When
1 (none — publisher replaces existing Python pod's command:) After --mode=publisher ships
2 (none — projector replaces existing Python pod's command:) After --mode=projector ships
3 (none — server replaces existing Python pod's command:) After --mode=server ships
4 All four reserved manifests above After --mode=system ships

The first three steps change image references in existing manifests but don't add new ones. The system pool is the only new operational surface.

Related