Skip to content

system worker pool

Kadyapam edited this page Jun 2, 2026 · 1 revision

System worker pool — deploy topology

Status: Design proposal — tracked under noetl/ai-meta#45 and noetl/ai-meta#46. Not yet implemented. This page reserves the operational shape so when the work lands, the manifests have a known home.

For the architectural rationale, see the docs site: System Worker Pool and WASM Plug-in Surface.

For the implementation-level Rust crate layout, see the noetl-server wiki — Runtime shape page.

What lives in this deploy

After the full Rust migration plus the system-pool design, the NoETL cluster has:

Workload Image Role Pool Replicas
noetl-server ghcr.io/noetl/server:<v> HTTP control plane n/a 1-3
noetl-outbox-publisher ghcr.io/noetl/server:<v> Postgres outbox → NATS n/a 1
noetl-projector ghcr.io/noetl/server:<v> NATS → event log n/a 1-N (sharded)
noetl-worker-rust ghcr.io/noetl/worker:<v> User-playbook compute worker-rust-pool 1-20 (KEDA)
noetl-worker-cpu ghcr.io/noetl/worker:<v> User-playbook compute, Python tools fallback worker-cpu-01 1-20 (KEDA)
noetl-worker-system-pool ghcr.io/noetl/server:<v> System playbook compute (WASM) worker-system-pool 1-3 (KEDA)

The system worker pool is new. It runs the same image as the server (because it shares the wasmtime host code + the capability surface), but with --mode=system and a NATS consumer that filters on noetl.commands.system.>.

NATS routing for system traffic

The per-pool routing scheme from noetl/ai-meta#42 extends naturally:

noetl.commands              (legacy bare subject, drained post-cutover)
noetl.commands.shared.<eid>   (default — Rust + Python pools race)
noetl.commands.python.<eid>   (Python-only kinds, e.g. agent)
noetl.commands.system.<eid>   (NEW — system pool only)

POOL_FILTER_MAP in the server gains the system family:

POOL_FILTER_MAP = {
    "agent": "python",
    "system_auth": "system",
    "system_rbac": "system",
    "system_cleanup": "system",
    "system_credential_rotate": "system",
    # ... default → "shared"
}

Server-side validation: only catalog entries under the system/ path may declare system_* tool kinds. User playbooks attempting to declare kind: system_auth are rejected at register-time.

KEDA scaler manifest (reserved shape)

To live at ci/manifests/keda/scaledobject-worker-system-pool.yaml once the work lands:

# NoETL system worker pool autoscaler.
#
# Scales `noetl-worker-system-pool` based on backlog of the
# `noetl_worker_pool_system` JetStream consumer.  System playbooks
# are typically low-frequency (auth checks, scheduled cleanups,
# credential rotation), so the pool defaults to 1 replica with
# room to scale on bursts.
#
# Generated by:
#   from noetl.core.runtime.keda import (
#       ScaledObjectSpec, build_worker_scaledobject, dump_scaledobject_yaml,
#   )
#   spec = ScaledObjectSpec(
#       worker_pool_urn="noetl://tenant/default/org/default/worker/worker-system-pool",
#       deployment="noetl-worker-system-pool",
#       nats_consumer="noetl_worker_pool_system",
#   )
#   print(dump_scaledobject_yaml(build_worker_scaledobject(spec)))
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: noetl-worker-system-pool-scaler
  namespace: noetl
  labels:
    app: noetl-worker-system-pool
    worker-pool: worker-system-pool
    managed-by: noetl
spec:
  scaleTargetRef:
    name: noetl-worker-system-pool
  minReplicaCount: 1
  maxReplicaCount: 5     # smaller cap than user pools
  pollingInterval: 10
  cooldownPeriod: 30
  triggers:
  - type: nats-jetstream
    metadata:
      natsServerMonitoringEndpoint: nats.nats.svc.cluster.local:8222
      account: NOETL
      stream: NOETL_COMMANDS
      consumer: noetl_worker_pool_system
      lagThreshold: '5'           # tighter than user pools (10)
      activationLagThreshold: '1'
      useHttps: 'false'

Smaller maxReplicaCount than user pools (5 vs 20) reflects the expected workload — system playbooks are not high-throughput. Tighter lagThreshold (5 vs 10) keeps auth + RBAC latency low during bursts.

Deployment manifest (reserved shape)

To live at ci/manifests/noetl/worker-system-pool-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: noetl-worker-system-pool
  namespace: noetl
  labels:
    app: noetl-worker-system-pool
    component: system-worker
    runtime: rust
    worker-pool: worker-system-pool
spec:
  replicas: 1
  selector:
    matchLabels:
      app: noetl-worker-system-pool
      worker-pool: worker-system-pool
  template:
    metadata:
      labels:
        app: noetl-worker-system-pool
        component: system-worker
        runtime: rust
        worker-pool: worker-system-pool
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: noetl-worker-system-pool   # distinct RBAC
      initContainers:
        - name: wait-for-api
          image: curlimages/curl:8.7.1
          command: ["sh", "-c", "until curl -sf http://noetl.noetl.svc.cluster.local:8082/api/health; do sleep 3; done"]
      containers:
        - name: system-pool
          image: ghcr.io/noetl/server:<v>
          imagePullPolicy: IfNotPresent
          args: ["--mode=system"]
          ports:
            - name: metrics
              containerPort: 9090
          env:
            - name: WORKER_POOL_NAME
              value: worker-system-pool
            - name: WORKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NATS_URL
              value: nats://noetl:noetl@nats.nats.svc.cluster.local:4222
            - name: NATS_STREAM
              value: NOETL_COMMANDS
            - name: NATS_CONSUMER
              value: noetl_worker_pool_system
            - name: NATS_FILTER_SUBJECT
              value: noetl.commands.system.>
            - name: NOETL_SERVER_URL
              value: http://noetl.noetl.svc.cluster.local:8082
            - name: WASM_MODULE_CACHE_DIR
              value: /var/cache/noetl/wasm
            - name: RUST_LOG
              value: "info,noetl_server_system_pool=debug"
          resources:
            requests:
              cpu: "100m"
              memory: "256Mi"      # WASM modules + cache
            limits:
              cpu: "1000m"
              memory: "1Gi"
          volumeMounts:
            - name: wasm-cache
              mountPath: /var/cache/noetl/wasm
      volumes:
        - name: wasm-cache
          emptyDir:
            sizeLimit: 512Mi

Key differences from worker-rust-deployment.yaml:

  • Image: ghcr.io/noetl/server (not worker) — the system pool ships in the same crate as the server because it shares the wasmtime host code and the capability surface.
  • Service account: noetl-worker-system-pool — distinct from the user-pool service account. RBAC grants:
    • read access to the catalog (for fetching system playbook YAML)
    • write access to the keychain (for host_get_credential / host_credential_rotate)
    • write access to noetl.event (for host_put_event)
    • read access to noetl.event (for host_read_event_log)
  • Memory request: 256Mi (vs user pool's 128Mi) for the WASM module cache.
  • Volume: wasm-cache emptyDir for compiled WASM module artefacts. Catalog version bump invalidates entries by (path, version, digest).
  • No WORKER_MAX_CONCURRENT: system pool serialises by default (one WASM execution per worker pod at a time) for determinism. Scale horizontally via KEDA instead.

Service account + RBAC (reserved shape)

To live at ci/manifests/noetl/serviceaccount-system-pool.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: noetl-worker-system-pool
  namespace: noetl
  labels:
    app: noetl-worker-system-pool
---
# K8s-side RBAC.  Application-side RBAC (which keychain
# credentials, which catalog paths) lives in the keychain ACL
# and the catalog ACL, not here.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: noetl
  name: noetl-worker-system-pool-role
rules:
  # Read configmap with system pool config
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["noetl-worker-system-pool-config"]
    verbs: ["get", "list", "watch"]
  # Read secret with NATS + DB credentials
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["noetl-worker-system-pool-secrets"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: noetl
  name: noetl-worker-system-pool-rolebinding
subjects:
  - kind: ServiceAccount
    name: noetl-worker-system-pool
    namespace: noetl
roleRef:
  kind: Role
  name: noetl-worker-system-pool-role
  apiGroup: rbac.authorization.k8s.io

Helm chart integration

The system pool extends the chart's worker-pool template (noetl/ops/helm/noetl/templates/worker-pool.yaml — present today for the Rust and Python pools). Add a third values.yaml section:

# noetl/ops/helm/noetl/values.yaml
workerPools:
  cpu-01:                       # existing — Python pool
    enabled: true
    image: ghcr.io/noetl/worker
    replicas: 1
    natsConsumer: noetl_worker_pool
  rust-pool:                    # existing — Rust user pool
    enabled: true
    image: ghcr.io/noetl/worker
    replicas: 1
    natsConsumer: noetl_worker_pool_shared
    natsFilterSubject: noetl.commands.shared.>
  system-pool:                  # NEW
    enabled: false              # default off until plug-in ring lands
    image: ghcr.io/noetl/server # NOT noetl/worker
    args: ["--mode=system"]
    replicas: 1
    natsConsumer: noetl_worker_pool_system
    natsFilterSubject: noetl.commands.system.>
    wasmCacheSize: 512Mi

When the system pool is disabled, no Deployment is rendered, no KEDA scaler is rendered, no service account is created. Opt-in per cluster.

kind validation

Per agents/rules/deployment-validation.md, every operational manifest validates on the local kind cluster before GKE. The system pool's validation rig will live at:

  • repos/ops/automation/development/system-pool-validation.yaml
  • repos/ops/automation/development/validate-system-pool.sh

Smoke-test playbook: a tiny system/echo WASM module that takes an input string and echoes it back as the result. Exercises:

  1. Catalog can store a WasmPlaybook entry
  2. Server publishes the dispatch to noetl.commands.system.<eid>
  3. System pool worker claims, fetches the WASM, executes
  4. Result lands back via POST /api/events (the same boundary the Rust user pool uses)
  5. Catalog version bump invalidates the cached module — re-claim compiles the new version

Sequencing — when each manifest lands

Per the implementation sequencing in the server wiki:

Step New manifests When
1 (none — publisher replaces existing Python pod's command:) After --mode=publisher ships
2 (none — projector replaces existing Python pod's command:) After --mode=projector ships
3 (none — server replaces existing Python pod's command:) After --mode=server ships
4 All four reserved manifests above After --mode=system ships

The first three steps change image references in existing manifests but don't add new ones. The system pool is the only new operational surface.

Related

Clone this wiki locally