-
Notifications
You must be signed in to change notification settings - Fork 0
deployment specification
This page is the durable reference for deploying noetl-worker into any environment. It covers the runtime contract the binary expects, the resources it consumes, the network surface it exposes, and — critically — every environment variable it reads, with the why behind each one.
This page is the single source of truth for the deployment shape.
Any code change that adds, renames, removes, or shifts the
meaning of an env var MUST update the
Environment Variables section in the
same change set. Same rule for ports, dependencies, and
runtime requirements. See
agents/rules/wiki-maintenance.md.
The matching deployment manifests live in noetl/ops (Helm chart + kind overlays). This wiki page describes what the manifests need to provide; the manifests are the implementation.
| Field | Value |
|---|---|
| Repo | noetl/worker |
| Binary | noetl-worker |
| Container image |
noetl-worker (built from the repo's Dockerfile) |
| Image versioning | crates.io version pinned in Cargo.toml; semver releases tagged vX.Y.Z
|
| Current version | see Cargo.toml package.version
|
| Language / runtime | Rust 1.91+; Tokio multi-threaded |
| Process model | Single binary, single process per pod |
| Role | NATS pull consumer + tool dispatch. Stateless atomic-compute block per agents/rules/execution-model.md. |
What the binary expects from its environment to start cleanly:
-
NATS reachable at
${NATS_URL}with the JetStream consumer + stream provisioned (see NATS layout). Hard requirement; without it the worker exits. -
noetl-server reachable at
${NOETL_SERVER_URL}(HTTP). The worker's per-commandserver_urloverride (set by the publishing server on every NATS notification per noetl/ai-meta#53) takes precedence at runtime; the env var is the initial default for one-server deployments. -
Worker pool slot:
WORKER_POOL_NAMEidentifies which pool this worker belongs to. Server-side runtime registration is keyed by(kind, name); multiple workers in one pool share work. -
Worker identity:
WORKER_IDdistinguishes this individual worker. When unset, the worker generates a uuid (fine for ephemeral pods; for StatefulSets prefer setting it explicitly from the pod ordinal so logs + metrics stay readable across restarts). - KEDA-managed scaling: the worker is designed to be scaled by KEDA on NATS consumer lag. See Scaling.
| Port | Protocol | Purpose | Bind |
|---|---|---|---|
9090 |
HTTP | Prometheus scrape endpoint at /metrics + /healthz + /readyz
|
${WORKER_METRICS_BIND} (default 0.0.0.0:9090) |
The worker doesn't expose an API. All traffic flows out (NATS pull, HTTP to server) except the scrape port.
| Target | Protocol | Why |
|---|---|---|
| NATS JetStream | TCP 4222 (default) | Pull commands from noetl.commands.* subjects. Receive notifications with the publishing server's URL. |
| noetl-server | HTTP (default port 8082) | Fetch command details (GET /api/commands/{event_id}), claim atomically (POST /api/commands/{event_id}/claim), emit lifecycle events (POST /api/events), put result blobs (POST /api/result_store/...). Per-command server_url from the NATS notification overrides this default. |
| External APIs (Auth0 / Duffel / OpenAI / ...) | HTTPS | Whatever the executing playbook tool calls. Credentials resolved via NoETL keychain or NOETL_KEYCHAIN_ENV_VARS allowlist (see Keychain credentials). |
The worker does NOT call Postgres directly. Per
agents/rules/data-access-boundary.md,
NoETL platform data is accessible via server API only.
Recommended starting point for production. Workers scale on NATS backlog (KEDA), so each replica should be sized for one concurrent playbook dispatch.
| Resource | Request | Limit | Notes |
|---|---|---|---|
| CPU | 100m | 500m | Per-replica; CPU-bound only during tool execution (Python eval, JSON transforms). Idle workers consume <5m. |
| Memory | 128Mi | 256Mi | Dominated by the Arrow IPC cache budget (NOETL_IPC_CACHE_BUDGET_BYTES, default 256 MB → cap memory at 384Mi if you raise that). |
| Ephemeral storage | 100Mi | 500Mi | Tool execution scratch (Python eval temp dirs). |
WORKER_MAX_CONCURRENT (default 1) governs how many commands a
single worker pod processes in parallel. Stay at 1 unless the
pool's tools are I/O-bound (HTTP fetches dominated by external
latency); CPU-bound tools should scale out via more replicas, not
more concurrency per pod.
| Probe | Path | Initial delay | Period | Failure threshold | Effect |
|---|---|---|---|---|---|
| Liveness | /healthz |
10s | 10s | 3 | Pod restart |
| Readiness | /readyz |
5s | 5s | 3 | Removed from Service endpoints (only relevant if the metrics port is fronted by a Service) |
/readyz confirms NATS connection + heartbeat path is live.
-
Stream:
noetl_commands(default; override viaNATS_STREAM). Subjects:noetl.commands.>. -
Consumer: pull-based, durable. Default name
worker-pool(override viaNATS_CONSUMER). Filter subject:noetl.commands(override viaNATS_FILTER_SUBJECT). -
Subjects the worker reads: per-pool filter shapes vary; the
server publishes commands to
noetl.commands.{system|shared}.<execution_id>per the Phase F sharding-design. A "shared" pool worker filters onnoetl.commands.shared.*; a "system" pool worker filters onnoetl.commands.system.*.
The worker is the scale-out unit per
agents/rules/execution-model.md.
KEDA reads NATS consumer lag and scales the Deployment between
minReplicas and maxReplicas.
Standard KEDA ScaledObject for the worker pool:
triggers:
- type: nats-jetstream
metadata:
account: $G
natsServerMonitoringEndpoint: "nats.noetl.svc:8222"
stream: noetl_commands
consumer: worker-pool
lagThreshold: "5"Workers can be added, removed, or restarted freely — state lives in the server's event log + cache; the worker is stateless.
The worker mints its own snowflake IDs via
src/snowflake.rs
for client-side event IDs (per
agents/rules/observability.md
Principle 3). The 10-bit node id comes from either:
-
NOETL_SNOWFLAKE_NODE_ID(preferred — set per pod by the deployment manifest). -
NOETL_SHARD_ID(back-compat alias for the same idea). - Derived from
NOETL_NODE_ID/NODE_NAME/ pod hostname via FNV-1a hash to 10 bits.
For multi-replica deployments, set NOETL_SNOWFLAKE_NODE_ID
explicitly to avoid hash collisions producing duplicate IDs.
The epoch comes from NOETL_SNOWFLAKE_EPOCH_MS if set, otherwise
defaults to a build-time constant. Match the server's epoch
(2024-01-01T00:00:00Z UTC) to keep IDs mutually orderable across
producers.
All env vars the binary reads at startup or runtime, with the why behind each one.
| Variable | Default | Required | Why |
|---|---|---|---|
WORKER_ID |
(uuid v4 generated at startup) | recommended | Unique identifier for this worker pod. Embedded in event metadata + runtime registration so the server can track which worker handled which command. For StatefulSet pods, derive from the pod ordinal so logs/metrics stay stable across restarts. |
WORKER_POOL_NAME |
default |
yes for non-default pools | Pool the worker registers under. Drives the NATS subject filter (system pool reads noetl.commands.system.*, shared pool reads noetl.commands.shared.*). Must match the pool's runtime row name on the server side. |
WORKER_HEARTBEAT_INTERVAL |
(see code; ~10s) | no | Seconds between heartbeat POSTs to /api/worker/pool/heartbeat. Lower = faster failover detection; higher = less HTTP load. Tune to match the server's NOETL_RUNTIME_OFFLINE_SECONDS. |
WORKER_MAX_CONCURRENT |
1 |
no | Number of commands this pod processes concurrently. See Resources. |
WORKER_METRICS_BIND |
0.0.0.0:9090 |
no | Bind address for the metrics + health HTTP server. Override only when you don't want to accept scrape traffic on all interfaces. |
WORKER_NATS_LAG_POLL_INTERVAL |
(see code) | no | Seconds between consumer-lag polls used by the worker's own lag-aware logic (independent of KEDA's external poll). Tune only when the default is shown to be too lazy or too aggressive. |
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_SERVER_URL |
(built-in fallback, typically http://noetl.noetl.svc:8082) |
yes | Initial default URL for the noetl-server HTTP API. Note: as of noetl/worker#41 the per-command server_url from the NATS notification overrides this at runtime, so a worker that picks up a command published by a different server replica (or by the Rust server vs Python server) will correctly route lifecycle events back to the publishing server. |
| Variable | Default | Required | Why |
|---|---|---|---|
NATS_URL |
nats://localhost:4222 |
yes | NATS server URL. In kind: nats://nats.noetl.svc:4222; in GKE: matches the NATS Helm release's Service. |
NATS_USER |
— | when NATS auth | NATS user. |
NATS_PASSWORD |
— | when NATS auth | NATS password. Should come from a K8s Secret. |
NATS_STREAM |
noetl_commands |
no | JetStream stream name. Override only when running multiple NoETL deployments on one NATS cluster. |
NATS_CONSUMER |
worker-pool |
no | Durable consumer name. All workers in one pool share the same consumer name. |
NATS_SUBJECT |
noetl.commands |
no | Base subject prefix. Pool-specific filter (noetl.commands.system.* vs noetl.commands.shared.*) appends below this. |
NATS_FILTER_SUBJECT |
(derived from pool) | no | Explicit override for the consumer's filter subject. Use only when the pool routing scheme needs a custom filter. |
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_SNOWFLAKE_NODE_ID |
— | per-replica in prod | 10-bit node id (0–1023) for the worker's snowflake generator. Each pod in a deployment MUST set a distinct value to avoid id collisions producing duplicate event IDs. Same shape as NOETL_SERVER_MACHINE_ID on the server side. |
NOETL_SHARD_ID |
— | (alias) | Back-compat alias for NOETL_SNOWFLAKE_NODE_ID; same semantics. |
NOETL_NODE_ID |
(HOSTNAME) | — | Fallback identifier hashed to derive the 10-bit node id when neither NOETL_SNOWFLAKE_NODE_ID nor NOETL_SHARD_ID is set. Read at startup. |
NODE_NAME |
(set by container runtime via downward API) | — | Same fallback chain as NOETL_NODE_ID; standard K8s downward-API value. |
NOETL_SNOWFLAKE_EPOCH_MS |
(build-time default: 2024-01-01T00Z) | no | Epoch the snowflake timestamps count from. Override only to match an alternate epoch elsewhere in the system — but the noetl-server uses 2024-01-01, so changing this on the worker breaks cross-component ordering. |
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_KEYCHAIN_ENV_VARS |
— | when env-aliased credentials used | Comma-separated allowlist of env var names that hold credentials playbook steps can reference by alias. Example: NOETL_KEYCHAIN_ENV_VARS=NOETL_FLIGHT_BEARER_TOKEN,OPENAI_API_KEY permits a playbook to write bearer_token: NOETL_FLIGHT_BEARER_TOKEN and the worker resolves it from the env at dispatch time. The values themselves are loaded from the same env (the variable name is the alias and the var holds the secret). See src/executor/command.rs::KEYCHAIN_ENV_ALLOWLIST_VAR. When unset, env aliasing is disabled and all credentials must go through the NoETL keychain API. |
The mechanism is a bridge for credentials that already exist
as env vars in the worker's environment (because they come from
GKE Workload Identity, an existing K8s Secret mount, or similar
already-in-place trust per
agents/rules/execution-model.md).
Net-new business-logic credentials belong in the NoETL keychain,
not in this allowlist.
| Variable | Default | Required | Why |
|---|---|---|---|
HOSTNAME |
(set by container runtime) | — | Fallback for snowflake node-id derivation. |
RUST_LOG |
(build default) | no | Standard tracing-subscriber filter. Set to noetl_worker=debug for targeted debugging. |
NOETL_IPC_CACHE_BUDGET_BYTES |
268435456 (256 MB) |
no | Arrow IPC shared-memory cache budget for same-node zero-copy reads. Tune up only when working sets exceed 256 MB; cap pod memory limit accordingly. |
Same shape as the server:
| Secret | Storage | Mount as |
|---|---|---|
NATS_PASSWORD |
K8s Secret | valueFrom.secretKeyRef |
Any value listed in NOETL_KEYCHAIN_ENV_VARS
|
K8s Secret (when used) |
valueFrom.secretKeyRef per allowlisted name |
Per agents/rules/execution-model.md
"Secrets and credentials rule": business-logic credentials
(third-party API tokens, tenant database DSNs) belong in the
NoETL keychain via the server's /api/keychain/* routes. The
allowlist mechanism above is a bridge only for credentials
that already live in the pod env via existing trust (GKE Workload
Identity, etc.).
-
Metrics: Prometheus surface at
:9090/metricsperagents/rules/observability.md. Surface includes NATS pull rate, command claim outcomes, tool dispatch durations per tool kind, result-store PUT latency. -
Tracing:
tracingspans on every NATS message, every tool dispatch.execution_idis a span field. -
Logs: structured JSON via
tracing-subscriber.
When changing any of the above, validate per
agents/rules/deployment-validation.md:
cargo build --release --binscargo test --quiet- Build image locally + load into kind:
kind load docker-image … - Apply the manifests against
kubectl --context kind-noetl - Smoke-test:
- Submit a playbook execution via the noetl CLI.
- Confirm the worker claims the command, runs the tool, and POSTs lifecycle events back.
- For env-var changes: verify
kubectl exec … env | grep <VAR>shows the value AND the worker's startup log line confirms it was honored.
Only after kind passes does the change roll forward to Cloud Build + GKE.
- noetl-executor adoption — the shared execution core the worker pulls from crates.io.
- release-pipeline — Cargo + Docker image publish flow.
- worker-credentials — credential resolution in tool dispatch.
- nats-mcp-tool-kinds — tool kinds the worker dispatches.
-
noetl/server
sharding-design— the cross-component routing context. -
noetl/server
deployment-specification— the server-side counterpart of this page. - noetl/ops Helm chart + manifests — the deployment-time implementation.
-
agents/rules/wiki-maintenance.md— the rule that requires this page to update in lockstep with code.