deployment specification

Deployment Specification

This page is the durable reference for deploying noetl-worker into any environment. It covers the runtime contract the binary expects, the resources it consumes, the network surface it exposes, and — critically — every environment variable it reads, with the why behind each one.

This page is the single source of truth for the deployment shape. Any code change that adds, renames, removes, or shifts the meaning of an env var MUST update the Environment Variables section in the same change set. Same rule for ports, dependencies, and runtime requirements. See agents/rules/wiki-maintenance.md.

The matching deployment manifests live in noetl/ops (Helm chart + kind overlays). This wiki page describes what the manifests need to provide; the manifests are the implementation.

Component summary

Field	Value
Repo	noetl/worker
Binary	`noetl-worker`
Container image	`noetl-worker` (built from the repo's Dockerfile)
Image versioning	crates.io version pinned in `Cargo.toml`; semver releases tagged `vX.Y.Z`
Current version	see `Cargo.toml` `package.version`
Language / runtime	Rust 1.91+; Tokio multi-threaded
Process model	Single binary, single process per pod
Role	NATS pull consumer + tool dispatch. Stateless atomic-compute block per `agents/rules/execution-model.md`.

Runtime contract

What the binary expects from its environment to start cleanly:

NATS reachable at ${NATS_URL} with the JetStream consumer + stream provisioned (see NATS layout). Hard requirement; without it the worker exits.
noetl-server reachable at ${NOETL_SERVER_URL} (HTTP). The worker's per-command server_url override (set by the publishing server on every NATS notification per noetl/ai-meta#53) takes precedence at runtime; the env var is the initial default for one-server deployments.
Worker pool slot: WORKER_POOL_NAME identifies which pool this worker belongs to. Server-side runtime registration is keyed by (kind, name); multiple workers in one pool share work.
Worker identity: WORKER_ID distinguishes this individual worker. When unset, the worker generates a uuid (fine for ephemeral pods; for StatefulSets prefer setting it explicitly from the pod ordinal so logs + metrics stay readable across restarts).
KEDA-managed scaling: the worker is designed to be scaled by KEDA on NATS consumer lag. See Scaling.

Network surface

Ports

Port	Protocol	Purpose	Bind
`9090`	HTTP	Prometheus scrape endpoint at `/metrics` + `/healthz` + `/readyz`	`${WORKER_METRICS_BIND}` (default `0.0.0.0:9090`)

The worker doesn't expose an API. All traffic flows out (NATS pull, HTTP to server) except the scrape port.

Dependencies (outbound)

Target	Protocol	Why
NATS JetStream	TCP 4222 (default)	Pull commands from `noetl.commands.*` subjects. Receive notifications with the publishing server's URL.
noetl-server	HTTP (default port 8082)	Fetch command details (`GET /api/commands/{event_id}`), claim atomically (`POST /api/commands/{event_id}/claim`), emit lifecycle events (`POST /api/events`), put result blobs (`POST /api/result_store/...`). Per-command `server_url` from the NATS notification overrides this default.
External APIs (Auth0 / Duffel / OpenAI / ...)	HTTPS	Whatever the executing playbook tool calls. Credentials resolved via NoETL keychain or `NOETL_KEYCHAIN_ENV_VARS` allowlist (see Keychain credentials).

The worker does NOT call Postgres directly. Per agents/rules/data-access-boundary.md, NoETL platform data is accessible via server API only.

Resources

Recommended starting point for production. Workers scale on NATS backlog (KEDA), so each replica should be sized for one concurrent playbook dispatch.

Resource	Request	Limit	Notes
CPU	100m	500m	Per-replica; CPU-bound only during tool execution (Python eval, JSON transforms). Idle workers consume <5m.
Memory	128Mi	768Mi	The limit must cover the memory-backed `/dev/shm` tmpfs (charged to the pod cgroup, sized to the Arrow IPC cache budget — see Shared memory below) plus the worker RSS. With the 256 MB cache budget the `/dev/shm` tmpfs is 320Mi, so the limit is 768Mi (320Mi shm + worker RSS headroom).
Ephemeral storage	100Mi	500Mi	Tool execution scratch (Python eval temp dirs).

WORKER_MAX_CONCURRENT (default 1) governs how many commands a single worker pod processes in parallel. Stay at 1 unless the pool's tools are I/O-bound (HTTP fetches dominated by external latency); CPU-bound tools should scale out via more replicas, not more concurrency per pod.

Shared memory (`/dev/shm`)

Required: a memory-backed /dev/shm sized to exceed the Arrow IPC cache budget.

The worker allocates one Arrow IPC shared-memory cache per process at init (ArrowIpcSharedMemoryCache::new(), budget NOETL_IPC_CACHE_BUDGET_BYTES, default 256 MB). The cache stages call.done results that exceed the broker's 100 KB inline budget into POSIX shared-memory regions backed by /dev/shm (via the shared_memory crate's shm_open+ftruncate+mmap).

The Kubernetes container-runtime default for /dev/shm is a 64 MiB tmpfs. When the cache writes past 64 MiB the store page-faults against the full tmpfs and the kernel delivers SIGBUS — the worker dies with exit code 135 and crash-loops. Mount a memory-backed emptyDir at /dev/shm sized above the cache budget:

spec:
  containers:
    - name: noetl-worker
      env:
        - name: NOETL_IPC_CACHE_BUDGET_BYTES
          value: "268435456"   # 256 MiB — pin next to sizeLimit below
      volumeMounts:
        - name: dshm
          mountPath: /dev/shm
      resources:
        limits:
          memory: "768Mi"      # >= sizeLimit + worker RSS
  volumes:
    - name: dshm
      emptyDir:
        medium: Memory
        sizeLimit: 320Mi        # >= NOETL_IPC_CACHE_BUDGET_BYTES + headroom

Three values move together and must stay coherent:

Value	Constraint
`NOETL_IPC_CACHE_BUDGET_BYTES`	the cache budget (256 MiB default)
`/dev/shm` `emptyDir` `sizeLimit`	`>` the budget (320Mi = budget + 64Mi headroom for tmpfs page rounding)
container memory limit	`>` `sizeLimit` + worker RSS (768Mi); the tmpfs is charged to the pod cgroup

If you raise the cache budget, raise the sizeLimit and the memory limit in lockstep. The committed manifests wiring this up live in noetl/ops ci/manifests/noetl/worker-*.yaml (fixed in ops#193).

Health probes

Probe	Path	Initial delay	Period	Failure threshold	Effect
Liveness	`/healthz`	10s	10s	3	Pod restart
Readiness	`/readyz`	5s	5s	3	Removed from Service endpoints (only relevant if the metrics port is fronted by a Service)

/readyz confirms NATS connection + heartbeat path is live.

NATS layout

Stream: noetl_commands (default; override via NATS_STREAM). Subjects: noetl.commands.>.
Consumer: pull-based, durable. Default name worker-pool (override via NATS_CONSUMER). Filter subject: noetl.commands (override via NATS_FILTER_SUBJECT).
Subjects the worker reads: per-pool filter shapes vary; the server publishes commands to noetl.commands.{system|shared}.<execution_id> per the Phase F sharding-design. A "shared" pool worker filters on noetl.commands.shared.*; a "system" pool worker filters on noetl.commands.system.*.

Scaling

The worker is the scale-out unit per agents/rules/execution-model.md. KEDA reads NATS consumer lag and scales the Deployment between minReplicas and maxReplicas.

Standard KEDA ScaledObject for the worker pool:

triggers:
  - type: nats-jetstream
    metadata:
      account: $G
      natsServerMonitoringEndpoint: "nats.noetl.svc:8222"
      stream: noetl_commands
      consumer: worker-pool
      lagThreshold: "5"

Workers can be added, removed, or restarted freely — state lives in the server's event log + cache; the worker is stateless.

Snowflake ID generation

The worker mints its own snowflake IDs via src/snowflake.rs for client-side event IDs (per agents/rules/observability.md Principle 3). The 10-bit node id comes from either:

NOETL_SNOWFLAKE_NODE_ID (preferred — set per pod by the deployment manifest).
NOETL_SHARD_ID (back-compat alias for the same idea).
Derived from NOETL_NODE_ID / NODE_NAME / pod hostname via FNV-1a hash to 10 bits.

For multi-replica deployments, set NOETL_SNOWFLAKE_NODE_ID explicitly to avoid hash collisions producing duplicate IDs.

The epoch comes from NOETL_SNOWFLAKE_EPOCH_MS if set, otherwise defaults to a build-time constant. Match the server's epoch (2024-01-01T00:00:00Z UTC) to keep IDs mutually orderable across producers.

Environment variables

All env vars the binary reads at startup or runtime, with the why behind each one.

Worker identity + pool

Variable	Default	Required	Why
`WORKER_ID`	(uuid v4 generated at startup)	recommended	Unique identifier for this worker pod. Embedded in event metadata + runtime registration so the server can track which worker handled which command. For StatefulSet pods, derive from the pod ordinal so logs/metrics stay stable across restarts.
`WORKER_POOL_NAME`	`default`	yes for non-default pools	Pool the worker registers under. Drives the NATS subject filter (system pool reads `noetl.commands.system.`, shared pool reads `noetl.commands.shared.`). Must match the pool's `runtime` row name on the server side.
`WORKER_HEARTBEAT_INTERVAL`	(see code; ~10s)	no	Seconds between heartbeat POSTs to `/api/worker/pool/heartbeat`. Lower = faster failover detection; higher = less HTTP load. Tune to match the server's `NOETL_RUNTIME_OFFLINE_SECONDS`.
`WORKER_MAX_CONCURRENT`	`1`	no	Number of commands this pod processes concurrently. See Resources.
`WORKER_METRICS_BIND`	`0.0.0.0:9090`	no	Bind address for the metrics + health HTTP server. Override only when you don't want to accept scrape traffic on all interfaces.
`WORKER_NATS_LAG_POLL_INTERVAL`	(see code)	no	Seconds between consumer-lag polls used by the worker's own lag-aware logic (independent of KEDA's external poll). Tune only when the default is shown to be too lazy or too aggressive.

Server endpoint

Variable	Default	Required	Why
`NOETL_SERVER_URL`	(built-in fallback, typically `http://noetl.noetl.svc:8082`)	yes	Initial default URL for the noetl-server HTTP API. Note: as of noetl/worker#41 the per-command `server_url` from the NATS notification overrides this at runtime, so a worker that picks up a command published by a different server replica (or by the Rust server vs Python server) will correctly route lifecycle events back to the publishing server. Use an `https://` URL when the server's TLS listener is on (see Transport security below).

Transport security (mTLS client)

Secrets Wallet Phase 4b (noetl/ai-meta#61, noetl/worker#56) — the worker half of the mTLS transport the server opts into in Phase 4a (noetl/server#103, NOETL_TLS_*). The worker's control-plane HTTP client (credential fetch, command claims, event posts) speaks plain HTTP by default; the env below makes it present a client certificate so the worker→server credential channel is authenticated + encrypted (the resolved secret no longer travels plaintext on the wire).

Variable	Default	Required	Why
`NOETL_TLS_CLIENT_CERT`	—	with `NOETL_TLS_CLIENT_KEY`	PEM client-certificate-chain path (the mTLS identity the worker presents). Both cert + key together, or neither (setting one is a fail-fast misconfig).
`NOETL_TLS_CLIENT_KEY`	—	with `NOETL_TLS_CLIENT_CERT`	PEM private-key path for the identity cert. Never logged.
`NOETL_TLS_CA`	—	for a private-CA server	PEM CA-bundle path the worker adds as a trust root when verifying the server's certificate. Needed when the server cert is signed by an internal CA rather than a public root. Independent of the identity (a publicly-trusted server needs none).

Built on the rustls-tls reqwest backend the worker already uses (reqwest::Identity::from_pem + Certificate::from_pem_bundle). Set NOETL_SERVER_URL to the server's https:// URL when these are on. The cert/key/CA are mounted from a K8s Secret; the manifests live in noetl/ops.

Init-container caveat (Phase 4c): the wait-for-api init container curls /api/health over plain HTTP — against an mTLS server it can't complete and blocks the pod in Init. It needs the client cert (mTLS curl) or a non-mTLS health path (parallels the server's mTLS httpGet-probe caveat).

Sealed credential delivery (Phase 5c)

Secrets Wallet Phase 5c (noetl/worker#58, tracks noetl/ai-meta#61) — defense in depth on top of the Phase-4 mTLS transport. mTLS encrypts the wire; sealing encrypts the credential payload to a key only this worker holds. The cleartext exists only briefly inside the worker process after unseal; never on the wire, never in the server's HTTP response body.

Variable	Default	Required	Why
`NOETL_SEALED_CREDENTIALS`	`false`	when sealing	Master toggle. When `true` / `1` / `yes` and `WORKER_ID` is set, the worker calls `GET /api/credentials/{alias}/sealed?worker_id=<WORKER_ID>` instead of the plaintext path. The server (v2.32.0 + the Phase-5b endpoint, server#109) seals the response to the X25519 public key this worker registered on `POST /api/worker/pool/register`.

Wire shape. The worker generates a long-lived X25519 keypair once at startup, includes the base64 public half in its register payload's runtime JSON blob:

{
  "name": "<WORKER_ID>",
  "component_type": "worker_pool",
  "runtime": { "kind": "rust", "worker_public_key": "<base64-32-byte-x25519-pub>" },
  "status": "ready",
  ...
}

The private half stays in-process for the worker's lifetime — only the public half ever leaves. The server-side deployment-specification § Sealed credential delivery describes the endpoint shape; the wire envelope is {alg, v, eph_pub (b64), ciphertext (b64)} carried by the same noetl_sealed algorithm constant on both sides (x25519-hkdf-sha256-chacha20-poly1305, v=1).

Zeroize. After the auth-alias resolver has injected the resolved fields into the tool config, the worker zeroizes the string values in the intermediate credential.data map so an OOM core dump or a debugger snoop on the worker pod can't recover the secret from the post-dispatch heap.

Sealing requires a worker name on the server side. The server looks up worker_public_key by the name (kind=worker_pool) of the runtime row. Workers identify themselves at register time with WORKER_ID, so that env var is required for sealing to work; absent → the wrapper logs a WARN and falls back to the plaintext path.

NATS

Variable	Default	Required	Why
`NATS_URL`	`nats://localhost:4222`	yes	NATS server URL. In kind: `nats://nats.noetl.svc:4222`; in GKE: matches the NATS Helm release's Service.
`NATS_USER`	—	when NATS auth	NATS user.
`NATS_PASSWORD`	—	when NATS auth	NATS password. Should come from a K8s Secret.
`NATS_STREAM`	`noetl_commands`	no	JetStream stream name. Override only when running multiple NoETL deployments on one NATS cluster.
`NATS_CONSUMER`	`worker-pool`	no	Durable consumer name. All workers in one pool share the same consumer name.
`NATS_SUBJECT`	`noetl.commands`	no	Base subject prefix. Pool-specific filter (`noetl.commands.system.` vs `noetl.commands.shared.`) appends below this.
`NATS_FILTER_SUBJECT`	(derived from pool)	no	Explicit override for the consumer's filter subject. Use only when the pool routing scheme needs a custom filter.

Snowflake ID generation

Variable	Default	Required	Why
`NOETL_SNOWFLAKE_NODE_ID`	—	per-replica in prod	10-bit node id (0–1023) for the worker's snowflake generator. Each pod in a deployment MUST set a distinct value to avoid id collisions producing duplicate event IDs. Same shape as `NOETL_SERVER_MACHINE_ID` on the server side.
`NOETL_SHARD_ID`	—	(alias)	Back-compat alias for `NOETL_SNOWFLAKE_NODE_ID`; same semantics.
`NOETL_NODE_ID`	(HOSTNAME)	—	Fallback identifier hashed to derive the 10-bit node id when neither `NOETL_SNOWFLAKE_NODE_ID` nor `NOETL_SHARD_ID` is set. Read at startup.
`NODE_NAME`	(set by container runtime via downward API)	—	Same fallback chain as `NOETL_NODE_ID`; standard K8s downward-API value.
`NOETL_SNOWFLAKE_EPOCH_MS`	(build-time default: 2024-01-01T00Z)	no	Epoch the snowflake timestamps count from. Override only to match an alternate epoch elsewhere in the system — but the noetl-server uses 2024-01-01, so changing this on the worker breaks cross-component ordering.

Keychain credentials (`NOETL_KEYCHAIN_ENV_VARS`)

Variable	Default	Required	Why
`NOETL_KEYCHAIN_ENV_VARS`	—	when env-aliased credentials used	Comma-separated allowlist of env var names that hold credentials playbook steps can reference by alias. Example: `NOETL_KEYCHAIN_ENV_VARS=NOETL_FLIGHT_BEARER_TOKEN,OPENAI_API_KEY` permits a playbook to write `bearer_token: NOETL_FLIGHT_BEARER_TOKEN` and the worker resolves it from the env at dispatch time. The values themselves are loaded from the same env (the variable name is the alias and the var holds the secret). See `src/executor/command.rs::KEYCHAIN_ENV_ALLOWLIST_VAR`. When unset, env aliasing is disabled and all credentials must go through the NoETL keychain API.

The mechanism is a bridge for credentials that already exist as env vars in the worker's environment (because they come from GKE Workload Identity, an existing K8s Secret mount, or similar already-in-place trust per agents/rules/execution-model.md). Net-new business-logic credentials belong in the NoETL keychain, not in this allowlist.

Misc / standard tooling

Variable	Default	Required	Why
`HOSTNAME`	(set by container runtime)	—	Fallback for snowflake node-id derivation.
`RUST_LOG`	(build default)	no	Standard `tracing-subscriber` filter. Set to `noetl_worker=debug` for targeted debugging.
`NOETL_IPC_CACHE_BUDGET_BYTES`	`268435456` (256 MB)	no	Arrow IPC shared-memory cache budget for same-node zero-copy reads. The pod's `/dev/shm` must be a memory-backed tmpfs sized above this value, or the worker SIGBUSes (exit 135) when the cache fills past the 64 MiB k8s default — see Shared memory. Tune up only when working sets exceed 256 MB; raise the `/dev/shm` `sizeLimit` and the pod memory limit in lockstep.

Event result staging (results-by-reference)

Variable	Default	Required	Why
`NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES`	`102400` (100 KB)	no	Inline budget for a tool result's serialised context on the `call.done` event. A result within budget goes inline (`{status, context}`); a result over budget is staged in the durable result store (`noetl.result_store`) + the shm cache, and the event carries a `{data:{_ref}}` placeholder + a `reference` block instead — keeping the event log lean. The server orchestrator resolves those references when driving state (`hydrate_result_references`, noetl/server#197). Lower it to push more results through the reference path (ops tuning, or forcing the path in kind validation — `=256` was used to validate every PFT result by reference); raise it to keep more results inline. Read per-result in `src/executor/command.rs::inline_context_max_bytes`.
`NOETL_RESULT_MATERIALIZER_ENABLED`	`false`	no	Result materializer — shadow Feather tier (noetl/ai-meta#104 Phase B, system pool only). Truthy spawns a SEPARATE `noetl_events` consume-loop (consumer `noetl_result_materializer`, own ack cursor) that, for an over-budget result, fetches the authoritative payload (read-only), tiers it (tabular → Arrow Feather, non-tabular → JSON, small → no write) and writes the body to the derived §7 `physical_key` via the server's `PUT /api/internal/objects/{key}` — shadow, alongside `noetl.result_store`; nothing reads it until the resolve path is enabled. Never alters the authoritative result, never fails an event. Default off → not spawned (true no-op).
`NOETL_RESULT_URI_RESOLVE`	`false`	no	Resolve-by-URN read path (noetl/ai-meta#104 Phase C, consume pool). Truthy makes a step that binds the bulk of an over-budget upstream result resolve it by canonical logical URI → cell placement (server registry) → §7 key → object bytes (`GET /api/internal/objects/{key}`) instead of the legacy `noetl.result_store` fetch. On any registry miss / object miss / decode error it falls back fail-safe to the authoritative `resolve_ref` (`noetl_worker_result_resolve_total{outcome}` records `resolved_` vs `fallback_`). Default off → byte-identical legacy read path.
`NOETL_RESULT_MINT_AUTHORITATIVE`	`false`	no	Phase D minting flip (noetl/ai-meta#104 Phase D). ONE flag that makes the URN → Feather/GCS tier the authoritative result store: on the system pool it makes the result materializer the authoritative tier writer (implies `NOETL_RESULT_MATERIALIZER_ENABLED`); on the consume pool it makes resolve-by-URN the primary read path (implies `NOETL_RESULT_URI_RESOLVE`). A tier miss / parse failure still falls back fail-safe to the dual-written `noetl.result_store` (rollback safety), recorded on `noetl_worker_result_mint_authoritative_total{path}` (`tier` = authoritative tier served; `legacy_fallback` = reversible fallback served). Default off → byte-identical to Phase A–C (true no-op). The dual-write to `result_store` continues until the OQ5-gated retirement decision (NOT Phase D), so flag-off rolls back cleanly.
`NOETL_SIDE_EFFECT_BARRIER`	`false`	no	Side-effect durability barrier (noetl/ai-meta#104 Phase E, consume pool). Before the worker (re-)dispatches a side-effecting tool (per the registry classifier `noetl_tools::registry::kind_is_side_effecting`, noetl-tools 3.17.0), it checks whether the cycle's derived result URN already resolves to a durable result (the Phase C resolve-by-URN read path). If it does — the cycle already ran to a durable completion on a prior drive — the worker skips re-execution and adopts the recorded result, so an external side effect (HTTP `POST`, DB write, payment) fires exactly once across a crash-resume / re-drive. Non-side-effecting cycles are never blocked (idempotent recompute is fine); a side-effecting cycle whose result is not durable re-executes normally. Adopt-only: `resolve_by_urn` returns `Some` only on a durable tier hit, so the barrier can only ever turn a duplicate side effect into a single one — never drop work. Outcomes on `noetl_worker_side_effect_barrier_total{outcome}` (`skipped` = adopted durable result; `executed` = re-executed). Default off → byte-identical to today (the barrier block is short-circuited by the cheap flag read; dispatch unchanged).
`NOETL_RESULT_TIER_DR`	`false`	no	Result-tier DR re-derive (noetl/ai-meta#104 Phase F, system pool only). The Feather/JSON result tier is derivable from the WAL — an object's bytes are the deterministic encode of the authoritative payload and its location is computed from the logical URI — so a missing or corrupt tier object can be rebuilt from its source by re-running the materialization for its URN. Truthy puts the result materializer in verify-and-repair mode: for each over-budget result it re-derives the object + §7 key and rewrites it only when the durable object is missing or byte-divergent (corrupt); a healthy object is left untouched. The flag alone spawns the materializer (DR-only mode — no need for `NOETL_RESULT_MATERIALIZER_ENABLED`). Never alters the authoritative `result_store` source; the rewrite is byte-identical to a fresh materialization. A WAL event re-delivery is therefore a targeted DR repair. Outcomes on `noetl_worker_result_tier_dr_total{outcome}` (`present` / `rederived` / `source_gone` / `error`). Default off → byte-identical to Phase B/D (normal write path; this branch never runs).

Off-server state builder shadow (RFC #115 Phase 4)

System worker pool only. The off-server state builder reconstructs orchestrator WorkflowState from the noetl_events WAL (not the materialized noetl.event table), walking the one-level prev_event_id chain and caching the built spine keyed by the immutable chain head (noetl/ai-meta#115 Phase 4). The shadow loop is observation-only — it proves on the running cluster that the builder reads the WAL with zero noetl.event scans and that the chain-walk + pool-side cache (hit / incremental tail-advance / cold-rebuild) behave, without touching the drive. The drive cutover that makes the drive consume this builder's state is staged behind the server's NOETL_STATE_BUILDER=offserver.

Variable	Default	Required	Why
`NOETL_STATE_BUILDER_SHADOW`	unset (off)	no	Truthy (`1`/`true`/`yes`/`on`) spawns the off-server state-builder shadow loop (system pool). It opens its own ephemeral `DeliverAll`/`AckNone` pull consumer on `noetl_events` (never the materializer's durable `noetl_materializer` consumer), replays the retained WAL into a pool-side per-execution chain index, and exercises the chain walk + cache, emitting the `state_builder` metrics below. Observation-only — no drive impact; safe to leave on for monitoring. Default off → every other worker unaffected.
`NOETL_STATE_BUILDER_STREAM`	`noetl_events`	no	The JetStream WAL stream the shadow drains. Mirror of the materializer's `NOETL_MATERIALIZER_STREAM`; only override when the stream is renamed.
`NOETL_STATE_BUILDER_BATCH`	`200`	no	Bounded pull batch (clamped 1–1000).
`NOETL_STATE_BUILDER_TIMEOUT_MS`	`2000`	no	Pull-batch expiry wait.
`NOETL_STATE_BUILDER_IDLE_SLEEP_MS`	`500`	no	Sleep when a drain comes back empty (keeps the idle loop off the CPU).

State-builder metrics (:9090/metrics): noetl_worker_state_builder_wal_events_total (events consumed from the WAL — the WAL-read proof, RFC tenet 5) · noetl_worker_state_builder_event_scans_total (noetl.event scans the builder issued — the no-scan proof, RFC tenet 3; stays 0) · noetl_worker_state_builder_builds_total{outcome=cache_hit|incremental|cold_rebuild|incomplete} (cache effectiveness + correctness) · noetl_worker_state_builder_chain_hops (chain-walk depth per cold rebuild).

Secrets handling

Same shape as the server:

Secret	Storage	Mount as
`NATS_PASSWORD`	K8s Secret	`valueFrom.secretKeyRef`
Any value listed in `NOETL_KEYCHAIN_ENV_VARS`	K8s Secret (when used)	`valueFrom.secretKeyRef` per allowlisted name

Per agents/rules/execution-model.md "Secrets and credentials rule": business-logic credentials (third-party API tokens, tenant database DSNs) belong in the NoETL keychain via the server's /api/keychain/* routes. The allowlist mechanism above is a bridge only for credentials that already live in the pod env via existing trust (GKE Workload Identity, etc.).

Observability

Metrics: Prometheus surface at :9090/metrics per agents/rules/observability.md. Surface includes NATS pull rate, command claim outcomes, tool dispatch durations per tool kind, result-store PUT latency.
JetStream consumer lag (noetl_worker_nats_consumer_pending
- noetl_worker_nats_consumer_ack_pending, labelled {stream, consumer}): a periodic lag poller queries JetStream consumer info on an independent task and sets these gauges. It covers the command consumer always, and the materializer consumer ({stream="noetl_events", consumer="noetl_materializer"}) whenever NOETL_MATERIALIZER_ENABLED=true. The latter is the earliest signal that, under the server's NOETL_EVENT_INGEST_PUBLISH_ONLY gate, published events are piling up un-materialized — it climbs even when the materializer loop itself has stalled or died (the loop can't report its own lag; the independent poller can). It is the metric the CQRS flip guardrail alerts read (see noetl/ai-meta#103 and the ops runbook noetl-cqrs-publish-only-flip.md).
Materializer counters: noetl_worker_materializer_drained_total / _projected_total / _duplicates_total / _acked_total / _project_errors_total + _cycle_duration_seconds — the ack-after-materialize loop's throughput + the no-loss redelivery surface (project_errors = a batch left un-acked to redeliver).
Scrape path (the metrics above only guard the flip if something scrapes them): the :9090/metrics port is exposed by a headless *-metrics Service per pool. On the kind dev cluster a VictoriaMetrics VMServiceScrape selects them; on prod (GKE) the cluster runs Google Managed Prometheus, which does NOT honor prometheus.io/scrape annotations — the worker pods are scraped by a GMP PodMonitoring (ci/manifests/noetl/gmp/podmonitoring-noetl.yaml, ops wiki Production monitoring (GMP)). Without that object the lag gauge + materializer counters never reach Managed Prometheus and the CQRS flip guardrail is blind.
Tracing: tracing spans on every NATS message, every tool dispatch. execution_id is a span field.
Logs: structured JSON via tracing-subscriber.

Validation procedure

When changing any of the above, validate per agents/rules/deployment-validation.md:

cargo build --release --bins
cargo test --quiet
Build image locally + load into kind: kind load docker-image …
Apply the manifests against kubectl --context kind-noetl
Smoke-test:
- Submit a playbook execution via the noetl CLI.
- Confirm the worker claims the command, runs the tool, and POSTs lifecycle events back.
- For env-var changes: verify kubectl exec … env | grep <VAR> shows the value AND the worker's startup log line confirms it was honored.

Only after kind passes does the change roll forward to Cloud Build + GKE.

deployment specification

Deployment Specification

Component summary

Runtime contract

Network surface

Ports

Dependencies (outbound)

Resources

Shared memory (/dev/shm)

Health probes

NATS layout

Scaling

Snowflake ID generation

Environment variables

Worker identity + pool

Server endpoint

Transport security (mTLS client)

Sealed credential delivery (Phase 5c)

NATS

Snowflake ID generation

Keychain credentials (NOETL_KEYCHAIN_ENV_VARS)

Misc / standard tooling

Event result staging (results-by-reference)

Off-server state builder shadow (RFC #115 Phase 4)

Secrets handling

Observability

Validation procedure

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noetl-worker

Architecture

Operations

Related repos

External

Clone this wiki locally

Shared memory (`/dev/shm`)

Keychain credentials (`NOETL_KEYCHAIN_ENV_VARS`)