-
Notifications
You must be signed in to change notification settings - Fork 0
deployment specification
This page is the durable reference for deploying noetl-worker into any environment. It covers the runtime contract the binary expects, the resources it consumes, the network surface it exposes, and — critically — every environment variable it reads, with the why behind each one.
This page is the single source of truth for the deployment shape.
Any code change that adds, renames, removes, or shifts the
meaning of an env var MUST update the
Environment Variables section in the
same change set. Same rule for ports, dependencies, and
runtime requirements. See
agents/rules/wiki-maintenance.md.
The matching deployment manifests live in noetl/ops (Helm chart + kind overlays). This wiki page describes what the manifests need to provide; the manifests are the implementation.
| Field | Value |
|---|---|
| Repo | noetl/worker |
| Binary | noetl-worker |
| Container image |
noetl-worker (built from the repo's Dockerfile) |
| Image versioning | crates.io version pinned in Cargo.toml; semver releases tagged vX.Y.Z
|
| Current version | see Cargo.toml package.version
|
| Language / runtime | Rust 1.91+; Tokio multi-threaded |
| Process model | Single binary, single process per pod |
| Role | NATS pull consumer + tool dispatch. Stateless atomic-compute block per agents/rules/execution-model.md. |
What the binary expects from its environment to start cleanly:
-
NATS reachable at
${NATS_URL}with the JetStream consumer + stream provisioned (see NATS layout). Hard requirement; without it the worker exits. -
noetl-server reachable at
${NOETL_SERVER_URL}(HTTP). The worker's per-commandserver_urloverride (set by the publishing server on every NATS notification per noetl/ai-meta#53) takes precedence at runtime; the env var is the initial default for one-server deployments. -
Worker pool slot:
WORKER_POOL_NAMEidentifies which pool this worker belongs to. Server-side runtime registration is keyed by(kind, name); multiple workers in one pool share work. -
Worker identity:
WORKER_IDdistinguishes this individual worker. When unset, the worker generates a uuid (fine for ephemeral pods; for StatefulSets prefer setting it explicitly from the pod ordinal so logs + metrics stay readable across restarts). - KEDA-managed scaling: the worker is designed to be scaled by KEDA on NATS consumer lag. See Scaling.
| Port | Protocol | Purpose | Bind |
|---|---|---|---|
9090 |
HTTP | Prometheus scrape endpoint at /metrics + /healthz + /readyz
|
${WORKER_METRICS_BIND} (default 0.0.0.0:9090) |
The worker doesn't expose an API. All traffic flows out (NATS pull, HTTP to server) except the scrape port.
| Target | Protocol | Why |
|---|---|---|
| NATS JetStream | TCP 4222 (default) | Pull commands from noetl.commands.* subjects. Receive notifications with the publishing server's URL. |
| noetl-server | HTTP (default port 8082) | Fetch command details (GET /api/commands/{event_id}), claim atomically (POST /api/commands/{event_id}/claim), emit lifecycle events (POST /api/events), put result blobs (POST /api/result_store/...). Per-command server_url from the NATS notification overrides this default. |
| External APIs (Auth0 / Duffel / OpenAI / ...) | HTTPS | Whatever the executing playbook tool calls. Credentials resolved via NoETL keychain or NOETL_KEYCHAIN_ENV_VARS allowlist (see Keychain credentials). |
The worker does NOT call Postgres directly. Per
agents/rules/data-access-boundary.md,
NoETL platform data is accessible via server API only.
Recommended starting point for production. Workers scale on NATS backlog (KEDA), so each replica should be sized for one concurrent playbook dispatch.
| Resource | Request | Limit | Notes |
|---|---|---|---|
| CPU | 100m | 500m | Per-replica; CPU-bound only during tool execution (Python eval, JSON transforms). Idle workers consume <5m. |
| Memory | 128Mi | 768Mi | The limit must cover the memory-backed /dev/shm tmpfs (charged to the pod cgroup, sized to the Arrow IPC cache budget — see Shared memory below) plus the worker RSS. With the 256 MB cache budget the /dev/shm tmpfs is 320Mi, so the limit is 768Mi (320Mi shm + worker RSS headroom). |
| Ephemeral storage | 100Mi | 500Mi | Tool execution scratch (Python eval temp dirs). |
WORKER_MAX_CONCURRENT (default 1) governs how many commands a
single worker pod processes in parallel. Stay at 1 unless the
pool's tools are I/O-bound (HTTP fetches dominated by external
latency); CPU-bound tools should scale out via more replicas, not
more concurrency per pod.
Required: a memory-backed /dev/shm sized to exceed the Arrow IPC
cache budget.
The worker allocates one Arrow IPC shared-memory cache per process at
init (ArrowIpcSharedMemoryCache::new(), budget
NOETL_IPC_CACHE_BUDGET_BYTES, default 256 MB). The cache stages
call.done results that exceed the broker's 100 KB inline budget into
POSIX shared-memory regions backed by /dev/shm (via the
shared_memory crate's shm_open+ftruncate+mmap).
The Kubernetes container-runtime default for /dev/shm is a 64 MiB
tmpfs. When the cache writes past 64 MiB the store page-faults against
the full tmpfs and the kernel delivers SIGBUS — the worker dies
with exit code 135 and crash-loops. Mount a memory-backed
emptyDir at /dev/shm sized above the cache budget:
spec:
containers:
- name: noetl-worker
env:
- name: NOETL_IPC_CACHE_BUDGET_BYTES
value: "268435456" # 256 MiB — pin next to sizeLimit below
volumeMounts:
- name: dshm
mountPath: /dev/shm
resources:
limits:
memory: "768Mi" # >= sizeLimit + worker RSS
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 320Mi # >= NOETL_IPC_CACHE_BUDGET_BYTES + headroomThree values move together and must stay coherent:
| Value | Constraint |
|---|---|
NOETL_IPC_CACHE_BUDGET_BYTES |
the cache budget (256 MiB default) |
/dev/shm emptyDir sizeLimit
|
> the budget (320Mi = budget + 64Mi headroom for tmpfs page rounding) |
| container memory limit |
> sizeLimit + worker RSS (768Mi); the tmpfs is charged to the pod cgroup |
If you raise the cache budget, raise the sizeLimit and the memory
limit in lockstep. The committed manifests wiring this up live in
noetl/ops ci/manifests/noetl/worker-*.yaml
(fixed in ops#193).
| Probe | Path | Initial delay | Period | Failure threshold | Effect |
|---|---|---|---|---|---|
| Liveness | /healthz |
10s | 10s | 3 | Pod restart |
| Readiness | /readyz |
5s | 5s | 3 | Removed from Service endpoints (only relevant if the metrics port is fronted by a Service) |
/readyz confirms NATS connection + heartbeat path is live.
-
Stream:
noetl_commands(default; override viaNATS_STREAM). Subjects:noetl.commands.>. -
Consumer: pull-based, durable. Default name
worker-pool(override viaNATS_CONSUMER). Filter subject:noetl.commands(override viaNATS_FILTER_SUBJECT). -
Subjects the worker reads: per-pool filter shapes vary; the
server publishes commands to
noetl.commands.{system|shared}.<execution_id>per the Phase F sharding-design. A "shared" pool worker filters onnoetl.commands.shared.*; a "system" pool worker filters onnoetl.commands.system.*.
The worker is the scale-out unit per
agents/rules/execution-model.md.
KEDA reads NATS consumer lag and scales the Deployment between
minReplicas and maxReplicas.
Standard KEDA ScaledObject for the worker pool:
triggers:
- type: nats-jetstream
metadata:
account: $G
natsServerMonitoringEndpoint: "nats.noetl.svc:8222"
stream: noetl_commands
consumer: worker-pool
lagThreshold: "5"Workers can be added, removed, or restarted freely — state lives in the server's event log + cache; the worker is stateless.
The worker mints its own snowflake IDs via
src/snowflake.rs
for client-side event IDs (per
agents/rules/observability.md
Principle 3). The 10-bit node id comes from either:
-
NOETL_SNOWFLAKE_NODE_ID(preferred — set per pod by the deployment manifest). -
NOETL_SHARD_ID(back-compat alias for the same idea). - Derived from
NOETL_NODE_ID/NODE_NAME/ pod hostname via FNV-1a hash to 10 bits.
For multi-replica deployments, set NOETL_SNOWFLAKE_NODE_ID
explicitly to avoid hash collisions producing duplicate IDs.
The epoch comes from NOETL_SNOWFLAKE_EPOCH_MS if set, otherwise
defaults to a build-time constant. Match the server's epoch
(2024-01-01T00:00:00Z UTC) to keep IDs mutually orderable across
producers.
All env vars the binary reads at startup or runtime, with the why behind each one.
| Variable | Default | Required | Why |
|---|---|---|---|
WORKER_ID |
(uuid v4 generated at startup) | recommended | Unique identifier for this worker pod. Embedded in event metadata + runtime registration so the server can track which worker handled which command. For StatefulSet pods, derive from the pod ordinal so logs/metrics stay stable across restarts. |
WORKER_POOL_NAME |
default |
yes for non-default pools | Pool the worker registers under. Drives the NATS subject filter (system pool reads noetl.commands.system.*, shared pool reads noetl.commands.shared.*). Must match the pool's runtime row name on the server side. |
WORKER_HEARTBEAT_INTERVAL |
(see code; ~10s) | no | Seconds between heartbeat POSTs to /api/worker/pool/heartbeat. Lower = faster failover detection; higher = less HTTP load. Tune to match the server's NOETL_RUNTIME_OFFLINE_SECONDS. |
WORKER_MAX_CONCURRENT |
1 |
no | Number of commands this pod processes concurrently. See Resources. |
WORKER_METRICS_BIND |
0.0.0.0:9090 |
no | Bind address for the metrics + health HTTP server. Override only when you don't want to accept scrape traffic on all interfaces. |
WORKER_NATS_LAG_POLL_INTERVAL |
(see code) | no | Seconds between consumer-lag polls used by the worker's own lag-aware logic (independent of KEDA's external poll). Tune only when the default is shown to be too lazy or too aggressive. |
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_SERVER_URL |
(built-in fallback, typically http://noetl.noetl.svc:8082) |
yes | Initial default URL for the noetl-server HTTP API. Note: as of noetl/worker#41 the per-command server_url from the NATS notification overrides this at runtime, so a worker that picks up a command published by a different server replica (or by the Rust server vs Python server) will correctly route lifecycle events back to the publishing server. Use an https:// URL when the server's TLS listener is on (see Transport security below). |
Secrets Wallet Phase 4b (noetl/ai-meta#61,
noetl/worker#56) — the worker half of
the mTLS transport the server opts into in Phase 4a
(noetl/server#103, NOETL_TLS_*).
The worker's control-plane HTTP client (credential fetch, command claims, event
posts) speaks plain HTTP by default; the env below makes it present a client
certificate so the worker→server credential channel is authenticated +
encrypted (the resolved secret no longer travels plaintext on the wire).
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_TLS_CLIENT_CERT |
— | with NOETL_TLS_CLIENT_KEY
|
PEM client-certificate-chain path (the mTLS identity the worker presents). Both cert + key together, or neither (setting one is a fail-fast misconfig). |
NOETL_TLS_CLIENT_KEY |
— | with NOETL_TLS_CLIENT_CERT
|
PEM private-key path for the identity cert. Never logged. |
NOETL_TLS_CA |
— | for a private-CA server | PEM CA-bundle path the worker adds as a trust root when verifying the server's certificate. Needed when the server cert is signed by an internal CA rather than a public root. Independent of the identity (a publicly-trusted server needs none). |
Built on the rustls-tls reqwest backend the worker already uses
(reqwest::Identity::from_pem + Certificate::from_pem_bundle). Set
NOETL_SERVER_URL to the server's https:// URL when these are on. The
cert/key/CA are mounted from a K8s Secret; the manifests live in
noetl/ops.
Init-container caveat (Phase 4c): the wait-for-api init container curls
/api/health over plain HTTP — against an mTLS server it can't complete and
blocks the pod in Init. It needs the client cert (mTLS curl) or a non-mTLS
health path (parallels the server's mTLS httpGet-probe caveat).
Secrets Wallet Phase 5c (noetl/worker#58, tracks noetl/ai-meta#61) — defense in depth on top of the Phase-4 mTLS transport. mTLS encrypts the wire; sealing encrypts the credential payload to a key only this worker holds. The cleartext exists only briefly inside the worker process after unseal; never on the wire, never in the server's HTTP response body.
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_SEALED_CREDENTIALS |
false |
when sealing | Master toggle. When true / 1 / yes and WORKER_ID is set, the worker calls GET /api/credentials/{alias}/sealed?worker_id=<WORKER_ID> instead of the plaintext path. The server (v2.32.0 + the Phase-5b endpoint, server#109) seals the response to the X25519 public key this worker registered on POST /api/worker/pool/register. |
Wire shape. The worker generates a long-lived X25519 keypair once at
startup, includes the base64 public half in its register payload's runtime
JSON blob:
{
"name": "<WORKER_ID>",
"component_type": "worker_pool",
"runtime": { "kind": "rust", "worker_public_key": "<base64-32-byte-x25519-pub>" },
"status": "ready",
...
}The private half stays in-process for the worker's lifetime — only the public
half ever leaves. The server-side
deployment-specification § Sealed credential delivery
describes the endpoint shape; the wire envelope is
{alg, v, eph_pub (b64), ciphertext (b64)} carried by the same noetl_sealed
algorithm constant on both sides
(x25519-hkdf-sha256-chacha20-poly1305, v=1).
Zeroize. After the auth-alias resolver has injected the resolved fields
into the tool config, the worker zeroizes the string values in the
intermediate credential.data map so an OOM core dump or a debugger snoop
on the worker pod can't recover the secret from the post-dispatch heap.
Sealing requires a worker name on the server side. The server looks up
worker_public_key by the name (kind=worker_pool) of the runtime row.
Workers identify themselves at register time with WORKER_ID, so that env
var is required for sealing to work; absent → the wrapper logs a WARN and
falls back to the plaintext path.
| Variable | Default | Required | Why |
|---|---|---|---|
NATS_URL |
nats://localhost:4222 |
yes | NATS server URL. In kind: nats://nats.noetl.svc:4222; in GKE: matches the NATS Helm release's Service. |
NATS_USER |
— | when NATS auth | NATS user. |
NATS_PASSWORD |
— | when NATS auth | NATS password. Should come from a K8s Secret. |
NATS_STREAM |
noetl_commands |
no | JetStream stream name. Override only when running multiple NoETL deployments on one NATS cluster. |
NATS_CONSUMER |
worker-pool |
no | Durable consumer name. All workers in one pool share the same consumer name. |
NATS_SUBJECT |
noetl.commands |
no | Base subject prefix. Pool-specific filter (noetl.commands.system.* vs noetl.commands.shared.*) appends below this. |
NATS_FILTER_SUBJECT |
(derived from pool) | no | Explicit override for the consumer's filter subject. Use only when the pool routing scheme needs a custom filter. |
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_SNOWFLAKE_NODE_ID |
— | per-replica in prod | 10-bit node id (0–1023) for the worker's snowflake generator. Each pod in a deployment MUST set a distinct value to avoid id collisions producing duplicate event IDs. Same shape as NOETL_SERVER_MACHINE_ID on the server side. |
NOETL_SHARD_ID |
— | (alias) | Back-compat alias for NOETL_SNOWFLAKE_NODE_ID; same semantics. |
NOETL_NODE_ID |
(HOSTNAME) | — | Fallback identifier hashed to derive the 10-bit node id when neither NOETL_SNOWFLAKE_NODE_ID nor NOETL_SHARD_ID is set. Read at startup. |
NODE_NAME |
(set by container runtime via downward API) | — | Same fallback chain as NOETL_NODE_ID; standard K8s downward-API value. |
NOETL_SNOWFLAKE_EPOCH_MS |
(build-time default: 2024-01-01T00Z) | no | Epoch the snowflake timestamps count from. Override only to match an alternate epoch elsewhere in the system — but the noetl-server uses 2024-01-01, so changing this on the worker breaks cross-component ordering. |
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_KEYCHAIN_ENV_VARS |
— | when env-aliased credentials used | Comma-separated allowlist of env var names that hold credentials playbook steps can reference by alias. Example: NOETL_KEYCHAIN_ENV_VARS=NOETL_FLIGHT_BEARER_TOKEN,OPENAI_API_KEY permits a playbook to write bearer_token: NOETL_FLIGHT_BEARER_TOKEN and the worker resolves it from the env at dispatch time. The values themselves are loaded from the same env (the variable name is the alias and the var holds the secret). See src/executor/command.rs::KEYCHAIN_ENV_ALLOWLIST_VAR. When unset, env aliasing is disabled and all credentials must go through the NoETL keychain API. |
The mechanism is a bridge for credentials that already exist
as env vars in the worker's environment (because they come from
GKE Workload Identity, an existing K8s Secret mount, or similar
already-in-place trust per
agents/rules/execution-model.md).
Net-new business-logic credentials belong in the NoETL keychain,
not in this allowlist.
| Variable | Default | Required | Why |
|---|---|---|---|
HOSTNAME |
(set by container runtime) | — | Fallback for snowflake node-id derivation. |
RUST_LOG |
(build default) | no | Standard tracing-subscriber filter. Set to noetl_worker=debug for targeted debugging. |
NOETL_IPC_CACHE_BUDGET_BYTES |
268435456 (256 MB) |
no | Arrow IPC shared-memory cache budget for same-node zero-copy reads. The pod's /dev/shm must be a memory-backed tmpfs sized above this value, or the worker SIGBUSes (exit 135) when the cache fills past the 64 MiB k8s default — see Shared memory. Tune up only when working sets exceed 256 MB; raise the /dev/shm sizeLimit and the pod memory limit in lockstep. |
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES |
102400 (100 KB) |
no | Inline budget for a tool result's serialised context on the call.done event. A result within budget goes inline ({status, context}); a result over budget is staged in the durable result store (noetl.result_store) + the shm cache, and the event carries a {data:{_ref}} placeholder + a reference block instead — keeping the event log lean. The server orchestrator resolves those references when driving state (hydrate_result_references, noetl/server#197). Lower it to push more results through the reference path (ops tuning, or forcing the path in kind validation — =256 was used to validate every PFT result by reference); raise it to keep more results inline. Read per-result in src/executor/command.rs::inline_context_max_bytes. |
NOETL_RESULT_MATERIALIZER_ENABLED |
false |
no |
Result materializer — shadow Feather tier (noetl/ai-meta#104 Phase B, system pool only). Truthy spawns a SEPARATE noetl_events consume-loop (consumer noetl_result_materializer, own ack cursor) that, for an over-budget result, fetches the authoritative payload (read-only), tiers it (tabular → Arrow Feather, non-tabular → JSON, small → no write) and writes the body to the derived §7 physical_key via the server's PUT /api/internal/objects/{key} — shadow, alongside noetl.result_store; nothing reads it until the resolve path is enabled. Never alters the authoritative result, never fails an event. Default off → not spawned (true no-op). |
NOETL_RESULT_URI_RESOLVE |
false |
no |
Resolve-by-URN read path (noetl/ai-meta#104 Phase C, consume pool). Truthy makes a step that binds the bulk of an over-budget upstream result resolve it by canonical logical URI → cell placement (server registry) → §7 key → object bytes (GET /api/internal/objects/{key}) instead of the legacy noetl.result_store fetch. On any registry miss / object miss / decode error it falls back fail-safe to the authoritative resolve_ref (noetl_worker_result_resolve_total{outcome} records resolved_* vs fallback_*). Default off → byte-identical legacy read path. |
NOETL_RESULT_MINT_AUTHORITATIVE |
false |
no |
Phase D minting flip (noetl/ai-meta#104 Phase D). ONE flag that makes the URN → Feather/GCS tier the authoritative result store: on the system pool it makes the result materializer the authoritative tier writer (implies NOETL_RESULT_MATERIALIZER_ENABLED); on the consume pool it makes resolve-by-URN the primary read path (implies NOETL_RESULT_URI_RESOLVE). A tier miss / parse failure still falls back fail-safe to the dual-written noetl.result_store (rollback safety), recorded on noetl_worker_result_mint_authoritative_total{path} (tier = authoritative tier served; legacy_fallback = reversible fallback served). Default off → byte-identical to Phase A–C (true no-op). The dual-write to result_store continues until the OQ5-gated retirement decision (NOT Phase D), so flag-off rolls back cleanly. |
NOETL_SIDE_EFFECT_BARRIER |
false |
no |
Side-effect durability barrier (noetl/ai-meta#104 Phase E, consume pool). Before the worker (re-)dispatches a side-effecting tool (per the registry classifier noetl_tools::registry::kind_is_side_effecting, noetl-tools 3.17.0), it checks whether the cycle's derived result URN already resolves to a durable result (the Phase C resolve-by-URN read path). If it does — the cycle already ran to a durable completion on a prior drive — the worker skips re-execution and adopts the recorded result, so an external side effect (HTTP POST, DB write, payment) fires exactly once across a crash-resume / re-drive. Non-side-effecting cycles are never blocked (idempotent recompute is fine); a side-effecting cycle whose result is not durable re-executes normally. Adopt-only: resolve_by_urn returns Some only on a durable tier hit, so the barrier can only ever turn a duplicate side effect into a single one — never drop work. Outcomes on noetl_worker_side_effect_barrier_total{outcome} (skipped = adopted durable result; executed = re-executed). Default off → byte-identical to today (the barrier block is short-circuited by the cheap flag read; dispatch unchanged). |
NOETL_RESULT_TIER_DR |
false |
no |
Result-tier DR re-derive (noetl/ai-meta#104 Phase F, system pool only). The Feather/JSON result tier is derivable from the WAL — an object's bytes are the deterministic encode of the authoritative payload and its location is computed from the logical URI — so a missing or corrupt tier object can be rebuilt from its source by re-running the materialization for its URN. Truthy puts the result materializer in verify-and-repair mode: for each over-budget result it re-derives the object + §7 key and rewrites it only when the durable object is missing or byte-divergent (corrupt); a healthy object is left untouched. The flag alone spawns the materializer (DR-only mode — no need for NOETL_RESULT_MATERIALIZER_ENABLED). Never alters the authoritative result_store source; the rewrite is byte-identical to a fresh materialization. A WAL event re-delivery is therefore a targeted DR repair. Outcomes on noetl_worker_result_tier_dr_total{outcome} (present / rederived / source_gone / error). Default off → byte-identical to Phase B/D (normal write path; this branch never runs). |
System worker pool only. The off-server state builder reconstructs orchestrator
WorkflowState from the noetl_events WAL (not the materialized
noetl.event table), walking the one-level prev_event_id chain and caching the
built spine keyed by the immutable chain head
(noetl/ai-meta#115 Phase 4). The
shadow loop is observation-only — it proves on the running cluster that the
builder reads the WAL with zero noetl.event scans and that the chain-walk +
pool-side cache (hit / incremental tail-advance / cold-rebuild) behave, without
touching the drive. The drive cutover that makes the drive consume this
builder's state is staged behind the server's NOETL_STATE_BUILDER=offserver.
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_STATE_BUILDER_SHADOW |
unset (off) | no | Truthy (1/true/yes/on) spawns the off-server state-builder shadow loop (system pool). It opens its own ephemeral DeliverAll/AckNone pull consumer on noetl_events (never the materializer's durable noetl_materializer consumer), replays the retained WAL into a pool-side per-execution chain index, and exercises the chain walk + cache, emitting the state_builder metrics below. Observation-only — no drive impact; safe to leave on for monitoring. Default off → every other worker unaffected. |
NOETL_STATE_BUILDER_STREAM |
noetl_events |
no | The JetStream WAL stream the shadow drains. Mirror of the materializer's NOETL_MATERIALIZER_STREAM; only override when the stream is renamed. |
NOETL_STATE_BUILDER_BATCH |
200 |
no | Bounded pull batch (clamped 1–1000). |
NOETL_STATE_BUILDER_TIMEOUT_MS |
2000 |
no | Pull-batch expiry wait. |
NOETL_STATE_BUILDER_IDLE_SLEEP_MS |
500 |
no | Sleep when a drain comes back empty (keeps the idle loop off the CPU). |
State-builder metrics (:9090/metrics):
noetl_worker_state_builder_wal_events_total (events consumed from the WAL — the
WAL-read proof, RFC tenet 5) · noetl_worker_state_builder_event_scans_total
(noetl.event scans the builder issued — the no-scan proof, RFC tenet 3;
stays 0) · noetl_worker_state_builder_builds_total{outcome=cache_hit|incremental|cold_rebuild|incomplete}
(cache effectiveness + correctness) · noetl_worker_state_builder_chain_hops
(chain-walk depth per cold rebuild).
Same shape as the server:
| Secret | Storage | Mount as |
|---|---|---|
NATS_PASSWORD |
K8s Secret | valueFrom.secretKeyRef |
Any value listed in NOETL_KEYCHAIN_ENV_VARS
|
K8s Secret (when used) |
valueFrom.secretKeyRef per allowlisted name |
Per agents/rules/execution-model.md
"Secrets and credentials rule": business-logic credentials
(third-party API tokens, tenant database DSNs) belong in the
NoETL keychain via the server's /api/keychain/* routes. The
allowlist mechanism above is a bridge only for credentials
that already live in the pod env via existing trust (GKE Workload
Identity, etc.).
-
Metrics: Prometheus surface at
:9090/metricsperagents/rules/observability.md. Surface includes NATS pull rate, command claim outcomes, tool dispatch durations per tool kind, result-store PUT latency. -
JetStream consumer lag (
noetl_worker_nats_consumer_pending-
noetl_worker_nats_consumer_ack_pending, labelled{stream, consumer}): a periodic lag poller queries JetStream consumer info on an independent task and sets these gauges. It covers the command consumer always, and the materializer consumer ({stream="noetl_events", consumer="noetl_materializer"}) wheneverNOETL_MATERIALIZER_ENABLED=true. The latter is the earliest signal that, under the server'sNOETL_EVENT_INGEST_PUBLISH_ONLYgate, published events are piling up un-materialized — it climbs even when the materializer loop itself has stalled or died (the loop can't report its own lag; the independent poller can). It is the metric the CQRS flip guardrail alerts read (see noetl/ai-meta#103 and the ops runbooknoetl-cqrs-publish-only-flip.md).
-
-
Materializer counters:
noetl_worker_materializer_drained_total/_projected_total/_duplicates_total/_acked_total/_project_errors_total+_cycle_duration_seconds— the ack-after-materialize loop's throughput + the no-loss redelivery surface (project_errors= a batch left un-acked to redeliver). -
Scrape path (the metrics above only guard the flip if something
scrapes them): the
:9090/metricsport is exposed by a headless*-metricsService per pool. On the kind dev cluster a VictoriaMetricsVMServiceScrapeselects them; on prod (GKE) the cluster runs Google Managed Prometheus, which does NOT honorprometheus.io/scrapeannotations — the worker pods are scraped by a GMPPodMonitoring(ci/manifests/noetl/gmp/podmonitoring-noetl.yaml, ops wiki Production monitoring (GMP)). Without that object the lag gauge + materializer counters never reach Managed Prometheus and the CQRS flip guardrail is blind. -
Tracing:
tracingspans on every NATS message, every tool dispatch.execution_idis a span field. -
Logs: structured JSON via
tracing-subscriber.
When changing any of the above, validate per
agents/rules/deployment-validation.md:
cargo build --release --binscargo test --quiet- Build image locally + load into kind:
kind load docker-image … - Apply the manifests against
kubectl --context kind-noetl - Smoke-test:
- Submit a playbook execution via the noetl CLI.
- Confirm the worker claims the command, runs the tool, and POSTs lifecycle events back.
- For env-var changes: verify
kubectl exec … env | grep <VAR>shows the value AND the worker's startup log line confirms it was honored.
Only after kind passes does the change roll forward to Cloud Build + GKE.
- noetl-executor adoption — the shared execution core the worker pulls from crates.io.
- release-pipeline — Cargo + Docker image publish flow.
- worker-credentials — credential resolution in tool dispatch.
- nats-mcp-tool-kinds — tool kinds the worker dispatches.
-
noetl/server
sharding-design— the cross-component routing context. -
noetl/server
deployment-specification— the server-side counterpart of this page. - noetl/ops Helm chart + manifests — the deployment-time implementation.
-
agents/rules/wiki-maintenance.md— the rule that requires this page to update in lockstep with code.