Skip to content

deployment specification

Kadyapam edited this page Jun 23, 2026 · 11 revisions

Deployment Specification

This page is the durable reference for deploying noetl-worker into any environment. It covers the runtime contract the binary expects, the resources it consumes, the network surface it exposes, and — critically — every environment variable it reads, with the why behind each one.

This page is the single source of truth for the deployment shape. Any code change that adds, renames, removes, or shifts the meaning of an env var MUST update the Environment Variables section in the same change set. Same rule for ports, dependencies, and runtime requirements. See agents/rules/wiki-maintenance.md.

The matching deployment manifests live in noetl/ops (Helm chart + kind overlays). This wiki page describes what the manifests need to provide; the manifests are the implementation.

Component summary

Field Value
Repo noetl/worker
Binary noetl-worker
Container image noetl-worker (built from the repo's Dockerfile)
Image versioning crates.io version pinned in Cargo.toml; semver releases tagged vX.Y.Z
Current version see Cargo.toml package.version
Language / runtime Rust 1.91+; Tokio multi-threaded
Process model Single binary, single process per pod
Role NATS pull consumer + tool dispatch. Stateless atomic-compute block per agents/rules/execution-model.md.

Runtime contract

What the binary expects from its environment to start cleanly:

  1. NATS reachable at ${NATS_URL} with the JetStream consumer + stream provisioned (see NATS layout). Hard requirement; without it the worker exits.
  2. noetl-server reachable at ${NOETL_SERVER_URL} (HTTP). The worker's per-command server_url override (set by the publishing server on every NATS notification per noetl/ai-meta#53) takes precedence at runtime; the env var is the initial default for one-server deployments.
  3. Worker pool slot: WORKER_POOL_NAME identifies which pool this worker belongs to. Server-side runtime registration is keyed by (kind, name); multiple workers in one pool share work.
  4. Worker identity: WORKER_ID distinguishes this individual worker. When unset, the worker generates a uuid (fine for ephemeral pods; for StatefulSets prefer setting it explicitly from the pod ordinal so logs + metrics stay readable across restarts).
  5. KEDA-managed scaling: the worker is designed to be scaled by KEDA on NATS consumer lag. See Scaling.

Network surface

Ports

Port Protocol Purpose Bind
9090 HTTP Prometheus scrape endpoint at /metrics + /healthz + /readyz ${WORKER_METRICS_BIND} (default 0.0.0.0:9090)

The worker doesn't expose an API. All traffic flows out (NATS pull, HTTP to server) except the scrape port.

Dependencies (outbound)

Target Protocol Why
NATS JetStream TCP 4222 (default) Pull commands from noetl.commands.* subjects. Receive notifications with the publishing server's URL.
noetl-server HTTP (default port 8082) Fetch command details (GET /api/commands/{event_id}), claim atomically (POST /api/commands/{event_id}/claim), emit lifecycle events (POST /api/events), put result blobs (POST /api/result_store/...). Per-command server_url from the NATS notification overrides this default.
External APIs (Auth0 / Duffel / OpenAI / ...) HTTPS Whatever the executing playbook tool calls. Credentials resolved via NoETL keychain or NOETL_KEYCHAIN_ENV_VARS allowlist (see Keychain credentials).

The worker does NOT call Postgres directly. Per agents/rules/data-access-boundary.md, NoETL platform data is accessible via server API only.

Resources

Recommended starting point for production. Workers scale on NATS backlog (KEDA), so each replica should be sized for one concurrent playbook dispatch.

Resource Request Limit Notes
CPU 100m 500m Per-replica; CPU-bound only during tool execution (Python eval, JSON transforms). Idle workers consume <5m.
Memory 128Mi 768Mi The limit must cover the memory-backed /dev/shm tmpfs (charged to the pod cgroup, sized to the Arrow IPC cache budget — see Shared memory below) plus the worker RSS. With the 256 MB cache budget the /dev/shm tmpfs is 320Mi, so the limit is 768Mi (320Mi shm + worker RSS headroom).
Ephemeral storage 100Mi 500Mi Tool execution scratch (Python eval temp dirs).

WORKER_MAX_CONCURRENT (default 1) governs how many commands a single worker pod processes in parallel. Stay at 1 unless the pool's tools are I/O-bound (HTTP fetches dominated by external latency); CPU-bound tools should scale out via more replicas, not more concurrency per pod.

Shared memory (/dev/shm)

Required: a memory-backed /dev/shm sized to exceed the Arrow IPC cache budget.

The worker allocates one Arrow IPC shared-memory cache per process at init (ArrowIpcSharedMemoryCache::new(), budget NOETL_IPC_CACHE_BUDGET_BYTES, default 256 MB). The cache stages call.done results that exceed the broker's 100 KB inline budget into POSIX shared-memory regions backed by /dev/shm (via the shared_memory crate's shm_open+ftruncate+mmap).

The Kubernetes container-runtime default for /dev/shm is a 64 MiB tmpfs. When the cache writes past 64 MiB the store page-faults against the full tmpfs and the kernel delivers SIGBUS — the worker dies with exit code 135 and crash-loops. Mount a memory-backed emptyDir at /dev/shm sized above the cache budget:

spec:
  containers:
    - name: noetl-worker
      env:
        - name: NOETL_IPC_CACHE_BUDGET_BYTES
          value: "268435456"   # 256 MiB — pin next to sizeLimit below
      volumeMounts:
        - name: dshm
          mountPath: /dev/shm
      resources:
        limits:
          memory: "768Mi"      # >= sizeLimit + worker RSS
  volumes:
    - name: dshm
      emptyDir:
        medium: Memory
        sizeLimit: 320Mi        # >= NOETL_IPC_CACHE_BUDGET_BYTES + headroom

Three values move together and must stay coherent:

Value Constraint
NOETL_IPC_CACHE_BUDGET_BYTES the cache budget (256 MiB default)
/dev/shm emptyDir sizeLimit > the budget (320Mi = budget + 64Mi headroom for tmpfs page rounding)
container memory limit > sizeLimit + worker RSS (768Mi); the tmpfs is charged to the pod cgroup

If you raise the cache budget, raise the sizeLimit and the memory limit in lockstep. The committed manifests wiring this up live in noetl/ops ci/manifests/noetl/worker-*.yaml (fixed in ops#193).

Health probes

Probe Path Initial delay Period Failure threshold Effect
Liveness /healthz 10s 10s 3 Pod restart
Readiness /readyz 5s 5s 3 Removed from Service endpoints (only relevant if the metrics port is fronted by a Service)

/readyz confirms NATS connection + heartbeat path is live.

NATS layout

  • Stream: noetl_commands (default; override via NATS_STREAM). Subjects: noetl.commands.>.
  • Consumer: pull-based, durable. Default name worker-pool (override via NATS_CONSUMER). Filter subject: noetl.commands (override via NATS_FILTER_SUBJECT).
  • Subjects the worker reads: per-pool filter shapes vary; the server publishes commands to noetl.commands.{system|shared}.<execution_id> per the Phase F sharding-design. A "shared" pool worker filters on noetl.commands.shared.*; a "system" pool worker filters on noetl.commands.system.*.

Scaling

The worker is the scale-out unit per agents/rules/execution-model.md. KEDA reads NATS consumer lag and scales the Deployment between minReplicas and maxReplicas.

Standard KEDA ScaledObject for the worker pool:

triggers:
  - type: nats-jetstream
    metadata:
      account: $G
      natsServerMonitoringEndpoint: "nats.noetl.svc:8222"
      stream: noetl_commands
      consumer: worker-pool
      lagThreshold: "5"

Workers can be added, removed, or restarted freely — state lives in the server's event log + cache; the worker is stateless.

Snowflake ID generation

The worker mints its own snowflake IDs via src/snowflake.rs for client-side event IDs (per agents/rules/observability.md Principle 3). The 10-bit node id comes from either:

  1. NOETL_SNOWFLAKE_NODE_ID (preferred — set per pod by the deployment manifest).
  2. NOETL_SHARD_ID (back-compat alias for the same idea).
  3. Derived from NOETL_NODE_ID / NODE_NAME / pod hostname via FNV-1a hash to 10 bits.

For multi-replica deployments, set NOETL_SNOWFLAKE_NODE_ID explicitly to avoid hash collisions producing duplicate IDs.

The epoch comes from NOETL_SNOWFLAKE_EPOCH_MS if set, otherwise defaults to a build-time constant. Match the server's epoch (2024-01-01T00:00:00Z UTC) to keep IDs mutually orderable across producers.

Environment variables

All env vars the binary reads at startup or runtime, with the why behind each one.

Worker identity + pool

Variable Default Required Why
WORKER_ID (uuid v4 generated at startup) recommended Unique identifier for this worker pod. Embedded in event metadata + runtime registration so the server can track which worker handled which command. For StatefulSet pods, derive from the pod ordinal so logs/metrics stay stable across restarts.
WORKER_POOL_NAME default yes for non-default pools Pool the worker registers under. Drives the NATS subject filter (system pool reads noetl.commands.system.*, shared pool reads noetl.commands.shared.*). Must match the pool's runtime row name on the server side.
WORKER_HEARTBEAT_INTERVAL (see code; ~10s) no Seconds between heartbeat POSTs to /api/worker/pool/heartbeat. Lower = faster failover detection; higher = less HTTP load. Tune to match the server's NOETL_RUNTIME_OFFLINE_SECONDS.
WORKER_MAX_CONCURRENT 1 no Number of commands this pod processes concurrently. See Resources.
WORKER_METRICS_BIND 0.0.0.0:9090 no Bind address for the metrics + health HTTP server. Override only when you don't want to accept scrape traffic on all interfaces.
WORKER_NATS_LAG_POLL_INTERVAL (see code) no Seconds between consumer-lag polls used by the worker's own lag-aware logic (independent of KEDA's external poll). Tune only when the default is shown to be too lazy or too aggressive.

Server endpoint

Variable Default Required Why
NOETL_SERVER_URL (built-in fallback, typically http://noetl.noetl.svc:8082) yes Initial default URL for the noetl-server HTTP API. Note: as of noetl/worker#41 the per-command server_url from the NATS notification overrides this at runtime, so a worker that picks up a command published by a different server replica (or by the Rust server vs Python server) will correctly route lifecycle events back to the publishing server. Use an https:// URL when the server's TLS listener is on (see Transport security below).

Transport security (mTLS client)

Secrets Wallet Phase 4b (noetl/ai-meta#61, noetl/worker#56) — the worker half of the mTLS transport the server opts into in Phase 4a (noetl/server#103, NOETL_TLS_*). The worker's control-plane HTTP client (credential fetch, command claims, event posts) speaks plain HTTP by default; the env below makes it present a client certificate so the worker→server credential channel is authenticated + encrypted (the resolved secret no longer travels plaintext on the wire).

Variable Default Required Why
NOETL_TLS_CLIENT_CERT with NOETL_TLS_CLIENT_KEY PEM client-certificate-chain path (the mTLS identity the worker presents). Both cert + key together, or neither (setting one is a fail-fast misconfig).
NOETL_TLS_CLIENT_KEY with NOETL_TLS_CLIENT_CERT PEM private-key path for the identity cert. Never logged.
NOETL_TLS_CA for a private-CA server PEM CA-bundle path the worker adds as a trust root when verifying the server's certificate. Needed when the server cert is signed by an internal CA rather than a public root. Independent of the identity (a publicly-trusted server needs none).

Built on the rustls-tls reqwest backend the worker already uses (reqwest::Identity::from_pem + Certificate::from_pem_bundle). Set NOETL_SERVER_URL to the server's https:// URL when these are on. The cert/key/CA are mounted from a K8s Secret; the manifests live in noetl/ops.

Init-container caveat (Phase 4c): the wait-for-api init container curls /api/health over plain HTTP — against an mTLS server it can't complete and blocks the pod in Init. It needs the client cert (mTLS curl) or a non-mTLS health path (parallels the server's mTLS httpGet-probe caveat).

Sealed credential delivery (Phase 5c)

Secrets Wallet Phase 5c (noetl/worker#58, tracks noetl/ai-meta#61) — defense in depth on top of the Phase-4 mTLS transport. mTLS encrypts the wire; sealing encrypts the credential payload to a key only this worker holds. The cleartext exists only briefly inside the worker process after unseal; never on the wire, never in the server's HTTP response body.

Variable Default Required Why
NOETL_SEALED_CREDENTIALS false when sealing Master toggle. When true / 1 / yes and WORKER_ID is set, the worker calls GET /api/credentials/{alias}/sealed?worker_id=<WORKER_ID> instead of the plaintext path. The server (v2.32.0 + the Phase-5b endpoint, server#109) seals the response to the X25519 public key this worker registered on POST /api/worker/pool/register.

Wire shape. The worker generates a long-lived X25519 keypair once at startup, includes the base64 public half in its register payload's runtime JSON blob:

{
  "name": "<WORKER_ID>",
  "component_type": "worker_pool",
  "runtime": { "kind": "rust", "worker_public_key": "<base64-32-byte-x25519-pub>" },
  "status": "ready",
  ...
}

The private half stays in-process for the worker's lifetime — only the public half ever leaves. The server-side deployment-specification § Sealed credential delivery describes the endpoint shape; the wire envelope is {alg, v, eph_pub (b64), ciphertext (b64)} carried by the same noetl_sealed algorithm constant on both sides (x25519-hkdf-sha256-chacha20-poly1305, v=1).

Zeroize. After the auth-alias resolver has injected the resolved fields into the tool config, the worker zeroizes the string values in the intermediate credential.data map so an OOM core dump or a debugger snoop on the worker pod can't recover the secret from the post-dispatch heap.

Sealing requires a worker name on the server side. The server looks up worker_public_key by the name (kind=worker_pool) of the runtime row. Workers identify themselves at register time with WORKER_ID, so that env var is required for sealing to work; absent → the wrapper logs a WARN and falls back to the plaintext path.

NATS

Variable Default Required Why
NATS_URL nats://localhost:4222 yes NATS server URL. In kind: nats://nats.noetl.svc:4222; in GKE: matches the NATS Helm release's Service.
NATS_USER when NATS auth NATS user.
NATS_PASSWORD when NATS auth NATS password. Should come from a K8s Secret.
NATS_STREAM noetl_commands no JetStream stream name. Override only when running multiple NoETL deployments on one NATS cluster.
NATS_CONSUMER worker-pool no Durable consumer name. All workers in one pool share the same consumer name.
NATS_SUBJECT noetl.commands no Base subject prefix. Pool-specific filter (noetl.commands.system.* vs noetl.commands.shared.*) appends below this.
NATS_FILTER_SUBJECT (derived from pool) no Explicit override for the consumer's filter subject. Use only when the pool routing scheme needs a custom filter.

Snowflake ID generation

Variable Default Required Why
NOETL_SNOWFLAKE_NODE_ID per-replica in prod 10-bit node id (0–1023) for the worker's snowflake generator. Each pod in a deployment MUST set a distinct value to avoid id collisions producing duplicate event IDs. Same shape as NOETL_SERVER_MACHINE_ID on the server side.
NOETL_SHARD_ID (alias) Back-compat alias for NOETL_SNOWFLAKE_NODE_ID; same semantics.
NOETL_NODE_ID (HOSTNAME) Fallback identifier hashed to derive the 10-bit node id when neither NOETL_SNOWFLAKE_NODE_ID nor NOETL_SHARD_ID is set. Read at startup.
NODE_NAME (set by container runtime via downward API) Same fallback chain as NOETL_NODE_ID; standard K8s downward-API value.
NOETL_SNOWFLAKE_EPOCH_MS (build-time default: 2024-01-01T00Z) no Epoch the snowflake timestamps count from. Override only to match an alternate epoch elsewhere in the system — but the noetl-server uses 2024-01-01, so changing this on the worker breaks cross-component ordering.

Keychain credentials (NOETL_KEYCHAIN_ENV_VARS)

Variable Default Required Why
NOETL_KEYCHAIN_ENV_VARS when env-aliased credentials used Comma-separated allowlist of env var names that hold credentials playbook steps can reference by alias. Example: NOETL_KEYCHAIN_ENV_VARS=NOETL_FLIGHT_BEARER_TOKEN,OPENAI_API_KEY permits a playbook to write bearer_token: NOETL_FLIGHT_BEARER_TOKEN and the worker resolves it from the env at dispatch time. The values themselves are loaded from the same env (the variable name is the alias and the var holds the secret). See src/executor/command.rs::KEYCHAIN_ENV_ALLOWLIST_VAR. When unset, env aliasing is disabled and all credentials must go through the NoETL keychain API.

The mechanism is a bridge for credentials that already exist as env vars in the worker's environment (because they come from GKE Workload Identity, an existing K8s Secret mount, or similar already-in-place trust per agents/rules/execution-model.md). Net-new business-logic credentials belong in the NoETL keychain, not in this allowlist.

Misc / standard tooling

Variable Default Required Why
HOSTNAME (set by container runtime) Fallback for snowflake node-id derivation.
RUST_LOG (build default) no Standard tracing-subscriber filter. Set to noetl_worker=debug for targeted debugging.
NOETL_IPC_CACHE_BUDGET_BYTES 268435456 (256 MB) no Arrow IPC shared-memory cache budget for same-node zero-copy reads. The pod's /dev/shm must be a memory-backed tmpfs sized above this value, or the worker SIGBUSes (exit 135) when the cache fills past the 64 MiB k8s default — see Shared memory. Tune up only when working sets exceed 256 MB; raise the /dev/shm sizeLimit and the pod memory limit in lockstep.

Event result staging (results-by-reference)

Variable Default Required Why
NOETL_EVENT_RESULT_CONTEXT_MAX_BYTES 102400 (100 KB) no Inline budget for a tool result's serialised context on the call.done event. A result within budget goes inline ({status, context}); a result over budget is staged in the durable result store (noetl.result_store) + the shm cache, and the event carries a {data:{_ref}} placeholder + a reference block instead — keeping the event log lean. The server orchestrator resolves those references when driving state (hydrate_result_references, noetl/server#197). Lower it to push more results through the reference path (ops tuning, or forcing the path in kind validation — =256 was used to validate every PFT result by reference); raise it to keep more results inline. Read per-result in src/executor/command.rs::inline_context_max_bytes.
NOETL_RESULT_MATERIALIZER_ENABLED false no Result materializer — shadow Feather tier (noetl/ai-meta#104 Phase B, system pool only). Truthy spawns a SEPARATE noetl_events consume-loop (consumer noetl_result_materializer, own ack cursor) that, for an over-budget result, fetches the authoritative payload (read-only), tiers it (tabular → Arrow Feather, non-tabular → JSON, small → no write) and writes the body to the derived §7 physical_key via the server's PUT /api/internal/objects/{key}shadow, alongside noetl.result_store; nothing reads it until the resolve path is enabled. Never alters the authoritative result, never fails an event. Default off → not spawned (true no-op).
NOETL_RESULT_URI_RESOLVE false no Resolve-by-URN read path (noetl/ai-meta#104 Phase C, consume pool). Truthy makes a step that binds the bulk of an over-budget upstream result resolve it by canonical logical URI → cell placement (server registry) → §7 key → object bytes (GET /api/internal/objects/{key}) instead of the legacy noetl.result_store fetch. On any registry miss / object miss / decode error it falls back fail-safe to the authoritative resolve_ref (noetl_worker_result_resolve_total{outcome} records resolved_* vs fallback_*). Default off → byte-identical legacy read path.
NOETL_RESULT_MINT_AUTHORITATIVE false no Phase D minting flip (noetl/ai-meta#104 Phase D). ONE flag that makes the URN → Feather/GCS tier the authoritative result store: on the system pool it makes the result materializer the authoritative tier writer (implies NOETL_RESULT_MATERIALIZER_ENABLED); on the consume pool it makes resolve-by-URN the primary read path (implies NOETL_RESULT_URI_RESOLVE). A tier miss / parse failure still falls back fail-safe to the dual-written noetl.result_store (rollback safety), recorded on noetl_worker_result_mint_authoritative_total{path} (tier = authoritative tier served; legacy_fallback = reversible fallback served). Default off → byte-identical to Phase A–C (true no-op). The dual-write to result_store continues until the OQ5-gated retirement decision (NOT Phase D), so flag-off rolls back cleanly.
NOETL_SIDE_EFFECT_BARRIER false no Side-effect durability barrier (noetl/ai-meta#104 Phase E, consume pool). Before the worker (re-)dispatches a side-effecting tool (per the registry classifier noetl_tools::registry::kind_is_side_effecting, noetl-tools 3.17.0), it checks whether the cycle's derived result URN already resolves to a durable result (the Phase C resolve-by-URN read path). If it does — the cycle already ran to a durable completion on a prior drive — the worker skips re-execution and adopts the recorded result, so an external side effect (HTTP POST, DB write, payment) fires exactly once across a crash-resume / re-drive. Non-side-effecting cycles are never blocked (idempotent recompute is fine); a side-effecting cycle whose result is not durable re-executes normally. Adopt-only: resolve_by_urn returns Some only on a durable tier hit, so the barrier can only ever turn a duplicate side effect into a single one — never drop work. Outcomes on noetl_worker_side_effect_barrier_total{outcome} (skipped = adopted durable result; executed = re-executed). Default off → byte-identical to today (the barrier block is short-circuited by the cheap flag read; dispatch unchanged).
NOETL_RESULT_TIER_DR false no Result-tier DR re-derive (noetl/ai-meta#104 Phase F, system pool only). The Feather/JSON result tier is derivable from the WAL — an object's bytes are the deterministic encode of the authoritative payload and its location is computed from the logical URI — so a missing or corrupt tier object can be rebuilt from its source by re-running the materialization for its URN. Truthy puts the result materializer in verify-and-repair mode: for each over-budget result it re-derives the object + §7 key and rewrites it only when the durable object is missing or byte-divergent (corrupt); a healthy object is left untouched. The flag alone spawns the materializer (DR-only mode — no need for NOETL_RESULT_MATERIALIZER_ENABLED). Never alters the authoritative result_store source; the rewrite is byte-identical to a fresh materialization. A WAL event re-delivery is therefore a targeted DR repair. Outcomes on noetl_worker_result_tier_dr_total{outcome} (present / rederived / source_gone / error). Default off → byte-identical to Phase B/D (normal write path; this branch never runs).

Off-server state builder shadow (RFC #115 Phase 4)

System worker pool only. The off-server state builder reconstructs orchestrator WorkflowState from the noetl_events WAL (not the materialized noetl.event table), walking the one-level prev_event_id chain and caching the built spine keyed by the immutable chain head (noetl/ai-meta#115 Phase 4). The shadow loop is observation-only — it proves on the running cluster that the builder reads the WAL with zero noetl.event scans and that the chain-walk + pool-side cache (hit / incremental tail-advance / cold-rebuild) behave, without touching the drive. The drive cutover that makes the drive consume this builder's state is staged behind the server's NOETL_STATE_BUILDER=offserver.

Variable Default Required Why
NOETL_STATE_BUILDER_SHADOW unset (off) no Truthy (1/true/yes/on) spawns the off-server state-builder shadow loop (system pool). It opens its own ephemeral DeliverAll/AckNone pull consumer on noetl_events (never the materializer's durable noetl_materializer consumer), replays the retained WAL into a pool-side per-execution chain index, and exercises the chain walk + cache, emitting the state_builder metrics below. Observation-only — no drive impact; safe to leave on for monitoring. Default off → every other worker unaffected.
NOETL_STATE_BUILDER_STREAM noetl_events no The JetStream WAL stream the shadow drains. Mirror of the materializer's NOETL_MATERIALIZER_STREAM; only override when the stream is renamed.
NOETL_STATE_BUILDER_BATCH 200 no Bounded pull batch (clamped 1–1000).
NOETL_STATE_BUILDER_TIMEOUT_MS 2000 no Pull-batch expiry wait.
NOETL_STATE_BUILDER_IDLE_SLEEP_MS 500 no Sleep when a drain comes back empty (keeps the idle loop off the CPU).

State-builder metrics (:9090/metrics): noetl_worker_state_builder_wal_events_total (events consumed from the WAL — the WAL-read proof, RFC tenet 5) · noetl_worker_state_builder_event_scans_total (noetl.event scans the builder issued — the no-scan proof, RFC tenet 3; stays 0) · noetl_worker_state_builder_builds_total{outcome=cache_hit|incremental|cold_rebuild|incomplete} (cache effectiveness + correctness) · noetl_worker_state_builder_chain_hops (chain-walk depth per cold rebuild).

Secrets handling

Same shape as the server:

Secret Storage Mount as
NATS_PASSWORD K8s Secret valueFrom.secretKeyRef
Any value listed in NOETL_KEYCHAIN_ENV_VARS K8s Secret (when used) valueFrom.secretKeyRef per allowlisted name

Per agents/rules/execution-model.md "Secrets and credentials rule": business-logic credentials (third-party API tokens, tenant database DSNs) belong in the NoETL keychain via the server's /api/keychain/* routes. The allowlist mechanism above is a bridge only for credentials that already live in the pod env via existing trust (GKE Workload Identity, etc.).

Observability

  • Metrics: Prometheus surface at :9090/metrics per agents/rules/observability.md. Surface includes NATS pull rate, command claim outcomes, tool dispatch durations per tool kind, result-store PUT latency.
  • JetStream consumer lag (noetl_worker_nats_consumer_pending
    • noetl_worker_nats_consumer_ack_pending, labelled {stream, consumer}): a periodic lag poller queries JetStream consumer info on an independent task and sets these gauges. It covers the command consumer always, and the materializer consumer ({stream="noetl_events", consumer="noetl_materializer"}) whenever NOETL_MATERIALIZER_ENABLED=true. The latter is the earliest signal that, under the server's NOETL_EVENT_INGEST_PUBLISH_ONLY gate, published events are piling up un-materialized — it climbs even when the materializer loop itself has stalled or died (the loop can't report its own lag; the independent poller can). It is the metric the CQRS flip guardrail alerts read (see noetl/ai-meta#103 and the ops runbook noetl-cqrs-publish-only-flip.md).
  • Materializer counters: noetl_worker_materializer_drained_total / _projected_total / _duplicates_total / _acked_total / _project_errors_total + _cycle_duration_seconds — the ack-after-materialize loop's throughput + the no-loss redelivery surface (project_errors = a batch left un-acked to redeliver).
  • Scrape path (the metrics above only guard the flip if something scrapes them): the :9090/metrics port is exposed by a headless *-metrics Service per pool. On the kind dev cluster a VictoriaMetrics VMServiceScrape selects them; on prod (GKE) the cluster runs Google Managed Prometheus, which does NOT honor prometheus.io/scrape annotations — the worker pods are scraped by a GMP PodMonitoring (ci/manifests/noetl/gmp/podmonitoring-noetl.yaml, ops wiki Production monitoring (GMP)). Without that object the lag gauge + materializer counters never reach Managed Prometheus and the CQRS flip guardrail is blind.
  • Tracing: tracing spans on every NATS message, every tool dispatch. execution_id is a span field.
  • Logs: structured JSON via tracing-subscriber.

Validation procedure

When changing any of the above, validate per agents/rules/deployment-validation.md:

  1. cargo build --release --bins
  2. cargo test --quiet
  3. Build image locally + load into kind: kind load docker-image …
  4. Apply the manifests against kubectl --context kind-noetl
  5. Smoke-test:
    • Submit a playbook execution via the noetl CLI.
    • Confirm the worker claims the command, runs the tool, and POSTs lifecycle events back.
    • For env-var changes: verify kubectl exec … env | grep <VAR> shows the value AND the worker's startup log line confirms it was honored.

Only after kind passes does the change roll forward to Cloud Build + GKE.

See also

Clone this wiki locally