Skip to content

deployment specification

Kadyapam edited this page Jun 6, 2026 · 33 revisions

Deployment Specification

This page is the durable reference for deploying noetl-server into any environment. It covers the runtime contract the binary expects, the resources it consumes, the network surface it exposes, and — critically — every environment variable it reads, with the why behind each one.

This page is the single source of truth for the deployment shape. Any code change that adds, renames, removes, or shifts the meaning of an env var MUST update the Environment Variables section in the same change set. Same rule for ports, dependencies, and runtime requirements. See agents/rules/wiki-maintenance.md.

The matching deployment manifests live in noetl/ops (Helm chart + kind overlays). This wiki page describes what the manifests need to provide; the manifests are the implementation.

Component summary

Field Value
Repo noetl/server
Binary noetl-control-plane
Container image noetl-server (built from the repo's Dockerfile)
Image versioning crates.io version pinned in Cargo.toml; semver releases tagged vX.Y.Z
Current version see Cargo.toml package.version
Language / runtime Rust 1.91+; Tokio multi-threaded
Process model Single binary, single process per pod

Runtime contract

What the binary expects from its environment to start cleanly:

  1. Postgres reachable at ${POSTGRES_HOST}:${POSTGRES_PORT} with the noetl schema migrated. See Database.
  2. NATS reachable at ${NATS_URL} (optional but strongly recommended; without it the server runs in degraded mode and doesn't publish command notifications).
  3. Encryption key present in NOETL_ENCRYPTION_KEY — required to decrypt credentials at rest. Absent → credential routes return 500.
  4. Machine ID in NOETL_SERVER_MACHINE_ID per replica (Phase F R1.5; see Snowflake ID generation). Absent → derives from HOSTNAME; fine for single-replica dev, NOT fine for production replicas (hash collisions produce duplicate snowflakes).

Network surface

Ports

Port Protocol Purpose Bind
8082 HTTP Main API surface — /api/events, /api/commands/*, /api/executions/*, /api/catalog/*, /api/credentials/*, /api/keychain/*, /api/worker/pool/*, /api/runtime/* (incl. /api/runtime/shard-info for the Phase F R3b drift-guard), /api/vars/*, /api/internal/* ${NOETL_HOST}:${NOETL_PORT} (default 0.0.0.0:8082)
8082/metrics HTTP Prometheus scrape endpoint. Gated by NOETL_DISABLE_METRICS same
8082/healthz HTTP Liveness probe (returns 200 once Axum is serving) same
8082/readyz HTTP Readiness probe (200 only when Postgres + NATS are reachable) same

The metrics + health endpoints share the same port as the API intentionally so the LB only routes one port per replica.

Dependencies (outbound)

Target Protocol Why
PostgreSQL TCP 5432 (default) Event log, command queue, catalog, credentials, runtime registration. See Data Access Boundary.
NATS JetStream TCP 4222 (default) Publish command notifications to noetl.commands.{system|shared}.<execution_id>. Workers consume from these subjects.

The server does not call out to gateway, worker, or any third-party API. All third-party traffic flows through workers.

Resources

Recommended starting point for production. Scale on RAM (Postgres connection pool) and CPU (axum request handling); requests/sec scale linearly with replica count once each replica's Postgres pool is saturated.

Resource Request Limit Notes
CPU 250m 1000m One replica handles ~1k req/s sustained at p95 < 100ms on the Phase B R4 load smoke.
Memory 256Mi 512Mi Dominated by sqlx pool + Tokio runtime; no large in-memory state.

Phase F (sharding) scales this horizontally — see sharding-design.

Health probes

Probe Path Initial delay Period Failure threshold Effect
Liveness /healthz 10s 10s 3 Pod restart
Readiness /readyz 5s 5s 3 Removed from Service endpoints
Startup /healthz 5s 30 Gives the bin 150s to come up before liveness kicks in

/readyz checks Postgres + NATS reachability; the LB takes the replica out of rotation while it's not ready instead of restarting the pod (NATS hiccups don't need pod restarts).

Snowflake ID generation

Phase F R1.5 of noetl/ai-meta#49 moved event_id / command_id / execution_id generation out of the DB-side noetl.snowflake_id() function into an app-side generator. See sharding-design and src/snowflake.rs.

Each replica MUST set NOETL_SERVER_MACHINE_ID to a distinct 10-bit value. Without that:

  • Local dev / single-node deployments: the server derives a machine_id by hashing HOSTNAME. Fine.
  • Multi-replica deployments: two pods with hostname-derived ids can collide on the same 10-bit value. Collisions produce duplicate snowflakes (same execution_id minted by two replicas in the same ms). Set the env var explicitly.

Typical patterns:

  • StatefulSet: derive from the pod ordinal — valueFrom: { fieldRef: { fieldPath: metadata.labels['apps.kubernetes.io/pod-index'] } }, pipe through an initContainer that emits the integer. Sub-1024 replicas all distinct.
  • Deployment: assign a fixed integer per replica via Helm values (replicaIndex), or generate from a CRD operator.
  • Last resort: leave unset and let HOSTNAME hashing do its job; accept the collision risk if replica count stays small.

The id is not security-sensitive; it's a deployment coordination knob. Leaking it does not weaken anything.

Sharding

Phase F R2 of noetl/ai-meta#49 added the ShardConfig + shard_for(execution_id, N) helper in src/sharding.rs. R2 is infrastructure only — the helper is available to handlers but no live request path enforces shard membership yet.

Default behavior (current deployments):

NOETL_SHARD_COUNT=1, NOETL_SHARD_INDEX=0 (both unset). ShardConfig::owns(execution_id) short-circuits to true for every execution — no routing change vs. pre-R2 behavior.

Enabling sharding (Phase F R4+ cutover):

  1. Decide N for the cluster (typically a small power of two: 2, 4, 8, 16).
  2. Set NOETL_SHARD_COUNT=N cluster-wide (Helm value or ConfigMap reference).
  3. Set NOETL_SHARD_INDEX={0..N-1} per-pod — distinct for each replica. See the patterns in Snowflake ID generation (StatefulSet ordinal / Deployment Helm value).
  4. Update the gateway / ingress LB to route by hash(execution_id) % N (Phase F R3 — not yet implemented).
  5. Partition the per-execution DB tables (Phase F R4).

Routing key derivation:

The shard for a given execution_id is hash(execution_id) % shard_count where hash is twox_hash::XxHash64 with fixed seed 0. The seed + hash crate version pin the assignment forever — changing either invalidates every existing shard mapping, so neither must change once a deployment has started sharding.

src/sharding.rs documents the hash-function choice (xxhash for good avalanche on sequential snowflake i64s; alternatives rejected: DefaultHasher is release-unstable; ahash default is seed-randomized; FNV-1a has weak distribution on sequential i64s).

Diagnostic endpoint (Phase F R3b-1)

Added in noetl-server v2.12.0 (noetl/server#46). Public, deterministic, no auth gate.

GET /api/runtime/shard-info?execution_id=<i64>&shard_count=<u32>

Returns the shard the server's shard_for() selects for the given pair, plus the server's own configured shard for diagnostic completeness:

{
  "execution_id": 320816801799737344,
  "shard_count": 4,
  "shard_index": 2,
  "source": "noetl-server",
  "hash_function": "twox_hash::XxHash64",
  "seed": 0,
  "server_config": {
    "shard_index": 0,
    "shard_count": 1
  }
}

Validation:

  • execution_id parses as i64 (decimal); non-numeric → 400.
  • shard_count is required, range 1..=1024; 0 or > 1024 → 400.
  • No default for shard_count — explicit param to avoid silently mixing the math output with the server's own deployment topology.

Companion endpoints:

  • noetl-gateway's twin endpoint (Phase F R3b-2) returns the same shape with source: "noetl-gateway".
  • The integration test in noetl/ops (Phase F R3b-3) POSTs to both and asserts they agree across a battery of (execution_id, N) pairs — catches end-to-end drift in the deployed twox-hash versions, seed constants, or byte-encoding choices.

The endpoint is intentionally cheap: pure math, no DB access, no NATS publish. Safe to leave reachable from internal networks; the shard math is not sensitive.

Environment variables

All env vars the binary reads at startup, with the why behind each one.

The envy prefix convention varies by config section:

  • NOETL_* — main app config (src/config/app.rs).
  • POSTGRES_* — DB config (src/config/database.rs).
  • NATS_* — NATS connection (used directly in src/main.rs).
  • A few unprefixed vars (DATABASE_URL, HOSTNAME) for compatibility with standard tooling.

Main app (NOETL_*)

Variable Default Required Why
NOETL_HOST 0.0.0.0 no Bind address for the HTTP server. Override only when you don't want to accept traffic on all interfaces (rare in containers).
NOETL_PORT 8082 no Port the HTTP server listens on. Match this in the Service spec + readiness probe.
NOETL_WORKERS (CPU count) no Tokio worker thread count. Default of CPU count is correct for most workloads; reduce only for memory-constrained pods.
NOETL_DEBUG false no Enable debug-level tracing + verbose error responses. NEVER true in production — exposes internal error details to clients.
NOETL_SERVER_NAME noetl-control-plane no Identifier the server uses in self-reporting (health endpoints, log fields). Useful when running multiple servers behind one LB.
NOETL_ENCRYPTION_KEY yes Base64-encoded 32-byte AES-256-GCM key for encrypting credentials at rest. Read at startup; never logged. Rotation is a multi-step procedure (re-encrypt all existing credential rows under the new key before retiring the old). Absent → credential routes return 500.
NOETL_INTERNAL_API_TOKEN when /api/internal/* used Constant-time-compared bearer token gating /api/internal/* routes (outbox claim, event projection, etc.). Only the system worker pool's K8s ServiceAccount should hold this token; user playbooks must NOT see it. Absent → internal routes return 401.
NOETL_PUBLIC_SERVER_URL — (localhost fallback) yes for kind/GKE URL workers should call back on when they receive a NATS command notification. Embedded verbatim in the server_url field of each NATS message. When unset, a http://localhost:<port> fallback is used; this won't work cross-pod in kind or GKE, so the deployment manifest must override.
NOETL_DISABLE_METRICS false no If true, the /metrics route is not registered. Use only when running behind a separate metrics-export sidecar.
NOETL_AUTO_RECREATE_RUNTIME true no If true, the server re-registers the noetl.runtime row for itself at startup if missing. Set false in environments where runtime rows are managed externally (operator).
NOETL_RUNTIME_SWEEP_INTERVAL 30 no Seconds between runtime-pool offline-detection sweeps. Larger value = workers take longer to be marked offline after a crash. Default is appropriate for production.
NOETL_RUNTIME_OFFLINE_SECONDS (see code) no Heartbeat staleness threshold before a runtime row is marked offline. Increase only when worker heartbeats are unreliable (poor network).
NOETL_SERVER_MACHINE_ID (derived from HOSTNAME) per-replica in prod 10-bit machine id (0–1023) for the application-side snowflake generator. Each pod in a deployment MUST set a distinct value to avoid id collisions. See Snowflake ID generation. Added in Phase F R1.5 (noetl/server#42).
NOETL_SHARD_INDEX 0 per-replica when sharded Phase F R2 (noetl/server#44). Shard index 0..N-1 this replica owns. Single-replica / pre-sharding deployments leave it unset (defaults to 0 with NOETL_SHARD_COUNT=1 — no enforcement). When NOETL_SHARD_COUNT > 1 each replica MUST set a distinct value. Startup validates shard_index < shard_count and panics otherwise (fail fast on config bug rather than silently mis-route). See Sharding below + sharding-design for the cross-cluster routing context.
NOETL_SHARD_COUNT 1 cluster-wide when sharded Phase F R2. Total shard count for the cluster. 1 (the default) disables sharding — every replica owns every execution_id and ShardConfig::owns short-circuits to true. Every replica MUST agree on this value or routing diverges. When operators raise it to N > 1 (Phase F R4 cutover), the NOETL_SHARD_INDEX per replica must be set in lockstep.
NOETL_SCHEMA noetl no Postgres schema the server reads from / writes to. Override only when running multiple NoETL deployments in one DB.

Secret providers (keychain resolution)

The keychain resolver fetches a provider:-backed credential alias from an external secret manager on a credential-store miss (Secrets Wallet Phase 3, noetl/ai-meta#61). build_secret_provider(provider) selects the backend by the keychain entry's provider id; each reads its config from the environment.

GCP Secret Manager (provider: gcp) — REST :access + ambient GKE Workload-Identity token.

Variable Default Required Why
GOOGLE_CLOUD_PROJECT / GCP_PROJECT when a gcp entry omits its project Default project for a gcp secret ref that isn't a fully-qualified projects/.../secrets/... path.
NOETL_GCP_SM_ENDPOINT https://secretmanager.googleapis.com/v1 no Secret Manager base URL. Override for a mock in tests.
NOETL_GCP_METADATA_TOKEN_URL (GKE metadata server) no Workload-Identity token URL. Override for tests.

Kubernetes Secrets (provider: k8s / kubernetes) — reads a Secret object from the in-cluster API server with the pod's ServiceAccount token, trusting the cluster CA. No cloud credentials; the only dependency is the API server, so this is the one secret backend fully kind-validatable end-to-end. A keychain entry references a value as [<namespace>/]<secret>/<key> (a bare <secret> requires the Secret to hold exactly one data key).

Variable Default Required Why
NOETL_K8S_API_URL https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT, else https://kubernetes.default.svc no API server URL. In-cluster the KUBERNETES_SERVICE_* vars are injected automatically; override only for an out-of-cluster / mock endpoint.
NOETL_K8S_NAMESPACE projected <sa>/namespace, else default no Namespace used when a k8s ref omits its own (<secret>/<key> form). In-cluster the projected file supplies the pod's own namespace.
NOETL_K8S_CA_FILE /var/run/secrets/kubernetes.io/serviceaccount/ca.crt no Cluster CA bundle, added as a trust root. Absent (e.g. an http:// mock) → system roots.
NOETL_K8S_TOKEN_FILE /var/run/secrets/kubernetes.io/serviceaccount/token no ServiceAccount bearer token path; re-read per fetch so projected-token rotation is honored.
NOETL_K8S_TOKEN no Inline bearer token override (tests / mock API servers). Takes precedence over NOETL_K8S_TOKEN_FILE.

RBAC. The server's ServiceAccount needs get (and list) on secrets in each namespace it resolves from — the default ServiceAccount has none. The deployment manifests in noetl/ops must bind a Role granting secrets: [get, list] to the server SA before a provider: k8s keychain entry can resolve. (Kind validation uses a manually-applied noetl-server-secrets-reader Role/RoleBinding.)

HashiCorp Vault (provider: vault) — reads a KV v2 secret via the Vault REST API (GET <addr>/v1/<mount>/data/<path>), authenticating with X-Vault-Token. Like Kubernetes Secrets, Vault can run in-cluster, so this backend is fully kind-validatable. A keychain entry references a value as [<mount>/]<path>#<key> (a bare [<mount>/]<path> requires the secret to hold exactly one key); the metadata.version is carried as the value version.

Variable Default Required Why
VAULT_ADDR http://127.0.0.1:8200 yes (real Vault) Vault server address. In kind: http://vault.<ns>.svc:8200.
VAULT_TOKEN one of token sources Vault token for the X-Vault-Token header. A platform credential (the server's own auth to Vault), so env/Secret is acceptable per the secrets rule. Production should prefer the Vault Kubernetes auth method (SA JWT → Vault token) over a static token — a follow-up.
NOETL_VAULT_TOKEN_FILE alt to VAULT_TOKEN Token file path; re-read per fetch (rotating tokens).
VAULT_NAMESPACE Enterprise only Sent as X-Vault-Namespace.
NOETL_VAULT_KV_MOUNT secret no Default KV v2 mount when a ref omits <mount>/.
NOETL_VAULT_CA_FILE for https:// Vault CA bundle, added as a trust root.

Database (POSTGRES_* + DATABASE_URL)

Variable Default Required Why
DATABASE_URL when set, overrides individual POSTGRES_* Full Postgres connection URL. Standard ecosystem env var; honored as an override for cases where the connection comes from a managed Postgres service that emits a URL.
POSTGRES_HOST localhost yes Postgres host. In kind: postgres.noetl.svc; in GKE: the Cloud SQL Proxy sidecar address.
POSTGRES_PORT 5432 no Postgres port.
POSTGRES_USER noetl yes Postgres user; needs SELECT/INSERT/UPDATE/DELETE on noetl.* plus EXECUTE on noetl.snowflake_id() (the latter is the DB-side fallback per observability.md).
POSTGRES_PASSWORD (empty) yes for non-trust auth Postgres password. Should come from a K8s Secret, not the Deployment env directly.
POSTGRES_DATABASE noetl no Postgres database name.
POSTGRES_MAX_CONNECTIONS 10 no sqlx pool max. Phase F R4 sharding will partition this across shards; until then, the math is replicas × 10 ≤ Postgres max_connections − headroom. See agents/rules/data-access-boundary.md.
POSTGRES_MIN_CONNECTIONS 1 no sqlx pool min — keep at least one connection warm.
POSTGRES_ACQUIRE_TIMEOUT 30 no Seconds before a pool.acquire() call fails. Increase only when Postgres is known to be slow under load.

NATS

Variable Default Required Why
NATS_URL recommended NATS server URL (nats://<host>:4222). Without it, the server runs in degraded mode and doesn't publish command notifications. Workers see no work; useful for read-only deployments.
NATS_USER when NATS auth NATS user.
NATS_PASSWORD when NATS auth NATS password. Should come from a K8s Secret.

The NATS connection accepts a user:pass@ segment embedded in NATS_URL; the NATS_USER / NATS_PASSWORD form is the preferred shape per async_nats::ConnectOptions::with_user_and_password.

Misc / standard tooling

Variable Default Required Why
HOSTNAME (set by container runtime) Fallback source for the snowflake machine_id when NOETL_SERVER_MACHINE_ID is unset. Read at startup, hashed via FNV-1a to 10 bits.
COMPUTERNAME (Windows) Same fallback as HOSTNAME on Windows hosts (rare for production but useful for local dev).
RUST_LOG (build default) no Standard tracing-subscriber filter. Default sufficient; set to noetl_server=debug,axum=info for targeted debugging.

Secrets handling

Secrets must NOT be passed via plain Deployment.spec.template.spec.containers[].env[].value.

Secret Storage Mount as
NOETL_ENCRYPTION_KEY K8s Secret (production) / sealed-secret (kind) valueFrom.secretKeyRef
POSTGRES_PASSWORD K8s Secret valueFrom.secretKeyRef
NATS_PASSWORD K8s Secret valueFrom.secretKeyRef
NOETL_INTERNAL_API_TOKEN K8s Secret valueFrom.secretKeyRef

In Cloud Build / GKE the K8s Secrets are projected from GCP Secret Manager via the standard CSI driver.

Per agents/rules/execution-model.md "Secrets and credentials rule": business-logic secrets (third- party API tokens, tenant database DSNs) belong in the NoETL keychain, NOT in env vars. The env vars above are the platform-runtime secrets — the server's own keys for talking to its own infrastructure.

Observability

  • Metrics: Prometheus surface at /metrics per agents/rules/observability.md. Cardinality discipline: execution_id is a span attribute, NOT a metric label (would blow up the registry).
  • Tracing: tracing spans on every boundary; execution_id is a span field.
  • Logs: structured JSON via tracing-subscriber's json layer. Default filter suppresses health endpoint noise per agents/rules/logging.md.
  • Snowflake ID is logged once at startup with its derivation source (NOETL_SERVER_MACHINE_ID vs derived from HOSTNAME). Useful for confirming the per-replica configuration is correct.

Validation procedure

When changing any of the above, validate per agents/rules/deployment-validation.md:

  1. cargo build --release --bins
  2. cargo test --quiet
  3. Build image locally + load into kind: kind load docker-image …
  4. Apply the manifests against kubectl --context kind-noetl
  5. Smoke-test the changed surface:
    • For env-var changes: confirm the var lands in kubectl exec … env | grep <VAR> AND that the binary behaves as documented when the value is present, absent, or invalid.
    • For port / probe changes: confirm kubectl get svc shows the right port + kubectl describe pod shows the probes passing.

Only after kind passes does the change roll forward to Cloud Build + GKE.

See also

Clone this wiki locally