-
Notifications
You must be signed in to change notification settings - Fork 0
deployment specification
This page is the durable reference for deploying noetl-server into any environment. It covers the runtime contract the binary expects, the resources it consumes, the network surface it exposes, and — critically — every environment variable it reads, with the why behind each one.
This page is the single source of truth for the deployment shape.
Any code change that adds, renames, removes, or shifts the
meaning of an env var MUST update the
Environment Variables section in the
same change set. Same rule for ports, dependencies, and runtime
requirements. See agents/rules/wiki-maintenance.md.
The matching deployment manifests live in noetl/ops (Helm chart + kind overlays). This wiki page describes what the manifests need to provide; the manifests are the implementation.
| Field | Value |
|---|---|
| Repo | noetl/server |
| Binary | noetl-control-plane |
| Container image |
noetl-server (built from the repo's Dockerfile) |
| Image versioning | crates.io version pinned in Cargo.toml; semver releases tagged vX.Y.Z
|
| Current version | see Cargo.toml package.version
|
| Language / runtime | Rust 1.91+; Tokio multi-threaded |
| Process model | Single binary, single process per pod |
What the binary expects from its environment to start cleanly:
-
Postgres reachable at
${POSTGRES_HOST}:${POSTGRES_PORT}with the noetl schema migrated. See Database. -
NATS reachable at
${NATS_URL}(optional but strongly recommended; without it the server runs in degraded mode and doesn't publish command notifications). -
Encryption key present in
NOETL_ENCRYPTION_KEY— required to decrypt credentials at rest. Absent → credential routes return 500. -
Machine ID in
NOETL_SERVER_MACHINE_IDper replica (Phase F R1.5; see Snowflake ID generation). Absent → derives fromHOSTNAME; fine for single-replica dev, NOT fine for production replicas (hash collisions produce duplicate snowflakes).
| Port | Protocol | Purpose | Bind |
|---|---|---|---|
8082 |
HTTP | Main API surface — /api/events, /api/commands/*, /api/executions/*, /api/catalog/*, /api/credentials/*, /api/keychain/*, /api/worker/pool/*, /api/runtime/* (incl. /api/runtime/shard-info for the Phase F R3b drift-guard), /api/vars/*, /api/internal/*
|
${NOETL_HOST}:${NOETL_PORT} (default 0.0.0.0:8082) |
8082/metrics |
HTTP | Prometheus scrape endpoint. Gated by NOETL_DISABLE_METRICS
|
same |
8082/healthz |
HTTP | Liveness probe (returns 200 once Axum is serving) | same |
8082/readyz |
HTTP | Readiness probe (200 only when Postgres + NATS are reachable) | same |
The metrics + health endpoints share the same port as the API intentionally so the LB only routes one port per replica.
| Target | Protocol | Why |
|---|---|---|
| PostgreSQL | TCP 5432 (default) | Event log, command queue, catalog, credentials, runtime registration. See Data Access Boundary. |
| NATS JetStream | TCP 4222 (default) | Publish command notifications to noetl.commands.{system|shared}.<execution_id>. Workers consume from these subjects. |
The server does not call out to gateway, worker, or any third-party API. All third-party traffic flows through workers.
Recommended starting point for production. Scale on RAM (Postgres connection pool) and CPU (axum request handling); requests/sec scale linearly with replica count once each replica's Postgres pool is saturated.
| Resource | Request | Limit | Notes |
|---|---|---|---|
| CPU | 250m | 1000m | One replica handles ~1k req/s sustained at p95 < 100ms on the Phase B R4 load smoke. |
| Memory | 256Mi | 512Mi | Dominated by sqlx pool + Tokio runtime; no large in-memory state. |
Phase F (sharding) scales this horizontally — see sharding-design.
| Probe | Path | Initial delay | Period | Failure threshold | Effect |
|---|---|---|---|---|---|
| Liveness | /healthz |
10s | 10s | 3 | Pod restart |
| Readiness | /readyz |
5s | 5s | 3 | Removed from Service endpoints |
| Startup | /healthz |
— | 5s | 30 | Gives the bin 150s to come up before liveness kicks in |
/readyz checks Postgres + NATS reachability; the LB takes the
replica out of rotation while it's not ready instead of restarting
the pod (NATS hiccups don't need pod restarts).
mTLS caveat. When NOETL_TLS_CLIENT_CA is set (mTLS), an
HTTP(S) httpGet probe cannot pass — Kubernetes probes don't
present a client certificate, so the TLS handshake is rejected
before the path is ever evaluated. Switch the liveness/readiness
probes to tcpSocket (a port-open check), or terminate mTLS at a
sidecar / serve health on a separate non-mTLS port. Server-only
TLS (no client CA) works with an httpGet probe at scheme: HTTPS.
See Transport security.
Phase F R1.5 of noetl/ai-meta#49
moved event_id / command_id / execution_id generation out of
the DB-side noetl.snowflake_id() function into an app-side
generator. See sharding-design and
src/snowflake.rs.
Each replica MUST set NOETL_SERVER_MACHINE_ID to a distinct
10-bit value. Without that:
- Local dev / single-node deployments: the server derives a
machine_id by hashing
HOSTNAME. Fine. - Multi-replica deployments: two pods with hostname-derived ids
can collide on the same 10-bit value. Collisions produce
duplicate snowflakes (same
execution_idminted by two replicas in the same ms). Set the env var explicitly.
Typical patterns:
-
StatefulSet: derive from the pod ordinal —
valueFrom: { fieldRef: { fieldPath: metadata.labels['apps.kubernetes.io/pod-index'] } }, pipe through an initContainer that emits the integer. Sub-1024 replicas all distinct. -
Deployment: assign a fixed integer per replica via Helm
values (
replicaIndex), or generate from a CRD operator. -
Last resort: leave unset and let
HOSTNAMEhashing do its job; accept the collision risk if replica count stays small.
The id is not security-sensitive; it's a deployment coordination knob. Leaking it does not weaken anything.
Phase F R2 of noetl/ai-meta#49
added the ShardConfig + shard_for(execution_id, N) helper in
src/sharding.rs. R2 is infrastructure only — the helper is
available to handlers but no live request path enforces shard
membership yet.
Default behavior (current deployments):
NOETL_SHARD_COUNT=1, NOETL_SHARD_INDEX=0 (both unset).
ShardConfig::owns(execution_id) short-circuits to true for
every execution — no routing change vs. pre-R2 behavior.
Enabling sharding (Phase F R4+ cutover):
- Decide
Nfor the cluster (typically a small power of two: 2, 4, 8, 16). - Set
NOETL_SHARD_COUNT=Ncluster-wide (Helm value or ConfigMap reference). - Set
NOETL_SHARD_INDEX={0..N-1}per-pod — distinct for each replica. See the patterns in Snowflake ID generation (StatefulSet ordinal / Deployment Helm value). - Update the gateway / ingress LB to route by
hash(execution_id) % N(Phase F R3 — not yet implemented). - Partition the per-execution DB tables (Phase F R4).
Routing key derivation:
The shard for a given execution_id is
hash(execution_id) % shard_count where hash is
twox_hash::XxHash64 with
fixed seed 0. The seed + hash crate version pin the
assignment forever — changing either invalidates every existing
shard mapping, so neither must change once a deployment has
started sharding.
src/sharding.rs documents the hash-function choice (xxhash for
good avalanche on sequential snowflake i64s; alternatives
rejected: DefaultHasher is release-unstable; ahash default is
seed-randomized; FNV-1a has weak distribution on sequential
i64s).
Added in noetl-server v2.12.0 (noetl/server#46). Public, deterministic, no auth gate.
GET /api/runtime/shard-info?execution_id=<i64>&shard_count=<u32>
Returns the shard the server's shard_for() selects for the given pair, plus the server's own configured shard for diagnostic completeness:
{
"execution_id": 320816801799737344,
"shard_count": 4,
"shard_index": 2,
"source": "noetl-server",
"hash_function": "twox_hash::XxHash64",
"seed": 0,
"server_config": {
"shard_index": 0,
"shard_count": 1
}
}Validation:
-
execution_idparses asi64(decimal); non-numeric → 400. -
shard_countis required, range1..=1024; 0 or > 1024 → 400. - No default for
shard_count— explicit param to avoid silently mixing the math output with the server's own deployment topology.
Companion endpoints:
- noetl-gateway's twin endpoint (Phase F R3b-2) returns the same shape with
source: "noetl-gateway". - The integration test in noetl/ops (Phase F R3b-3) POSTs to both and asserts they agree across a battery of
(execution_id, N)pairs — catches end-to-end drift in the deployedtwox-hashversions, seed constants, or byte-encoding choices.
The endpoint is intentionally cheap: pure math, no DB access, no NATS publish. Safe to leave reachable from internal networks; the shard math is not sensitive.
All env vars the binary reads at startup, with the why behind each one.
The envy prefix convention varies by config section:
-
NOETL_*— main app config (src/config/app.rs). -
POSTGRES_*— DB config (src/config/database.rs). -
NATS_*— NATS connection (used directly insrc/main.rs). - A few unprefixed vars (
DATABASE_URL,HOSTNAME) for compatibility with standard tooling.
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_HOST |
0.0.0.0 |
no | Bind address for the HTTP server. Override only when you don't want to accept traffic on all interfaces (rare in containers). |
NOETL_PORT |
8082 |
no | Port the HTTP server listens on. Match this in the Service spec + readiness probe. |
NOETL_WORKERS |
(CPU count) | no | Tokio worker thread count. Default of CPU count is correct for most workloads; reduce only for memory-constrained pods. |
NOETL_DEBUG |
false |
no | Enable debug-level tracing + verbose error responses. NEVER true in production — exposes internal error details to clients. |
NOETL_SERVER_NAME |
noetl-control-plane |
no | Identifier the server uses in self-reporting (health endpoints, log fields). Useful when running multiple servers behind one LB. |
NOETL_ENCRYPTION_KEY |
— | yes | Base64-encoded 32-byte AES-256-GCM key for encrypting credentials at rest. Read at startup; never logged. Rotation is a multi-step procedure (re-encrypt all existing credential rows under the new key before retiring the old). Absent → credential routes return 500. |
NOETL_INTERNAL_API_TOKEN |
— | when /api/internal/* used |
Constant-time-compared bearer token gating /api/internal/* routes (outbox claim, event projection, etc.). Only the system worker pool's K8s ServiceAccount should hold this token; user playbooks must NOT see it. Absent → internal routes return 401. |
NOETL_PUBLIC_SERVER_URL |
— (localhost fallback) | yes for kind/GKE | URL workers should call back on when they receive a NATS command notification. Embedded verbatim in the server_url field of each NATS message. When unset, a http://localhost:<port> fallback is used; this won't work cross-pod in kind or GKE, so the deployment manifest must override. |
NOETL_DISABLE_METRICS |
false |
no | If true, the /metrics route is not registered. Use only when running behind a separate metrics-export sidecar. |
NOETL_AUTO_RECREATE_RUNTIME |
true |
no | If true, the server re-registers the noetl.runtime row for itself at startup if missing. Set false in environments where runtime rows are managed externally (operator). |
NOETL_RUNTIME_SWEEP_INTERVAL |
30 |
no | Seconds between runtime-pool offline-detection sweeps. Larger value = workers take longer to be marked offline after a crash. Default is appropriate for production. |
NOETL_RUNTIME_OFFLINE_SECONDS |
(see code) | no | Heartbeat staleness threshold before a runtime row is marked offline. Increase only when worker heartbeats are unreliable (poor network). |
NOETL_SERVER_MACHINE_ID |
(derived from HOSTNAME) | per-replica in prod | 10-bit machine id (0–1023) for the application-side snowflake generator. Each pod in a deployment MUST set a distinct value to avoid id collisions. See Snowflake ID generation. Added in Phase F R1.5 (noetl/server#42). |
NOETL_SHARD_INDEX |
0 |
per-replica when sharded | Phase F R2 (noetl/server#44). Shard index 0..N-1 this replica owns. Single-replica / pre-sharding deployments leave it unset (defaults to 0 with NOETL_SHARD_COUNT=1 — no enforcement). When NOETL_SHARD_COUNT > 1 each replica MUST set a distinct value. Startup validates shard_index < shard_count and panics otherwise (fail fast on config bug rather than silently mis-route). See Sharding below + sharding-design for the cross-cluster routing context. |
NOETL_SHARD_COUNT |
1 |
cluster-wide when sharded | Phase F R2. Total shard count for the cluster. 1 (the default) disables sharding — every replica owns every execution_id and ShardConfig::owns short-circuits to true. Every replica MUST agree on this value or routing diverges. When operators raise it to N > 1 (Phase F R4 cutover), the NOETL_SHARD_INDEX per replica must be set in lockstep. |
NOETL_SCHEMA |
noetl |
no | Postgres schema the server reads from / writes to. Override only when running multiple NoETL deployments in one DB. |
NOETL_TLS_CERT |
— | with NOETL_TLS_KEY
|
PEM server-certificate-chain path. Set together with NOETL_TLS_KEY to serve HTTPS instead of plain HTTP. Setting exactly one of cert/key is a fail-fast misconfiguration (startup errors). See Transport security. |
NOETL_TLS_KEY |
— | with NOETL_TLS_CERT
|
PEM private-key path for NOETL_TLS_CERT (PKCS#8 / PKCS#1 / SEC1). Never logged. |
NOETL_TLS_CLIENT_CA |
— | for mTLS | PEM CA-bundle path. When set (cert+key already on), the listener requires + verifies client certs (mTLS) against this CA. Absent → server-only TLS (any client). Enables the authenticated worker↔server credential channel (Secrets Wallet Phase 4). |
The keychain resolver fetches a provider:-backed credential alias from an
external secret manager on a credential-store miss (Secrets Wallet Phase 3,
noetl/ai-meta#61).
build_secret_provider(provider) selects the backend by the keychain entry's
provider id; each reads its config from the environment.
GCP Secret Manager (provider: gcp) — REST :access + ambient GKE
Workload-Identity token.
| Variable | Default | Required | Why |
|---|---|---|---|
GOOGLE_CLOUD_PROJECT / GCP_PROJECT
|
— | when a gcp entry omits its project |
Default project for a gcp secret ref that isn't a fully-qualified projects/.../secrets/... path. |
NOETL_GCP_SM_ENDPOINT |
https://secretmanager.googleapis.com/v1 |
no | Secret Manager base URL. Override for a mock in tests. |
NOETL_GCP_METADATA_TOKEN_URL |
(GKE metadata server) | no | Workload-Identity token URL. Override for tests. |
Kubernetes Secrets (provider: k8s / kubernetes) — reads a Secret
object from the in-cluster API server with the pod's ServiceAccount token,
trusting the cluster CA. No cloud credentials; the only dependency is the API
server, so this is the one secret backend fully kind-validatable end-to-end. A
keychain entry references a value as [<namespace>/]<secret>/<key> (a bare
<secret> requires the Secret to hold exactly one data key).
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_K8S_API_URL |
https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT, else https://kubernetes.default.svc
|
no | API server URL. In-cluster the KUBERNETES_SERVICE_* vars are injected automatically; override only for an out-of-cluster / mock endpoint. |
NOETL_K8S_NAMESPACE |
projected <sa>/namespace, else default
|
no | Namespace used when a k8s ref omits its own (<secret>/<key> form). In-cluster the projected file supplies the pod's own namespace. |
NOETL_K8S_CA_FILE |
/var/run/secrets/kubernetes.io/serviceaccount/ca.crt |
no | Cluster CA bundle, added as a trust root. Absent (e.g. an http:// mock) → system roots. |
NOETL_K8S_TOKEN_FILE |
/var/run/secrets/kubernetes.io/serviceaccount/token |
no | ServiceAccount bearer token path; re-read per fetch so projected-token rotation is honored. |
NOETL_K8S_TOKEN |
— | no | Inline bearer token override (tests / mock API servers). Takes precedence over NOETL_K8S_TOKEN_FILE. |
RBAC. The server's ServiceAccount needs get (and list) on secrets in
each namespace it resolves from — the default ServiceAccount has none. The
deployment manifests in noetl/ops must bind a
Role granting secrets: [get, list] to the server SA before a provider: k8s
keychain entry can resolve. (Kind validation uses a manually-applied
noetl-server-secrets-reader Role/RoleBinding.)
HashiCorp Vault (provider: vault) — reads a KV v2 secret via the
Vault REST API (GET <addr>/v1/<mount>/data/<path>), authenticating with
X-Vault-Token. Like Kubernetes Secrets, Vault can run in-cluster, so this
backend is fully kind-validatable. A keychain entry references a value as
[<mount>/]<path>#<key> (a bare [<mount>/]<path> requires the secret to hold
exactly one key); the metadata.version is carried as the value version.
| Variable | Default | Required | Why |
|---|---|---|---|
VAULT_ADDR |
http://127.0.0.1:8200 |
yes (real Vault) | Vault server address. In kind: http://vault.<ns>.svc:8200. |
VAULT_TOKEN |
— | one of token sources | Vault token for the X-Vault-Token header. A platform credential (the server's own auth to Vault), so env/Secret is acceptable per the secrets rule. Production should prefer the Vault Kubernetes auth method (SA JWT → Vault token) over a static token — a follow-up. |
NOETL_VAULT_TOKEN_FILE |
— | alt to VAULT_TOKEN
|
Token file path; re-read per fetch (rotating tokens). |
VAULT_NAMESPACE |
— | Enterprise only | Sent as X-Vault-Namespace. |
NOETL_VAULT_KV_MOUNT |
secret |
no | Default KV v2 mount when a ref omits <mount>/. |
NOETL_VAULT_CA_FILE |
— | for https:// Vault |
CA bundle, added as a trust root. |
AWS Secrets Manager (provider: aws / aws_sm) — JSON-over-POST against
https://secretsmanager.<region>.amazonaws.com/ action
secretsmanager.GetSecretValue, authenticated with hand-rolled AWS Signature
Version 4 signing (no aws-sdk dependency). A keychain entry references a
value as [<region>:]<secret-id>[#<json-key>] — bare <secret-id> returns the
entire SecretString; #<json-key> picks a key out of a JSON-encoded secret
(the common AWS convention for multi-field credentials). Cloud-only backend
(like GCP); kind validation is at the unit-test layer.
| Variable | Default | Required | Why |
|---|---|---|---|
AWS_ACCESS_KEY_ID |
— | yes | Access key id for SigV4 signing. In production this comes from EKS IRSA's temporary credentials (which set all three together). |
AWS_SECRET_ACCESS_KEY |
— | yes | Secret access key; never logged. |
AWS_SESSION_TOKEN |
— | with temp creds | Session token sent as X-Amz-Security-Token. Set by IRSA / aws sts assume-role; omit for long-lived IAM-user credentials. |
AWS_REGION (or AWS_DEFAULT_REGION) |
— | yes | Default region for the regional endpoint; a <region>: prefix on a single ref overrides this. |
NOETL_AWS_SM_ENDPOINT |
— | tests | Override the endpoint host (mock / VPC endpoint). Production uses the regional default. |
IRSA follow-up. This round consumes the static env triple — which IRSA
already populates in the pod via its injected credentials. A direct
web-identity-token → STS AssumeRoleWithWebIdentity exchange (no env-injection
step) is a clearly-scoped follow-up.
Azure Key Vault (provider: azure / azure_kv) — REST GET https://<vault>.vault.azure.net/secrets/<name>[/<version>]?api-version=7.4,
authenticating with an OAuth2 bearer token from the Azure Instance Metadata
Service (IMDS, used by Managed Identity on AKS / VMs) at
http://169.254.169.254/metadata/identity/oauth2/token. A keychain entry
references a value as [<vault>/]<secret-name>[#<version>] (a bare
<secret-name> uses the default vault; the .vault.azure.net suffix is
appended automatically). Cloud-only backend.
| Variable | Default | Required | Why |
|---|---|---|---|
AZURE_KEYVAULT_VAULT |
— | when refs omit <vault>/
|
Default vault short name (e.g. prod-eu). The .vault.azure.net suffix is added by the provider. |
AZURE_KEYVAULT_TOKEN |
— | tests / mocks | Pre-fetched bearer token used in lieu of IMDS. Production should leave this unset and rely on Managed Identity. |
NOETL_AZURE_KEYVAULT_DNS_SUFFIX |
vault.azure.net |
sovereign clouds | DNS suffix for the vault host (vault.azure.cn China, vault.usgovcloudapi.net US gov). |
NOETL_AZURE_KEYVAULT_API_VERSION |
7.4 |
no | Key Vault REST API version. |
NOETL_AZURE_IMDS_TOKEN_URL |
http://169.254.169.254/metadata/identity/oauth2/token |
tests | Override the IMDS endpoint for mocks. |
AAD client-credentials follow-up. This round supports IMDS (Managed Identity) + a pre-fetched token. The AAD client-credentials flow (tenant id + client id + client secret) is a follow-up — needed only when a workload is deployed somewhere IMDS isn't available.
Secrets Wallet Phase 6a (residency-aware distributed resolution) plumbs the keychain entry's home region through the resolver into the provider, so each fetch hits the right regional endpoint / vault / cluster instead of the server's default.
-
KeychainDef.region— optional field on each keychain entry (region: us-east-1,region: europe-west4). When set, it propagates into the [SecretRef::region] the provider receives. -
AWS consumes it as the regional endpoint host
(
secretsmanager.<region>.amazonaws.com) — same shape as the existing<region>:ref-prefix override; the prefix wins when both are set. - Azure / Vault use it for vault / cluster routing (a per-region vault or Vault cluster lives behind a region-shaped DNS name).
- GCP includes it in the resource id where the project is region-bound.
When KeychainDef.region is unset, the resolver falls back to
NOETL_SERVER_REGION (server-side env, see below). When that is also unset
(legacy mode), the provider falls back to its own default (e.g. AWS_REGION)
and the metric records region="-".
Per agents/rules/observability.md Principle 1 every resolution increments
noetl_secret_resolve_total{provider, region, status} — status is ok /
provider_fetch_error / template_error. Region is a low-cardinality label
(operators deploy into low-tens of regions); resolution latency lives on the
matching secret.resolve tracing span (with execution_id per Principle 4),
not on the metric.
| Variable | Default | Required | Why |
|---|---|---|---|
NOETL_SERVER_REGION |
(empty) | yes for residency-aware deployments | Server's home region (e.g. us-east-1). Used as the fallback when a keychain entry doesn't declare its own region:. Phase 6c (residency enforcement) will additionally compare this against an entry's region to fail-closed on cross-region fetches when residency: strict. |
NOETL_SECRET_PROVIDER_TTL_SECONDS |
0 (no TTL) |
no | Phase 6b — TTL for the [ProviderRegistry] cache of (provider_id, region) → Arc<dyn SecretProvider>. When set, an entry older than the TTL is rebuilt on next access. Useful as an operator escape hatch before Phase 6d's dynamic-secret refresh path lands (short-lived AWS STS / Azure IMDS creds expire faster than the registry's process-lifetime default). Unset or 0 ⇒ cache for process lifetime. |
The resolver's cache-miss path (resolve_keychain_entry) used to call
build_secret_provider(provider) on every fetch — re-reading env vars,
rebuilding the reqwest::Client (TLS bundle reparse on the rustls path),
reparsing IMDS / token state. Phase 6b adds a server-side
ProviderRegistry keyed by (provider_id, region) so the per-region
instance is built once and reused.
Two new metrics per agents/rules/observability.md Principle 1:
-
noetl_secret_provider_build_total{provider, region, status}— counter.statusiscache_hit/ok/error. Together with the Phase 6anoetl_secret_resolve_totalthis answers two operator questions: "is the cache effective?" (cache_hit / (ok + cache_hit)ratio) and "is a region's provider down?" (errorper-region rate). -
noetl_secret_resolve_duration_seconds{provider, region}— histogram of resolve wall-clock latency, bucketed[5 ms, 10 ms, 25 ms, 50 ms, 100 ms, 250 ms, 500 ms, 1 s, 2 s, 5 s]to span the range where cloud secret managers and Vault clusters actually live. Observed regardless of outcome so a dashboard surfaces "everything's slow" + "everything's failing" independently (timeouts dominate failure-mode wall-clock).
execution_id is NOT a label on either — it lives on the matching
secret.resolve tracing span per Principle 4.
Secrets Wallet Phase 4a (noetl/ai-meta#61,
noetl/server#103) — the
transport half of sealed secret delivery. The control-plane API serves
plain HTTP by default (unchanged); the listener opts in to TLS purely
through the three NOETL_TLS_* env vars above:
NOETL_TLS_CERT |
NOETL_TLS_KEY |
NOETL_TLS_CLIENT_CA |
Mode |
|---|---|---|---|
| unset | unset | — | Plain HTTP (default) |
| set | set | unset | HTTPS, any client |
| set | set | set | mTLS — client cert required + verified against the CA |
| exactly one of cert/key | — | Startup error (fail fast) |
TLS is built on the ring rustls provider the rest of the stack already
uses (no aws-lc-rs clash); axum-server bind_rustls serves the encrypted
listener with graceful shutdown. mTLS uses a WebPkiClientVerifier over the
NOETL_TLS_CLIENT_CA bundle.
The cert/key/CA are mounted into the pod from a K8s Secret; the manifests
that wire the volume + env live in noetl/ops.
mTLS is what authenticates + encrypts the worker→server credential fetch
(GET /api/credentials/<alias>) so the resolved secret no longer travels
plaintext on the wire. The worker-side mTLS client (ControlPlaneClient)
is Phase 4b; encrypting the payload to the worker's key is Phase 5.
Probe interaction: see the mTLS caveat — with a client
CA set, switch K8s probes to tcpSocket.
Secrets Wallet Phase 5 (noetl/ai-meta#61)
adds defense-in-depth on top of the Phase-4 mTLS transport: mTLS encrypts the
wire, sealing encrypts the payload to a key only the recipient worker
holds. The cleartext exists only briefly inside the server process at seal
time; an operator with kubectl exec on the server pod sees only ciphertext
in the response body.
- Phase 5a (noetl/server#107,
v2.32.0) —
src/crypto/sealed.rsprimitives. - Phase 5b (noetl/server#108) — wire format + endpoint (this section).
- Phase 5c — worker side (ephemeral keypair + unseal +
zeroize).
Worker registration: the worker opts in by including a base64-encoded
32-byte X25519 public key in the runtime JSONB blob of POST /api/worker/pool/register:
{
"name": "worker-rust-pool",
"kind": "worker_pool",
"status": "ready",
"runtime": {
"worker_public_key": "<base64-32-byte-x25519-pub>"
}
}No schema migration is needed — the runtime column already accepts
arbitrary metadata. Workers that don't send a key keep working unchanged
(only the sealing endpoint reads the field).
Endpoint: GET /api/credentials/{identifier}/sealed?worker_id=<name>
Returns a SealedEnvelope JSON addressed to the worker named by
worker_id (matched against the kind=worker_pool row in
noetl.runtime):
{
"alg": "x25519-hkdf-sha256-chacha20-poly1305",
"v": 1,
"eph_pub": "<32 bytes b64>",
"ciphertext": "<n+16 bytes b64>"
}The plaintext that the AEAD ciphertext encrypts is the same JSON
CredentialResponse shape GET /api/credentials/{identifier} returns with
include_data=true. The worker recovers it by running X25519 ECDH against
its long-lived secret + the envelope's eph_pub, deriving the AEAD key + a
12-byte nonce via HKDF-SHA256 with info "noetl-sealed-v1", and ChaCha20-
Poly1305 decrypting with AAD <alg>|v=<v>.
400 BadRequest is returned when the worker_pool row exists but didn't
register a worker_public_key (or the row is missing entirely) — the error
message points at the runtime-row condition so the operator can fix
registration.
Observability: noetl_credentials_sealed_total{status} counter where
status ∈ {ok, no_pubkey, worker_not_found, seal_error,
credential_error}; pairs with a credential.seal span that carries
worker_id, identifier, and (when provided) execution_id.
| Variable | Default | Required | Why |
|---|---|---|---|
DATABASE_URL |
— | when set, overrides individual POSTGRES_*
|
Full Postgres connection URL. Standard ecosystem env var; honored as an override for cases where the connection comes from a managed Postgres service that emits a URL. |
POSTGRES_HOST |
localhost |
yes | Postgres host. In kind: postgres.noetl.svc; in GKE: the Cloud SQL Proxy sidecar address. |
POSTGRES_PORT |
5432 |
no | Postgres port. |
POSTGRES_USER |
noetl |
yes | Postgres user; needs SELECT/INSERT/UPDATE/DELETE on noetl.* plus EXECUTE on noetl.snowflake_id() (the latter is the DB-side fallback per observability.md). |
POSTGRES_PASSWORD |
(empty) | yes for non-trust auth | Postgres password. Should come from a K8s Secret, not the Deployment env directly. |
POSTGRES_DATABASE |
noetl |
no | Postgres database name. |
POSTGRES_MAX_CONNECTIONS |
10 |
no | sqlx pool max. Phase F R4 sharding will partition this across shards; until then, the math is replicas × 10 ≤ Postgres max_connections − headroom. See agents/rules/data-access-boundary.md. |
POSTGRES_MIN_CONNECTIONS |
1 |
no | sqlx pool min — keep at least one connection warm. |
POSTGRES_ACQUIRE_TIMEOUT |
30 |
no | Seconds before a pool.acquire() call fails. Increase only when Postgres is known to be slow under load. |
| Variable | Default | Required | Why |
|---|---|---|---|
NATS_URL |
— | recommended | NATS server URL (nats://<host>:4222). Without it, the server runs in degraded mode and doesn't publish command notifications. Workers see no work; useful for read-only deployments. |
NATS_USER |
— | when NATS auth | NATS user. |
NATS_PASSWORD |
— | when NATS auth | NATS password. Should come from a K8s Secret. |
The NATS connection accepts a user:pass@ segment embedded in
NATS_URL; the NATS_USER / NATS_PASSWORD form is the
preferred shape per async_nats::ConnectOptions::with_user_and_password.
| Variable | Default | Required | Why |
|---|---|---|---|
HOSTNAME |
(set by container runtime) | — | Fallback source for the snowflake machine_id when NOETL_SERVER_MACHINE_ID is unset. Read at startup, hashed via FNV-1a to 10 bits. |
COMPUTERNAME |
(Windows) | — | Same fallback as HOSTNAME on Windows hosts (rare for production but useful for local dev). |
RUST_LOG |
(build default) | no | Standard tracing-subscriber filter. Default sufficient; set to noetl_server=debug,axum=info for targeted debugging. |
Secrets must NOT be passed via plain Deployment.spec.template.spec.containers[].env[].value.
| Secret | Storage | Mount as |
|---|---|---|
NOETL_ENCRYPTION_KEY |
K8s Secret (production) / sealed-secret (kind) | valueFrom.secretKeyRef |
POSTGRES_PASSWORD |
K8s Secret | valueFrom.secretKeyRef |
NATS_PASSWORD |
K8s Secret | valueFrom.secretKeyRef |
NOETL_INTERNAL_API_TOKEN |
K8s Secret | valueFrom.secretKeyRef |
In Cloud Build / GKE the K8s Secrets are projected from GCP Secret Manager via the standard CSI driver.
Per agents/rules/execution-model.md
"Secrets and credentials rule": business-logic secrets (third-
party API tokens, tenant database DSNs) belong in the NoETL
keychain, NOT in env vars. The env vars above are the
platform-runtime secrets — the server's own keys for talking
to its own infrastructure.
-
Metrics: Prometheus surface at
/metricsperagents/rules/observability.md. Cardinality discipline:execution_idis a span attribute, NOT a metric label (would blow up the registry). -
Tracing:
tracingspans on every boundary;execution_idis a span field. -
Logs: structured JSON via
tracing-subscriber'sjsonlayer. Default filter suppresses health endpoint noise peragents/rules/logging.md. -
Snowflake ID is logged once at startup with its derivation
source (
NOETL_SERVER_MACHINE_IDvs derived from HOSTNAME). Useful for confirming the per-replica configuration is correct.
When changing any of the above, validate per
agents/rules/deployment-validation.md:
cargo build --release --binscargo test --quiet- Build image locally + load into kind:
kind load docker-image … - Apply the manifests against
kubectl --context kind-noetl - Smoke-test the changed surface:
- For env-var changes: confirm the var lands in
kubectl exec … env | grep <VAR>AND that the binary behaves as documented when the value is present, absent, or invalid. - For port / probe changes: confirm
kubectl get svcshows the right port +kubectl describe podshows the probes passing.
- For env-var changes: confirm the var lands in
Only after kind passes does the change roll forward to Cloud Build + GKE.
-
sharding-design — Phase F design that
introduced
NOETL_SERVER_MACHINE_ID. - event-envelope — the wire format the server's HTTP API uses.
- runtime-shape — the 4-binary deployment blueprint (server + publisher + projector + system_pool).
- noetl/ops Helm chart + manifests — the deployment-time implementation.
-
agents/rules/wiki-maintenance.md— the rule that requires this page to update in lockstep with code.
- Event envelope
- Event-sourced execution
- API surface
- Runtime shape (compiled + plug-in ring)
- Cursor / claim loop mode
- noetl/cli wiki
- noetl/worker wiki
- noetl/tools wiki
- noetl/noetl wiki — Python implementation (twin during migration)
- noetl/ops wiki