Skip to content

Umbrella Secrets Wallet

Kadyapam edited this page Jun 10, 2026 · 36 revisions

Umbrella — Secrets Wallet (Rust)

Status: CLOSED 2026-06-07 — umbrella is feature-complete. Every named phase + every queued follow-up + every cloud-specific dynamic-secret provider has shipped. The platform-side wallet has nothing left to ship; future work would be new product surface (additional providers, additional residency-policy modes) rather than completing the original umbrella scope.

Downstream follow-ups (post-close, client-side):

  • 2026-06-10 — GUI credential View/Edit recovery for pre-wallet records (noetl/gui#36; closes noetl/ai-meta#82). A consequence of Phase 1's forward-only storage (no legacy single-master-key path): credentials written before the wallet migration can't be decrypted by the new KEK, so GET /api/credentials/{id}?include_data=true returns 500 Decryption failed: aead::Error. The server is behaving correctly — this is a client fix. The noetl-gui credential page now (a) explains the cause on View instead of a generic toast, and (b) on Edit reopens the modal with the list-row metadata (name/type/description/tags) + a warning banner and an empty-but-required data field, so re-entering the secret and saving re-seals the record under the current wallet — the supported recovery path for a pre-wallet record must be re-registered. Response shape is unchanged; wallet-era credentials View/Edit untouched.

Feature inventory (all shipped):

  • 1 Envelope encryption — credentials + keychain store per-record-DEK-wrapped-by-KEK self-describing blobs, fail-closed key (v2.19.8 → v2.21.0).
  • 2 GCP Cloud KMS KeyManager — the KEK can leave the process (v2.22.0; runtime NOETL_KMS_PROVIDER).
  • 3 Secret resolution via the auth:/keychain path — auth: "{{ alias }}" against a provider:-backed keychain entry resolves on a credential-store miss, masked at the response boundary; standalone leak-prone secrets tool was removed (tools v2.19.2). R1–R3b across v2.23.0 → v2.26.0; Phase 3c keychain caching (v2.27.0).
  • 3.x providers (5) — GCP Secret Manager (v2.23.0), Kubernetes Secrets (v2.28.0), HashiCorp Vault KV v2 (v2.29.0), AWS Secrets Manager (v2.31.0), Azure Key Vault (v2.31.0).
  • 4 Transport mTLS — 4a server opt-in TLS/mTLS listener (v2.30.0); 4b worker mTLS client (worker v5.12.0); 4c cert-manager mTLS overlay (ops@37d4d6c); 4d Helm values-gated mTLS for GKE (ops@0fc0dc8).
  • 5 Sealed payload delivery — 5a crypto primitives (v2.32.0 — X25519 ECDH + HKDF-SHA256 + ChaCha20-Poly1305); 5b wire format + /sealed endpoint (v2.33.0); 5c worker integration (worker v5.13.0 — long-lived X25519 keypair, zeroize on cleartext).
  • 6 Residency-aware distributed resolution — 6a region tag + per-region routing (v2.34.0); 6b ProviderRegistry + per-(provider, region) metrics (v2.35.0); 6c residency-policy gate (v2.36.0); 6d primitives — SecretValue.expires_at + cache_decision honouring issuer TTL (v2.37.0); 6e cross-region broker (v2.38.0).
  • 6d cloud-specific dynamic-secret providers (3) — 6d.1 AWS STS AssumeRoleWithWebIdentity (v2.45.0, server#137); 6d.2 GCP iamcredentials.generateAccessToken (v2.47.0, server#138); 6d.3 Azure AAD client-credentials (v2.46.0, server#139).
  • 7 Rotation + audit + auto-renewal — 7a KEK rotation primitives (v2.39.0); 7a.2 rotation endpoint + key-status + DB scans (v2.42.0); 7b secret-resolution audit service (v2.40.0); 7b.2 noetl.secret_audit table + DbAuditSink + GET endpoint (v2.43.0); 7c should_refresh decision primitive (v2.41.0); 7c.2 KeychainService::should_refresh cache-side companion (v2.43.0); 7c.3 resolver-side stampede mutex + background re-resolve (v2.44.0).

Final landing — three cloud-provider rounds (2026-06-07, server v2.45.0 → v2.47.0):

Latest landings (2026-06-07, server v2.42.0 → v2.44.0):

  • 7c.3 — resolver-side stampede mutex + background re-resolve (server#136, closed server#135; v2.44.0): wires the Phase-7c decision primitive + the Phase-7c.2 cache-side companion into the resolver's cache-hit path. Cached value returns IMMEDIATELY to the caller (worker fetches stay on the fast path); a background tokio::spawn re-resolves via the Phase-3b SecretProvider + updates the cache. Stampede collapse via new src/services/keychain_refresh.rs RefreshInflight — N workers crossing the refresh threshold for the same (catalog_id, alias) collapse to one provider call; concurrent callers piggy-back via noetl_secret_refresh_total{outcome="stampede_collapsed"}. Refactor: extracted resolve_via_provider from try_resolve_keychain so cache-miss inline + background refresh share identical code (no behavior drift). Phase 7c series is now wire-complete (7c primitive + 7c.2 cache companion + 7c.3 resolver integration).

  • 7a.2 — KEK rotation endpoint + key-status + DB scans (server#127, closed server#126; v2.42.0): operator-facing wrap of the Phase-7a rewrap_storage_string primitive. POST /api/internal/wallet/rotate-kek?batch_size=&max_batches=&table= runs a batched cursor scan across noetl.credential + noetl.keychain, returns RotateSummary { processed, rewrapped, skipped, failed, last_id } for progress checkpointing across runs. GET /api/internal/wallet/key-status reports per-version row counts so an operator can confirm completion before retiring the old KEK version. Plaintext NEVER reconstructed.

  • 7b.2 — noetl.secret_audit table + DbAuditSink + GET endpoint (server#129, closed server#128; v2.43.0): durable storage path for the Phase-7b service. CREATE TABLE IF NOT EXISTS at server startup (server-owned, no out-of-band migration step); DbAuditSink impl + new GET /api/internal/secret-audit?credential=&execution_id=&from=&to=&limit= (bounded; hard cap 10_000).

  • 7c.2 — KeychainService::should_refresh cache-side primitive (server#131, closed server#130; v2.43.0): reads the cache row's expires_at, asks secrets::dynamic::should_refresh_default, bumps noetl_secret_refresh_total{outcome="triggered"} on a true return. Resolver-side wire-up (stampede mutex + background re-resolve) deferred to Phase 7c.3.

Remaining work: none. The platform-side wallet is feature-complete; the three cloud-specific dynamic-secret providers shipped this session as Phase 6d.1 / 6d.2 / 6d.3. Future work (additional providers, additional residency-policy modes, kind-validation against real cloud test rigs) would be new product surface tracked under a fresh umbrella, not a continuation of this one.

1 envelope encryption (v2.21.0) · 2 GCP Cloud KMS for the KEK (v2.22.0) · 3 secret resolution via the auth:/keychain path (the standalone leak-prone secrets tool was removed, tools v2.19.2): server-side GCP SM client (v2.23.0) → keychain-def model (v2.24.0) → resolver logic (v2.25.0) → R3b wiring (v2.26.0, server#89)auth: "{{ alias }}" against a provider: gcp keychain entry resolves from GCP Secret Manager on a credential miss, end-to-end kind-validated. Phase 3c keychain caching done (v2.27.0, server#91). Providers 3.x — Kubernetes Secrets landed (v2.28.0, server#97, closed server#96): a provider: k8s keychain alias resolves from an in-cluster Secret via the API server + ServiceAccount token + cluster CA — the first backend kind-validated end-to-end with a real value (GCP needs GKE's metadata server). Reference shape [<namespace>/]<secret>/<key>; config from NOETL_K8S_* env; requires secrets: [get, list] RBAC on the server SA (ops follow-up). HashiCorp Vault provider landed (v2.29.0, server#101; closed server#100): a provider: vault keychain alias resolves from a Vault KV v2 secret (X-Vault-Token; ref [<mount>/]<path>#<key>), kind-validated end-to-end against an in-cluster Vault — the second backend validatable on kind. Phase 4a (transport mTLS) — server opt-in TLS/mTLS listener: landed v2.30.0 (server#103, closed server#102): NOETL_TLS_CERT+NOETL_TLS_KEY ⇒ HTTPS, +NOETL_TLS_CLIENT_CA ⇒ mTLS (ring rustls provider, axum-server bind_rustls); curl with a client cert → 200, without → TLS-rejected, plain HTTP → refused. Phase 4b (transport mTLS) — worker mTLS client: landed v5.12.0 (worker#56, closed worker#55): the worker presents a client cert (NOETL_TLS_CLIENT_CERT/KEY + NOETL_TLS_CA); cross-repo kind-val ran a hello_world playbook to COMPLETED over https+mTLS (worker registered, 0 heartbeat failures). Phase 4c (transport mTLS) — cert-manager mTLS overlay: merged (ops@37d4d6c) (ops#163, closed ops#162): ci/manifests/noetl/tls/ issues the server+worker certs in-cluster via cert-manager + patches the rust deployments; fixes the two findings (server probes → tcpSocket, worker init → mTLS curl). Declaratively kind-validated (cert-manager v1.16.2, zero manual cert gen) — a hello_world playbook COMPLETED over full mTLS. Phase 4 (transport mTLS) is now functionally complete across server + worker + ops. (Phases 4/5 reordered: transport mTLS first, payload sealing second.) Next: Helm/GKE mTLS-default flip (follow-up); AWS SM / Azure KV providers; sealed payload (5), residency (6), rotation+audit (7). Tracking issue: noetl/ai-meta#61 Scope: noetl/server, noetl/worker, noetl/tools, noetl/opsRust only. Do not touch Python (repos/noetl). Codified: 2026-06-05 from the standing instruction: "we need to create a true wallet … secrets in postgres unencrypted won't pass any security validation … keep keychain unencrypted … pass credentials to workers unencrypted … add Azure secret manager, all token types, Kubernetes secrets … design how to handle secret references in a very distributed environment where tasks run in different regional / cloud / data-center zones."


1. Current state (grounded survey, 2026-06-05)

Secrets are AES-256-GCM encrypted at rest today — but the way the key is managed makes it fail any real security review.

Area Today File
Cipher AES-256-GCM, random 96-bit nonce prepended, 16-byte tag server/src/crypto/encryption.rs
Key source single static key from NOETL_ENCRYPTION_KEY; falls back to a hardcoded all-zeros default with only a WARN server/src/main.rs:26,375
noetl.credential.data_encrypted TEXT, base64-armored ciphertext (server#71) server/src/db/{models,queries}/credential.rs
noetl.keychain.data BYTEA, raw AES-GCM ciphertext server/src/db/{models,queries}/keychain.rs
Key rotation / versioning none — one key, no key_version column
Envelope encryption (DEK/KEK) none — every record under the one key
KMS integration none — env var only
Worker transit plaintext credential JSON over plain HTTP (GET /api/credentials/{alias}?include_data=true); no mTLS worker/src/client/control_plane.rs:376, server/src/main.rs listener
Worker memory secrets held in HashMap<String,String>, no zeroization worker/src/executor/auth_alias.rs
External secret providers env only; GCP / AWS / Azure / Vault / K8s all return "not implemented" tools/src/tools/secrets.rs
Audit log/event sanitization exists, but no credential-access audit table server/src/sanitize.rs

Threat model gaps (why it fails validation):

  1. Key custody. An all-zeros default key + an env-var key with no KMS means the encryption key is recoverable by anyone with pod/env access, and identical across every deployment that didn't set it. Effectively "obfuscated, not encrypted."
  2. Blast radius. One key encrypts every secret; compromise = full wallet compromise; no rotation to recover.
  3. In transit + on worker. Plaintext over HTTP, plaintext in worker RAM, no mTLS — a network or memory observer reads every secret.
  4. No residency control. Secrets can be resolved/transited anywhere; no region/cloud boundary enforcement for a distributed fleet.
  5. Provider lock-in. Only env; no path to the secret managers real deployments use (GCP SM, AWS SM, Azure KV, Vault, K8s).

2. Goals & non-goals

Goals

  • G1 — No recoverable plaintext at rest: envelope encryption, DEK per record, KEK in an external KMS; fail-closed if no real key manager is configured (kill the all-zeros default).
  • G2 — Both noetl.credential and noetl.keychain use the same wallet primitives.
  • G3 — Secrets never plaintext in transit or at rest on the worker: sealed delivery (per-worker ephemeral key) + mTLS transport.
  • G4 — Pluggable KMS providers (KEK): GCP KMS, AWS KMS, Azure Key Vault keys, HashiCorp Vault Transit, + a loudly-insecure local dev one.
  • G5 — Pluggable secret providers (external references): GCP Secret Manager, AWS Secrets Manager, Azure Key Vault, HashiCorp Vault, Kubernetes Secrets, env (dev).
  • G6 — A uniform secret-reference model (secret://…) usable from v10 playbooks, with version + residency + field selectors.
  • G7 — Distributed / multi-cloud / multi-region: resolve secrets region-locally, honor data-residency, prefer short-lived dynamic secrets; never cross a residency boundary in plaintext.
  • G8 — Key rotation without downtime (key versioning per record) + an append-only secret-access audit.
  • G9 — All token types: static/opaque, structured (DSN/basic), OAuth2/OIDC with refresh, cloud workload-identity / STS short-lived, mTLS keypairs, SSH keys, API keys.

Non-goals (for now)

  • Touching the Python server/keychain (Rust-only deployment is the target).
  • A full HSM/FIPS module integration (KMS gives the managed-key property; HSM-backed KMS keys are a config choice, not new code).
  • Per-field client-side encryption in the browser (gateway/SPA stays as is).

3. Core abstractions

3.1 Secret reference (secret://)

Playbooks reference secrets by opaque URI, never inline. The resolver parses the URI into {provider, locator, version?, field?, residency?}.

secret://wallet/<alias>[@<version>]                  # NoETL-managed wallet (keychain)
secret://gcp-sm/<project>/<name>[@<version>]         # GCP Secret Manager
secret://aws-sm/<region>/<name>                      # AWS Secrets Manager
secret://azure-kv/<vault>/<name>[@<version>]         # Azure Key Vault
secret://vault/<mount>/<path>#<field>                # HashiCorp Vault (KV / dynamic)
secret://k8s/<namespace>/<name>#<key>                # Kubernetes Secret
secret://env/<VAR>                                   # dev only, gated

Back-compat: today's auth: <alias> / credential: <alias> map to secret://wallet/<alias>.

3.2 SecretMaterial (typed result)

enum SecretMaterial {
    Opaque(SecretString),                       // password / api_key / token
    Structured(BTreeMap<String, SecretString>), // postgres DSN parts, basic auth
    OAuth2 { access: SecretString, refresh: Option<SecretString>, expires_at: Option<DateTime> },
    CloudIdentity { token: SecretString, expires_at: DateTime }, // STS / GCP access token (short-lived)
    Keypair { cert: Vec<u8>, key: SecretBytes }, // mTLS / SSH
}

SecretString / SecretBytes wrap zeroize::Zeroizing — overwritten on drop, never Debug/Serialize in the clear.

3.3 KeyManager (KEK — wraps/unwraps DEKs)

#[async_trait]
trait KeyManager: Send + Sync {
    async fn wrap_dek(&self, key_ref: &KekRef, dek: &[u8]) -> Result<WrappedDek>;   // KMS Encrypt
    async fn unwrap_dek(&self, wrapped: &WrappedDek) -> Result<SecretBytes>;         // KMS Decrypt
    async fn current_version(&self, key_ref: &KekRef) -> Result<KeyVersion>;
}

Impls: GcpKms, AwsKms, AzureKeyVaultKeys, VaultTransit, LocalDevKms (file-backed, prints a loud insecure-mode warning and is refused when NOETL_ENV=production).

3.4 SecretProvider (external secret managers)

#[async_trait]
trait SecretProvider: Send + Sync {
    async fn fetch(&self, loc: &SecretLocator) -> Result<SecretMaterial>;
    fn supports_dynamic(&self) -> bool { false } // Vault dynamic DB creds, STS, etc.
}

Impls: GcpSecretManager, AwsSecretsManager, AzureKeyVault, HashiCorpVault, KubernetesSecrets, Env (dev). Authentication to each provider uses ambient workload identity where available (GKE WI, AWS IRSA, Azure Workload Identity, K8s ServiceAccount token, Vault K8s auth) — per execution-model.md "already-in-place trust" rule — so no bootstrap secret is itself stored in the wallet.

4. Envelope encryption

Per-record DEK; KEK in KMS. Stored beside the ciphertext.

write(secret):
  dek         = random 32 bytes
  ciphertext  = AES-256-GCM(dek, nonce, plaintext)        # as today
  wrapped_dek = KMS.wrap_dek(kek_ref, dek)                # KMS Encrypt
  store { ciphertext, nonce, wrapped_dek, kek_provider, kek_key_id,
          kek_key_version, enc_alg = "AES-256-GCM", enc_version }

read(record):
  dek        = KMS.unwrap_dek(record.wrapped_dek)         # KMS Decrypt (region-local)
  plaintext  = AES-256-GCM_decrypt(dek, record.nonce, record.ciphertext)
  zeroize(dek)

Rotation: rotating the KEK only re-wraps DEKs (cheap, no record re-encryption). Rotating a DEK re-encrypts that one record. enc_version

  • kek_key_version make rotation incremental and auditable. A background re-wrap job walks records on the old KEK version and re-wraps to the new one.

Migration off the static key: a one-shot job reads each existing record with the legacy static key, generates a DEK, envelope-encrypts, writes the new columns. The legacy column is dropped once enc_version is uniform. The all-zeros default is removed — startup fails closed if no KMS/key manager is configured (except explicit dev mode).

5. Distributed / multi-cloud / multi-region model

The load-bearing part. Worker pools run in different regions/clouds/DCs; secrets must resolve locally and honor residency.

  1. Residency policy on the SecretRef. Optional residency=<region|cloud|"in-region">. The control plane refuses to resolve or transit a secret outside its residency boundary.

  2. Region-local secret brokers. Resolution is a system-pool playbook (system/secret_resolve, per data-access-boundary.md) running in the worker's own region, with KMS + provider endpoints local to that region. The dispatching server routes the resolve to the broker in the target region (shard/region routing reuses Phase F's shard map). Plaintext DEKs and secrets never leave the region.

  3. KMS topology. Each region/cloud has its own KMS (GCP KMS us-central1, AWS KMS eu-west-1, Azure KV westeurope…). Two options for a wallet record needed in multiple regions:

    • (a) Multi-region KEK (GCP multi-region keys, AWS multi-Region keys, Vault replicated transit) — one wrapped DEK valid in every region. Simplest; pick where the KMS supports it.
    • (b) Per-region wrap — store N wrapped DEKs, one per region's KEK; re-wrap on region add. Use where the KMS is single-region. The record carries the list of {region, kek_ref, wrapped_dek}; the broker picks its own region's entry.
  4. Prefer short-lived dynamic secrets. Where the provider supports it (Vault dynamic DB creds, cloud STS / workload-identity tokens, GCP IAM access tokens), resolve a short-TTL secret scoped to the execution at dispatch — auto-expiring, nothing long-lived stored or transited. This is the strongest posture for a distributed fleet.

  5. Sealed delivery to the worker (defense in depth over mTLS).

    • Worker generates an ephemeral X25519 keypair at startup (rotated periodically); registers the public key via noetl.runtime (worker registration).
    • The broker/server resolves the secret then seals it to the worker's ephemeral public key (libsodium sealed box / HPKE).
    • The worker unseals with its ephemeral private key, uses it, zeroizes. The sealed blob is useless to a MITM, to the event log, or to a co-tenant — independent of TLS.
    • Plus mTLS (SPIFFE/SPIRE or cert-manager issued certs) for the transport channel.
  6. Keychain = the execution-scoped resolved-secret / token cache (see §5a). Cached entries are envelope-encrypted (same wallet primitives), execution_id-scoped, lineage-inheritable, TTL-bounded, region-local, and never replicated across a residency boundary.

5a. Keychain — execution-scoped cache + sub-playbook inheritance

noetl.keychain is not a second credential store — the wallet (noetl.credential + external providers) is the source of truth. The keychain is the per-execution-instance cache of resolved secrets and minted tokens for a running playbook:

  • When a step resolves a secret://… (wallet or external provider) or mints an OAuth / STS / cloud-access token, the resolved material is cached in the keychain keyed by (name, execution_id, scope), envelope-encrypted (same DEK/KEK primitives — the cache is not a plaintext hole), with expires_at + auto_renew.
  • Later steps in the same execution read the cache instead of re-resolving — one provider call / one OAuth refresh per execution, not per step — and a single renewer keeps a shared token fresh (no thundering-herd refresh).

Scope semantics (the existing scope_type column, made precise):

scope visible to use
local the one execution_id only per-execution secrets that must not leak to children
shared the execution lineage (this execution + its sub-playbook descendants) the default for inherited creds/tokens
global all executions for the catalog entry long-lived shared service tokens

Sub-playbook inheritance. A kind: playbook step starts a child execution with its own execution_id and a parent_execution_id link (already recorded; the worker threads parent_execution_id on get_credentialworker/src/client/control_plane.rs). A keychain lookup for a child resolves by walking the lineage chain (execution_id → parent_execution_id → … → root) and returns the nearest shared / global entry:

  • A token the parent resolved / refreshed is inherited by its sub-playbooks — no redundant provider call, no duplicate OAuth refresh, one refresh authority per token across the whole execution tree.
  • local-scope entries stay private to their execution (not inherited) — the isolation knob for secrets a sub-playbook must not see.
  • Server-side the resolver walks noetl.execution's parent links; inheritance is a server concern (workers never see the chain, only the sealed result), keeping it consistent with the data-access boundary.

Distributed caveat. Inheritance is lineage + region-local: a sub-playbook dispatched to a different region re-resolves in that region rather than inheriting plaintext across a residency boundary — residency wins over cache reuse. The cached blob is sealed/at-rest-encrypted in its origin region only.

playbook step: auth: secret://wallet/pg_eu  (residency=eu)
        │
        ▼  (control plane routes to EU broker; refuses non-EU)
  EU secret broker (system pool, EU region)
        │  KMS.unwrap_dek (EU KMS)  +  AES-GCM decrypt   (plaintext stays in EU)
        ▼  seal(secret, worker_eu.ephemeral_pubkey)
  EU worker  ── mTLS ──▶ receives sealed blob ──▶ unseal ──▶ use ──▶ zeroize

6. Data model + API changes (Rust-only, new migrations)

  • noetl.credential + noetl.keychain: add wrapped_dek BYTEA, kek_provider TEXT, kek_key_id TEXT, kek_key_version TEXT, enc_alg TEXT, enc_version SMALLINT, residency TEXT NULL, and a wrap_regions JSONB NULL (per-region wrapped DEKs for option 5b).
  • noetl.keychain: keep scope_type (local/shared/global) + execution_id
    • expires_at + auto_renew; inheritance walks noetl.execution.parent_execution_id (already recorded for sub-playbook child executions — add/confirm the column + an index on it). A keychain GET for a child resolves (name, scope) by walking the lineage chain and returning the nearest shared/global hit.
  • noetl.runtime: add ephemeral_pubkey BYTEA, pubkey_expires_at.
  • New noetl.secret_audit (append-only): id, ts, principal, alias, provider, region, execution_id, action, outcome.
  • POST /api/credentials / keychain write: envelope-encrypt (gen DEK → KMS.wrap → store).
  • GET /api/credentials/{alias}?seal_to=<worker_pubkey>&execution_id=…: returns a sealed blob (not plaintext). The legacy plaintext path is retained only behind admin RBAC + audit for break-glass/dev.
  • New internal POST /api/internal/secrets/resolve (system-pool/broker, residency-aware) per data-access-boundary.md.
  • noetl-tools secrets / secret_manager tool dispatches to the SecretProvider registry and understands secret://….

7. Provider matrix

Provider KMS (KEK) Secret manager Workload-identity auth
GCP Cloud KMS Secret Manager GKE Workload Identity
AWS KMS Secrets Manager IRSA
Azure Key Vault keys Key Vault secrets (new) Azure Workload Identity
HashiCorp Vault Transit KV v2 + dynamic Vault K8s auth
Kubernetes — (use cloud KMS) Secrets (new) ServiceAccount token
Local/dev file (insecure, gated) env

8. Phased plan (each phase = its own sub-issues + PRs, Rust-only)

Phase Status Deliverable Repos
0 This design + threat model + decisions sign-off ai-meta
1 ✅ v2.19.8–v2.21.0 Envelope-encryption core: KeyManager trait + LocalDevKms; self-describing storage blob (no migration) for both noetl.credential and noetl.keychain; fail-closed key (1a, server#75) + envelope core (1b, server#77) + live wiring (1c/1d, server#79) server
2 ✅ v2.22.0 KMS providers: GcpKms (server#81) — Cloud KMS :encrypt/:decrypt + Workload Identity; runtime NOETL_KMS_PROVIDER. AwsKms / AzureKeyVaultKeys / VaultTransit follow behind the same trait server
3 ✅ server v2.23.0–v2.26.0 Secret resolution via the auth:/keychain path (not a workflow tool — the standalone secrets tool was removed, tools v2.19.2, because it leaked the value into the data flow). Redesigned server-side: R1 GCP SM client (v2.23.0) → R2 keychain-def model provider/map + find_keychain (v2.24.0) → R3a resolve_keychain_entry + build_secret_provider (v2.25.0) → R3b wire into the get_credential cache-miss (v2.26.0, server#89). GCP SM live; AWS SM / Azure KV / Vault / K8s slot into the same SecretProvider trait server (+ tools)
3c ✅ server v2.27.0 Keychain as execution-scoped cache: resolved secrets/tokens envelope-encrypted with scope + TTL (avoid re-fetching the provider per step) + keychain storage-layer repair (server#91). Sub-playbook parent_execution_id inheritance is a later follow-up server
3.x ✅ all five providers landed GCP SM (Phase 3) · Kubernetes Secrets (v2.28.0, server#97) · HashiCorp Vault (v2.29.0, server#101) · AWS Secrets Manager (v2.31.0, server#105, hand-rolled SigV4, no aws-sdk dep tree) · Azure Key Vault (v2.31.0, server#105, IMDS Managed Identity) — all behind the one SecretProvider trait server
4 ✅ transport security (TLS / mTLS) 4a ✅ landed v2.30.0 — server opt-in TLS/mTLS listener (server#103, closed server#102). 4b ✅ landed v5.12.0 — worker ControlPlaneClient mTLS client (worker#56, closed worker#55). 4c ✅ merged (ops@37d4d6c) — cert-manager mTLS overlay for kind (ops#163, closed ops#162): ci/manifests/noetl/tls/. 4d ✅ merged (ops@0fc0dc8) — Helm chart values-gated mTLS for GKE (ops#165, closed ops#164): automation/helm/noetl/ exposes tls.* values; off-mode renders byte-identical to main; on-mode produces 2 Issuers + 3 Certificates + the server/worker mTLS env contract; kind-validated (cert-manager materialised the Secrets with the right keys). Phase 4 is now fully merged across all four rounds. Production GKE points tls.certManager.issuerRef at a ClusterIssuer backed by GCP CAS or SPIRE/SPIFFE server, worker, ops
5 ✅ sealed payload delivery 5a ✅ landed v2.32.0 — server-side crypto primitives (server#107, closed server#106): src/crypto/sealed.rs X25519 ECDH + HKDF-SHA256 + ChaCha20-Poly1305 sealed-box (nonce derived from the shared secret, AAD pins alg+v); 12 unit tests, lib 369/0. 5b ✅ landed v2.33.0 — wire format + sealing endpoint (server#109, closed server#108): workers opt in by including worker_public_key (b64 X25519 pubkey) in their register payload's runtime JSON blob (no schema migration); GET /api/credentials/{id}/sealed?worker_id=<name> returns a SealedEnvelope; 400 when the worker_pool row exists but didn't register a key. Kind-validated end-to-end (Python cryptography opens the envelope → recovers the bearer token + scope round-trip). noetl_credentials_sealed_total{status} counter + credential.seal span per observability.md. 5c ✅ landed v5.13.0 — worker integration (worker#58, closed worker#57): long-lived X25519 keypair generated once at startup, pubkey registered in the runtime JSON blob, get_sealed_credential calls /sealed endpoint, unseals via the same primitives (drift-guard test against server constants), zeroizes the cleartext after the auth-alias resolver consumes it. Env-gated (`NOETL_SEALED_CREDENTIALS=true 1
6 🚧 residency-aware distributed resolution 6a ✅ landed v2.34.0 — region tag on keychain entries + per-region routing (server#111, closed server#110): KeychainDef.region (no schema migration — lives in the existing JSON blob), SecretRef.region provider-agnostic, AWS provider consumes it with explicit precedence (<region>: ref prefix > field > legacy project overload > AWS_REGION env); NOETL_SERVER_REGION env + server_region() / effective_region() fallback helpers; noetl_secret_resolve_total{provider,region,status} counter per observability.md Principle 1. 5 new unit tests; lib 376/0. Lib-only — backward compatible. 6b ✅ landed v2.35.0ProviderRegistry + per-(provider, region) metrics (server#113, closed server#112): src/secrets/registry.rs ProviderRegistry keyed by (provider_id, region), RwLock-protected with double-checked locking on the build path so concurrent get_or_build for the same key only builds once. Optional TTL via NOETL_SECRET_PROVIDER_TTL_SECONDS env (default 0 = process lifetime). New noetl_secret_provider_build_total{provider,region,status="cache_hit|ok|error"} counter + noetl_secret_resolve_duration_seconds{provider,region} histogram (bucketed 5 ms – 5 s, observed regardless of outcome so dashboards surface "slow" + "failing" independently). 7 new unit tests; lib 383/0. Lib-only. 6c ✅ landed v2.36.0 — residency-policy gate (server#115, closed server#114): KeychainDef.residency enum (none|advisory|strict, default none) + KeychainDef.allowed_regions allowlist; src/secrets/residency.rs evaluate() returns Allow(label) / AllowWithViolationLogged / Deny(AppError::ResidencyViolation); resolver runs the gate at the top of resolve_keychain_entry BEFORE any provider call. AppError::ResidencyViolation { credential, entry_region, server_region } → HTTP 403 with clear "credential X is region-locked to Y; this server is in Z" message that NEVER includes the value. noetl_secret_residency_check_total{policy, decision} counter — strict + violation_blocked is alert-worthy, advisory + violation_allowed is migration-window signal. Defensive: empty string in allowlist never matches empty server region. 8 new unit tests; lib 391/0. Lib-only — no schema migration (residency + allowed_regions ride the existing JSON blob). 6d ✅ landed v2.37.0 (primitives) — dynamic-secret primitives + cache plumbing (server#117, closed server#116): SecretValue.expires_at: Option<DateTime<Utc>> field; src/secrets/dynamic.rs cache_decision() honors min(default_ttl, expires_at - now - safety_margin) and returns SkipCacheAlreadyExpired when the deadline is already past or inside the operator's safety margin; KEYCHAIN_CACHE_DYNAMIC_SAFETY_MARGIN_SECS env (default 60); resolve_keychain_entry_with_meta returns the bundle's earliest expires_at; CredentialService::resolve_via_provider consumes the helper. Two new metrics: noetl_secret_dynamic_ttl_seconds histogram (1m / 5m / 15m / 1h / 4h / 12h buckets, observed when issuer reports TTL) + noetl_secret_cache_skip_total{reason="already_expired"} counter. 7 new unit tests; lib 398/0. Backward compatible (providers without expires_at keep the 600 s default). Follow-ups (each its own sub-issue): 6d.1 AWS STS AssumeRoleWithWebIdentity provider · 6d.2 GCP iamcredentials.generateAccessToken · 6d.3 Azure AAD client-credentials. 6e ✅ landed v2.38.0 — cross-region broker (server#119, closed server#118): BrokerRegistry (region → broker_url from NOETL_SECRET_BROKER_REGISTRY env, empty default = pre-6e fail-closed); POST /api/internal/cross-region/resolve peer endpoint validates expected_entry_region == server_region() (defensive against stale peer registries), resolves locally, seals via Phase-5a primitives to the requesting worker's pubkey; get_sealed handler falls back to broker on AppError::ResidencyViolation; KeychainDef.no_broker_fallback: bool per-credential opt-out for hard-isolation credentials; AppError::CrossRegionUnreachable → HTTP 502. Two new metrics: noetl_secret_broker_call_total{broker_region, outcome} counter + noetl_secret_broker_call_duration_seconds{broker_region} histogram (50ms – 5s buckets). 10 new unit tests; lib 410/0. Lib-only — opt-in via env, no schema migration. Phase 6 closes. Both residency shapes operational: hard isolation (residency: strict + no broker → fail-closed HTTP 403) + soft federation (residency: strict + broker registered → transparent cross-region routing). Covers G7 in full. server, ops
7 ✅ rotation + audit + auto-renewal 7a ✅ landed v2.39.0 — KEK rotation primitives (server#121, closed server#120): KeyManager::current_key_version() trait accessor; EnvelopeCipher::rewrap_storage_string primitive (parse → if wrapped.key_version == current_key_versionSkipped; else unwrap with historical KEK version → re-wrap with current → return Rewrapped { old_key_version, new_key_version, new_storage_string }). Plaintext payload NEVER reconstructed — pure DEK re-wrap, AES-GCM ciphertext bytes stay byte-identical. noetl_wallet_rotate_total{table, status} counter (skipped|rewrapped|failed_unwrap|failed_wrap|parse_error; failed_unwrap alert-worthy). 4 new unit tests; lib 414/0. 7a.2 ⏳ rotation endpoint (POST /api/internal/wallet/rotate-kek) + DB scans over noetl.credential+noetl.keychain + diagnostic GET /api/internal/wallet/key-status. 7b ✅ landed v2.40.0 (primitives) — secret-resolution audit service (server#123, closed server#122): services::secret_audit::AuditEvent struct (NEVER contains the secret value); Operation + Outcome bounded enums; AuditSink trait + NoopAuditSink default + SecretAuditService wrapper with record_async (fire-and-forget) + record_strict (await) + record (dispatches by strict-mode); NOETL_SECRET_AUDIT_REQUIRED env (default false; 1/true/TRUE/yes/YES enable strict); noetl_secret_audit_writes_total{operation, outcome, status} counter (failed_strict alert-worthy). 8 new unit tests; lib 422/0. Lib-only. 7b.2 ⏳ noetl.secret_audit table + DbAuditSink + GET /api/internal/secret-audit query endpoint + wire into the four credential surfaces. 7c ✅ landed v2.41.0 — token auto-renewal primitives (server#125, closed server#124): secrets::dynamic::should_refresh(expires_at, refresh_window, now) decision primitive (true iff expires_at set + still valid + inside refresh window) + should_refresh_default reading env; KEYCHAIN_CACHE_REFRESH_WINDOW_SECS env (default 60). Two new metrics: noetl_secret_refresh_total{outcome} counter (triggered|succeeded|failed|stampede_collapsed; failed alert-worthy) + noetl_secret_refresh_duration_seconds histogram (50ms–5s buckets, observed regardless of outcome). 5 new unit tests; lib 427/0. Lib-only. 7c.2 ⏳ cache + resolver wire-up (KeychainService::should_refresh + on-cache-hit spawn-background-refresh path + per-(catalog_id, alias) tokio::sync::Mutex stampede collapse + refresh path records its own Phase-7b AuditEvent). Phase 7 closes. Remaining queue: 7a.2 / 7b.2 / 7c.2 (.2 endpoint+DB rounds), 6d.1 / 6d.2 / 6d.3 (cloud-specific dynamic providers) — all discrete follow-up sub-issues, each its own bounded round. server

Phases 1–4 are the security-validation must-haves (managed keys, no plaintext to workers). 5–7 harden transit, distribution, and rotation. (Reordered from the original plan: transport mTLS is now Phase 4 and payload sealing Phase 5 — mTLS is the foundation, sealing is the defense-in-depth layer on top.)

9. Open decisions (need user input before Phase 1)

  1. Primary cloud / KMS first. Which KMS does Phase 2 implement first — GCP Cloud KMS, AWS KMS, Azure Key Vault, or Vault Transit? (Drives the reference implementation; others follow the same trait.)
  2. Multi-region key strategy. Multi-region KEK (5a) vs per-region wrap (5b) — depends on whether the chosen KMS offers multi-region keys.
  3. Sealed delivery vs mTLS-only. Do both (recommended, defense in depth) or start with mTLS-only and add sealing later?
  4. Residency requirements. Are there hard data-residency boundaries to enforce now (e.g., EU-only secrets), or is that future-proofing?
  5. Dynamic vs static secrets. How aggressively to push short-lived dynamic creds (Vault/STS) vs encrypted-at-rest static creds?
  6. Break-glass plaintext read. Keep an admin-RBAC + audited plaintext include_data=true path, or remove it entirely (sealed-only)?

10. Compliance mapping (why this passes review)

Requirement (SOC2 / ISO 27001 / PCI-DSS) Met by
Encryption at rest with managed keys Phase 1–2 (KMS-backed envelope)
No hardcoded / static keys Phase 1 (kill all-zeros default, fail-closed)
Key rotation Phase 7 (versioned re-wrap)
Encryption in transit Phase 4–5 (sealed delivery + mTLS)
Least privilege + access logging Phase 6–7 (secret_audit, RBAC)
Data residency Phase 6 (region brokers + residency policy)
Secret sprawl / external managers Phase 3 (GCP/AWS/Azure/Vault/K8s)

11. Related

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally