-
Notifications
You must be signed in to change notification settings - Fork 0
Umbrella Secrets Wallet
Status: CLOSED 2026-06-07 — umbrella is feature-complete. Every named phase + every queued follow-up + every cloud-specific dynamic-secret provider has shipped. The platform-side wallet has nothing left to ship; future work would be new product surface (additional providers, additional residency-policy modes) rather than completing the original umbrella scope.
Downstream follow-ups (post-close, client-side):
-
2026-06-10 — GUI credential View/Edit recovery for pre-wallet records (noetl/gui#36; closes noetl/ai-meta#82). A consequence of Phase 1's forward-only storage (no legacy single-master-key path): credentials written before the wallet migration can't be decrypted by the new KEK, so
GET /api/credentials/{id}?include_data=truereturns500 Decryption failed: aead::Error. The server is behaving correctly — this is a client fix. The noetl-gui credential page now (a) explains the cause on View instead of a generic toast, and (b) on Edit reopens the modal with the list-row metadata (name/type/description/tags) + a warning banner and an empty-but-required data field, so re-entering the secret and saving re-seals the record under the current wallet — the supported recovery path fora pre-wallet record must be re-registered. Response shape is unchanged; wallet-era credentials View/Edit untouched.
Feature inventory (all shipped):
- 1 Envelope encryption — credentials + keychain store per-record-DEK-wrapped-by-KEK self-describing blobs, fail-closed key (v2.19.8 → v2.21.0).
-
2 GCP Cloud KMS
KeyManager— the KEK can leave the process (v2.22.0; runtimeNOETL_KMS_PROVIDER). -
3 Secret resolution via the
auth:/keychain path —auth: "{{ alias }}"against aprovider:-backed keychain entry resolves on a credential-store miss, masked at the response boundary; standalone leak-pronesecretstool was removed (tools v2.19.2). R1–R3b across v2.23.0 → v2.26.0; Phase 3c keychain caching (v2.27.0). - 3.x providers (5) — GCP Secret Manager (v2.23.0), Kubernetes Secrets (v2.28.0), HashiCorp Vault KV v2 (v2.29.0), AWS Secrets Manager (v2.31.0), Azure Key Vault (v2.31.0).
- 4 Transport mTLS — 4a server opt-in TLS/mTLS listener (v2.30.0); 4b worker mTLS client (worker v5.12.0); 4c cert-manager mTLS overlay (ops@37d4d6c); 4d Helm values-gated mTLS for GKE (ops@0fc0dc8).
-
5 Sealed payload delivery — 5a crypto primitives (v2.32.0 — X25519 ECDH + HKDF-SHA256 + ChaCha20-Poly1305); 5b wire format +
/sealedendpoint (v2.33.0); 5c worker integration (worker v5.13.0 — long-lived X25519 keypair, zeroize on cleartext). -
6 Residency-aware distributed resolution — 6a region tag + per-region routing (v2.34.0); 6b ProviderRegistry + per-(provider, region) metrics (v2.35.0); 6c residency-policy gate (v2.36.0); 6d primitives —
SecretValue.expires_at+cache_decisionhonouring issuer TTL (v2.37.0); 6e cross-region broker (v2.38.0). -
6d cloud-specific dynamic-secret providers (3) — 6d.1 AWS STS
AssumeRoleWithWebIdentity(v2.45.0, server#137); 6d.2 GCPiamcredentials.generateAccessToken(v2.47.0, server#138); 6d.3 Azure AAD client-credentials (v2.46.0, server#139). -
7 Rotation + audit + auto-renewal — 7a KEK rotation primitives (v2.39.0); 7a.2 rotation endpoint + key-status + DB scans (v2.42.0); 7b secret-resolution audit service (v2.40.0); 7b.2
noetl.secret_audittable + DbAuditSink + GET endpoint (v2.43.0); 7cshould_refreshdecision primitive (v2.41.0); 7c.2KeychainService::should_refreshcache-side companion (v2.43.0); 7c.3 resolver-side stampede mutex + background re-resolve (v2.44.0).
Final landing — three cloud-provider rounds (2026-06-07, server v2.45.0 → v2.47.0):
Latest landings (2026-06-07, server v2.42.0 → v2.44.0):
-
7c.3 — resolver-side stampede mutex + background re-resolve (server#136, closed server#135; v2.44.0): wires the Phase-7c decision primitive + the Phase-7c.2 cache-side companion into the resolver's cache-hit path. Cached value returns IMMEDIATELY to the caller (worker fetches stay on the fast path); a background
tokio::spawnre-resolves via the Phase-3b SecretProvider + updates the cache. Stampede collapse via newsrc/services/keychain_refresh.rsRefreshInflight— N workers crossing the refresh threshold for the same(catalog_id, alias)collapse to one provider call; concurrent callers piggy-back vianoetl_secret_refresh_total{outcome="stampede_collapsed"}. Refactor: extractedresolve_via_providerfromtry_resolve_keychainso cache-miss inline + background refresh share identical code (no behavior drift). Phase 7c series is now wire-complete (7c primitive + 7c.2 cache companion + 7c.3 resolver integration). -
7a.2 — KEK rotation endpoint + key-status + DB scans (server#127, closed server#126; v2.42.0): operator-facing wrap of the Phase-7a
rewrap_storage_stringprimitive.POST /api/internal/wallet/rotate-kek?batch_size=&max_batches=&table=runs a batched cursor scan acrossnoetl.credential+noetl.keychain, returnsRotateSummary { processed, rewrapped, skipped, failed, last_id }for progress checkpointing across runs.GET /api/internal/wallet/key-statusreports per-version row counts so an operator can confirm completion before retiring the old KEK version. Plaintext NEVER reconstructed. -
7b.2 —
noetl.secret_audittable + DbAuditSink + GET endpoint (server#129, closed server#128; v2.43.0): durable storage path for the Phase-7b service.CREATE TABLE IF NOT EXISTSat server startup (server-owned, no out-of-band migration step);DbAuditSinkimpl + newGET /api/internal/secret-audit?credential=&execution_id=&from=&to=&limit=(bounded; hard cap 10_000). -
7c.2 —
KeychainService::should_refreshcache-side primitive (server#131, closed server#130; v2.43.0): reads the cache row'sexpires_at, askssecrets::dynamic::should_refresh_default, bumpsnoetl_secret_refresh_total{outcome="triggered"}on a true return. Resolver-side wire-up (stampede mutex + background re-resolve) deferred to Phase 7c.3.
Remaining work: none. The platform-side wallet is feature-complete; the three cloud-specific dynamic-secret providers shipped this session as Phase 6d.1 / 6d.2 / 6d.3. Future work (additional providers, additional residency-policy modes, kind-validation against real cloud test rigs) would be new product surface tracked under a fresh umbrella, not a continuation of this one.
1 envelope encryption (v2.21.0) · 2 GCP Cloud KMS for the KEK (v2.22.0) · 3
secret resolution via the auth:/keychain path (the standalone leak-prone
secrets tool was removed, tools v2.19.2): server-side GCP SM client (v2.23.0)
→ keychain-def model (v2.24.0) → resolver logic (v2.25.0) → R3b wiring
(v2.26.0, server#89) — auth: "{{ alias }}" against a provider: gcp
keychain entry resolves from GCP Secret Manager on a credential miss,
end-to-end kind-validated. Phase 3c keychain caching done (v2.27.0,
server#91). Providers 3.x — Kubernetes Secrets landed (v2.28.0,
server#97, closed server#96): a provider: k8s keychain alias resolves from an
in-cluster Secret via the API server + ServiceAccount token + cluster CA —
the first backend kind-validated end-to-end with a real value (GCP needs
GKE's metadata server). Reference shape [<namespace>/]<secret>/<key>; config
from NOETL_K8S_* env; requires secrets: [get, list] RBAC on the server SA
(ops follow-up). HashiCorp Vault provider landed (v2.29.0, server#101;
closed server#100): a provider: vault keychain alias resolves from a Vault
KV v2 secret (X-Vault-Token; ref [<mount>/]<path>#<key>), kind-validated
end-to-end against an in-cluster Vault — the second backend validatable on
kind. Phase 4a (transport mTLS) — server opt-in TLS/mTLS listener: landed
v2.30.0 (server#103,
closed server#102): NOETL_TLS_CERT+NOETL_TLS_KEY ⇒ HTTPS,
+NOETL_TLS_CLIENT_CA ⇒ mTLS (ring rustls provider, axum-server bind_rustls); curl with a client cert → 200, without → TLS-rejected, plain
HTTP → refused. Phase 4b (transport mTLS) — worker mTLS client: landed v5.12.0
(worker#56, closed
worker#55): the worker presents a client cert (NOETL_TLS_CLIENT_CERT/KEY +
NOETL_TLS_CA); cross-repo kind-val ran a hello_world playbook to COMPLETED
over https+mTLS (worker registered, 0 heartbeat failures). Phase 4c
(transport mTLS) — cert-manager mTLS overlay: merged (ops@37d4d6c)
(ops#163, closed ops#162):
ci/manifests/noetl/tls/ issues the server+worker certs in-cluster via
cert-manager + patches the rust deployments; fixes the two findings (server
probes → tcpSocket, worker init → mTLS curl). Declaratively kind-validated
(cert-manager v1.16.2, zero manual cert gen) — a hello_world playbook
COMPLETED over full mTLS. Phase 4 (transport mTLS) is now functionally
complete across server + worker + ops. (Phases 4/5 reordered: transport mTLS
first, payload sealing second.) Next: Helm/GKE mTLS-default flip (follow-up);
AWS SM / Azure KV providers; sealed payload (5), residency (6),
rotation+audit (7).
Tracking issue: noetl/ai-meta#61
Scope: noetl/server, noetl/worker, noetl/tools, noetl/ops —
Rust only. Do not touch Python (repos/noetl).
Codified: 2026-06-05 from the standing instruction: "we need to
create a true wallet … secrets in postgres unencrypted won't pass any
security validation … keep keychain unencrypted … pass credentials to
workers unencrypted … add Azure secret manager, all token types,
Kubernetes secrets … design how to handle secret references in a very
distributed environment where tasks run in different regional / cloud /
data-center zones."
Secrets are AES-256-GCM encrypted at rest today — but the way the key is managed makes it fail any real security review.
| Area | Today | File |
|---|---|---|
| Cipher | AES-256-GCM, random 96-bit nonce prepended, 16-byte tag | server/src/crypto/encryption.rs |
| Key source |
single static key from NOETL_ENCRYPTION_KEY; falls back to a hardcoded all-zeros default with only a WARN |
server/src/main.rs:26,375 |
noetl.credential.data_encrypted |
TEXT, base64-armored ciphertext (server#71) | server/src/db/{models,queries}/credential.rs |
noetl.keychain.data |
BYTEA, raw AES-GCM ciphertext | server/src/db/{models,queries}/keychain.rs |
| Key rotation / versioning |
none — one key, no key_version column |
— |
| Envelope encryption (DEK/KEK) | none — every record under the one key | — |
| KMS integration | none — env var only | — |
| Worker transit |
plaintext credential JSON over plain HTTP (GET /api/credentials/{alias}?include_data=true); no mTLS |
worker/src/client/control_plane.rs:376, server/src/main.rs listener |
| Worker memory | secrets held in HashMap<String,String>, no zeroization
|
worker/src/executor/auth_alias.rs |
| External secret providers |
env only; GCP / AWS / Azure / Vault / K8s all return "not implemented" |
tools/src/tools/secrets.rs |
| Audit | log/event sanitization exists, but no credential-access audit table | server/src/sanitize.rs |
Threat model gaps (why it fails validation):
- Key custody. An all-zeros default key + an env-var key with no KMS means the encryption key is recoverable by anyone with pod/env access, and identical across every deployment that didn't set it. Effectively "obfuscated, not encrypted."
- Blast radius. One key encrypts every secret; compromise = full wallet compromise; no rotation to recover.
- In transit + on worker. Plaintext over HTTP, plaintext in worker RAM, no mTLS — a network or memory observer reads every secret.
- No residency control. Secrets can be resolved/transited anywhere; no region/cloud boundary enforcement for a distributed fleet.
- Provider lock-in. Only env; no path to the secret managers real deployments use (GCP SM, AWS SM, Azure KV, Vault, K8s).
Goals
- G1 — No recoverable plaintext at rest: envelope encryption, DEK per record, KEK in an external KMS; fail-closed if no real key manager is configured (kill the all-zeros default).
- G2 — Both
noetl.credentialandnoetl.keychainuse the same wallet primitives. - G3 — Secrets never plaintext in transit or at rest on the worker: sealed delivery (per-worker ephemeral key) + mTLS transport.
- G4 — Pluggable KMS providers (KEK): GCP KMS, AWS KMS, Azure Key Vault keys, HashiCorp Vault Transit, + a loudly-insecure local dev one.
- G5 — Pluggable secret providers (external references): GCP Secret Manager, AWS Secrets Manager, Azure Key Vault, HashiCorp Vault, Kubernetes Secrets, env (dev).
- G6 — A uniform secret-reference model (
secret://…) usable from v10 playbooks, with version + residency + field selectors. - G7 — Distributed / multi-cloud / multi-region: resolve secrets region-locally, honor data-residency, prefer short-lived dynamic secrets; never cross a residency boundary in plaintext.
- G8 — Key rotation without downtime (key versioning per record) + an append-only secret-access audit.
- G9 — All token types: static/opaque, structured (DSN/basic), OAuth2/OIDC with refresh, cloud workload-identity / STS short-lived, mTLS keypairs, SSH keys, API keys.
Non-goals (for now)
- Touching the Python server/keychain (Rust-only deployment is the target).
- A full HSM/FIPS module integration (KMS gives the managed-key property; HSM-backed KMS keys are a config choice, not new code).
- Per-field client-side encryption in the browser (gateway/SPA stays as is).
Playbooks reference secrets by opaque URI, never inline. The resolver
parses the URI into {provider, locator, version?, field?, residency?}.
secret://wallet/<alias>[@<version>] # NoETL-managed wallet (keychain)
secret://gcp-sm/<project>/<name>[@<version>] # GCP Secret Manager
secret://aws-sm/<region>/<name> # AWS Secrets Manager
secret://azure-kv/<vault>/<name>[@<version>] # Azure Key Vault
secret://vault/<mount>/<path>#<field> # HashiCorp Vault (KV / dynamic)
secret://k8s/<namespace>/<name>#<key> # Kubernetes Secret
secret://env/<VAR> # dev only, gated
Back-compat: today's auth: <alias> / credential: <alias> map to
secret://wallet/<alias>.
enum SecretMaterial {
Opaque(SecretString), // password / api_key / token
Structured(BTreeMap<String, SecretString>), // postgres DSN parts, basic auth
OAuth2 { access: SecretString, refresh: Option<SecretString>, expires_at: Option<DateTime> },
CloudIdentity { token: SecretString, expires_at: DateTime }, // STS / GCP access token (short-lived)
Keypair { cert: Vec<u8>, key: SecretBytes }, // mTLS / SSH
}SecretString / SecretBytes wrap zeroize::Zeroizing — overwritten on
drop, never Debug/Serialize in the clear.
#[async_trait]
trait KeyManager: Send + Sync {
async fn wrap_dek(&self, key_ref: &KekRef, dek: &[u8]) -> Result<WrappedDek>; // KMS Encrypt
async fn unwrap_dek(&self, wrapped: &WrappedDek) -> Result<SecretBytes>; // KMS Decrypt
async fn current_version(&self, key_ref: &KekRef) -> Result<KeyVersion>;
}Impls: GcpKms, AwsKms, AzureKeyVaultKeys, VaultTransit,
LocalDevKms (file-backed, prints a loud insecure-mode warning and is
refused when NOETL_ENV=production).
#[async_trait]
trait SecretProvider: Send + Sync {
async fn fetch(&self, loc: &SecretLocator) -> Result<SecretMaterial>;
fn supports_dynamic(&self) -> bool { false } // Vault dynamic DB creds, STS, etc.
}Impls: GcpSecretManager, AwsSecretsManager, AzureKeyVault,
HashiCorpVault, KubernetesSecrets, Env (dev). Authentication to each
provider uses ambient workload identity where available (GKE WI, AWS
IRSA, Azure Workload Identity, K8s ServiceAccount token, Vault K8s auth) —
per execution-model.md "already-in-place trust" rule — so no bootstrap
secret is itself stored in the wallet.
Per-record DEK; KEK in KMS. Stored beside the ciphertext.
write(secret):
dek = random 32 bytes
ciphertext = AES-256-GCM(dek, nonce, plaintext) # as today
wrapped_dek = KMS.wrap_dek(kek_ref, dek) # KMS Encrypt
store { ciphertext, nonce, wrapped_dek, kek_provider, kek_key_id,
kek_key_version, enc_alg = "AES-256-GCM", enc_version }
read(record):
dek = KMS.unwrap_dek(record.wrapped_dek) # KMS Decrypt (region-local)
plaintext = AES-256-GCM_decrypt(dek, record.nonce, record.ciphertext)
zeroize(dek)
Rotation: rotating the KEK only re-wraps DEKs (cheap, no record
re-encryption). Rotating a DEK re-encrypts that one record. enc_version
-
kek_key_versionmake rotation incremental and auditable. A background re-wrap job walks records on the old KEK version and re-wraps to the new one.
Migration off the static key: a one-shot job reads each existing
record with the legacy static key, generates a DEK, envelope-encrypts,
writes the new columns. The legacy column is dropped once enc_version
is uniform. The all-zeros default is removed — startup fails closed
if no KMS/key manager is configured (except explicit dev mode).
The load-bearing part. Worker pools run in different regions/clouds/DCs; secrets must resolve locally and honor residency.
-
Residency policy on the SecretRef. Optional
residency=<region|cloud|"in-region">. The control plane refuses to resolve or transit a secret outside its residency boundary. -
Region-local secret brokers. Resolution is a system-pool playbook (
system/secret_resolve, perdata-access-boundary.md) running in the worker's own region, with KMS + provider endpoints local to that region. The dispatching server routes the resolve to the broker in the target region (shard/region routing reuses Phase F's shard map). Plaintext DEKs and secrets never leave the region. -
KMS topology. Each region/cloud has its own KMS (GCP KMS us-central1, AWS KMS eu-west-1, Azure KV westeurope…). Two options for a wallet record needed in multiple regions:
- (a) Multi-region KEK (GCP multi-region keys, AWS multi-Region keys, Vault replicated transit) — one wrapped DEK valid in every region. Simplest; pick where the KMS supports it.
-
(b) Per-region wrap — store N wrapped DEKs, one per region's KEK;
re-wrap on region add. Use where the KMS is single-region.
The record carries the list of
{region, kek_ref, wrapped_dek}; the broker picks its own region's entry.
-
Prefer short-lived dynamic secrets. Where the provider supports it (Vault dynamic DB creds, cloud STS / workload-identity tokens, GCP IAM access tokens), resolve a short-TTL secret scoped to the execution at dispatch — auto-expiring, nothing long-lived stored or transited. This is the strongest posture for a distributed fleet.
-
Sealed delivery to the worker (defense in depth over mTLS).
- Worker generates an ephemeral X25519 keypair at startup (rotated
periodically); registers the public key via
noetl.runtime(worker registration). - The broker/server resolves the secret then seals it to the worker's ephemeral public key (libsodium sealed box / HPKE).
- The worker unseals with its ephemeral private key, uses it, zeroizes. The sealed blob is useless to a MITM, to the event log, or to a co-tenant — independent of TLS.
- Plus mTLS (SPIFFE/SPIRE or cert-manager issued certs) for the transport channel.
- Worker generates an ephemeral X25519 keypair at startup (rotated
periodically); registers the public key via
-
Keychain = the execution-scoped resolved-secret / token cache (see §5a). Cached entries are envelope-encrypted (same wallet primitives),
execution_id-scoped, lineage-inheritable, TTL-bounded, region-local, and never replicated across a residency boundary.
noetl.keychain is not a second credential store — the wallet
(noetl.credential + external providers) is the source of truth. The
keychain is the per-execution-instance cache of resolved secrets and
minted tokens for a running playbook:
- When a step resolves a
secret://…(wallet or external provider) or mints an OAuth / STS / cloud-access token, the resolved material is cached in the keychain keyed by(name, execution_id, scope), envelope-encrypted (same DEK/KEK primitives — the cache is not a plaintext hole), withexpires_at+auto_renew. - Later steps in the same execution read the cache instead of re-resolving — one provider call / one OAuth refresh per execution, not per step — and a single renewer keeps a shared token fresh (no thundering-herd refresh).
Scope semantics (the existing scope_type column, made precise):
| scope | visible to | use |
|---|---|---|
local |
the one execution_id only |
per-execution secrets that must not leak to children |
shared |
the execution lineage (this execution + its sub-playbook descendants) | the default for inherited creds/tokens |
global |
all executions for the catalog entry | long-lived shared service tokens |
Sub-playbook inheritance. A kind: playbook step starts a child
execution with its own execution_id and a parent_execution_id link
(already recorded; the worker threads parent_execution_id on
get_credential — worker/src/client/control_plane.rs). A keychain
lookup for a child resolves by walking the lineage chain
(execution_id → parent_execution_id → … → root) and returns the nearest
shared / global entry:
- A token the parent resolved / refreshed is inherited by its sub-playbooks — no redundant provider call, no duplicate OAuth refresh, one refresh authority per token across the whole execution tree.
-
local-scope entries stay private to their execution (not inherited) — the isolation knob for secrets a sub-playbook must not see. - Server-side the resolver walks
noetl.execution's parent links; inheritance is a server concern (workers never see the chain, only the sealed result), keeping it consistent with the data-access boundary.
Distributed caveat. Inheritance is lineage + region-local: a sub-playbook dispatched to a different region re-resolves in that region rather than inheriting plaintext across a residency boundary — residency wins over cache reuse. The cached blob is sealed/at-rest-encrypted in its origin region only.
playbook step: auth: secret://wallet/pg_eu (residency=eu)
│
▼ (control plane routes to EU broker; refuses non-EU)
EU secret broker (system pool, EU region)
│ KMS.unwrap_dek (EU KMS) + AES-GCM decrypt (plaintext stays in EU)
▼ seal(secret, worker_eu.ephemeral_pubkey)
EU worker ── mTLS ──▶ receives sealed blob ──▶ unseal ──▶ use ──▶ zeroize
-
noetl.credential+noetl.keychain: addwrapped_dek BYTEA,kek_provider TEXT,kek_key_id TEXT,kek_key_version TEXT,enc_alg TEXT,enc_version SMALLINT,residency TEXT NULL, and awrap_regions JSONB NULL(per-region wrapped DEKs for option 5b). -
noetl.keychain: keepscope_type(local/shared/global) +execution_id-
expires_at+auto_renew; inheritance walksnoetl.execution.parent_execution_id(already recorded for sub-playbook child executions — add/confirm the column + an index on it). A keychain GET for a child resolves(name, scope)by walking the lineage chain and returning the nearestshared/globalhit.
-
-
noetl.runtime: addephemeral_pubkey BYTEA,pubkey_expires_at. - New
noetl.secret_audit(append-only):id, ts, principal, alias, provider, region, execution_id, action, outcome. -
POST /api/credentials/ keychain write: envelope-encrypt (gen DEK → KMS.wrap → store). -
GET /api/credentials/{alias}?seal_to=<worker_pubkey>&execution_id=…: returns a sealed blob (not plaintext). The legacy plaintext path is retained only behind admin RBAC + audit for break-glass/dev. - New internal
POST /api/internal/secrets/resolve(system-pool/broker, residency-aware) perdata-access-boundary.md. - noetl-tools
secrets/secret_managertool dispatches to theSecretProviderregistry and understandssecret://….
| Provider | KMS (KEK) | Secret manager | Workload-identity auth |
|---|---|---|---|
| GCP | Cloud KMS | Secret Manager | GKE Workload Identity |
| AWS | KMS | Secrets Manager | IRSA |
| Azure | Key Vault keys | Key Vault secrets (new) | Azure Workload Identity |
| HashiCorp Vault | Transit | KV v2 + dynamic | Vault K8s auth |
| Kubernetes | — (use cloud KMS) | Secrets (new) | ServiceAccount token |
| Local/dev | file (insecure, gated) | env | — |
| Phase | Status | Deliverable | Repos |
|---|---|---|---|
| 0 | ✅ | This design + threat model + decisions sign-off | ai-meta |
| 1 | ✅ v2.19.8–v2.21.0 | Envelope-encryption core: KeyManager trait + LocalDevKms; self-describing storage blob (no migration) for both noetl.credential and noetl.keychain; fail-closed key (1a, server#75) + envelope core (1b, server#77) + live wiring (1c/1d, server#79) |
server |
| 2 | ✅ v2.22.0 | KMS providers: GcpKms (server#81) — Cloud KMS :encrypt/:decrypt + Workload Identity; runtime NOETL_KMS_PROVIDER. AwsKms / AzureKeyVaultKeys / VaultTransit follow behind the same trait |
server |
| 3 | ✅ server v2.23.0–v2.26.0 | Secret resolution via the auth:/keychain path (not a workflow tool — the standalone secrets tool was removed, tools v2.19.2, because it leaked the value into the data flow). Redesigned server-side: R1 GCP SM client (v2.23.0) → R2 keychain-def model provider/map + find_keychain (v2.24.0) → R3a resolve_keychain_entry + build_secret_provider (v2.25.0) → R3b wire into the get_credential cache-miss (v2.26.0, server#89). GCP SM live; AWS SM / Azure KV / Vault / K8s slot into the same SecretProvider trait |
server (+ tools) |
| 3c | ✅ server v2.27.0 | Keychain as execution-scoped cache: resolved secrets/tokens envelope-encrypted with scope + TTL (avoid re-fetching the provider per step) + keychain storage-layer repair (server#91). Sub-playbook parent_execution_id inheritance is a later follow-up |
server |
| 3.x | ✅ all five providers landed | GCP SM (Phase 3) · Kubernetes Secrets (v2.28.0, server#97) · HashiCorp Vault (v2.29.0, server#101) · AWS Secrets Manager (v2.31.0, server#105, hand-rolled SigV4, no aws-sdk dep tree) · Azure Key Vault (v2.31.0, server#105, IMDS Managed Identity) — all behind the one SecretProvider trait |
server |
| 4 | ✅ transport security (TLS / mTLS) |
4a ✅ landed v2.30.0 — server opt-in TLS/mTLS listener (server#103, closed server#102). 4b ✅ landed v5.12.0 — worker ControlPlaneClient mTLS client (worker#56, closed worker#55). 4c ✅ merged (ops@37d4d6c) — cert-manager mTLS overlay for kind (ops#163, closed ops#162): ci/manifests/noetl/tls/. 4d ✅ merged (ops@0fc0dc8) — Helm chart values-gated mTLS for GKE (ops#165, closed ops#164): automation/helm/noetl/ exposes tls.* values; off-mode renders byte-identical to main; on-mode produces 2 Issuers + 3 Certificates + the server/worker mTLS env contract; kind-validated (cert-manager materialised the Secrets with the right keys). Phase 4 is now fully merged across all four rounds. Production GKE points tls.certManager.issuerRef at a ClusterIssuer backed by GCP CAS or SPIRE/SPIFFE |
server, worker, ops |
| 5 | ✅ sealed payload delivery |
5a ✅ landed v2.32.0 — server-side crypto primitives (server#107, closed server#106): src/crypto/sealed.rs X25519 ECDH + HKDF-SHA256 + ChaCha20-Poly1305 sealed-box (nonce derived from the shared secret, AAD pins alg+v); 12 unit tests, lib 369/0. 5b ✅ landed v2.33.0 — wire format + sealing endpoint (server#109, closed server#108): workers opt in by including worker_public_key (b64 X25519 pubkey) in their register payload's runtime JSON blob (no schema migration); GET /api/credentials/{id}/sealed?worker_id=<name> returns a SealedEnvelope; 400 when the worker_pool row exists but didn't register a key. Kind-validated end-to-end (Python cryptography opens the envelope → recovers the bearer token + scope round-trip). noetl_credentials_sealed_total{status} counter + credential.seal span per observability.md. 5c ✅ landed v5.13.0 — worker integration (worker#58, closed worker#57): long-lived X25519 keypair generated once at startup, pubkey registered in the runtime JSON blob, get_sealed_credential calls /sealed endpoint, unseals via the same primitives (drift-guard test against server constants), zeroizes the cleartext after the auth-alias resolver consumes it. Env-gated (`NOETL_SEALED_CREDENTIALS=true |
1 |
| 6 | 🚧 residency-aware distributed resolution |
6a ✅ landed v2.34.0 — region tag on keychain entries + per-region routing (server#111, closed server#110): KeychainDef.region (no schema migration — lives in the existing JSON blob), SecretRef.region provider-agnostic, AWS provider consumes it with explicit precedence (<region>: ref prefix > field > legacy project overload > AWS_REGION env); NOETL_SERVER_REGION env + server_region() / effective_region() fallback helpers; noetl_secret_resolve_total{provider,region,status} counter per observability.md Principle 1. 5 new unit tests; lib 376/0. Lib-only — backward compatible. 6b ✅ landed v2.35.0 — ProviderRegistry + per-(provider, region) metrics (server#113, closed server#112): src/secrets/registry.rs ProviderRegistry keyed by (provider_id, region), RwLock-protected with double-checked locking on the build path so concurrent get_or_build for the same key only builds once. Optional TTL via NOETL_SECRET_PROVIDER_TTL_SECONDS env (default 0 = process lifetime). New noetl_secret_provider_build_total{provider,region,status="cache_hit|ok|error"} counter + noetl_secret_resolve_duration_seconds{provider,region} histogram (bucketed 5 ms – 5 s, observed regardless of outcome so dashboards surface "slow" + "failing" independently). 7 new unit tests; lib 383/0. Lib-only. 6c ✅ landed v2.36.0 — residency-policy gate (server#115, closed server#114): KeychainDef.residency enum (none|advisory|strict, default none) + KeychainDef.allowed_regions allowlist; src/secrets/residency.rs evaluate() returns Allow(label) / AllowWithViolationLogged / Deny(AppError::ResidencyViolation); resolver runs the gate at the top of resolve_keychain_entry BEFORE any provider call. AppError::ResidencyViolation { credential, entry_region, server_region } → HTTP 403 with clear "credential X is region-locked to Y; this server is in Z" message that NEVER includes the value. noetl_secret_residency_check_total{policy, decision} counter — strict + violation_blocked is alert-worthy, advisory + violation_allowed is migration-window signal. Defensive: empty string in allowlist never matches empty server region. 8 new unit tests; lib 391/0. Lib-only — no schema migration (residency + allowed_regions ride the existing JSON blob). 6d ✅ landed v2.37.0 (primitives) — dynamic-secret primitives + cache plumbing (server#117, closed server#116): SecretValue.expires_at: Option<DateTime<Utc>> field; src/secrets/dynamic.rs cache_decision() honors min(default_ttl, expires_at - now - safety_margin) and returns SkipCacheAlreadyExpired when the deadline is already past or inside the operator's safety margin; KEYCHAIN_CACHE_DYNAMIC_SAFETY_MARGIN_SECS env (default 60); resolve_keychain_entry_with_meta returns the bundle's earliest expires_at; CredentialService::resolve_via_provider consumes the helper. Two new metrics: noetl_secret_dynamic_ttl_seconds histogram (1m / 5m / 15m / 1h / 4h / 12h buckets, observed when issuer reports TTL) + noetl_secret_cache_skip_total{reason="already_expired"} counter. 7 new unit tests; lib 398/0. Backward compatible (providers without expires_at keep the 600 s default). Follow-ups (each its own sub-issue): 6d.1 AWS STS AssumeRoleWithWebIdentity provider · 6d.2 GCP iamcredentials.generateAccessToken · 6d.3 Azure AAD client-credentials. 6e ✅ landed v2.38.0 — cross-region broker (server#119, closed server#118): BrokerRegistry (region → broker_url from NOETL_SECRET_BROKER_REGISTRY env, empty default = pre-6e fail-closed); POST /api/internal/cross-region/resolve peer endpoint validates expected_entry_region == server_region() (defensive against stale peer registries), resolves locally, seals via Phase-5a primitives to the requesting worker's pubkey; get_sealed handler falls back to broker on AppError::ResidencyViolation; KeychainDef.no_broker_fallback: bool per-credential opt-out for hard-isolation credentials; AppError::CrossRegionUnreachable → HTTP 502. Two new metrics: noetl_secret_broker_call_total{broker_region, outcome} counter + noetl_secret_broker_call_duration_seconds{broker_region} histogram (50ms – 5s buckets). 10 new unit tests; lib 410/0. Lib-only — opt-in via env, no schema migration. Phase 6 closes. Both residency shapes operational: hard isolation (residency: strict + no broker → fail-closed HTTP 403) + soft federation (residency: strict + broker registered → transparent cross-region routing). Covers G7 in full. |
server, ops |
| 7 | ✅ rotation + audit + auto-renewal |
7a ✅ landed v2.39.0 — KEK rotation primitives (server#121, closed server#120): KeyManager::current_key_version() trait accessor; EnvelopeCipher::rewrap_storage_string primitive (parse → if wrapped.key_version == current_key_version → Skipped; else unwrap with historical KEK version → re-wrap with current → return Rewrapped { old_key_version, new_key_version, new_storage_string }). Plaintext payload NEVER reconstructed — pure DEK re-wrap, AES-GCM ciphertext bytes stay byte-identical. noetl_wallet_rotate_total{table, status} counter (skipped|rewrapped|failed_unwrap|failed_wrap|parse_error; failed_unwrap alert-worthy). 4 new unit tests; lib 414/0. 7a.2 ⏳ rotation endpoint (POST /api/internal/wallet/rotate-kek) + DB scans over noetl.credential+noetl.keychain + diagnostic GET /api/internal/wallet/key-status. 7b ✅ landed v2.40.0 (primitives) — secret-resolution audit service (server#123, closed server#122): services::secret_audit::AuditEvent struct (NEVER contains the secret value); Operation + Outcome bounded enums; AuditSink trait + NoopAuditSink default + SecretAuditService wrapper with record_async (fire-and-forget) + record_strict (await) + record (dispatches by strict-mode); NOETL_SECRET_AUDIT_REQUIRED env (default false; 1/true/TRUE/yes/YES enable strict); noetl_secret_audit_writes_total{operation, outcome, status} counter (failed_strict alert-worthy). 8 new unit tests; lib 422/0. Lib-only. 7b.2 ⏳ noetl.secret_audit table + DbAuditSink + GET /api/internal/secret-audit query endpoint + wire into the four credential surfaces. 7c ✅ landed v2.41.0 — token auto-renewal primitives (server#125, closed server#124): secrets::dynamic::should_refresh(expires_at, refresh_window, now) decision primitive (true iff expires_at set + still valid + inside refresh window) + should_refresh_default reading env; KEYCHAIN_CACHE_REFRESH_WINDOW_SECS env (default 60). Two new metrics: noetl_secret_refresh_total{outcome} counter (triggered|succeeded|failed|stampede_collapsed; failed alert-worthy) + noetl_secret_refresh_duration_seconds histogram (50ms–5s buckets, observed regardless of outcome). 5 new unit tests; lib 427/0. Lib-only. 7c.2 ⏳ cache + resolver wire-up (KeychainService::should_refresh + on-cache-hit spawn-background-refresh path + per-(catalog_id, alias) tokio::sync::Mutex stampede collapse + refresh path records its own Phase-7b AuditEvent). Phase 7 closes. Remaining queue: 7a.2 / 7b.2 / 7c.2 (.2 endpoint+DB rounds), 6d.1 / 6d.2 / 6d.3 (cloud-specific dynamic providers) — all discrete follow-up sub-issues, each its own bounded round. |
server |
Phases 1–4 are the security-validation must-haves (managed keys, no plaintext to workers). 5–7 harden transit, distribution, and rotation. (Reordered from the original plan: transport mTLS is now Phase 4 and payload sealing Phase 5 — mTLS is the foundation, sealing is the defense-in-depth layer on top.)
- Primary cloud / KMS first. Which KMS does Phase 2 implement first — GCP Cloud KMS, AWS KMS, Azure Key Vault, or Vault Transit? (Drives the reference implementation; others follow the same trait.)
- Multi-region key strategy. Multi-region KEK (5a) vs per-region wrap (5b) — depends on whether the chosen KMS offers multi-region keys.
- Sealed delivery vs mTLS-only. Do both (recommended, defense in depth) or start with mTLS-only and add sealing later?
- Residency requirements. Are there hard data-residency boundaries to enforce now (e.g., EU-only secrets), or is that future-proofing?
- Dynamic vs static secrets. How aggressively to push short-lived dynamic creds (Vault/STS) vs encrypted-at-rest static creds?
-
Break-glass plaintext read. Keep an admin-RBAC + audited plaintext
include_data=truepath, or remove it entirely (sealed-only)?
| Requirement (SOC2 / ISO 27001 / PCI-DSS) | Met by |
|---|---|
| Encryption at rest with managed keys | Phase 1–2 (KMS-backed envelope) |
| No hardcoded / static keys | Phase 1 (kill all-zeros default, fail-closed) |
| Key rotation | Phase 7 (versioned re-wrap) |
| Encryption in transit | Phase 4–5 (sealed delivery + mTLS) |
| Least privilege + access logging | Phase 6–7 (secret_audit, RBAC) |
| Data residency | Phase 6 (region brokers + residency policy) |
| Secret sprawl / external managers | Phase 3 (GCP/AWS/Azure/Vault/K8s) |
-
agents/rules/execution-model.md— "Secrets and credentials rule" (keychain by alias; runtime vs business-logic secrets; already-in-place trust). -
agents/rules/data-access-boundary.md— system-pool playbooks fornoetl.*access; the secret broker is a system playbook. -
agents/rules/observability.md— every new/api/internal/*resolve endpoint ships span + metric +execution_id;secret_auditis the access trail. - Phase F sharding (Umbrella: Rust Server Port) — region/shard routing is reused for region-local brokers.
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture