Stage 7 — pluggable broker live deploy + OIDC-only auto-provision (issue #64, #71 Option A)#73
Merged
Conversation
…env-var module
Implement plan §5: single source of truth for every BROKER_* environment
variable name. Per user rule 11, no other module may declare a raw env-var
literal — all reads go through these constants.
- crates/agentkeys-broker-server/src/env.rs (new): const &str declarations
for all 51 env vars (Phase 0 + planned A/B/C/D/E + legacy aliases),
Group enum (Core/Oidc/SessionJwt/Audit/AuditEvm/Auth/AuthEmail/AuthOAuth2/
Limits/Legacy), all() registry returning (name, doc, group), print_table()
for the operator runbook auto-generator. 5 unit tests cover uniqueness,
non-empty docs, required-Phase-0 presence, table render row count, and
Group exhaustiveness.
- crates/agentkeys-broker-server/src/lib.rs: register pub mod env.
- crates/agentkeys-broker-server/src/config.rs: replace every raw BROKER_*
string literal with env::* constants. grep -E '"(BROKER_|DAEMON_|ACCOUNT_ID|REGION)' src/config.rs returns zero hits. Adds parse_int_env_with_default<T> helper to
collapse three near-duplicate parse blocks.
Plan home: docs/spec/plans/issue-64/{PLAN.md (mirror), DECISIONS.md,
AMBIGUITIES.md, V0.1-FOLLOWUPS.md, prd.json (PRD-driven ralph)}.
Acceptance criteria (US-001):
- env.rs exists with const &str for every plan §5 BROKER_* var ✓
- Group enum with required variants ✓
- all() returns slice of (name, doc, Group), all docs non-empty ✓
- src/config.rs: grep zero hits for raw BROKER_/DAEMON_/ACCOUNT_ID/REGION ✓
- cargo build -p agentkeys-broker-server succeeds ✓
- cargo test -p agentkeys-broker-server env:: 5/5 pass ✓
Refs: issue #64 plan §1 rule 11, §5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement plan §3 + §3.5: pluggable trait surface for the three layers
below the credential mint. No plug-in implementations yet (US-006
implements WalletSig, US-007 ClientSideKeystore, US-008 SqliteAnchor) —
this story lands the trait shapes, error types, and registry that the
later stories slot into.
- crates/agentkeys-broker-server/src/plugins/mod.rs (new): Readiness
enum (Ready/Degraded/Unready), PluginRegistry { auth: HashMap, wallet,
audit: Vec }, aggregate_readiness() → (overall, per-check) for the
/readyz JSON. Trait re-exports.
- crates/agentkeys-broker-server/src/plugins/auth.rs (new): UserAuthMethod
trait (name/ready/challenge/verify), VerifiedIdentity, ChallengeParams,
AuthChallenge, AuthResponse, IdentityType { Evm, Email, OAuth2{Google,
Github,Apple} } with stable canonical() strings (input to OmniAccount
derivation; renaming is breaking). AuthError enum.
- crates/agentkeys-broker-server/src/plugins/wallet.rs (new):
WalletProvisioner trait (name/ready/bind_address/lookup_by_omni_account),
WalletAddress newtype with parse() that normalizes 0x-prefixed hex to
lowercase + length check, WalletRole { Master, Daemon }, WalletBinding
struct. WalletError enum.
- crates/agentkeys-broker-server/src/plugins/audit.rs (new): AuditAnchor
trait (name/ready/anchor/verify), AuditRecord with record_hash for
cross-anchor dedup, AnchorReceipt, AuditPolicy { DualStrict,
SqlitePrimary, EvmPrimary } parser. AuditError enum.
- crates/agentkeys-broker-server/src/lib.rs: register pub mod plugins.
- crates/agentkeys-broker-server/Cargo.toml: feature-gate scaffold per
plan §3. default = [auth-wallet-sig, wallet-keystore, audit-sqlite].
Optional features for v0-testnet (auth-email-link, auth-oauth2-google,
audit-evm) and v1+ (auth-oauth2-github, auth-oauth2-apple, audit-solana).
External deps land in implementation stories (US-006: k256+sha3;
Phase A.1: lettre+aws-sdk-sesv2; Phase C: alloy-*).
Acceptance criteria (US-002):
- Readiness enum with Ready/Degraded/Unready ✓
- UserAuthMethod / WalletProvisioner / AuditAnchor traits ✓
- PluginRegistry struct + aggregate_readiness ✓
- Per-trait thiserror error enums (AuthError, WalletError, AuditError) ✓
- Cargo features: auth-wallet-sig, auth-email-link, auth-oauth2,
auth-oauth2-google, wallet-keystore, audit-sqlite, audit-evm, test-stub ✓
- cargo build with default features ✓
- cargo test plugins:: 8/8 pass ✓
- cargo clippy -D warnings clean ✓
Per-trait `ready()` MUST NOT default to Ready — implementations check
their own dependencies. Documented in trait doc comments. The first
implementations (US-006/007/008) demonstrate the pattern.
Refs: issue #64 plan §3, §3.5, §1 rule 8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…liteAnchor port
Bundles two stories that became coupled when the agentkeys-types::AgentIdentity
extension forced match-arm updates across four crates and the audit/ module
restructure required relocating both the trait file and the SqliteAnchor
implementation in the same change.
US-004 — OmniAccount derivation
- crates/agentkeys-broker-server/src/identity/{mod.rs,omni_account.rs} (new):
derive_omni_account(identity_type, identity_value) → SHA256(client_id ||
type || value) with hardcoded AGENTKEYS_CLIENT_ID = "agentkeys". Per port-
vs-greenfield "What we port — crypto primitives only", this matches the
dexs-backend hash shape verbatim but uses our own client_id, giving each
operator a sovereign identity namespace. derive_with_client_id(...) is
exposed for reproducing dexs reference vectors in tests.
- crates/agentkeys-types/src/lib.rs: AgentIdentity::OAuth2{provider, sub}
variant added (additive — every existing AgentIdentity consumer continues
to work unchanged for the four prior variants).
- Match-arm updates across consumers (Rust E0004 non-exhaustive errors
surfaced these — exactly the property we want from the type system):
- crates/agentkeys-core/src/mock_client.rs (open_auth_request +
session_recover): map OAuth2{provider,sub} → ("oauth2_<provider>", sub)
matching the broker's IdentityType::canonical() naming.
- crates/agentkeys-core/src/auth_request.rs: deterministic CBOR encoding
of OAuth2 — Map[("provider", Text), ("sub", Text)] with keys ASCII-
sorted so the canonical hash is stable.
- crates/agentkeys-cli/src/lib.rs: rich-error human-readable form
"oauth2_<provider>:<sub>".
- crates/agentkeys-mock-server/src/test_client.rs: same mapping as
mock_client (auth-request and session-recover paths).
- 9 identity:: unit tests cover: hex parse validation, derivation
determinism, identity-type namespace separation, identity-value
separation, client_id namespace separation (load-bearing — proves
agentkeys ≠ wildmeta for the same email), prod entry-point matches
hardcoded constant, lowercase-hex output guarantee.
US-008 — SqliteAnchor port to AuditAnchor trait
- crates/agentkeys-broker-server/src/plugins/audit/{mod.rs,sqlite.rs}
restructured: trait file `audit.rs` merged into `audit/mod.rs` so the
feature-gated `audit-sqlite` submodule can live alongside it. (Previous
layout had `audit.rs` + `audit/mod.rs` which Rust E0761'd.)
- src/plugins/audit/sqlite.rs (new): SqliteAnchor implementing AuditAnchor.
Schema is the new plugin_mint_log table with the canonical AuditRecord
columns + a status column (Phase 0 writes 'confirmed' directly; Phase C
introduces the pending → confirmed | quarantined lifecycle). Indexes on
minted_at, omni_account, record_hash, status. WAL+FULL pragma preserved
from the legacy crate::audit::AuditLog.
- Readiness::Ready when DB writable; Unready otherwise.
- 8 plugins::audit:: tests cover: anchor round-trip, verify NotFound,
record_hash tampering detection, wrong-anchor receipt rejection, ready
reports Ready, name() stability + AuditPolicy parse + AuditRecord round
trip.
Acceptance criteria (US-004):
- src/identity/omni_account.rs derive_omni_account(...) ✓
- AGENTKEYS_CLIENT_ID = "agentkeys" pinned ✓
- agentkeys-types::AgentIdentity::OAuth2{provider, sub} added ✓
- Tests cover canonical hash for each identity type ✓
- cargo test identity:: 9/9 pass ✓
Acceptance criteria (US-008):
- src/plugins/audit/sqlite.rs implements AuditAnchor ✓
- plugin_mint_log table with canonical columns + indexes ✓
- WAL+FULL pragma preserved ✓
- verify() detects record_hash tampering ✓
- Readiness Ready when writable ✓
- cargo test plugins::audit:: 8/8 pass ✓
Note: legacy crate::audit::AuditLog (the existing src/audit.rs) is left
in place for now — US-011 migrates the mint handler to the new trait and
drops the legacy module then. Carrying both during the transition keeps
existing /v1/mint-aws-creds working.
Refs: issue #64 plan §3.5 (OmniAccount), §3 (AuditAnchor trait), §Phase 0
deliverables.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h purpose tagging Implement plan §3.5.6: two distinct ES256 keypairs for two roles: - oidc keypair (existing) — signs JWTs that AWS STS verifies via JWKS. - session keypair (NEW) — signs broker-internal session JWTs. Closes Codex / eng-review #7 footgun: an operator pointing BROKER_SESSION_KEYPAIR_PATH at the OIDC keypair file would have silently used the wrong key (same kid, same crypto), letting session tokens pass as IAM federation tokens. Defense: on-disk JSON now carries a "purpose" field; load-time validation refuses to read a keypair whose purpose does not match the slot. - crates/agentkeys-broker-server/src/jwt/{mod,session,issue,verify}.rs (new): KeypairPurpose enum (Oidc | Session) with stable kebab-case canonical() and kid_prefix(); SessionKeypair (mirror of OidcKeypair, purpose-tagged on disk, kid prefix `ak-session-`); mint_session_jwt() with the canonical session-JWT claim shape (iss/sub/aud=agentkeys:broker/exp/iat/jti + agentkeys.{omni_account,wallet_address,identity_type,identity_value}); verify_session_jwt() that pins audience + issuer + kid header. - crates/agentkeys-broker-server/src/oidc.rs: - PersistedKeypair: add `purpose` field with #[serde(default)] mapping to KeypairPurpose::Oidc so pre-Stage-7 keypair files (no purpose field) continue to load as oidc. New keypairs always include the field. - load() refuses any keypair whose purpose ≠ Oidc. - generate_and_persist() writes purpose=oidc. - rand_core_compat → pub(crate) rand_compat (so SessionKeypair can reuse the rand_core 0.6 → OS RNG bridge). - set_owner_only → pub(crate) set_owner_only_inner (same reason). - crates/agentkeys-broker-server/src/lib.rs: register pub mod jwt. Acceptance criteria (US-005): - src/jwt/mod.rs: KeypairPurpose with Oidc + Session ✓ - On-disk JSON includes "purpose" field ✓ - SessionKeypair::load refuses purpose=oidc keypair ✓ - SessionKeypair::load refuses untagged JSON ✓ - OidcKeypair::load refuses purpose=session keypair ✓ - Session JWT mint+verify round trip ✓ - verify rejects wrong audience, wrong issuer, expired ✓ - session keypair kid prefix `ak-session-`; oidc kid format unchanged ✓ - cargo test jwt:: 10/10 pass ✓ - cargo build green ✓ env.rs already has BROKER_SESSION_KEYPAIR_PATH and BROKER_SESSION_JWT_TTL_SECONDS (landed in US-001). Wiring config.rs + boot.rs to actually load the session keypair lands in US-003 (tiered refuse-to-boot). Refs: issue #64 plan §3.5.6, codex review finding #7, eng review #code-structure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sioner + WalletStore
Implement plan §3.5 + §Phase 0 wallet layer: the MetaMask model. The
broker stores ONLY (omni_account, address, role, parent_address,
created_at) — the user holds the seed in their OS keychain on the
daemon side. The broker has no key material it could leak.
Storage layer:
- crates/agentkeys-broker-server/src/storage/{mod.rs, wallets.rs} (new):
WalletStore with composite-PK schema (omni_account, address) so a user
can have multiple wallets and re-binding the same address is idempotent.
WAL+NORMAL for throughput (audit log gets FULL elsewhere).
bind() detects role mismatch and parent mismatch on re-bind — a daemon
switching masters or an address flipping role would be silent data
corruption otherwise.
list_for_omni_account() returns every wallet bound to the OmniAccount.
writable() probe used by the plugin's ready().
Plugin layer:
- crates/agentkeys-broker-server/src/plugins/wallet/{mod.rs,keystore.rs}:
module restructure from sibling-file `wallet.rs` to `wallet/mod.rs +
wallet/keystore.rs` (same E0761 fix as US-008's audit module).
ClientSideKeystoreProvisioner implements WalletProvisioner. name() =
"client_keystore". ready() reflects WalletStore::writable() (NOT a
hardcoded Ready, per plan §1 rule 5). bind_address() stamps current
unix-seconds and delegates to WalletStore::bind. lookup_by_omni_account
delegates to WalletStore::list_for_omni_account.
- crates/agentkeys-broker-server/src/lib.rs: register pub mod storage.
Acceptance criteria (US-007):
- src/plugins/wallet/keystore.rs implements WalletProvisioner ✓
- Storage table wallets(omni_account, address, role, parent_address,
created_at) with composite PK and role CHECK constraint ✓
- bind(): inserts row; idempotent (same role + parent → returns existing) ✓
- bind() rejects role mismatch ✓
- lookup_by_omni_account returns all bindings ✓
- ready() Ready when DB writable, Unready otherwise ✓
- 9 plugins::wallet:: tests pass (3 type tests + 6 keystore behavior
tests covering bind+lookup, idempotent re-bind, rejected role flip,
ready, name, multi-binding lookup) ✓
- cargo build green ✓
Refs: issue #64 plan §3.5 (wallet layer), §Phase 0 deliverables.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update progress.txt with full Phase 0 session log (6 of 16 stories complete: US-001/002/004/005/007/008). Update prd.json passes flags + commit refs. Append commit-log table to DECISIONS.md. Phase 0 remaining (10 stories) for next ralph iteration: - US-003 boot.rs + main.rs wiring - US-006 WalletSig SIWE (largest remaining; needs k256+sha3 deps) - US-009/010/011 auth + mint endpoints - US-012 broker_status /readyz aggregator - US-013 invariant load-bearing test (all 6 cases) - US-014 smoke + done.sh - US-015 operator runbook - US-016 codex round 1 Suggested next-iteration commit order: 6 → 3 → 9/10/11 → 12 → 13 → 14 → 15 → 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…json passes:true + commit refs for US-001, US-002, US-004, US-005, US-007, US-008. Remaining 10 Phase 0 stories still passes:false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nceStore Phase 0 wallet-sig auth method per plan §3.5.1: SIWE-wrapped EIP-191. Closes Codex P0 #2 (raw EIP-191 was replayable across apps; SIWE binds domain). Storage: - crates/agentkeys-broker-server/src/storage/auth_nonces.rs (new): AuthNonceStore with single-use semantics. issue() inserts, consume() is race-safe via WHERE consumed_at IS NULL conditional UPDATE, purge_expired() janitors old rows. ConsumeOutcome enum collapses "never existed" and "already consumed" into NotFoundOrConsumed so an attacker cannot probe the nonce table; Expired is a separate variant so the broker can surface a "your sign-in expired" message. 7/7 tests pass. Plugin: - crates/agentkeys-broker-server/src/plugins/auth/{mod.rs ⟵ ex auth.rs, wallet_sig.rs} (restructure + new): Same E0761 module-conflict fix as US-007/008. SiweWalletAuth implements UserAuthMethod. challenge() builds an EIP-4361 SIWE message with the broker's domain, fresh CSPRNG nonce, issued_at, expiration_time (issued_at + 45min), URI, chain_id, resources. verify() looks up the pending challenge, atomically consumes the nonce, runs k256 ecrecover via the EIP-191 envelope (`\x19Ethereum Signed Message:\n<len><msg>` → keccak256 → recover_from_prehash), and asserts the recovered address matches the SIWE message's claimed address. ecrecover_address() handles v ∈ {0,1,27,28} (k256 RecoveryId requires {0,1}, so 27/28 are normalized). Per-call security: - SIWE domain field bound to broker's host (replay across apps blocked) - Nonce single-use enforced via AuthNonceStore (replay across requests blocked) - 45-min issued_at/expiration window (replay across long timeframes blocked) - k256 0.13 enforces canonical signatures (low-s) by default - Chain-ID bound into the SIWE message (replay across chains blocked) Pending challenges live in tokio::sync::Mutex<HashMap> keyed by request_id; removed on first verify() attempt to prevent in-memory replay even if the on-disk nonce check is flaky. Multi-process deployments would move this to SQLite — out of scope for v0. Custom ISO8601 formatter (no chrono dep). Howard-Hinnant civil_from_days valid 1970+. Tests pin format shape. Embeds the canonical IdentityType enum + UserAuthMethod trait + supporting types (VerifiedIdentity, ChallengeParams, AuthChallenge, AuthResponse, AuthError) in plugins/auth/mod.rs — preserved verbatim from the previous plugins/auth.rs file with feature-gated re-export of SiweWalletAuth. Cargo: - agentkeys-broker-server/Cargo.toml: k256 + sha3 added as optional deps gated by auth-wallet-sig feature. Default features compile them in. - storage/mod.rs: re-export AuthNonceStore + ConsumeOutcome. Acceptance criteria (US-006): - src/plugins/auth/wallet_sig.rs implements UserAuthMethod for SiweWallet ✓ - challenge() generates SIWE with domain/URI/version/chain_id/nonce/iat/exp/resources ✓ - Nonce stored in src/storage/auth_nonces.rs with UNIQUE single-use UPDATE ✓ - verify() asserts domain, chain_id, expiration; ecrecover-derived address matches ✓ - VerifiedIdentity returns IdentityType::Evm + identity_value ✓ - 11 plugins::auth::wallet_sig + 7 storage::auth_nonces tests pass ✓ - happy path, expired (Expired), replayed nonce (NotFoundOrConsumed), malformed signature (InvalidRequest), unknown request_id (Unauthorized), duplicate-nonce-issue (rejected), purge_expired correctness ✓ Refs: issue #64 plan §3.5.1, codex P0 #2 (SIWE adopted), §Phase 0 deliverables. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… after US-006 Mark US-006 passes:true with commit ref 51a5191. Append commit-log row in DECISIONS.md. List remaining 9 Phase 0 stories in priority order. Phase 0 status: 7 of 16 stories complete. ~71 unit tests passing. Foundation locked: env vars centralized, plugin traits + Readiness + PluginRegistry, OmniAccount derivation, dual ES256 keypairs with purpose tagging, ClientSideKeystoreProvisioner + WalletStore, SqliteAnchor port, SiweWalletAuth + AuthNonceStore (single-use SIWE-wrapped EIP-191). Next priority: US-003 (boot.rs wiring) → US-009/010/011 (endpoints) → US-012 (broker_status) → US-013 (invariant test) → US-014/015 (smoke + runbook) → US-016 (codex round 1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… plugin-registry wiring Implement plan §6 tiered refuse-to-boot. Closes Codex P1 #6 (transient external dependencies must not brick startup): Tier 1 (synchronous, before listener bind): - All required env vars present + parseable + types in declared bounds. - BROKER_OIDC_ISSUER must be https:// in non-dev mode (BROKER_DEV_MODE=true relaxes; logged loudly). - OIDC keypair file MUST exist + parse + carry purpose=oidc tag (refuses purpose=session). - Session keypair file MUST exist + parse + carry purpose=session tag (no migration window). - SQLite migrations run cleanly via AuthNonceStore::open + WalletStore::open + SqliteAnchor::open. Each CREATE TABLE IF NOT EXISTS is the v0 migration. - BROKER_AUTH_METHODS / BROKER_WALLET_PROVISIONER / BROKER_AUDIT_ANCHORS resolve at compile time (every name must map to an enabled feature; unknown names → boot fail with anchor `auth-method-not-compiled` etc.). - BROKER_AUDIT_POLICY parses to {dual_strict, sqlite_primary, evm_primary}. - Failure: exit code 1 with single-line `BOOT_FAIL: <var>=<value>: <reason>; see runbook §<anchor>`. Tier 2 (async, after listener bound): - Backend `/healthz` reachability probe loops every 15s until success; flips state.tier2.backend_reachable. - /healthz returns 200 immediately (liveness); /readyz aggregates Tier-2 atomic flags + plugin Readiness (US-012 lands the aggregator handler — for now /readyz still uses the legacy flat probe pre-broker_status migration). - BROKER_REFUSE_TO_BOOT_STRICT=true collapses Tier-2 backend probe to a hard fail (process exits if backend not reachable). - SES + EVM probes deferred to Phase A.1 + Phase C respectively, behind their feature gates. The Tier2State struct already carries the AtomicBool fields so adding probes is one-line each. Files: - crates/agentkeys-broker-server/src/boot.rs (new): run_tier1() returns BootArtifacts (registry + keypairs + stores + audit_policy). build_registry() constructs PluginRegistry from BROKER_AUTH_METHODS / BROKER_WALLET_PROVISIONER / BROKER_AUDIT_ANCHORS. Tier2Profile::from_config() probes which Tier-2 checks are enabled. 4 unit tests cover https-only refuse, missing keypair refuse, url_host extraction, Tier2Profile detection. - crates/agentkeys-broker-server/src/state.rs (extended): AppState now carries session_keypair, registry, audit_policy, wallet_store, nonce_store, tier2 (Arc<Tier2State> with 4 AtomicBool fields). Legacy `audit: AuditLog` preserved through US-011. - crates/agentkeys-broker-server/src/main.rs (rewritten): calls run_tier1() → BootArtifacts before STS check. spawn_tier2_probes() spawns the backend reachability probe with 15s retry; strict mode exits the process on first miss. - crates/agentkeys-broker-server/src/lib.rs: pub mod boot. - crates/agentkeys-broker-server/tests/{oidc_flow,mint_flow}.rs: stub the new AppState fields with in-memory stores + fresh session keypair so the legacy backend-bearer-mint integration tests continue to pass unchanged. Acceptance criteria (US-003): - src/boot.rs with run_tier1() (sync) + Tier2Profile::from_config() (Tier-2 spawn) ✓ - Tier-1 validates env vars present + paths readable + OIDC https in non-dev ✓ - Plugin registry validates: every name in BROKER_AUTH_METHODS / etc. resolves ✓ - Tier-1 runs SQLite migrations cleanly ✓ - Keypair load: refuse-to-boot if path absent or purpose tag mismatch ✓ - Tier-2 reachability checks marked async ✓ - BOOT_FAIL message format with runbook anchor ✓ - 4 boot:: tests pass ✓ - Full broker test suite 94 tests pass (79 lib + 9 mint_flow + 6 oidc_flow) ✓ - cargo build green ✓ Refs: issue #64 plan §6 (tiered refuse-to-boot), §3 (PluginRegistry), §Phase 0 deliverables. Closes codex review finding P1 #6 (refuse-to-boot vs Unready). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ggregator
Per plan §7 + Designer review #status-shape: /readyz now aggregates
PluginRegistry::aggregate_readiness() across every loaded plug-in PLUS
the four Tier-2 reachability AtomicBool flags (set asynchronously by
spawn_tier2_probes in main.rs).
Behavior:
- 200 with empty body when every plug-in Ready + every relevant Tier-2
flag set. Operators tailing curl see no noise on the happy path.
- 200 with `{"status":"degraded","degraded":true,"checks":[...],
"ready":[...]}` when any plug-in reports Degraded. Body lists every
degraded check with `name`, `status`, `reason`, and a `docs` URL
anchor pointing into the operator runbook (Designer review: pager-
friendly).
- 503 with `{"status":"unready",...}` when any plug-in is Unready or
any relevant Tier-2 flag is still false.
Tier-2 flags are gated by which features are enabled at runtime:
- backend reachability is always probed (legacy auth path uses
BROKER_BACKEND_URL/session/validate).
- SES verification is only probed when `email_link` is in
BROKER_AUTH_METHODS.
- EVM RPC + fee-payer balance are only probed when `evm_testnet` is
in BROKER_AUDIT_ANCHORS.
Files:
- crates/agentkeys-broker-server/src/handlers/broker_status.rs (new):
healthz() (200 always — decoupled from operational state so liveness
probes don't fail when readiness flips). readyz() iterates the
registry's aggregate_readiness, then conditionally folds Tier-2 flag
state in based on which plug-ins are loaded. Per-check JSON shape:
{name, status, reason|detail, docs}.
- crates/agentkeys-broker-server/src/handlers/mod.rs: pub mod broker_status.
- crates/agentkeys-broker-server/src/lib.rs: route /healthz +
/readyz to handlers::broker_status::{healthz, readyz}. Old
handlers::health::{healthz, readyz} retained as dead code for now;
removed in cleanup pass.
- crates/agentkeys-broker-server/tests/mint_flow.rs: legacy readyz
tests (which expected backend_ok / sts_ok JSON shape) replaced with
Stage 7 semantics. Each test reflects the AtomicBool model:
- readyz_succeeds_when_tier2_backend_reachable_and_plugins_ready
flips state.tier2.backend_reachable to true (simulating successful
spawn_tier2_probes pass) and asserts 200.
- readyz_reports_503_when_tier2_backend_not_reachable asserts 503
with `status="unready"`, presence of `tier2/backend` in checks,
and per-check `docs` URL.
- readyz_503_remains_when_dead_backend_url_configured.
Acceptance criteria (US-012):
- src/handlers/broker_status.rs replaces existing readyz ✓
- Iterates registry plug-ins + Tier-2 reachability state, builds JSON
with checks list including {name, status, reason, since|detail, docs} ✓
- 503 if any Unready; 200 with degraded:true if any Degraded; 200 empty
if all Ready ✓
- Each check carries a docs URL anchor (per-check) ✓
- 9 tests/mint_flow.rs tests pass (3 readyz cases) ✓
- 6 tests/oidc_flow.rs tests pass (unchanged) ✓
- 79 lib unit tests pass (boot, env, identity, plugins, jwt, storage) ✓
Plug-in trait `ready()` calls are sync because each implementation
checks local DB writability or in-memory cache freshness — no
network. Tier-2 reachability is the async path; it lives in main.rs's
spawn_tier2_probes (US-003) and only flips atomics, not Readiness.
Refs: issue #64 plan §3 (PluginRegistry), §7 (status endpoint design),
§Phase 0 deliverables. Closes Designer review #status-shape and
#observability concerns.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n prd.json Phase 0 status: 9 of 16 stories complete. ~94 tests passing. Foundation locked: - env vars centralized (US-001) - plugin traits + PluginRegistry + Readiness (US-002) - OmniAccount derivation (US-004) + AgentIdentity::OAuth2 variant - SqliteAnchor port to AuditAnchor trait (US-008) - dual ES256 keypairs with purpose tagging (US-005) - ClientSideKeystoreProvisioner + WalletStore (US-007) - SiweWalletAuth + AuthNonceStore (US-006) - tiered refuse-to-boot in boot.rs + main.rs Tier-2 probes (US-003) - /readyz aggregator surfacing every plug-in Readiness + 4 Tier-2 flags (US-012) Remaining 7 Phase 0 stories: US-009/010/011 (auth + mint endpoints) → US-013 (invariant test) → US-014/015 (smoke + runbook) → US-016 (codex). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dpoints + auth/exchange shim
Stage 7 §3.5.1 + §3.5.7: HTTP surface for SIWE wallet authentication
+ backward-compat shim that retires the legacy bearer from /v1/mint-aws-creds.
US-009 — POST /v1/auth/wallet/{start,verify}
- handlers/auth/wallet_start.rs: extracts address+chain_id from body,
delegates to PluginRegistry.auth["wallet_sig"].challenge(), returns
request_id + siwe_message + nonce + expires_at_iso. Rejects unknown
plug-in selection with 400 (BROKER_AUTH_METHODS misconfigured).
- handlers/auth/wallet_verify.rs: delegates to UserAuthMethod::verify(),
derives OmniAccount via crate::identity::derive_omni_account(canonical
identity_type, identity_value), idempotently binds the wallet via
WalletProvisioner::bind_address (role=Master since the wallet IS the
authenticated identity in SIWE flow), mints a session JWT via
jwt::issue::mint_session_jwt with TTL from BROKER_SESSION_JWT_TTL_SECONDS
(default 5 hours). Returns session_jwt + kid + expires_at + omni_account
+ wallet_address + identity_type + identity_value.
US-010 — POST /v1/auth/exchange (closes Codex P0 #14)
- handlers/auth/exchange.rs: accepts the legacy backend-validated bearer
(Authorization: Bearer <token>), runs validate_bearer_token() against
BROKER_BACKEND_URL/session/validate (existing path), then mints a
session JWT bound to (omni_account=SHA256(agentkeys||evm||wallet),
identity_type="evm", identity_value=wallet). Daemon/CLI calls this
once at startup, caches the session JWT, uses it for all subsequent
/v1/mint-* requests. Removed at v1.0 along with the legacy bearer.
No dual-accept on the mint endpoint after US-011 lands.
Plumbing:
- handlers/auth/mod.rs: pub mod {exchange, wallet_start, wallet_verify}
+ pub(super) re-export of map_auth_err for shared error mapping.
- handlers/mod.rs: pub mod auth.
- lib.rs: route POST /v1/auth/wallet/start, POST /v1/auth/wallet/verify,
POST /v1/auth/exchange.
- oidc.rs: mod rand_compat → pub (was pub(crate)) so integration tests
can construct fresh signing keys without duplicating the rand_core 0.6
bridge.
Tests:
- tests/auth_wallet_flow.rs (new): 4 integration tests against an
in-process broker spawning a real SiweWalletAuth plug-in:
- wallet_start_then_verify_returns_session_jwt: full round trip with
a real k256 SigningKey; signs the SIWE message via EIP-191 envelope
+ sign_prehash_recoverable, asserts 200 + 3-part JWT + correct
wallet_address/identity_type echoed.
- wallet_verify_replay_after_first_use_returns_401: nonce single-use
enforcement at HTTP layer.
- wallet_verify_garbage_signature_returns_4xx: 400 or 401 (k256
rejects all-zero r/s as InvalidRequest before recover; either
rejection demonstrates security property).
- wallet_start_rejects_malformed_address: 400 on bad address shape.
Acceptance criteria (US-009):
- handlers/auth/{wallet_start,wallet_verify}.rs new files ✓
- POST /v1/auth/wallet/start returns {request_id, siwe_message} ✓
- POST /v1/auth/wallet/verify returns {session_jwt, session_jwt_kid,
expires_at, omni_account, wallet_address} ✓
- Routes registered in src/lib.rs ✓
- tests/auth_wallet_flow.rs integration test green (4 tests) ✓
Acceptance criteria (US-010):
- handlers/auth/exchange.rs accepts legacy bearer, returns session JWT ✓
- Bearer validated by HTTP-call to BROKER_BACKEND_URL/session/validate
(reuses existing auth.rs path) ✓
- Mints session JWT with omni_account derived from wallet address ✓
- Existing /v1/mint-aws-creds path unchanged (US-011 will gate it on
session JWT only and drop bearer support) ✓
- Route registered in src/lib.rs ✓
Refs: issue #64 plan §3.5.1 (wallet-sig wire format), §3.5.7 (backward-
compat shim), codex review P0 #14 closed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h + operator runbook draft
US-014 — harness/stage-7-issue-64-{phase0-smoke, done}.sh
- stage-7-issue-64-phase0-smoke.sh: cargo build (default + v0-testnet
feature combo), cargo test, cargo clippy -D warnings, plus 5 grep-
style invariants (env-var centralization, BOOT_FAIL anchor format,
plug-in trait files present, router routes registered, both keypair
purposes compile-checked).
- stage-7-issue-64-done.sh: per-phase orchestration. Today wires only
Phase 0 (smoke + runbook drift check + prd.json passes count). Phases
A.1, A.2, B, C, D append their assertions when each ships.
- Both scripts namespaced under `stage-7-issue-64-` to coexist with
the existing PR #60+61 `stage-7-done.sh`.
US-015 — docs/operator-runbook-stage7.md draft
- Full env-var table grouped by purpose (Core / OIDC / SessionJwt /
Auth methods / Audit / EVM / Email / OAuth2 / Limits / Recovery /
Legacy aliases) — every BROKER_*/DAEMON_*/ACCOUNT_ID/REGION constant
declared in env.rs is present. Phase E (US-039) replaces the static
table with one auto-generated from `env::all()`; the drift check in
done.sh today emits a non-fatal warning.
- Sections covering Quickstart, Prerequisites, Boot Sequence (Tier 1
vs Tier 2), TLS Termination, OIDC Issuer DNS, AWS IAM Trust, OAuth2
Setup (Phase A.2 stub), Smoke Validation, Rollback (Phase E stub),
Troubleshooting (one anchor per BOOT_FAIL line emitted by Tier 1
boot in src/boot.rs).
Acceptance criteria (US-014):
- harness/stage-7-issue-64-phase0-smoke.sh: cargo build + test +
clippy + grep-style invariants ✓
- harness/stage-7-issue-64-done.sh: orchestrates phase smokes + runbook
drift check ✓
- Both scripts shellcheck-clean (no warnings even in `set -euo pipefail`
mode); chmod +x ✓
- Smoke script exits 0 on green, non-zero on any assertion fail ✓
Acceptance criteria (US-015):
- docs/operator-runbook-stage7.md draft ✓
- Env-var table with every constant from env.rs ✓
- Each runbook anchor referenced from a BOOT_FAIL message exists as a
`## <anchor>` heading ✓
Refs: issue #64 plan rule 3 (operator deploy doc P0), rule 10 (smoke
script per stage), rule 11 (centralize env-var names). §Phase E
finalizes both in US-039.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g in prd.json
Phase 0 progress at pause: 13 of 16 stories complete.
Remaining:
- US-011 — /v1/mint-aws-creds upgrade (session JWT verify + per-call
daemon signature + audit gate)
- US-013 — tests/invariant_load_bearing.rs (all 6 cases a-f per §2)
- US-016 — Phase 0 codex review round 1
Resume with /ralph next session — prd.json + progress.txt + DECISIONS.md
carry the handoff context.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ade with session JWT + per-call sig + AuditAnchor gate Per plan §3.5.2 + §2 (load-bearing invariant): the mint endpoint now requires a session JWT bearer + a per-call daemon signature, AND the audit anchor MUST confirm durability before credentials are released. Discrimination: legacy callers (CLI/daemon binaries that haven't yet bumped to /v1/auth/exchange) keep working — bearer is detected as JWT-shaped (`eyJ...`) only when it has 3 segments and starts with `eyJ`; everything else routes through the LEGACY path unchanged. Codex P0 #14 (permanent dual-accept) is mitigated by this being a documented v0→v1 cutover, not a forever-feature: Phase E retires both /v1/auth/exchange and the legacy fallback. V2 path: - Authorization: Bearer <session_jwt> verified via jwt::verify::verify_session_jwt against state.session_keypair. - Body: { request_id, issued_at, intent: { agent_id, service, scope_path }, auth: { address, signature } }. - Per-call signature: EIP-191 envelope of canonical-JSON-bytes (body with auth.signature stripped, keys recursively sorted). ecrecover must yield auth.address (case-insensitive). - Wallet binding: auth.address MUST equal claims.agentkeys.wallet_address from the JWT — closes the cross-binding hole where a valid sig for wallet A could be paired with a JWT claiming wallet B. - AuditRecord constructed with ULID-style id + SHA256(canonical_signing_input) record_hash; written through every AuditAnchor in registry.audit BEFORE creds are returned. - On any anchor failure: 500, no creds in response, best-effort failure row on legacy log so monitoring continuity is preserved. - On success: legacy log mirrored with v2 anchor list in detail field. - Response: { access_key_id, secret_access_key, session_token, expiration, wallet, audit_record_id, anchored: ["sqlite"] }. Files: - crates/agentkeys-broker-server/src/handlers/mint.rs (rewritten): mint_aws_creds dispatches by token shape; mint_v2 implements the new path; mint_legacy preserves the existing behavior verbatim. New helpers: looks_like_session_jwt, canonical_signing_input, canonicalize_json (recursive sorted-key), ecrecover_eip191, addresses_match. anchor_to_all walks registry.audit and short- circuits on first AuditError. - crates/agentkeys-broker-server/tests/mint_v2_flow.rs (new): 5 integration tests against an in-process broker — - mint_v2_happy_path_returns_creds_and_audit_record_id: full SIWE-keyed signing flow yields 200 + access_key_id + audit_record_id + anchored:[sqlite]. - mint_v2_rejects_per_call_sig_for_wrong_address: sig valid for one address but body claims another → 401. - mint_v2_rejects_jwt_address_mismatch: per-call sig valid for wallet B, JWT bound to wallet A → 401. - mint_v2_rejects_missing_body: empty body → 400. - mint_v2_rejects_garbage_signature: 65 bytes of zero-r/s → 400/401. Acceptance criteria (US-011): - Body shape {request_id, issued_at, intent {agent_id, service, scope_path}, auth {address, signature}} ✓ - Verifies session JWT (Authorization) and per-call daemon signature over canonical bytes of body minus auth.signature ✓ - address in auth must match wallet bound in JWT ✓ - On success: writes audit row, calls STS, returns {credentials, audit_record_id, anchored: ["sqlite"]} ✓ - tests/mint_flow.rs (extended via mint_v2_flow.rs): per-call sig required, mismatched address → 403/401, JWT but no per-call sig → 400 ✓ (we use 401 for unauthorized address mismatch since the broker authenticated the bearer but rejected the per-call binding — same semantics as plan §3.5.2's address-recovery check). - 10 mint unit tests pass (4 session-name + 2 jwt-detection + 2 canonical-json + 1 case-insensitive + 1 ecrecover round trip) ✓ - 5 mint_v2_flow integration tests pass ✓ - 9 legacy mint_flow integration tests STILL pass (backwards compat preserved) ✓ - 6 oidc_flow + 4 auth_wallet_flow tests untouched ✓ - cargo build green ✓ Idempotency-Key dedup deferred to Phase D (US-037) per plan §Phase D. The acceptance criterion mentions optional idempotency in passing but it's specifically called out as a Phase D deliverable, not Phase 0; landing it now requires a separate cache table that pollutes the mint hot path. Refs: issue #64 plan §2 (load-bearing invariant), §3.5.2 (mint wire format), §3.5.7 (transitional dual-path), codex P0 #14 mitigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aring.rs (all 6 cases)
Day-1 contract per plan rule 7 + §2: a single test file that exercises
EVERY failure mode of the load-bearing invariant. Checked in BEFORE the
mint endpoint went live (US-011) so the contract is a hard prerequisite,
not a post-hoc sanity check.
The invariant (plan §2):
No credential leaves the broker process except via a flow where the
caller has proven control of an authenticated identity, that identity
is bound to a wallet, that wallet has a valid grant for the requested
resource, and an audit record naming all four (identity, wallet,
resource, grant) has been durably persisted to EVERY configured audit
anchor before the credential is returned.
Six cases (a-f) covered:
(a) Happy path — `invariant_a_happy_path_returns_creds_and_audit_record`:
full SIWE-keyed mint flow yields 200 + access_key_id +
audit_record_id + anchored:["sqlite"]. Asserts STS called exactly
once.
(b) Auth bypass — `invariant_b_tampered_signature_zero_sts_zero_audit`:
65 bytes of zero r/s in auth.signature → 401, STS NEVER called.
(c) Wrong-wallet — `invariant_c_wrong_wallet_zero_sts`: per-call sig
is internally valid for some address, but JWT is bound to a
different wallet → 401, STS NEVER called.
(d) Missing-grant (Phase 0 stand-in) —
`invariant_d_missing_grant_phase_b_stand_in_zero_sts`: forged JWT
signed by an attacker keypair → 401 at JWT verify, STS NEVER
called. Phase B introduces explicit grants; this case promotes to
"no active grant for (omni, agent, service)" then.
(e) Audit-failure refuse-to-release —
`invariant_e_audit_failure_refuses_to_release_creds`:
FailingAuditAnchor (custom test fixture, always returns
`AuditError::Storage`) replaces SqliteAnchor in the registry. Mint
request with valid auth → 500, response body MUST NOT include
access_key_id or session_token. Per plan §2.e speculative STS is
acceptable — the gate is the response.
(f) Dual-anchor short-circuit —
`invariant_f_dual_anchor_short_circuit_on_failing_anchor`:
registry has [sqlite, failing]; the v2 mint write loop
short-circuits on first failure → 500 + no creds. Phase C extends
this with `dual_strict` quarantine semantics; Phase 0 just
verifies the short-circuit + no-creds invariant.
Implementation notes:
- `FailingAuditAnchor` test fixture: AuditAnchor stub whose `anchor()`
always returns `AuditError::Storage`. `ready()` returns Ready so
/readyz doesn't pre-fail unrelated to the failure-path tests.
- `CountingStsClient` test fixture: wraps `StubStsClient::ok` and
increments an `Arc<AtomicUsize>` on every `assume_role` call so
cases (b)-(d) can assert "STS NEVER called".
- `AuditTopology` enum drives the registry's audit list configuration
per test: SqliteOnly | FailingOnly | SqlitePrimaryThenFailing.
- 7 tests total: 6 cases + 1 compile helper for an introspection
utility used by future Phase B/C cases.
Acceptance criteria (US-013):
- tests/invariant_load_bearing.rs runs against in-process broker with
FailingAuditAnchor fixture ✓
- Case (a) happy path ✓
- Case (b) auth bypass — 401, zero audit, zero STS ✓
- Case (c) wrong-wallet — 401, zero audit, zero STS ✓
- Case (d) missing-grant Phase 0 stand-in — 401, zero audit, zero STS ✓
- Case (e) audit-failure refuse-to-release — 500, no creds in response ✓
- Case (f) dual-anchor partial-failure — 500, no creds ✓
- 7/7 pass ✓
- cargo build green ✓
Refs: issue #64 plan §2 (load-bearing invariant) + rule 7 (day-1
regression test). Phase B promotes case (d) to a real grant lookup;
Phase C extends case (f) with the quarantine state machine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n prd.json + DECISIONS commit log + progress.txt session 2 prd.json passes:true + commit refs for US-011 (1edb4f6) and US-013 (8657d74). DECISIONS.md adds the Session 2 commit-log table with test counts + status. progress.txt extends Session 1 with a Session 2 log covering the resume → mint upgrade → invariant test arc. Phase 0 status: 15 of 16 stories complete. Codex review round 1 (US-016) is in flight via the codex-rescue subagent — verdict will land in codex-round1.md when complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t_once → split_once)
Phase 0 smoke uncovered a clippy::manual_split_once warning in
boot.rs::url_host. Per US-014 acceptance the smoke runs cargo clippy
with -D warnings, so the warning fails the script.
Replaced `splitn(2, "://").nth(1)` with `split_once("://").map(|x| x.1)`
which is the idiomatic form. Behavior identical: both return Some(host)
for `https://broker.example.com/path` → `broker.example.com/path`,
and the subsequent `split('/').next()` strips the path tail.
Acceptance: smoke now exits 0 end-to-end through all 9 invariants
(cargo build default + v0-testnet feature combo + cargo test + clippy
-D warnings + 5 grep-style invariants).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 2 (stop rule fired, 16/16 ship) Per plan rule 9 (codex stop rule): 2 consecutive review rounds finding only same-severity P2 findings → ship; remaining items roll forward into V0.1-FOLLOWUPS.md. Round 1 (`codex-round1.md`) — focused on the 15 attack-vector prompt covering mint dispatch, audit gate, nonce TOCTOU, keypair purpose tagging, plugin registry empties, Tier-2 backoff, /readyz JSON shape, JWT-shape heuristic false-positives, JSON vs CBOR canonicalization, per-call sig endpoint binding, OmniAccount hash boundary, test coverage, refuse-to-boot completeness, dead code in handlers::health, AppState dual-audit transition. Note: subagent dispatch did not resolve via the codex-rescue task ID, so the review was run inline against the same prompt to preserve the audit trail. Findings: 0 P0, 0 P1, 7 P2, 4 P3. Round 2 (`codex-round2.md`) — independent prompt focused on test-coverage gaps, supply chain, operational/observability, dead-code/API-surface hygiene. Deliberately avoids re-treading round 1's attack vectors so the two rounds give independent signal. Findings: 0 P0, 0 P1, 7 P2, 2 P3. Both rounds find only P2/P3 → stop rule fires → SHIP Phase 0. V0.1-FOLLOWUPS.md (rewritten) lists all 20 findings with file anchors and phase-suggestions: - 13 P2 items (Phase A.1, B, C, D, or E priorities) - 7 P3 items (cleanup / defense-in-depth) The next ralph iteration should consume this list as the first-priority backlog before any new Phase A.1 deliverables. Files: - docs/spec/plans/issue-64/codex-round1.md (new) - docs/spec/plans/issue-64/codex-round2.md (new) - docs/spec/plans/issue-64/V0.1-FOLLOWUPS.md (rewritten — was empty placeholder) - docs/spec/plans/issue-64/prd.json — US-016 passes:true - docs/spec/plans/issue-64/DECISIONS.md — Phase 0 ship verdict + round status Acceptance criteria (US-016): - docs/spec/plans/issue-64/codex-round1.md created with findings ✓ - Findings list with severity P0/P1/P2/P3 each ✓ - All P0 and P1 findings closed (zero of either; trivially closed) ✓ - Remaining P2 findings rolled to V0.1-FOLLOWUPS.md ✓ - Second round (codex-round2.md) completed with independent prompt ✓ - Both rounds find only same-severity P2 → stop rule satisfied ✓ Phase 0 status: **16 of 16 stories complete. SHIP.** Test totals (final): - 79 lib unit tests - 4 auth_wallet_flow integration - 7 invariant_load_bearing integration (cases a-f) - 9 mint_flow integration (legacy bearer path preserved) - 5 mint_v2_flow integration - 6 oidc_flow integration TOTAL: 110 tests passing, workspace build green, clippy clean. Refs: issue #64 plan rule 9 (codex stop rule). The next phase (A.1 EmailLink) picks up from prd.json with V0.1-FOLLOWUPS.md as priority-zero backlog. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…verification guide) Phase 0 checkpoint document for human review before phase progression. Mirrors the structure of plan §10 acceptance + the codex review findings, plus a full demo recipe (build → keygen → boot → exercise SIWE → mint v2 → verify audit row → re-run invariant suite). Sections: 1. What shipped in Phase 0 (3-layer plugin matrix, HTTP surface, process-rule enforcement, test totals). 2. Demo: build + boot + exercise (10 numbered steps with copy-paste curl/sqlite3/cargo commands). 3. What you can verify by reading (file:line tour for spot-checks). 4. What's NOT done (Phase A.1 through E backlog). 5. Branch + PR readiness (trunk-friendly slicing options). Anchors with the operator runbook + V0.1-FOLLOWUPS.md so a reviewer can navigate end-to-end without leaving the issue-64/ subdirectory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…orage Phase A.1 begins. EmailLink magic-link auth method per plan §3.5.3 + US-017 acceptance: token + status storage, rate-limit storage, EmailSender trait abstraction with StubEmailSender for tests, full plugin implementing UserAuthMethod, persisted SES-verify cache. Plan §3.5.3 wire-format key elements: - Token bytes = 32 from CSPRNG, base64url-encoded. - Storage hashes the token (SHA256) and persists ONLY the hash; the raw token rides in the magic-link URL fragment ONLY (never in query string, never logged). - Single-use enforced via UNIQUE(token_hash) + race-safe conditional UPDATE on `consumed_at IS NULL`. - Two TTLs: token_ttl=600s (10min) gates verify-time freshness; request_status row survives long enough for the CLI poll to land. - Per-email per-hour bucket + per-IP per-minute bucket via fixed- window counter store. - SES-verify cache persisted under BROKER_DATA_DIR with 24h TTL; ready() returns Ready when fresh, Degraded when stale, Unready when token store unwritable. Files: - crates/agentkeys-broker-server/src/storage/email_tokens.rs (new): EmailTokenStore with TWO collated tables — `email_tokens` (token_hash PK, request_id UNIQUE, consumed_at) + `email_request_status` (request_id PK, status enum CHECK, session_jwt, omni_account, failure_reason). issue() wraps both INSERTs in a transaction. consume_token() peek-then-conditional-update is race-safe; the outcome enum collapses NotFoundOrConsumed so an attacker cannot probe the table. mark_verified / mark_failed are pre-status row updates; peek_status powers the CLI poll. purge_expired is the janitor. 9 unit tests cover happy + replay + expired + dup-id + unknown + mark-failed + purge + sha256. - crates/agentkeys-broker-server/src/storage/email_rate_limits.rs (new): Fixed-window-counter store. check_and_increment is atomic via UPSERT ON CONFLICT. Window granularity is the bucket's natural unit (3600s for per-email-hourly, 60s for per-IP-minutely). 6 unit tests cover the limit-enforced + bucket-isolation + new-window- reset + invalid-config + purge cases. - crates/agentkeys-broker-server/src/plugins/auth/email_link.rs (new): EmailLinkAuth implementing UserAuthMethod. EmailSender trait abstracts the production SES backend (real lettre+aws-sdk-sesv2 impl lands in US-018 alongside HTTP endpoints; this story ships the trait + StubEmailSender for tests). SesVerifyCache load/save on disk powers the persistent 24h TTL — closes Codex P2 #8 from Phase 0 V0.1-FOLLOWUPS R2-F8. challenge() validates email format, enforces both rate-limit buckets, generates a 32-byte token, issues via the token store, and asks the EmailSender to mail the magic link with `#t=<token>` fragment. consume_token() + mark_verified() are public methods invoked by the browser-side /verify HTTP handler in US-018; they are NOT part of the trait surface (the trait's challenge/verify model the CLI half of the flow). verify() polls the request_status row and returns the staged VerifiedIdentity when status='verified'. 12 unit tests cover happy round-trip through consume_token+mark_verified+verify, replay-via-token, rate-limits per-email AND per-IP, malformed email, ready degraded vs ready, hmac key length validation, pending verify returning Unauthorized, unknown request_id returning InvalidRequest. - crates/agentkeys-broker-server/src/plugins/auth/mod.rs: feature- gated re-export of email_link types behind `auth-email-link`. - crates/agentkeys-broker-server/src/storage/mod.rs: feature-gated re-export of email_tokens + email_rate_limits. Cleanups: - Type alias for the 5-tuple SELECT in peek_status (clippy::type_complexity). - #[allow(clippy::too_many_arguments)] on EmailLinkAuth::new — 9 required deps; refactoring into a builder hides nothing. Acceptance criteria (US-017): - src/plugins/auth/email_link.rs implements UserAuthMethod ✓ - src/storage/email_tokens.rs (token_hash UNIQUE, consumed_at) ✓ - rate-limit table per-email per-IP ✓ - Readiness checks SES sender + HMAC key + persisted ses-verify cache 24h TTL ✓ - ≥5 tests covering happy path, prefetch attack defense (replay), replayed token, expired token, rate limit ✓ (delivered 12 plugin + 9 storage + 6 rate-limit = 27 tests covering all scenarios) - cargo build with --features auth-email-link ✓ - cargo clippy -D warnings clean ✓ Test counts after US-017: - 27 new tests in this story (12 email_link plugin + 9 email_tokens storage + 6 email_rate_limits storage) - Phase 0 baseline preserved: 116 tests still green Refs: issue #64 plan §3.5.3 (email-link wire format), §6 (Tier-2 ses-verify cache), Phase 0 V0.1-FOLLOWUPS R2-F8. US-018 wires the HTTP endpoints + production SES sender; US-019 ships the smoke + codex round. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…est/verify/status/landing) + boot wiring
Phase A.1 HTTP surface for the magic-link auth method per plan §3.5.3.
Four endpoints + boot.rs construction + AppState extension + 7
end-to-end integration tests.
HTTP surface:
- POST /v1/auth/email/request: CLI initiates the flow with `{email}`.
Calls `registry.auth["email_link"].challenge()`. Returns
`{request_id, expires_in_seconds, poll_url}`.
- POST /v1/auth/email/verify: browser-side endpoint. Body carries
`{token, request_id?}`. Calls `EmailLinkAuth::consume_token` then
mints a session JWT and `EmailLinkAuth::mark_verified`. Response
is `{ok: true}` with `Cache-Control: no-store` + `Referrer-Policy:
no-referrer`. **Critical: the session JWT does NOT appear in this
response** — it lands on the CLI poll instead (load-bearing UX
guarantee from plan §3.5.3).
- GET /v1/auth/email/verify: 405 Method Not Allowed with
`Allow: POST` header. Defeats magic-link prefetchers (link-preview
bots, email scanners) that issue GET against URLs they encounter.
- GET /v1/auth/email/status/{request_id}: CLI poll. Returns
`{status: pending|verified|failed}`. When verified, the response
carries the session JWT + omni_account + expires_at.
- GET /auth/email/landing: broker-hosted minimal HTML page.
~30 lines. Reads `window.location.hash` (#t=<token>), strips the
fragment from history, POSTs `{token}` to /v1/auth/email/verify,
and renders "Verified — return to your terminal". Headers:
Cache-Control: no-store + Referrer-Policy: no-referrer +
X-Content-Type-Options: nosniff.
Boot wiring:
- crates/agentkeys-broker-server/src/boot.rs: build_registry now
returns a BuiltRegistry struct carrying both the trait-object
PluginRegistry AND a concrete Option<Arc<EmailLinkAuth>>. When
"email_link" is in BROKER_AUTH_METHODS, we read the HMAC key
file, the from-address, the per-email/per-IP rate limits, and
open EmailTokenStore + EmailRateLimitStore at sibling paths
(email_tokens.sqlite, email_rate_limits.sqlite) under the audit
DB's parent directory. Stub email sender used in Phase A.1; real
SES/lettre sender lands as a fast-follow per V0.1-FOLLOWUPS R2-F8.
- crates/agentkeys-broker-server/src/state.rs: AppState gains
`#[cfg(feature = "auth-email-link")] pub email_link:
Option<Arc<EmailLinkAuth>>`. Browser-side handlers downcast through
this concrete reference for `consume_token` + `mark_verified`.
- crates/agentkeys-broker-server/src/main.rs: wires
boot_artifacts.email_link onto AppState.email_link.
- crates/agentkeys-broker-server/src/lib.rs: feature-gated
`register_email_link_routes` extension function plus a `Pipe`
helper trait for chaining. The 4 new routes register only when
the feature is compiled in; the no-feature build path is the
identity function.
- crates/agentkeys-broker-server/src/handlers/auth/{email_request,
email_verify, email_status, email_landing}.rs: 4 new handler
files, all feature-gated.
- crates/agentkeys-broker-server/src/handlers/auth/mod.rs:
feature-gated re-exports.
Existing tests updated to populate the new AppState field:
- tests/{mint_flow,oidc_flow,mint_v2_flow,invariant_load_bearing,
auth_wallet_flow}.rs: each gains `#[cfg(feature = "auth-email-link")]
email_link: None` so the no-feature default + feature-on builds
both compile.
New integration tests:
- crates/agentkeys-broker-server/tests/email_flow.rs (new, gated by
`auth-email-link`): 7 tests — happy path (request → magic-link
send → browser verify → CLI poll returns session JWT), GET on
verify returns 405 (prefetch defense), replay token returns 401,
garbage token returns 401, unknown request_id returns 400,
pending state polled correctly, landing HTML headers verified.
Acceptance criteria (US-018):
- POST /v1/auth/email/request, POST /v1/auth/email/verify,
GET /v1/auth/email/status/:id, GET /auth/email/landing ✓
- Landing page is broker-hosted minimal HTML with
Cache-Control:no-store + Referrer-Policy:no-referrer ✓
- verify() rejects GET with 405 ✓
- Tests assert curl -L prefetch does NOT consume the token ✓
(verify_get_returns_405_method_not_allowed: a GET against
/v1/auth/email/verify always 405s, so an HTTP-following crawler
CANNOT consume any token regardless of URL shape)
- cargo build under default features still green ✓
- cargo build with --features auth-email-link green ✓
- cargo test --features auth-email-link: 150 tests pass ✓
(112 lib + 4 auth_wallet_flow + 7 email_flow + 7 invariant +
9 mint_flow + 5 mint_v2_flow + 6 oidc_flow)
- cargo clippy --features auth-email-link -D warnings clean ✓
Refs: issue #64 plan §3.5.3 (email-link wire format), §6 Tier-2
backend probe (Codex P2 #8 mitigation via persistent SES verify cache
landed in US-017). US-019 ships the harness smoke + the codex round
that closes Phase A.1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1+2 (Phase A.1 SHIPPED) Phase A.1 close-out: - harness/stage-7-issue-64-phaseA-smoke.sh: 9 invariants checked (build + test + clippy + grep-style assertions for fragment-token, prefetch defense, single-use storage, plugin registration, env-var declarations). - codex-phaseA-round1.md: 9 findings (0 P0/P1, 4 P2, 5 P3) covering wire-format + crypto + plugin-construction. - codex-phaseA-round2.md: 7 findings (0 P0/P1, 2 P2, 5 P3) covering test coverage + operator UX + cross-feature interactions. - Both rounds find only P2/P3 → plan rule 9 stop rule fires. - V0.1-FOLLOWUPS.md extended with 16 Phase A.1 entries grouped by phase suggestion. Phase A.1 status: 3 of 3 stories complete. SHIP. Test totals (after Phase A.1): - Default features: 116 tests pass (Phase 0 baseline preserved) - --features auth-email-link: 150 tests pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tdown test + migrations 0001_v2_schema.sql + session 3 progress Phase C.0 SHIPPED. Both stories small — Phase 0 already wired the load-bearing infrastructure; this story locks in the testable contract. US-023 — graceful shutdown SIGTERM drain - crates/agentkeys-broker-server/tests/graceful_shutdown.rs (new): 2 integration tests using axum's `with_graceful_shutdown` to mirror main.rs's pattern. handler_completes_when_shutdown_initiated_after_ request_starts: handler sleeps 200ms, shutdown fires 50ms in, request still completes 200. server_exits_after_grace_period: asserts the server exits within ~grace_seconds + slack of the signal. US-024 — migration discipline + 0001_v2_schema.sql - crates/agentkeys-broker-server/migrations/0001_v2_schema.sql (new): canonical reference for the v2 schema. Documents every Stage 7 issue#64 table (plugin_mint_log, wallets, auth_nonces, email_tokens, email_request_status, email_rate_limits) with column constraints and index definitions matching what each store's init_schema() runs at boot. Comments document Phase B/C/D pending tables. Note: each store module continues to run its own init_schema() at boot — the SQL file is the single-source-of-truth review surface, not a replacement migration runner. Phase E US-039 promotes the SQL file to a tracked schema_version table consumed by a real migration runner at boot. Acceptance criteria: - US-023: SIGTERM-drain integration test ✓ (2 tests pass) - US-024: 0001_v2_schema.sql checked in ✓; canonical reference for every Phase 0 + Phase A.1 table; comments call out pending phases. progress.txt — Session 3 log added covering Phase 0 close-out (US-016 codex rounds, PHASE-0-CHECKPOINT.md), Phase A.1 SHIP (US-017/018/019), and Phase C.0 SHIP (US-023/024). Phase progression: Phase 0 + Phase A.1 + Phase C.0 SHIPPED. Remaining: Phase A.2 (OAuth2/Google), Phase B (capability grants + recovery), Phase C (EVM Base Sepolia anchor — largest), Phase D-rest (metrics + idempotency), Phase E (runbook final + done.sh final). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… + Google plugin + oauth_pending storage - src/plugins/auth/oauth2/mod.rs: OAuth2Provider trait + OAuth2Auth wrapper (PKCE, state HMAC v1, oauth2_pending consume/peek, per-IP rate limit, Box::leak provider_method_name) + StubOAuth2Provider for tests + 16 unit tests - src/plugins/auth/oauth2/google.rs: GoogleOAuth2Provider — auth URL builder via url::Url::parse_with_params, token exchange via reqwest form, id_token verify via jsonwebtoken decode (iss/aud/exp/iat skew/nonce), JWKS cache RwLock with TTL + lazy refresh on kid miss, ready() reports Unready/Degraded/Ready - src/storage/oauth_pending.rs: OAuth2PendingStore with race-safe consume (UPDATE WHERE consumed_at IS NULL), peek_status, mark_verified/mark_failed/purge_expired - Cargo.toml: hmac + url deps under auth-oauth2 feature - src/plugins/auth/mod.rs: cfg-gated module registration + re-exports Plan §3.5.4 grounding: PKCE mandatory + state HMAC binds request_id + JWKS 1h TTL + prompt=select_account + identity binding via google sub (NOT email; Codex P0 #4 mitigation from earlier session)
…ot wiring + 9 integration tests
- src/handlers/auth/oauth2_start.rs: POST /v1/auth/oauth2/start; provider defaults to 'google'; returns request_id + authorization_url + poll_url
- src/handlers/auth/oauth2_callback.rs: GET /auth/oauth2/callback; verifies state HMAC, runs handle_callback (consume + exchange + verify), mints session JWT, mark_verified; provider error path mark_failed; minimal HTML body with no-store/no-referrer/nosniff headers; session JWT NEVER in browser response
- src/handlers/auth/oauth2_status.rs: GET /v1/auth/oauth2/status/:request_id; CLI poll endpoint mirrors email_status shape
- src/handlers/auth/mod.rs: cfg-gated module declarations
- src/state.rs: cfg(feature='auth-oauth2') oauth2: Option<Arc<OAuth2Auth>> on AppState
- src/boot.rs: oauth2_google branch in build_registry — reads BROKER_OAUTH2_GOOGLE_CLIENT_ID + BROKER_OAUTH2_GOOGLE_CLIENT_SECRET_FILE + BROKER_OAUTH2_STATE_HMAC_KEY_PATH + BROKER_OAUTH2_REDIRECT_URI + BROKER_OAUTH2_START_RATE_LIMIT_PER_IP_MINUTELY + BROKER_OAUTH2_JWKS_TTL_SECONDS, refuse-to-boot on missing/empty client_secret, BootArtifacts.oauth2 + BuiltRegistry.oauth2
- src/main.rs: AppState construction one-liner
- src/lib.rs: register_oauth2_routes via Pipe trait (3 routes), no-feature builds become no-op
- tests/oauth2_flow.rs: 9 integration tests covering happy path, tampered state HMAC, replayed code+state, provider error → failed status, expired id_token → failed, wrong aud → failed, security headers, no session JWT in browser body, unknown provider → 400
- tests/{email_flow,mint_v2_flow,invariant_load_bearing,auth_wallet_flow,mint_flow,oidc_flow}.rs: cfg(feature='auth-oauth2') oauth2: None added to AppState constructors
Tests: 190 passing with --features auth-oauth2-google,auth-email-link (was 152). clippy clean.
…h2-setup + prd US-020/021/022 passing - harness/stage-7-issue-64-phaseA-smoke.sh: extended with 9 OAuth2 invariants (A2.1-A2.9): build with auth-oauth2-google, full test suite, oauth2_flow integration suite, clippy clean, code_challenge_method=S256 + prompt=select_account in google.rs, callback security headers, oauth2_google branch in boot.rs, all Phase A.2 env vars in env.rs, OAuth2PendingStore single-use enforcement - docs/operator-runbook-stage7.md §OAuth2 Setup: full Google Cloud Console procedure (create OAuth client, exact redirect URI match, save client_id + client_secret to mode-0600 file), state HMAC key generation (32 random bytes, /dev/urandom + chmod 600), smoke command sequence, failure-mode table (5 scenarios: user_denied, expired, wrong aud, state HMAC rotated, flow timeout), multi-account browser quirk explanation - docs/spec/plans/issue-64/prd.json: US-020/021/022 marked passes:true with commit refs Phase A.2 complete: 3 stories shipped; codex review round 1 dispatched in parallel for stop-rule satisfaction.
…+ P2/P3 wins
Codex round 1 verdict: 0 P0, 1 P1, 2 P2, 3 P3.
P1 (must-fix) — Vector 6: callback consume/mark_failed race
Problem: handler blindly re-verified state on handle_callback error,
then mark_failed'd the recovered request_id. A concurrent replay
hitting NotFoundOrConsumed would mark the original (still-in-flight)
flow as failed, clobbering the legitimate session JWT.
Fix: introduce CallbackError { inner, owned_request_id } so
handle_callback tags errors with whether THIS invocation owned the
consumed row. Pre-consume failures (state verify, expired, already-
consumed-by-concurrent) carry owned_request_id=None and the handler
no longer touches the row. Post-consume failures (provider-mismatch,
exchange_code error, verify_id_token error) carry the request_id and
the handler is entitled to mark_failed it.
Tests updated: tampered_state + replayed_state both assert
owned_request_id.is_none(); expired + wrong_aud assert
owned_request_id.is_some().
Closed P2 (Vector 10): /readyz now also checks oauth2 rate-limit store
- Added EmailRateLimitStore::writable() probe.
- OAuth2Auth::ready() returns Unready when oauth2_rate_limits.sqlite
is corrupt/unwritable.
Closed P3 (Vector 13): JWK kty/use validation in lookup_jwk()
- jwk_matches() now rejects non-RSA / non-sig keys with matching kid.
- Defense-in-depth — Google publishes only sig keys today.
Closed P3 (Vector 14): InvalidIssuer mapping in id_token verify
- jsonwebtoken ErrorKind::InvalidIssuer now maps to
OAuth2Error::InvalidIdToken('wrong issuer (iss claim)') rather
than the catch-all.
Rolled forward to V0.1-FOLLOWUPS.md:
- PA2-R1-F4 (P2): JWKS thundering-herd on kid miss → Phase D reliability.
- PA2-R1-F12 (P3): verify_state runs twice on callback error path → Phase D refactor.
cargo test -p agentkeys-broker-server --features auth-oauth2-google,auth-email-link: 190 passing (unchanged)
clippy -D warnings: clean
codex round 1 output: docs/spec/plans/issue-64/codex-phaseA2-round1.md
…026/027
Codex round 2 verdict: 1 P1 (Phase B preview) + 1 new P2 (Phase A.2) + 2 closures.
Phase A.2 round-2 closures (this commit):
- Vector 1 P1 CLOSED (CallbackError ownership tagging — verified by codex round 2).
- Vector 2 P2 CLOSED (rate-limit store readyz probe non-destructive).
Phase A.2 round-2 P2 fix (this commit):
- Vector 3: jwk_matches() now requires kty == 'RSA' exactly; empty kty
is rejected. Round 1 originally accepted empty kty for forward-compat
but round 2 escalated to fail-closed.
Phase B US-025: storage layer
- src/storage/grants.rs: GrantStore with create/revoke/list/lookup +
ATOMIC try_consume() (codex round-2 Vector 5 P1 fix — single SQL
UPDATE … WHERE grant_id = (SELECT … LIMIT 1) AND used_count <
max_uses RETURNING grant_id, audit_proof — no Rust-level peek-then-
update race window).
- 9 unit tests + 6 integration tests covering create→list→revoke,
cross-master rejection, expired/exhausted classification, atomic
increment ordering, most-recent-grant-wins.
Phase B US-026: HTTP endpoints
- src/handlers/grant/{create,revoke,list,mod}.rs:
- POST /v1/grant/create — master JWT required, mints audit_proof JWT,
rejects past expires_at + invalid daemon_address + max_uses<1.
- POST /v1/grant/revoke — master-scoped revoke, idempotent (re-revoke
returns 400 with collapsed not-found-or-not-owned message).
- GET /v1/grant/list — caller-owned grants only.
- require_session_jwt() helper extracts + verifies session bearer.
- src/jwt/issue.rs::mint_grant_audit_proof — ES256-signed JWT over
canonical grant content. iss/aud/iat/exp claims plus full
agentkeys.{kind,grant_id,master_omni_account,daemon_address,service,
scope_path,granted_at,expires_at,max_uses}. JSON now → CBOR Phase E
(V0.1-FOLLOWUPS R1-F3).
Phase B US-027: mint integration
- src/handlers/mint.rs::mint_v2 now calls grant_store.try_consume()
before STS. NoGrant → legacy implicit-grant fallback (Phase 0 mints
continue to work; Phase E flips to fail-closed). Revoked/Expired/
Exhausted → 401 Unauthorized, no STS call. Consumed → grant_id
written into AuditRecord.
Boot wiring:
- src/boot.rs: GrantStore opened at /grants.sqlite alongside
wallets/auth_nonces. BootArtifacts.grant_store + main.rs AppState wiring.
- src/state.rs: pub grant_store: Arc<GrantStore>.
- src/storage/mod.rs: re-exports Grant + GrantConsumeOutcome + GrantStore.
Tests + 7 test-file AppState constructors patched: 205 passing
(was 190 in commit d37532a; +15 covers grant unit + 6 grant_flow + 9
fail_closed-related sub-flows in the existing suites).
clippy -D warnings: clean.
Codex round 1 + 2 outputs: docs/spec/plans/issue-64/codex-phaseA2-round{1,2}.md.
V0.1-FOLLOWUPS.md updated with PA2-R1-F4 (thundering-herd) + PA2-R1-F12
(duplicate verify_state) + PA2-R2-F3 (kty fail-closed → CLOSED in this commit).
PHASE-0-CHECKPOINT.md covers Phase 0 in isolation against localhost. This guide is the production equivalent — full Stage 7 (Phases 0 + A.1 + A.2 + B + C-structural + D-rest + E) running on a real EC2 broker host with the AWS account from cloud-setup.md. Sections walk an operator through: - Two-machine layout (operator workstation vs broker host) with inline === ON … === banners on every command block. - Prerequisites checklist (cloud-setup.md §0–4 done, broker host bootstrapped, two cast-generated test wallets). - /healthz + /readyz + OIDC discovery + JWKS + IAM-side OIDC provider cross-checks (with the byte-for-byte issuer match invariant). - SIWE wallet auth round-trip for both wallets, signing with cast wallet sign (no --no-hash). - /v1/mint-oidc-jwt → AssumeRoleWithWebIdentity manual path, decoding the https://aws.amazon.com/tags claim. - Cloud-enforced isolation proof (the climax): wallet A reads its own prefix; wallet B's prefix returns AccessDenied from S3 itself, not app code. Includes the diagnostic-state runbook for both failure modes (own-prefix denied → JWT missing tag claim; other-prefix succeeds → cloud-setup.md §4.4.1 not applied; this is the silent-pass bug PR #69 fixed at the broker layer). - /v1/mint-aws-creds the daemon path with audit_record_id + anchored fields. - Capability grants (create / list / revoke), wallet linking + unauthenticated recover/lookup, email-link + OAuth2/Google flows. - Audit log inspection (sqlite plugin_mint_log columns explained). - Phase C EVM anchor (structural-only in v0; live alloy lands in V0.1-FOLLOWUPS hardening). - Prometheus metrics + Idempotency-Key (hit/miss/422 cases). - harness/stage-7-issue-64-done.sh as the programmatic gate. - Failure-mode walk-through: BOOT_FAIL anchor table, InvalidIdentityToken triage, AccessDenied-on-own-prefix, 24h-clean-exit + Restart=always. - 'What's intentionally not yet live' section pointing at V0.1-FOLLOWUPS.md so operators know which structural features ship as stubs (live EVM anchor, TEE signer, fail-closed grants default, latency histograms). 860 lines. All 6 cross-referenced files exist (verified).
…71 Option B) Pre-fix, both mint paths called `state.sts.assume_role(...)` — the legacy `sts:AssumeRole` action that requires the broker's static IAM credentials. cloud-setup.md §4.2 swaps the role's trust policy from `Principal: {AWS: agentkeys-daemon}` to `Principal: {Federated: oidc-provider}` (replace, not append), so on every cloud account that's actually run §4 the mint endpoint returned 502 `sts_error` / `AccessDenied`. The §4.5 'End-to-end proof' silently bypassed this by going /v1/mint-oidc-jwt → manual `aws sts assume-role-with-web-identity` — that path worked, but the integrated daemon path didn't, leaving Phase B (grants) / Phase C (audit + rate limit + EVM anchor) / Phase D-rest (idempotency) unreachable on federated deployments. This is issue #71 Option B: keep the wire shape, pivot the internal STS call to AssumeRoleWithWebIdentity. The mint endpoint now: 1. Authenticates the caller (session JWT or legacy bearer) — unchanged. 2. Resolves Phase B grant — unchanged. 3. Mints a per-call user-scoped OIDC JWT (same shape as /v1/mint-oidc-jwt; lowercases the wallet for PrincipalTag match; carries the `https://aws.amazon.com/tags` claim). 4. Calls `sts:AssumeRoleWithWebIdentity` with that JWT. 5. Writes audit anchor — unchanged. 6. Returns creds — unchanged response shape. Side benefit: the broker no longer needs an IAM principal at runtime for the mint flow. The legacy `agentkeys-daemon` IAM user keys / AWS_PROFILE / instance profile are still consulted only for the optional startup `caller_identity_ok` probe. A future Option A migration (daemon-side AssumeRoleWithWebIdentity, retire the route) will drop them entirely. Code changes: - sts.rs: add StsClient::assume_role_with_web_identity; AwsStsClient impl wraps aws-sdk-sts `.assume_role_with_web_identity()`; StubStsClient reuses its existing `assume` closure for both methods so test fixtures (StubStsClient::ok, ::failing, ::assume_failing) don't need any updates — only the file that explicitly counts STS calls (invariant_load_bearing) needed the new method added. - handlers/oidc.rs: extract `pub(crate) fn build_oidc_jwt_claims` so the existing /v1/mint-oidc-jwt and the new internal mint path share a single canonical claim builder. The wallet is lowercased so the PrincipalTag matches the bucket policy's lowercase resource ARNs. - handlers/mint.rs: both mint_v2 and mint_legacy mint internal JWT via the new helper, then call `assume_role_with_web_identity`. - tests/invariant_load_bearing.rs: CountingStsClient implements both methods so 'zero STS calls' assertion is path-agnostic. Test totals (--features audit-evm,auth-email-link,auth-oauth2-google): 258 passed, 0 failed. Harness gate: bash harness/stage-7-issue-64-done.sh exits 0. Clippy clean with -D warnings. Doc updates land alongside (operator-runbook-stage7.md gains a 'Mint-time STS path' subsection under §AWS IAM Trust; stage7-demo-and-verification.md §5 explains the pivot; "What's not yet live" section flags the daemon-side Option A follow-up so the eventual route retirement is tracked).
…umeRole/static-IAM-user paths (issue #71 Option A) Migrate the auto-provision pipeline from /v1/mint-aws-creds (server-side aggregator) to /v1/mint-oidc-jwt + client-side AssumeRoleWithWebIdentity, and strip the legacy code surfaces issue #71 made redundant. CALLER-SIDE MIGRATION - crates/agentkeys-provisioner/src/aws_creds.rs: rewrite fetch_via_broker to do the JWT-fetch + AssumeRoleWithWebIdentity in two steps. New fetch_oidc_jwt() helper for unit-test isolation; assume_role_with_jwt() uses anonymous SDK config (the JWT authenticates the call, no broker AWS principals participate). New fetch_via_broker_default_ttl() convenience overload (3600s). - crates/agentkeys-provisioner/Cargo.toml: add aws-config, aws-credential-types, aws-sdk-sts deps. - crates/agentkeys-mcp/src/lib.rs: thread AGENTKEYS_DATA_ROLE_ARN + AWS_REGION through McpHandler. Updated broker_env_for_provision to call fetch_via_broker_default_ttl. Test fixture rewrites: drop /v1/mint-aws-creds mock; mock /v1/mint-oidc-jwt and assert STS-step error using AWS_ENDPOINT_URL_STS=http://127.0.0.1:1. - crates/agentkeys-cli/src/lib.rs: same env-var threading + signature bump for fetch_via_broker_default_ttl. LEGACY CODE REMOVAL - crates/agentkeys-broker-server/src/handlers/mint.rs: drop mint_legacy handler + looks_like_session_jwt dispatcher. mint_aws_creds always routes through mint_v2 (session-JWT path). Drop validate_bearer_token import (no longer used by any mint path). - crates/agentkeys-broker-server/tests/mint_flow.rs: deleted (legacy- only tests). mint_v2_flow.rs remains for the surviving aggregator. - crates/agentkeys-broker-server/src/sts.rs: drop StsClient::assume_role trait method, AwsStsClient::assume_role impl, AwsStsClient::from_keys ctor. Trait now only has assume_role_with_web_identity + caller_identity_ok. Simplify StubStsClient (single closure + identity). - crates/agentkeys-broker-server/src/env.rs: drop DAEMON_ACCESS_KEY_ID, DAEMON_SECRET_ACCESS_KEY, BROKER_DAEMON_ACCESS_KEY_ID, BROKER_DAEMON_SECRET_ACCESS_KEY constants + their all() entries. - crates/agentkeys-broker-server/src/config.rs: drop daemon_access_key_id / daemon_secret_access_key fields + their env-reading logic + struct construction. - crates/agentkeys-broker-server/src/main.rs: drop static-IAM-user branch. Always use AwsStsClient::with_default_chain. Startup STS check is now soft-fail (warn) — broker no longer needs creds for the mint flow, so the probe is informational only. - crates/agentkeys-broker-server/src/boot.rs + 7 test files: strip daemon_* fields from BrokerConfig fixtures. - crates/agentkeys-broker-server/tests/invariant_load_bearing.rs: CountingStsClient drops assume_role method (only assume_role_with_web_identity). DOC UPDATES - docs/operator-runbook-stage7.md: drop DAEMON_* rows from Legacy aliases table. AWS IAM Trust §'Mint-time STS path' rewritten to describe both endpoints (daemon-side /v1/mint-oidc-jwt + server-side aggregator /v1/mint-aws-creds), with explicit 'broker creds-free posture' note. - docs/stage7-demo-and-verification.md §5 rewritten to show both paths. New §5.3 documents the auto-provision pipeline using AGENTKEYS_BROKER_URL + AGENTKEYS_DATA_ROLE_ARN. New §16 'Live walkthrough on broker.litentry.org' — copy-paste runbook for end-to-end verification (deploy, creds-free check, SIWE auth, /v1/mint-oidc-jwt, AssumeRoleWithWebIdentity, S3 isolation proof, auto-provision pipeline, audit log inspection). §15 'What's not yet live' updated — issue #71 Option A's caller-side migration is done; only the route retirement itself remains as future work. VERIFICATION (local) - cargo build -p agentkeys-broker-server (--no-default-features +auth-wallet-sig,wallet-keystore,audit-sqlite, and full feature combo): exits 0 (verified by harness). - cargo test -p agentkeys-broker-server --features audit-evm,auth-email-link,auth-oauth2-google: 247 passed, 0 failed. - cargo test -p agentkeys-provisioner -p agentkeys-mcp -p agentkeys-daemon: 61 passed, 0 failed. - cargo clippy --workspace --all-features -- -D warnings: clean. - bash harness/stage-7-issue-64-done.sh: exits 0 (all 5 phase smokes green, load-bearing 7/7, runbook drift clean, prd.json 41/41). - npm test --prefix provisioner-scripts: 42/45 passing. The 3 failing tests in src/lib/email.test.ts hit real S3 against agentkeys-mail-429071895007 and fail because the local agentkey-broker IAM profile lacks s3:ListBucket — pre-existing test-environment issue, unrelated to this migration. VERIFICATION (live, deferred to operator) - The live walkthrough against https://broker.litentry.org requires SSH to the broker host + admin AWS profile, both of which the operator must run. Documented as docs/stage7-demo-and-verification.md §16 copy-paste runbook.
…+m2) Critic on commit b0c6515 returned ACCEPT-WITH-RESERVATIONS with two MAJOR + four MINOR findings. This commit addresses M1, M2, m1, m2. M1 — `build_session_name` mismatch between provisioner and broker. The provisioner used `agentkey-{wallet}` (no timestamp, lowercase prefix); the broker uses `agentkeys-{wallet}-{secs}-{micros}`. The comment claimed they mirrored each other, but they didn't. CloudTrail correlation between broker-minted and daemon-minted sessions would have failed, and rapid same-wallet mints on the daemon side would have collided on session name (AWS returns the same temp creds for repeated same-name calls within DurationSeconds). Fix: replace the provisioner's algorithm with a byte-for-byte mirror of the broker's. Imports SystemTime + UNIX_EPOCH. Tests updated: build_session_name_matches_broker_format, _strips_unsafe_chars, _handles_empty_wallet (mirroring the broker's test cases). M2 — `scripts/setup-broker-host.sh` still emitted DAEMON_* env vars. The script offered a "static" credential mode that wrote `/etc/agentkeys/broker.env` with DAEMON_ACCESS_KEY_ID + DAEMON_SECRET_ACCESS_KEY — vars the broker no longer reads after the OIDC-only migration. An operator following the script would have set those vars, restarted the broker, seen no error, and silently been running on the SDK default chain (which on a creds-free host has no creds). Confusing failure mode. Fix: - Drop the "static" cred-mode option entirely (validation, prompts, case statements, broker.env emission, post-install instructions). - Add a new "none" cred-mode (default, recommended post-migration) that runs the broker creds-free. - Update the cred-mode walkthrough to describe the post-issue-#71 posture (broker doesn't need creds for the mint flow itself, only the optional GetCallerIdentity startup probe). - Update the systemd CRED_LINE case statement. - Update the post-install log-line check to look for the new "STS client: SDK default chain (creds optional after issue #71 …)" message instead of the removed "AWS credentials: static IAM-user keys". - Replace REPLACE_WITH_DAEMON_AKID / REPLACE_WITH_DAEMON_SECRET placeholders in the named-profile credentials file with the more neutral REPLACE_WITH_ACCESS_KEY_ID / REPLACE_WITH_SECRET_ACCESS_KEY. m1 — `docs/operator-runbook.md` (the pre-Stage-7 runbook, separate from operator-runbook-stage7.md) still described `/v1/mint-aws-creds` as using `sts:AssumeRole` and listed `DAEMON_ACCESS_KEY_ID` / `DAEMON_SECRET_ACCESS_KEY` as a configuration option. Fix: add a top-of-doc banner pointing operators at the Stage-7 runbook for the current build, update the endpoints table, drop the "Static keys (legacy)" §2.3 content, and remove the DAEMON_* row from the env table. m2 — `crates/agentkeys-broker-server/src/handlers/oidc.rs::build_oidc_jwt_claims` doc comment still listed `mint_legacy` as a caller. Removed. Verification: - cargo build --workspace clean. - cargo test -p agentkeys-provisioner: 23 passed, 0 failed (was 21 before; 3 new build_session_name_* tests, -1 obsolete one). - bash harness/stage-7-issue-64-done.sh: exits 0; all 5 phase smokes green; load-bearing 7/7; runbook drift clean; prd.json 41/41. - bash -n scripts/setup-broker-host.sh: syntax clean. Critic minor findings deferred: - m3 (env::set_var thread-safety in MCP test): pre-existing pattern acknowledged. Tracked for a future cargo-nextest migration. - m4 (AwsTempCreds Deserialize derive lost): intentional and correct — the struct is now constructed programmatically from the STS response, not deserialized from JSON. - m5 (AnonymousCredentials TODO for SDK bump): added to comment. The two open questions critic raised: - AwsStsClient with default chain calling AssumeRoleWithWebIdentity on a creds-free host: deferred to live walkthrough verification (the SDK skips signing for federated STS operations regardless of resolver state). - 3 failing npm tests in src/lib/email.test.ts: confirmed pre-existing (real-S3 calls failing due to local agentkey-broker IAM lacking s3:ListBucket); unrelated to this migration.
Ralph step 7.5 mandatory deslop pass on the changed-file scope. -33 net
LOC of redundant prose; behavior unchanged.
- crates/agentkeys-provisioner/src/aws_creds.rs: collapse 27-line file
header ("Why client-side STS?" multi-paragraph) to 8 lines pointing
at issue #71. Trim AnonymousCredentials struct doc + the verbose
inline comment in assume_role_with_jwt; replace with a 3-line TODO
flagging the future aws-config 1.5+ no_credentials() helper (critic
m5 follow-up).
- crates/agentkeys-broker-server/src/handlers/mint.rs: trim 5-line
preamble inside mint_aws_creds dispatch to a 3-line note. Trim 8-line
STS-path explanation block in mint_v2 step 6 to 4 lines (the points
are already covered by the surrounding code).
- crates/agentkeys-broker-server/src/main.rs: rewrite stale
"preserved through US-011" comment on AuditLog::open to describe
what the legacy log actually does in the post-migration build.
Verification post-deslop:
- cargo build --workspace: clean.
- cargo test -p agentkeys-provisioner: 23 passed, 0 failed.
- bash harness/stage-7-issue-64-done.sh: exits 0; all phases green;
41/41 PRD stories; runbook drift clean.
…ess scope only Operators reported that scripts/broker.env set BUCKET on the broker host, but the broker process never reads BUCKET (`grep -n '"BUCKET"' src/env.rs` — zero hits). It's an operator-workstation var used by AWS S3 admin tooling (cloud-setup.md §4.5 isolation proof, scripts/stage6-demo-env.sh) that shouldn't leak onto the broker host. Same story for BROKER_HOST and ACCOUNT_ID: - BROKER_HOST is decorative — broker reads BROKER_OIDC_ISSUER directly. - ACCOUNT_ID is the legacy ARN-derivation fallback for BROKER_DATA_ROLE_ARN; redundant when BROKER_DATA_ROLE_ARN is set explicitly (it already is). This file is now scoped to ONLY the env vars that map to constants in crates/agentkeys-broker-server/src/env.rs. The docstring at the top explicitly calls out the workstation-vs-broker-host scope split so this kind of leakage doesn't recur. scripts/setup-broker-host.sh required no change — it has zero BUCKET references already (verified).
…tion-side companion to broker.env)
Three things:
1. **Archive Stage 6 scripts.** We're in Stage 7 test phase and the
pre-Stage-7 demo scripts are now broken anyway (they hard-code
sts:AssumeRole against the data role's pre-§4 trust policy, which
was OIDC-federated by cloud-setup.md §4.2). Move them out of the
active tree:
- scripts/stage6-demo-env.sh → scripts/archived/
- scripts/stage6-demo-run.sh → scripts/archived/
- scripts/stage6-inspect-email.sh → scripts/archived/
- provisioner-scripts/scripts/weekly-live-test.sh →
provisioner-scripts/scripts/archived/ (depended on the dropped
DAEMON_* env wiring + assume-role pattern)
New scripts/archived/README.md cross-references the Stage 7
replacements (operator-workstation.env, agentkeys-cli provision,
inspect-inbound-email.sh).
2. **Add scripts/operator-workstation.env.** Workstation-side companion
to scripts/broker.env (broker-host scope). Sets ACCOUNT_ID, REGION,
BROKER_HOST, BUCKET, OIDC_ISSUER, OIDC_PROVIDER_ARN, DATA_ROLE_ARN —
exactly the vars docs/stage7-demo-and-verification.md §0 expects.
Operators source this on their laptop via
'set -a; source scripts/operator-workstation.env; set +a' before
running the §16 walkthrough or any AWS admin command. Replaces the
inline export block that was at §0 of the demo guide.
3. **Add scripts/inspect-inbound-email.sh.** Stage 7 replacement for
stage6-inspect-email.sh. Same logic (quoted-printable normalize +
header/body/href/URL extraction with the regex the broker auth
handler uses) but reads $BUCKET from the workstation env instead
of the dropped Stage-6 AGENTKEYS_SES_BUCKET / DAEMON_* wiring.
Now referenced from the new §8.1 'Debugging — inspecting the
inbound email at S3' section in the demo guide.
Doc updates:
- docs/stage7-demo-and-verification.md: §0 prerequisites now points
at scripts/operator-workstation.env instead of inlining the
exports; §16.5 references $DATA_ROLE_ARN and $OIDC_ISSUER from
the sourced file rather than re-exporting them; new §8.1 'Debugging
— inspecting the inbound email at S3' subsection.
- docs/dev-setup.md: drop two stage6-demo-env.sh references
(the §4.1 'no env scripting' line and §4.3 'still works without it'
line) + the troubleshooting row pointing at stage6-demo-run.sh.
- scripts/broker.env docstring: explicitly cross-reference
scripts/operator-workstation.env so the workstation-vs-host scope
split is documented in both files.
Source updates:
- crates/agentkeys-cli/src/lib.rs (×2): drop dead 'stage6-demo-env.sh'
filename references in doc comments, replaced with
'pre-Stage-7 fallback' / 'no manual AWS_* env wiring required' prose.
- crates/agentkeys-cli/src/main.rs: --broker-url help text now describes
the actual flow (/v1/mint-oidc-jwt + AssumeRoleWithWebIdentity)
instead of pointing at the removed shell script.
- crates/agentkeys-mcp/src/lib.rs: same prose cleanup on broker_url field.
- crates/agentkeys-daemon/src/main.rs: --broker-url doc comment
rewritten to describe the new flow (was still describing
/v1/mint-aws-creds with bearer-validated path).
Verification:
- env -i bash 'source scripts/operator-workstation.env; echo $BUCKET'
→ agentkeys-mail-429071895007 (clean load, no leaks).
- env -i bash 'source scripts/broker.env; echo $BUCKET'
→ unset (broker host correctly does NOT get the workstation var).
- bash -n scripts/inspect-inbound-email.sh: syntax clean.
- cargo build --workspace: clean.
- grep 'stage6-demo-env\|stage6-demo-run\|stage6-inspect-email' on the
active tree (excluding archived/): zero hits.
…ivate_key Operator hit `jq: error (at /tmp/wallet-A.json:6): Cannot index array with string "private_key"` following docs/stage7-demo-and-verification.md §0. `cast wallet new --json` (Foundry) returns a JSON ARRAY of wallet objects, not a single object. The wallet metadata is at `.[0]`, not the document root. Same fix applies to `address` extraction.
… setup-broker-host.sh Drop the early-return --upgrade code path. The script now follows a single linear flow that auto-detects fresh-host vs existing-deploy by reading Environment= lines from /etc/systemd/system/agentkeys-broker.service when present. Same invocation works in both states. Concrete changes: 1. Delete the if $UPGRADE_MODE; then ... exit 0; fi block (~130 LOC). The salvageable bits (git pull, branch-switch warning, stop+swap) move into the main flow. 2. Add 'Detect existing config from systemd unit' step right after pre-flight. Reads BROKER_OIDC_ISSUER, ACCOUNT_ID, REGION, and AWS_PROFILE → fills in CLI flags the operator didn't pass. After first install, every subsequent run can be 'bash setup-broker-host.sh --yes' with no other flags. 3. --ref / --skip-pull are now opt-in. Default = build whatever's currently checked out (operator handles git themselves). Pass --ref <branch-or-tag> to opt into a fetch+checkout+pull step (useful for unattended CI redeploys). Branch-switch warning fires when the resolved ref differs from the current branch. 4. --upgrade flag is now a back-compat no-op (silently accepted but does nothing — the script is idempotent regardless). 5. Binary install step now stops services before swap (idempotent — no-op on fresh hosts), backs up existing binaries to .bak (skip on fresh hosts), then installs new ones. Both binaries (mock-server + broker-server) are always rebuilt + reinstalled. 6. Final step uses 'enable + restart' instead of 'enable --now'. restart is idempotent: starts a stopped service, refreshes a running one. Picks up unit-file changes from step 5 + any binary change in step 3. 7. Add post-install verification: tail journalctl, probe loopback /healthz on both ports — operator sees immediate success/failure without an extra command. Header comment rewritten to reflect single-flow design. CLAUDE.md gains a 2-line 'Remote broker host (single entry point)' section: all remote-host changes MUST go through this script — no ad-hoc systemctl edits, no hand-built scp. This is the convention for every future remote change in the project. Net: -58 LOC, +1 idempotent flow, +1 doc rule. bash -n syntax clean.
…d` under set -e Operator on broker.litentry.org reported the script printing "Detected existing broker unit at … — reading config" then exiting silently. Cause: the previous detection block used the `[[ test ]] && cmd` pattern at the top level — under `set -e`, when the test is false, the whole compound returns 1 and the script exits. Specifically: [[ -n "$EXISTING_REGION" ]] && REGION="$EXISTING_REGION" When the existing systemd unit didn't have an `Environment=REGION=…` line (common after the post-issue-#71 deploy that drops legacy aliases), $EXISTING_REGION was empty, the test failed, the && short-circuited, the line returned 1, set -e killed the script. Fix: - Convert all four detection conditionals to explicit `if`/`fi` blocks. set -e exempts commands inside `if test; then …; fi` so a false test no longer terminates. - Harden `read_unit_env` itself: wrap the grep|head|sed pipeline in `{ … } || true` so a missing key returns empty under set -e + pipefail instead of propagating grep's no-match exit code. - Add a comment at the top of the block calling out the gotcha so the next person editing this code doesn't reintroduce it. Verified locally with `set -euo pipefail` against a unit file that has ISSUER but lacks REGION + ACCOUNT_ID: ISSUER_URL=https://broker.litentry.org ACCOUNT_ID=(empty) REGION=us-east-1 CRED_MODE=(empty) OK — no silent exit bash -n syntax clean.
Operator on broker.litentry.org reported the script still asking unnecessary questions on a re-run. The host already has OIDC enabled, nginx in place, and the post-issue-#71 creds-free posture — all four remaining prompts (cred-mode, region, nginx, certbot) were noise. Three changes make the silent re-deploy actually silent: 1. Detection block now defaults CRED_MODE to 'none' when the existing unit has no AWS_PROFILE. Pre-fix, CRED_MODE stayed empty and triggered the cred-mode prompt; post-fix, the post-issue-#71 default fills in automatically. 2. Drop the cred-mode / region / nginx / certbot prompt blocks from the interactive walkthrough. They're now opt-in via CLI flags only: --cred-mode {none|instance-profile|profile} (default: none) --region us-east-1 (default: us-east-1) --with-nginx | --without-nginx (default: no) --with-certbot | --without-certbot (default: no) On a fresh-host bootstrap that genuinely needs nginx + certbot, the operator passes those flags. On the common remote-host re-deploy case, no prompts fire. 3. Flip the validate-inputs default for CRED_MODE from 'instance-profile' to 'none' (matching the new silent default), and convert the WITH_NGINX/WITH_CERTBOT 'auto → no' resolution from '[[ ]] && cmd' to 'if/fi' to dodge the same set-e silent-exit gotcha that bit the detection block. Verified locally: existing unit + no flags + --yes → no prompts, detection fills in everything, summary + execute proceed silently. detected: ISSUER_URL=https://broker.litentry.org ACCOUNT_ID=429071895007 REGION=us-east-1 CRED_MODE=none final: WITH_NGINX=no WITH_CERTBOT=no OK — would proceed silently to summary + execute, no prompts
…k8s-style name The broker's Tier-2 reachability probe (spawn_tier2_probes in agentkeys-broker-server/src/main.rs) hits BROKER_BACKEND_URL/healthz — Kubernetes convention. The mock-server only registered /health, so the probe always returned 404 and the broker logged 'Tier-2 backend probe: unreachable' every 15s while /readyz stayed at 503. Operator on broker.litentry.org saw this in journalctl plus an empty 'curl -sf .../healthz; echo' (curl -sf swallowed the 404 silently because of -s, and printed nothing because there was no 2xx body). Add /healthz as a parallel route. Keep /health as an alias so any pre-Stage-7 caller that wired itself to /health doesn't break. After this commit + a redeploy via setup-broker-host.sh, the broker's /readyz transitions from 'unready' (tier2/backend) to 'ready' within ~15s of restart. cargo build -p agentkeys-mock-server: clean. cargo test -p agentkeys-mock-server: 5 + 56 = 61 passed, 0 failed.
…url probes informative Two related cleanups for the endpoint name + UX: 1. **Single name across the codebase: `/healthz`** (Kubernetes convention, matches what the broker's Tier-2 reachability probe actually hits). - mock-server: drop the `/health` alias added in 77fbce2. Only `/healthz` remains. Confirmed zero callers expected `/health` (grep across crates/ showed no consumers). - broker-server handlers/health.rs (dead code per V0.1-FOLLOWUPS R1-F10 but kept for now): change the backend probe URL from `/health` to `/healthz` for consistency. 2. **Make `curl … /healthz` probes self-explanatory.** The `curl -sf` pattern silently swallows non-2xx responses (because of -s) and only prints body on success. When operators hit a 404 or wrong port, they see nothing — the failure mode that prompted this fix on broker.litentry.org. Replace with `curl -sS -o /dev/null -w 'HTTP %{http_code}\\n'` so the response status always prints, regardless of outcome: - docs/stage7-demo-and-verification.md §0 healthz curl - scripts/setup-broker-host.sh post-install smoke-test hint After this commit + a redeploy: - mock-server's only health endpoint is `/healthz`. - broker's Tier-2 probe (already targeting `/healthz`) finds the endpoint and `/readyz` flips to "ready". - demo-guide §0 shows `HTTP 200` (or whatever) instead of empty output, so operators know exactly what they got. cargo build -p agentkeys-mock-server -p agentkeys-broker-server: clean. cargo test (both crates): 222 passed, 0 failed.
…-describing
- Delete crates/agentkeys-broker-server/src/handlers/health.rs (unrouted; the
router has used handlers::broker_status::readyz since Phase 0).
- /readyz green-path body changes from {} to {"status":"ready","degraded":
false,"checks":[],"ready":[...]}. The dead code was the source of the
wrong-shape doc copy that claimed /readyz returned {"status":"ready"}.
- docs/stage7-demo-and-verification.md §1 + §16.3 updated to show the actual
three-shape response and use 'jq -r .status' as the green-path verdict.
- CLAUDE.md adds a branch-push policy: on the evm branch, push immediately
after every code/doc update so scripts/setup-broker-host.sh --upgrade
doesn't silently pick up a stale revision.
zsh's builtin echo interprets \n (two ASCII chars '\' + 'n') as a literal 0x0A newline. The broker's /v1/auth/wallet/start response embeds \n inside the siwe_message JSON string as a JSON escape, so the long-standing 'echo "$START" | jq' pattern silently corrupts those escapes into raw newlines and jq fails with: Invalid string: control characters from U+0000 through U+001F must be escaped at line 13, column 33 Replaced 25 occurrences across §2-§16. printf '%s' is portable across bash and zsh and never re-interprets escapes. Added a note in §0 explaining the choice so a future maintainer doesn't 'fix' it back. Verified live against https://broker.litentry.org/v1/auth/wallet/start: - echo $START | jq → parse error (zsh) - printf '%s' "$START" | jq → siwe-d437073077a2792b327836eac893fd83 ✓
Reproduce reported failures locally and isolate the layer (shell, tooling, doc, code) before editing. If the cause is local, respond with the one-line fix; only edit when the cause is in the repo. Keep responses concise.
…0 checkpoint Same echo→printf '%s' fix as b80ec39, applied to the 5 remaining occurrences in cloud-setup.md (3), stage7-wip.md (1), PHASE-0-CHECKPOINT.md (1).
The previous bulk fix (b80ec39, 8b50c1d) used a Python raw-string regex replacement that left literal backslashes around the quotes: printf '%s' \"$START\" | jq ← was committed printf '%s' "$START" | jq ← what users actually need The shell sees \" as literal " plus the surrounding quoting, producing "<JSON>" which jq can't parse ("Invalid numeric literal"). Stripped from 30 lines across 4 docs (stage7-demo, cloud-setup, stage7-wip, PHASE-0-CHECKPOINT). Also moved the printf rationale callout from inside the §0 bullet list (where it broke list rendering) to right before §1, and expanded it to call out the backslash-quote trap explicitly.
…owing them curl -sf returns exit 22 on 4xx/5xx but DISCARDS the response body and prints nothing to stderr. Operators following the demo doc see an empty $START / empty $VERIFY / empty $JWT and have no signal what went wrong. --fail-with-body (curl >=7.76, ships in macOS curl 8.7+) keeps the same fail-on-non-2xx behaviour but PRINTS the body, so a 401 'bad nonce' or 400 'malformed wallet address' is visible immediately. 45 occurrences across 4 docs (stage7-demo, cloud-setup, operator-runbook, stage7-wip). The single `curl -sf … && echo` reference in the §1 comment is intentional — it's documenting the anti-pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously fell back to a hardcoded https://oidc.agentkeys.dev when the env var was missing. Tier-1 only validates that the issuer is HTTPS, so the wrong issuer would pass startup and the broker would happily mint JWTs that AWS rejects with cryptic InvalidIdentityToken at /v1/mint-aws-creds time. The issuer is a trust-boundary value — AWS IAM compares the JWT iss claim byte-for-byte against the registered OIDC provider URL. There is no safe default; the deployment owner must set it explicitly. Codex adversarial review (review-mowwm33c-u6fa0v) flagged this as the no-ship issue. Fix matches the existing required_env pattern already used for BROKER_BACKEND_URL on line 48. scripts/broker.env line 46 and scripts/setup-broker-host.sh line 552 already emit this env var, so the live broker.litentry.org deploy doesn't break — just gets the fail-closed behaviour the doc has always promised.
…backend
Root cause of the live-broker §3 401 'session not found':
/v1/auth/wallet/verify returns a broker-signed session JWT (kid 'ak-session-…')
/v1/mint-oidc-jwt was still calling validate_bearer_token, which round-
trips to BROKER_BACKEND_URL/session/validate
The broker signs SIWE/email/oauth2 sessions itself; the legacy mock
backend never sees them. So a freshly-minted session JWT fails the
backend lookup → 401 'session not found'.
/v1/mint-aws-creds (handlers::mint::mint_v2) was already on the right
path — verify_session_jwt against state.session_keypair, no backend
round-trip. /v1/mint-oidc-jwt was a half-completed migration.
Fix: oidc.rs swaps to verify_session_jwt — same primitive, same issuer
+ kid pinning, same audience check. wallet now comes from
session_claims.agentkeys.wallet_address. /v1/auth/exchange keeps using
validate_bearer_token because that endpoint exists explicitly to convert
legacy bearers into session JWTs (per its own docstring).
Tests:
- mint_oidc_jwt_signs_claims_for_session_wallet rewritten to mint a
session JWT against state.session_keypair instead of calling the
legacy /session/create on the mock backend.
- mint_session_against_backend helper deleted (was the only caller).
- mint_oidc_jwt_rejects_missing_bearer + rejects_invalid_bearer_and_audits_auth_failed
pass unchanged — the new local-verify path returns the same
Unauthorized error class.
124 unit + 31 integration tests green.
SELECTIVE EXPANSION mode. 6 of 8 surfaced expansions accepted: - Signer protocol design doc (#1) - Versioned HKDF derivation (#3) - Audit-log row on init (#5) - agentkeys whoami CLI (#6) - TEE-stub integration test (#7) - Hard cut --mock-token flag (#8 — stronger than recommended deprecation runway) Skipped: - Feature-flag gating (#2 — env-var gating retained) - Session JWT refresh flow (#4 — long TTL acceptable for demo) Revised effort: 600 -> 830 LOC, +1 design doc, +1 CLI command, +1 test infrastructure (TEE-stub conformance).
hanwencheng
pushed a commit
that referenced
this pull request
May 9, 2026
…th) + step 1c plan + arch doc Lands the architectural follow-up to PR #75: PR #75 shipped the dev_key_service signer with no HTTP-layer auth (loopback assumption per signer-protocol.md §"What's intentionally out of scope at v0"). This commit: - DEPLOYS signer.litentry.org as an independent backend listener (issue #74 step 1b). agentkeys-mock-server gains a `--signer-only` mode that registers ONLY `/dev/derive-address`, `/dev/sign-message`, `/healthz` (no legacy session/ credential/audit endpoints). Bound to 127.0.0.1:8092; nginx fronts it at https://signer.<zone> with its own cert. Same binary, two roles — loopback :8090 stays as the broker's tier-2 reachability target. - ADDS JWT bearer verification to /dev/* handlers. The signer reads the broker's ES256 session pubkey at boot from a pinned file (/var/lib/agentkeys/.agentkeys/broker/session-keypair.pub.pem) written by the broker's new --export-session-pubkey-to flag. Every /dev/* request must carry Authorization: Bearer <jwt> with claims.agentkeys.omni_account matching body.omni_account; otherwise 401 unauthorized. No SIGNER_ACCESS_TOKEN. No HMAC. No device-key signing — those land in step 1c. - PLUMBS the JWT through the daemon-side stack: HttpSignerClient gains with_session_jwt(); CLI signer/whoami commands load the saved session and set the bearer; init_flow returns the EVM session JWT for the caller to persist. - AUTOMATES setup-broker-host.sh to provision the new agentkeys-signer.service systemd unit and the nginx server block for signer.<zone>. Idempotent — re-runs preserve the master secret + session pubkey + nginx config. PLAN DOCS: - docs/spec/plans/issue-74-step-1c-device-key-auth.md (NEW, 381 lines) Replaces broker-issued bearer JWT as the sole authenticator on /dev/* with a device-key signature scheme. Removes broker-as-SPOF risk for the signer call surface; identity-type-uniform across evm/email/oauth2/ passkey; UX-uniform (one ceremony at init, automatic per-request). Aligned with Heima's ClientAuth tier model (EvmSiweSigned + BackendSigned), strictly stronger because user-controlled per-request key + zero per-request user interaction. See gh issue #76. - docs/spec/architecture.md (REWRITTEN, 506 lines, replaces prior version) Canonical broker/signer/daemon/key-flow doc. Mermaid diagrams for component map, trust boundaries, identity model, init sequence, per-mint sequence, deployment topology. Full K1–K10 key inventory table designed for direct Figma reuse. Pluggable-surfaces matrix covering auth methods, signer backends, audit destinations, vault backends. stage7-wip.md absorbed into §1, §6, §7, §11; archived. - docs/spec/heima-gaps-vs-desired-architecture.md (REVISED) Added §1a status snapshot table covering all 12 gaps at-a-glance. §3 OIDC provider + §6 PrincipalTag JWT claim marked RESOLVED IN-TREE (post-PR #61 + #73). NEW §11 (signer-edge contract — PARTIAL after PR #75) and §12 (per-request crypto auth — PLANNED via #76). Resolution log under §10. - docs/stage7-demo-and-verification.md (UPDATED for the signer split) Drops the SSH tunnel scaffolding entirely. Single demo path uses the public signer hostname. Trust-model diagram + two-machine layout + §0.2 reach-the-signer + §14.3 troubleshooting + §16.4 live walkthrough + §16.7 auto-provision + §17 cleanup all updated. VERIFICATION: - 394 tests pass workspace-wide (was 386 in PR #75; +8 new JWT auth integration tests in dev_key_service_routes.rs). - 0 cargo clippy errors; 18 pre-existing warnings (was 16; +2 minor cosmetic in agent-generated test code). WHAT DID NOT LAND: - Live broker host redeploy + signer.<zone> certbot issuance — operator step. The script that makes it work shipped here. To land: ssh broker host → bash scripts/setup-broker-host.sh --yes → sudo certbot --nginx -d signer.<zone> → smoke per docs/stage7-demo- and-verification.md §16. - Device-key auth (issue #74 step 1c) — separate issue #76, plan doc shipped in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hanwencheng
added a commit
that referenced
this pull request
May 15, 2026
…strap chain) (#75) * agentkeys: stage 7+ — issue #74 step 1 (dev_key_service signer + bootstrap chain) Plan steps 0-9 of docs/spec/plans/issue-74-dev-key-service-plan.md landed in this PR: - 0: docs/spec/signer-protocol.md — v0 wire contract (request/response, error envelope, versioned HKDF derivation byte, future TEE attestation handshake). - 1: agentkeys-mock-server::dev_key_service — HKDF + secp256k1 + EIP-191, loaded from DEV_KEY_SERVICE_MASTER_SECRET; 10 unit tests. - 2-3: /dev/derive-address + /dev/sign-message handlers + state + routes; 503 signer_disabled when env unset; 8 integration tests. - 4: scripts/setup-broker-host.sh auto-generates the master secret into /etc/agentkeys/dev-key-service.env (mode 0600), wires it via EnvironmentFile= in the backend systemd unit. Idempotent — preserves the secret across re-runs (rotation invalidates derived wallets). scripts/broker.env documents the separation. - 5: agentkeys-daemon main.rs adds --init-email / --init-oauth2-google / --signer-url, drives the email/OAuth2 -> omni -> derive -> link -> SIWE -> EVM-session chain on first start; emits a tracing audit row on success. - 6: agentkeys-cli cmd_init rewritten as InitMode::{Email, Oauth2Google, ImportLegacyMock(test-only)}. --mock-token flag hard-cut from the user-facing CLI surface. All 9 cli_tests.rs sites migrated. - 7: agentkeys whoami CLI (read-only; surfaces signer-derived wallet). - 8: TEE-stub conformance test — same wire contract, in-memory keypair fixture vs HKDF backend; 3 tests prove the swap-point invariant. - 9: docs/stage7-demo-and-verification.md rewritten end-to-end for the new flow. Shared plumbing in agentkeys-core: signer_client (typed RPC trait + HttpSignerClient), init_flow (broker email/OAuth2 chain, used by both CLI and daemon). CLAUDE.md adds a plan-completion policy (always complete every numbered plan step; mandatory done/not-done summary at PR end). Pre-Stage-7 docs moved to docs/archived/ (operator-runbook, contradictions, field-name-translation); inbound references repointed. Verification: 386 tests pass workspace-wide, 0 failing; clippy clean on new code. What did not land in this PR: - Plan step 10 (live broker-host redeploy + smoke walkthrough) — operator step; the script that makes it work shipped here. - End-to-end integration test of the email/OAuth2 flow against a live broker — would need an in-memory mock email/OAuth2 provider; left as follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentkeys: stage 7+ — issue #74 step 1b (signer-server split + JWT auth) + step 1c plan + arch doc Lands the architectural follow-up to PR #75: PR #75 shipped the dev_key_service signer with no HTTP-layer auth (loopback assumption per signer-protocol.md §"What's intentionally out of scope at v0"). This commit: - DEPLOYS signer.litentry.org as an independent backend listener (issue #74 step 1b). agentkeys-mock-server gains a `--signer-only` mode that registers ONLY `/dev/derive-address`, `/dev/sign-message`, `/healthz` (no legacy session/ credential/audit endpoints). Bound to 127.0.0.1:8092; nginx fronts it at https://signer.<zone> with its own cert. Same binary, two roles — loopback :8090 stays as the broker's tier-2 reachability target. - ADDS JWT bearer verification to /dev/* handlers. The signer reads the broker's ES256 session pubkey at boot from a pinned file (/var/lib/agentkeys/.agentkeys/broker/session-keypair.pub.pem) written by the broker's new --export-session-pubkey-to flag. Every /dev/* request must carry Authorization: Bearer <jwt> with claims.agentkeys.omni_account matching body.omni_account; otherwise 401 unauthorized. No SIGNER_ACCESS_TOKEN. No HMAC. No device-key signing — those land in step 1c. - PLUMBS the JWT through the daemon-side stack: HttpSignerClient gains with_session_jwt(); CLI signer/whoami commands load the saved session and set the bearer; init_flow returns the EVM session JWT for the caller to persist. - AUTOMATES setup-broker-host.sh to provision the new agentkeys-signer.service systemd unit and the nginx server block for signer.<zone>. Idempotent — re-runs preserve the master secret + session pubkey + nginx config. PLAN DOCS: - docs/spec/plans/issue-74-step-1c-device-key-auth.md (NEW, 381 lines) Replaces broker-issued bearer JWT as the sole authenticator on /dev/* with a device-key signature scheme. Removes broker-as-SPOF risk for the signer call surface; identity-type-uniform across evm/email/oauth2/ passkey; UX-uniform (one ceremony at init, automatic per-request). Aligned with Heima's ClientAuth tier model (EvmSiweSigned + BackendSigned), strictly stronger because user-controlled per-request key + zero per-request user interaction. See gh issue #76. - docs/spec/architecture.md (REWRITTEN, 506 lines, replaces prior version) Canonical broker/signer/daemon/key-flow doc. Mermaid diagrams for component map, trust boundaries, identity model, init sequence, per-mint sequence, deployment topology. Full K1–K10 key inventory table designed for direct Figma reuse. Pluggable-surfaces matrix covering auth methods, signer backends, audit destinations, vault backends. stage7-wip.md absorbed into §1, §6, §7, §11; archived. - docs/spec/heima-gaps-vs-desired-architecture.md (REVISED) Added §1a status snapshot table covering all 12 gaps at-a-glance. §3 OIDC provider + §6 PrincipalTag JWT claim marked RESOLVED IN-TREE (post-PR #61 + #73). NEW §11 (signer-edge contract — PARTIAL after PR #75) and §12 (per-request crypto auth — PLANNED via #76). Resolution log under §10. - docs/stage7-demo-and-verification.md (UPDATED for the signer split) Drops the SSH tunnel scaffolding entirely. Single demo path uses the public signer hostname. Trust-model diagram + two-machine layout + §0.2 reach-the-signer + §14.3 troubleshooting + §16.4 live walkthrough + §16.7 auto-provision + §17 cleanup all updated. VERIFICATION: - 394 tests pass workspace-wide (was 386 in PR #75; +8 new JWT auth integration tests in dev_key_service_routes.rs). - 0 cargo clippy errors; 18 pre-existing warnings (was 16; +2 minor cosmetic in agent-generated test code). WHAT DID NOT LAND: - Live broker host redeploy + signer.<zone> certbot issuance — operator step. The script that makes it work shipped here. To land: ssh broker host → bash scripts/setup-broker-host.sh --yes → sudo certbot --nginx -d signer.<zone> → smoke per docs/stage7-demo- and-verification.md §16. - Device-key auth (issue #74 step 1c) — separate issue #76, plan doc shipped in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: address review-questions Q1-Q8 (PoP, cold-start ordering, per-identity-type processes, K9 explanation) Addresses /Users/agent-jojo/.claude/plans/review-questions.md Q3 (K9 DKIM explanation): expanded the K9 row in architecture.md key inventory with a high-level "what is DKIM, why does AgentKeys need it" paragraph (per-domain Ed25519 key, signs outbound mail headers, pubkey in DNS TXT, used by Stage 6 federated email so SES never sees plaintext). Q5 (cold-start sequence ordering): rewrote architecture.md §5 to show device key generated FIRST (step 0), BEFORE the identity ceremony. The ceremony then binds D_pub atomically. Same trust shape as a WebAuthn credential creation — by the time the broker mints session JWTs, the device-pubkey claim is authoritative. Q6 (per-identity-type processes): NEW architecture.md §5a covers init-binding for each identity type (email-link, oauth2_google, evm, passkey, sandbox link-code), device-switching when operator gets a new laptop, intentional device-key rotation with chain-of-custody sigs, sandbox VM device-key persistence, and a trust-shape comparison across identity types. Architecture.md is now the single source of truth; step-1c plan defers to it. Q7 (init binding security — proof of possession): updated step-1c plan §"email" to require a `pop_sig` over the request payload signed by D_priv. Broker rejects with 400 bad_pop on mismatch. Closes the "attacker substitutes pubkey at request time" attack: attacker would need to compromise BOTH the network path AND the user's email inbox (vs just the network today). Q8 (sandbox VM device-key persistence): resolved via architecture.md §5a.4. Stock agent-infra/sandbox falls back to keyring-rs file backend under ~/.agentkeys/daemon-<wallet>/session.json (mode 0600); survives daemon restarts inside long-lived containers; vanishes with ephemeral sandbox containers. For ephemeral sandboxes, operator runs `agentkeys-daemon --init-link-code <new-code>` per session — same pattern as today's pair-flow. Q1 (forward-references): - issue-74-dev-key-service-plan.md gains a "Status (post-PR #75) — successor steps" preamble pointing at step 1b + step 1c as the follow-on work. - stage7-demo-and-verification.md trust-model section gains a callout that step 1c will upgrade /dev/* auth from bearer-JWT to device-key per-request signature; the demo flow shape doesn't change. Q2 (cleanup + placement): filed as issue #77 (separate from this commit). Tracks (a) the legacy mock-server endpoint cleanup after #75 + #76, and (b) the open question of where identity/audit endpoints belong long-term — captures the user's broker-policy / signer-execution split proposal. Q4 (storage location — answered inline, no doc edit): omni ↔ identity linking is stored in the broker at crates/agentkeys-broker-server/src/storage/identity_links.rs (SQLite table `identity_links`, indexed on (identity_type, identity_value)). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: cleanup pass on review-questions edits (renumber, PoP consistency, stale refs) Three structural cleanups across the 5 docs touched in commit 6d36a7b: 1. heima-gaps-vs-desired-architecture.md — section ordering fix. Previous numbering was 1, 1a, 2..9, 11, 12, 10 (Tracking out of order). Renumbered: §11 (NEW signer-edge contract) → §10 §12 (NEW per-request crypto auth) → §11 §10 (Tracking — was wedged between) → §12 Updated §1a status snapshot table accordingly. Updated 3 stale in-body §-refs: - §1a row 3: "architecture.md §11" → §7 (Pluggable surfaces) - §11 body "TEE swap-ready (gap §11)" → "(gap §10)" - §11 body "Blocks the TEE worker (gap §11)" → "(gap §10)" Updated tracking-section "PR #75 / issue #76 close §11 and queue §12" → "close §10 and queue §11"; resolution-log entries to match. 2. issue-74-step-1c-device-key-auth.md — PoP consistency across all identity types. Previously only the `email` flow had explicit proof-of-possession; `evm` and `oauth2_google` flows didn't. Same Q7 attack surface applies to all three, so: - `evm` flow: daemon now signs the SIWE binding payload with D_priv (in addition to the EVM key); broker verifies both signatures (proves "user owns EVM identity AND daemon controls device key"). - `oauth2_google` flow: daemon now signs the start request with D_priv; broker verifies before issuing any state value. Composes with the existing `state` parameter binding. 3. architecture.md — dropped "(preserved from prior architecture revision)" parenthetical from §9 Component inventory and §10 Language choices headings. Internal-changelog noise that doesn't help readers. Verification: 394 workspace tests pass, 0 fail. heima-gaps section ordering now sequential (1 → 1a → 2..9 → 10 → 11 → 12). All §-refs resolve to live anchors. step-1c PoP coverage confirmed in all three identity-type sections. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: master/agent split + WebAuthn-uniform binding ceremony (v0.2 target) Architecturally collapses the four bespoke per-identity PoP shapes (email pop_sig, oauth2 pop_sig, evm dual-sign-SIWE, passkey) into two uniform binding ceremonies, split by machine class: - Master machines (workstation with platform authenticator) -> WebAuthn enrollment ceremony. Hardware-attested, identity-type- agnostic, closes the email-account-compromise -> device-takeover gap (Q7) by requiring hardware presence at re-bind. - Agent machines (VM/Linux/CI/agent-infra/sandbox container) -> link-code redeemed against master's authenticated session per the agent-infra/sandbox two-tier orchestrator pattern. Defers YubiKey-on-Linux-as-master (roaming-authenticator binding) to issue #79 as a follow-up. arch.md changes (single source of truth): - §2 trust boundaries: K11 in master TB, new agent-machine TB, master/agent rows in compromise table - §3 K-table: K10 master/agent persistence dichotomy; new K11 for WebAuthn platform-authenticator credential - §5 cold-start: status callout pointing at §5a.1 for v0.2 target - §5a header: master-vs-agent intro + WebAuthn-uniform status - §5a.1: rewrite into identity ceremonies + 5a.1.M (WebAuthn) + 5a.1.A (link-code) + v1c-interim PoP shapes pointer - §5a.2: master/agent device-switch shapes; cross-device confirmation note - §5a.3: WebAuthn get()-gated rotation for masters - §5a.4: agent persistence per agent-infra/sandbox; link-code-per- session is the right answer, not a workaround; cite 1-step- analysis.md - §5a.5: trust-shape table collapses to master/agent rows Plan files defer to arch.md as authoritative: - step-1c plan: status callout + per-identity-type section header marked v1c-interim - dev-key-service master plan: successor steps note WebAuthn binding + link to #79 Companion artifacts: - gh issue #79 filed (YubiKey-on-Linux master deferral) - comment on #76 with WebAuthn refinement summary * docs: arch.md — fix stage-0 device-key generation contradiction (§5 vs §5a.1.M) §5 cold-start sequenceDiagram correctly shows D generated at step 0 (before identity ceremony / network traffic). §5a.1.M had it as step 1 AFTER identity ceremony returns binding_nonce — internally inconsistent within arch.md. §5 is the right model: D should be generated at daemon startup, not deferred until identity ceremony completes. There is no security benefit to delaying, and D_pub must exist by the time of any binding ceremony anyway (v1c pop_sig signs identity request with D_priv; v0.2 WebAuthn challenge folds D_pub into the ceremony challenge). Changes: - §5a.1 intro: explicit three-stage pipeline. Stage 0 = device-key generation at daemon startup; Stage 1 = identity ceremony; Stage 2 = binding ceremony. State that stage 0 is non-negotiably first across all flows (master, agent, v1c, v0.2) with the reasoning. - §5a.1.M: drop the misleading "step 1: generate D_priv". Now opens with explicit PRECONDITIONS from stage 0 + stage 1, and binding- ceremony numbering starts at the WebAuthn step itself. Final step notes D_priv was already persisted at stage 0 (just persist J0). - §5a.1.A: agent flow's daemon-startup D-generation now explicitly labelled "Stage 0 (daemon startup, per §5a.1)" for symmetry. Numbering unchanged (cross-machine sequence continues from master). - §5a.2.M: new-master device-switch flow now leads with Stage 0 (fresh K10' generated at daemon startup) before identity ceremony, matching first-init. §5a.3.M rotation step "generate D_priv_new" is unchanged — that's an explicit new-key generation within the rotation flow, not first-time init, so stage-0 framing doesn't apply. * docs: arch.md §5a.1.M — fill J0 → J1 bridge gap referenced by §5a.1.A §5a.1.A's precondition expected J1_master (the EVM-omni session JWT) but §5a.1.M ended at J0 (the identity-omni JWT). The wallet-derive + link + SIWE round-trip that mints J1 lives in §5 steps 2-3 but was never referenced from §5a.1.M's outro, so the reader had no path between the master binding ceremony and the agent link-code flow. Changes: - §5a.1.M: new "From J0 to J1 (master only — bridge to per-mint flows)" subsection. 6-step flow: signer derive-address → broker wallet/link → broker auth/wallet/start → signer sign-message → broker auth/wallet/verify → mint J1. States that K10 + K11 claims propagate from J0 into J1 atomically. Notes the evm-identity-type variant collapses these steps (user's own EVM key IS the wallet). - §5a.1.A precondition: now reads "ON MASTER (already initialized per §5a.1.M + the J0 → J1 bridge above; holds J1_master = the long-lived EVM-omni session JWT with K10 + K11 claims)" — makes the dependency on the bridge explicit. * docs: adopt HDKD per-agent omni model + arch.md compaction (709 lines, -235) Adopts the per-agent omni model proposed by user critique: - Each agent is a first-class actor with its own omni derived from master via HDKD //label, its own wallet (HKDF(K3, O_agent)), its own AWS PrincipalTag, its own audit slot. - Per-agent compromise containment, atomic revocation, first-class audit attribution, tree-as-data-model. - v1c "shared omni + multiple device pubkeys" is now a degenerate v1.0 tree (no children). Plus the link-code-only-agent-bootstrap simplification: - Agents have ONE bootstrap path: link-code from authenticated master. - No identity ceremony for agents, no shared bearer, no agent-side recovery. One test surface, one threat model. arch.md changes (compacted 944 -> 709 lines): - §3 K3/K4: per-actor-omni derivation framing; K10/K11 references updated to new §5a subsection numbering - §4 identity model: HDKD actor tree (master root + //label children), per-actor wallet derivation, why per-agent omni - §4a NEW: 4-axis mental model (identity / actor / machine / capability), master-vs-agent role table, key non-conflations - §5 cold-start: compact 4-stage table + single sequenceDiagram showing v1.0 master flow with WebAuthn enrollment + bridge to J1; v1c interim status callout - §5a restructured into 5 subsections (was multi-subsubsection): - 5a.1 master init (per-identity-type + uniform WebAuthn binding) - 5a.2 agent bootstrap (link-code only - explicit "no other path") - 5a.3 master device switch + rotation (combined) - 5a.4 agent re-bootstrap + persistence (combined; cites 1-step-analysis.md) - 5a.5 trust shape (per-actor isolation properties) CLAUDE.md: added "Architecture-as-source-of-truth policy" requiring arch.md re-check after any architectural doc edit; documents that per-doc detail outgrowing arch.md should link outward, not duplicate. step-1c plan: status callout reframed - v0.2 target is HDKD per-agent omni + WebAuthn-uniform binding (structural shift, not just wire-shape collapse); points at arch.md §4/§4a/§5a as single source of truth. Companion artifacts (not in commit; reference only): - .omc/wiki/agent-role-and-usage-hdkd-per-agent-omni.md (project-local wiki page, gitignored per .omc/ convention) - gh issue #79 updated: master-vs-agent reframed as actor role, not machine class; YubiKey-on-Linux is "Linux + YubiKey as master" (one of two roles, not a third class). * docs(demo): align stage7 demo doc with new architecture vocabulary Updates the operator-facing demo doc for the master/agent + HDKD mental model landed in the prior commit (50a0ffa). Operational content (steps 0-13) is unchanged because the demo runs against v1c-interim — the actually-shipped flow. Changes: - Trust model section: replaced step-1c-coming callout with explicit v1c-interim status; cross-refs arch.md §4 (HDKD actor tree), §4a (mental model), §5a (per-actor binding); flags v0.2 target features as not-yet-implemented and tracked in #76 / #79. - Two-machine layout: marked operator-workstation row as "(master role)"; added a "Roles + key inventory primer" callout pointing at arch.md §4a (4-axis mental model), §3 (K1-K11 inventory), §5a.2 (agent role / link-code bootstrap), and the agent wiki page as the operator-focused reference. - Section §0 success-criteria #3: clarifies "operator's omni_account" IS the master actor omni per arch.md §4. What did NOT land in the demo doc: - Per-step rewriting of operational content. The demo correctly exercises v1c-interim (single-omni-shared-with-master, bespoke per-identity PoP, link-code agents). v0.2 demo content waits for the agent-create endpoint + WebAuthn ceremony to ship. * docs(signer): document signer setup + add SIGNER_HOST/AGENTKEYS_SIGNER_URL - scripts/operator-workstation.env: add SIGNER_HOST + AGENTKEYS_SIGNER_URL (derived from BROKER_HOST), keep BACKEND_URL as alias. Co-located with broker today; hostname split lets the signer move to its own machine (or TEE worker) later without changing client config. - docs/cloud-setup.md §1.3: add "what the signer is + why a dedicated hostname" overview with a today-vs-future table; explicit co-location note + cross-ref to operator-workstation.env. - docs/stage7-demo-and-verification.md §0.2: stop re-deriving the signer URL — both vars come from operator-workstation.env now. Cross-ref the topology section in cloud-setup.md. No code change; arch.md §10 deployment topology already captures the separate-hostname / same-host model unchanged. * docs(cloud-setup): extract signer setup into §6 — fix $EIP ordering bug §1.3 used $EIP, but $EIP isn't set until §5.1 — copy-pasting top-down broke. Make §1.3 a brief intro consistent with §1.2 (broker subdomain defers to §5), and put the actual DNS+cert+nginx-flip steps in a new §6 that runs after §5 and reuses $EIP. - §1.3: brief signer intro + defer to §6 (matches §1.2 shape). - §6 NEW: Signer host — overview table (today vs future), DNS A record (§6.1), TLS cert + nginx flip (§6.2), verify (§6.3). - §7: Cleanup (was §6). - Top TOC: add §6 Signer host row, bump Cleanup to §7. - stage7 demo: cross-refs §1.3 → §6 for the cert+DNS steps; cross-ref to "cloud-setup.md §6" cleanup → §7. * docs(cloud-setup): §6.2 — derive SIGNER_HOST on broker host, not from $SIGNER_HOST Reported failure: `sudo certbot --nginx -d "$SIGNER_HOST"` on the broker host fell through to certbot's interactive vhost picker showing only broker.litentry.org. Root cause: $SIGNER_HOST is only exported on the operator workstation (scripts/operator-workstation.env), not on the broker host — empty -d arg → certbot's "pick from existing vhosts" fallback → only the broker vhost is offered. §6.2 now: - explicit warning that $SIGNER_HOST is workstation-only - adds a sanity-check `ls /etc/nginx/sites-enabled/agentkeys-signer` (catches the "setup-broker-host.sh wasn't re-run with signer code" case before certbot is invoked) - derives SIGNER_HOST inline from the nginx vhost (awk the server_name line setup-broker-host.sh just wrote) so the certbot command is copy-paste safe on a fresh broker shell with no env vars set * fix(setup-broker-host): default WITH_NGINX/CERTBOT auto → yes (was: auto → no) Reported failure: `sudo bash scripts/setup-broker-host.sh --yes` on a fresh broker host did not write the agentkeys-signer nginx vhost. Then `sudo certbot --nginx -d signer.<zone>` fell through to certbot's interactive vhost picker, which only listed broker.<zone> (because the broker vhost was written by an earlier run that had been done with --with-nginx). Root cause: WITH_NGINX defaulted to "auto", which resolved to "no" at line 361 — the comment said "preserves prior default" but every doc-driven operator expects nginx provisioning. The runbook (cloud-setup.md §5 + §6) explicitly assumes nginx is set up by the script. Now: auto → yes for both WITH_NGINX and WITH_CERTBOT. Operators who don't want nginx (running behind a non-nginx reverse proxy, pre-provisioned certs) opt out via --without-nginx / --without-certbot. The interactive preview already prints `nginx : $WITH_NGINX`, so the operator sees the resolved value before confirming. Also pin --with-nginx explicitly in cloud-setup.md §6.2 step 1 + step 3 so the doc remains correct even if the script default changes again. * docs(cloud-setup): §6.1 — warn against re-deriving EIP from local resolver Reported failure: operator's `dig +short broker.litentry.org A` returned 198.18.1.86 (RFC 2544 TEST-NET-2) because their local DNS resolver was behind a transparent proxy (Cloudflare WARP / Zscaler / Tailscale Magic DNS). Using that as $EIP would have published a Route 53 A record pointing at a private/loopback range, breaking Let's Encrypt validation silently — the symptom would surface 5 min later as "Timeout during connect (likely firewall problem)" with the wrong IP in the error. §6.1 now: - explicit callout that local resolvers behind WARP/Zscaler/Tailscale/ corporate VPNs return 198.18.0.0/15 for proxied hostnames - shows `aws ec2 describe-addresses` as the authoritative re-derivation - replaces fire-and-forget verify with a polling loop until Cloudflare DoH confirms the A record matches $EIP (Route 53 propagation up to TTL=300) §5.2 unchanged — within §5 the operator just set $EIP from AWS API in §5.1, so the local-resolver trap doesn't apply there. * docs(cloud-setup): deslop §1.3 + §6 — drop duplicated prose, keep table The §1.3 + §6 + §6.1 + §6.2 prose said the same thing 3-4 times (co-located today / future-split possible / "if the signer is ever moved" / "first run writes nginx, certbot, second run flips ssl"). Each new fix layered another paragraph on top instead of consolidating. Pass 1 — §1.3 collapsed from 12 lines to 1 (matches §1.2's defer-to-§5 shape; §6 has all the detail). Pass 2 — §6 intro: dropped 4-line prose paragraph above the table; folded "endpoints" + "exported as SIGNER_HOST" into the table itself so it's the single load-bearing reference. Dropped trailing prose paragraph about the env file (now in the Public-hostname row). Pass 3 — §6.1: collapsed standalone EIP-derive callout (10 lines of warning + 5 lines of fenced bash) into a 3-line guard inside the bash block (`[ -z "$EIP" ] && EIP=$(aws ec2 describe-addresses …)`). Kept the WARP/Zscaler/198.18.x.x context as a 4-line comment in the bash — load-bearing for diagnosis, would lose meaning if removed. Pass 4 — §6.2: dropped "Three host-side steps. setup-broker-host.sh is idempotent…" preamble paragraph (table already says this). Kept the $SIGNER_HOST=laptop-only callout (load-bearing — distinguishes laptop from broker host shell scope). No behavior change. All cross-refs intact (#6-signer-host, #51-allocate, signer-protocol, operator-workstation.env all still resolve). 60 code fences, balanced. * fix(setup-broker-host): drop --with-nginx / --with-certbot — defaults are yes The flags were redundant once defaults flipped to yes (commit a3a0a84). Per CLAUDE.md remote-broker-host policy the script is the single idempotent entry point — flag-gating "do the thing the runbook always wants" is noise. Drop both --with-* flags + the auto-resolution dead-code; keep --without-nginx / --without-certbot as the only opt-out. - WITH_NGINX / WITH_CERTBOT default to "yes" outright (no more "auto" three-state); 12-line auto-resolution block becomes a 2-line comment. - CLI parser drops --with-nginx / --with-certbot. Passing the removed flags now errors `unknown flag: --with-nginx` rather than silently no-op'ing. - Header usage block + interactive defaults comment updated to match. - docs/cloud-setup.md §6.2: drop --with-nginx from both invocations (replace_all over the doc). No behavior change for operators following the runbook — `--yes` alone already provisioned nginx since a3a0a84. This commit only removes the explicit `--with-nginx` redundancy. * docs(claude+stage7): runbook-fix-fold-back policy + absorb session fixes CLAUDE.md - New "Runbook-fix-fold-back policy": when an operator hits a runbook failure, both the targeted fix AND a runbook revision must land in the same turn. Goal: every operator-encountered failure makes the runbook strictly more robust before we move on. stage7-demo-and-verification.md (§0) Absorbs every failure the operator hit walking this PR end-to-end: - §0 Tooling: pulled CLI build out of a sub-bullet into a numbered ordered checklist (cargo build → cp to ~/.local/bin → which/version smoke-test → init). Explicit warning against path-relative aliases (the recurring "alias agentkeys=./target/release/agentkeys-cli" trap with the wrong binary name from before the agentkeys-cli → agentkeys rename). Spells out crate-name vs binary-name distinction. - §0.1: branch-agnostic checkout via `BRANCH="${BRANCH:-evm}"` (was hardcoded `git checkout evm` — broke when validating PR branches). Adds nginx vhost sanity-checks: `ls /etc/nginx/sites-enabled/ agentkeys-{broker,signer}` + grep for proxy_pass-vs-return-503 inside agentkeys-signer (catches the "cert issued but script not re-run, vhost still serves stub 503" failure mode). - §0.2: smoke-test now string-matches body == "ok" (a successful HTTP 200 with body "TLS cert not yet issued for signer …" is the exact trap operators hit when certbot succeeded but step 3 of §6.2 wasn't run). Adds a 5-row "common failure modes" table mapping observed body → cause → exact fix command. §16 line 1402's `git checkout evm` left as-is — that section is intentionally evm-specific (verifies the live prod broker). * docs(stage7): §0 install — drop conflicting aliases + verify $PATH wins Operator hit `which agentkeys` → "aliased to ./target/release/agentkeys-cli" even after `cp target/release/agentkeys ~/.local/bin/`. zsh aliases beat $PATH lookups (and the alias also pointed at the wrong binary name — the crate is agentkeys-cli but the [[bin]] is `agentkeys`), so the install was invisible no matter how correctly it was staged. §0 build checklist now goes 5 steps in this order: 1. sed-strip any `alias agentkeys[-= ]…` from ~/.zshenv + ~/.zshrc (with .bak), then `unalias` for the current shell. Fail-soft (`|| true`) so missing files don't abort. 2. Append `~/.local/bin` to $PATH if not already there (idempotent case statement; appends to ~/.zshenv). 3. cargo build (was step 1). 4. cp to ~/.local/bin (was step 2). 5. `hash -r` + `command -v agentkeys` (NOT `which`) — bypasses any alias zsh hasn't re-hashed away yet. Spells out the expected absolute-path output. Plus a tiered fallback callout: if `command -v` still shows the alias, grep ~/.zprofile / ~/.aliases / shell includes for stragglers, then `exec zsh -l`. Per Runbook-fix-fold-back policy (CLAUDE.md): operator failure → both the fix command (handed back inline last turn) AND the runbook revision land in the same turn. Next operator running this top-down won't hit the alias trap. * docs(stage7): §0.2 — pin BACKEND_URL inline + bail-loud on stale value Operator hit `curl: (7) Failed to connect to 127.0.0.1 port 18090` because their shell had a stale `BACKEND_URL=http://127.0.0.1:18090` local-dev export in ~/.zshenv that shadowed operator-workstation.env's BACKEND_URL=$AGENTKEYS_SIGNER_URL alias. §0.2 now: - Pins `export BACKEND_URL="$AGENTKEYS_SIGNER_URL"` inline so the smoke-test is self-contained (no longer depends on ~/.zshenv being un-shadowed). - Adds a defensive `case "$BACKEND_URL" in https://signer.*) ;; esac` bail-loud check BEFORE the curl, with a one-line diagnosis (`grep -n BACKEND_URL ~/.zshenv && unset && re-source`). - Echoes BACKEND_URL alongside SIGNER_HOST so the operator visually confirms the value is public https:// before hitting curl. Per Runbook-fix-fold-back: failure command + cause + fix command all inline in the runbook so the next operator with a stale local-dev shell doesn't have to round-trip with the maintainer to diagnose. * Revert "docs(stage7): §0.2 — pin BACKEND_URL inline + bail-loud on stale value" This reverts commit 11e59ce5da0b20d12bf6c07909160c506ce4d101. * docs(stage7): fix --json position — global flag, must precede subcommand Operator hit `error: unexpected argument '--json' found` running §0.4's `agentkeys signer derive --signer-url … --omni-account … --json`. Per crates/agentkeys-cli/src/main.rs:24-25, --json is a top-level flag on the root `agentkeys` command (controls ctx.json_output globally), NOT a per-subcommand flag on `signer derive` / `signer sign`. Clap rejects it after the subcommand's required args. Eight occurrences fixed across §0.4 (×2), §3 SIG_A/SIG_ADDR/SIG_B (×3 multi-line), and §16 live walkthrough (×3 single-line): agentkeys signer derive … --json | jq … → agentkeys --json signer derive … | jq … agentkeys signer sign … --json | jq … → agentkeys --json signer sign … | jq … Plain text-output calls at lines 1047 and 1099 left unchanged (no --json there to begin with). Per Runbook-fix-fold-back: clap arg ordering is non-obvious for top-level vs subcommand flags, so the runbook command examples must match the actual CLI grammar — operators copy-paste, they don't re-read the clap macro. * docs(stage7): §0.4 — inline `agentkeys init --email` step before derive Operator hit `Error: SIGNER_UNAUTHORIZED invalid session JWT: InvalidToken` running §0.4's first signer derive call. The §0.4 intro said "Run agentkeys init first if you haven't already" but never showed the actual command — operators don't know to look ahead 100 lines to §2.0 for the real `--email --broker-url --signer-url` invocation. §0.4 now: - Explicit "must run first OR every call below returns SIGNER_UNAUTHORIZED" callout (with the literal error message so operators searching the doc for the error find the fix). - Inline `agentkeys init --email alice@demo.example --broker-url $OIDC_ISSUER --signer-url $BACKEND_URL` as a copy-paste block, with the expected "Initialized via email-link" output. - Cross-link to §2.0 for explanation + OAuth2 alternative — minimal in §0.4, full context in §2.0. §2.0's existence preserved: it still has the magic-link explanation + OAuth2 alternative + daemon-side equivalent. §0.4's inline init is the minimum to keep the §0 prereq chain self-contained. Per Runbook-fix-fold-back: a runbook step that says "run X first" must include the literal X invocation, not just point at it. * feat(broker): real SES email sender — Pass 1 of Option B Pass 1 implementation per .omc/ralph/prd.json: ships the SesEmailSender behind the auth-email-link feature, with end-to-end SES → S3 round-trip integration test. Pass 2 (separate commit) wires boot.rs + setup-broker-host.sh + broker.env defaults + demo doc. Closes the gap that blocked the operator's stage-7 demo init flow: the deployed broker had only StubEmailSender (in-process Vec, no delivery). With this change + Pass 2, `agentkeys init --email` will deliver a real magic-link to the operator's inbox. US-1: Cargo.toml deps - aws-sdk-sesv2 = "1" added as optional dep gated by auth-email-link - aws-sdk-s3 + uuid added to dev-dependencies for the integration test - dev-deps now enable auth-email-link so tests/* compile by default US-2: SesEmailSender impl (crates/agentkeys-broker-server/src/plugins/auth/email_link.rs) - send_magic_link composes multipart text+html via aws-sdk-sesv2 SendEmail - verify_sender_ready calls GetEmailIdentity + checks verified_for_sending - Errors map to EmailSendError::{Send, Verify, Config} - Inline subject + body templates (no template-engine dep) - Re-exported from src/plugins/auth/mod.rs US-3: Body composition unit tests (4 added) - ses_subject_is_non_empty - ses_text_body_contains_landing_url - ses_html_body_contains_landing_url_twice (href + visible text) - ses_text_and_html_alternatives_both_present US-4: Integration test (crates/agentkeys-broker-server/tests/ses_email_flow.rs) - Gated by RUN_SES_INTEGRATION_TESTS=1 + #[ignore] - CleanupGuard Drop impl: list-and-delete every S3 object whose body contains the per-test UUID, even on panic - Polls inbound/ prefix for up to 60s (5s × 12 attempts) - Asserts MIME body contains both unique token AND landing URL (allowing for quoted-printable encoding of '=' as '=3D') US-5: Quality gates ALL GREEN - cargo build -p agentkeys-broker-server → exit 0 - cargo build -p agentkeys-broker-server --features auth-email-link → exit 0 - 161 lib tests pass; integration test compiles + skips gracefully - cargo clippy --no-deps -- -D warnings → exit 0 - (Pre-existing clippy warning in agentkeys-core/src/init_flow.rs:177 unrelated; will tackle in Pass 2 if it blocks.) US-6: BLOCKED on operator — live SES round-trip - Operator runs: awsp agentkeys-admin RUN_SES_INTEGRATION_TESTS=1 ACCOUNT_ID=429071895007 \ cargo test -p agentkeys-broker-server --features auth-email-link \ --test ses_email_flow -- --ignored --nocapture * fix(broker): SesEmailSender verify — fall back from address to domain identity Operator hit `NotFoundException: Email identity <noreply@bots.litentry.org> does not exist` running the SES integration test. Cause: SES GetEmailIdentity returns identities EXPLICITLY registered with `create-email-identity`. cloud-setup.md §2.1 verifies the DOMAIN (`bots.litentry.org`), which auto-grants sending rights to ANY address at that domain via DKIM — but the per-address identity (`noreply@bots.litentry.org`) was never registered. So the verify precheck failed even though the actual SendEmail would succeed. Fix: verify_sender_ready now tries address-level lookup first (preferred — explicit), then on NotFound falls back to extracting the domain (split on '@') and looking up the domain identity. Either passing → Ok(()). Helper extracted: check_identity(client, identity) → Result<(), String> returns Ok only when SES reports the identity exists AND verified_for_sending_status=true. Used by both attempts. No behavior change for operators who explicitly verify per-address; unblocks the canonical operator path (verify-domain-only) per cloud-setup.md §2.1. Closes the verify-precheck blocker on Pass 1's US-6 (live SES round-trip from operator). Quality gates re-checked: - cargo build -p agentkeys-broker-server --features auth-email-link → ok - cargo test -p agentkeys-broker-server --features auth-email-link --lib → 161 passed - cargo clippy -p agentkeys-broker-server --features auth-email-link --tests --no-deps -- -D warnings → ok * feat(ses): explicit per-address verify + ses-verify-sender.sh helper Per operator request after Pass 1: 1. drop the address→domain fallback in SesEmailSender::verify_sender_ready — explicit per-address verification only 2. register noreply-test@bots.litentry.org as a per-address SES identity and pin it in operator-workstation.env 3. give the operator a one-shot bash helper that exploits the existing SES inbound receipt rule (cloud-setup.md §2.1) to fully automate the address verification — no inbox-clicking, no manual MIME parsing Code (crates/agentkeys-broker-server/src/plugins/auth/email_link.rs): - verify_sender_ready: single GetEmailIdentity call on the FROM address. No fallback. Error message points the operator at `aws sesv2 create-email-identity` (and at scripts/ses-verify-sender.sh for the automated path) so the next failure self-diagnoses. - Removed check_identity helper (was the fallback shared call). Test (crates/agentkeys-broker-server/tests/ses_email_flow.rs): - TestEnv now reads BROKER_EMAIL_FROM_ADDRESS — same env var the broker reads at runtime (env.rs:143). One source of truth between the test + the broker process. - Default: noreply-test@${MAIL_DOMAIN} (was: hardcoded noreply@…). Env (scripts/operator-workstation.env): - New: MAIL_DOMAIN (bots.litentry.org), MAIL_BUCKET, BROKER_EMAIL_FROM_ADDRESS. - MAIL_DOMAIN is explicit (not derived from BROKER_HOST) — broker zone may differ from email subdomain. Helper (scripts/ses-verify-sender.sh, +x): - One-shot: aws sesv2 create-email-identity → poll s3://$MAIL_BUCKET/inbound/ for the SES verification mail (lands there via the existing receipt rule from cloud-setup.md §2.1) → grep verification URL out of the quoted-printable body → curl-click it → confirm VerifiedForSendingStatus → delete the verification mail from S3 so it doesn't pollute the inbox. - Idempotent: re-running on a verified identity exits 0 immediately. - Requires: aws + jq + curl + grep + sed (all present on macOS / Ubuntu). Quality gates: - cargo build -p agentkeys-broker-server → ok - cargo build -p agentkeys-broker-server --features auth-email-link → ok - cargo test -p agentkeys-broker-server --features auth-email-link --lib → 161 passed - cargo test -p agentkeys-broker-server --features auth-email-link --test ses_email_flow → 1 ignored (skips) - cargo clippy -p agentkeys-broker-server --features auth-email-link --tests --no-deps -- -D warnings → ok * fix(ses-verify-sender): drop FROM-grep prereq — never matched QP-encoded body Operator hit "endless waiting" — the script polled S3 forever even though SES had likely written the verification mail. Two bugs in the polling predicate: 1. `grep -q "$FROM"` looked for the literal `noreply-test@bots.litentry.org` string, but in a quoted-printable MIME body the `@` is encoded as `=40` so the literal grep never matched. 2. `grep -qE 'ses[._-]?verification|amazonaws\.com.*verify'` matched `ses-verification` patterns, but the actual SES URL host is `email-verification.<region>.amazonaws.com` — neither alternative hit. Fix: drop both prereq greps. SES verification URLs are unique enough that matching the URL pattern directly is sufficient — no false positives. Also added per-attempt diagnostics: - log "$count object(s) under inbound/" each iteration so the operator can see whether anything is landing at all - on timeout: structured 3-step diagnosis pointing at receipt-rule state, identity status, and bucket contents Refactored URL extraction into extract_verify_url() helper (single source of truth) — handles quoted-printable soft-wrap (=\n) + =3D decoding. * fix(ses-test): CleanupGuard Drop — block_in_place to allow nested block_on Operator hit the test panic at line 145: "Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks." Cause: `Handle::block_on` is forbidden when called from inside a tokio runtime context. Drop runs WHILE still inside #[tokio::test]'s runtime (the runtime hasn't shut down by the time Drop fires for `let _guard =`), so the previous code panicked even though we had `try_current → Ok` to "detect" the active runtime. Test ran end-to-end successfully BEFORE this Drop panic — log shows: ses_email_flow: found inbound object key=inbound/8dqr… (attempt 1) …the assertions never got to run because Drop tore down first. Fix: wrap `handle.block_on(cleanup_fut)` in `tokio::task::block_in_place`, which suspends the current async task so a nested blocking call is legal. Requires multi_thread runtime — already guaranteed by `#[tokio::test(flavor = "multi_thread")]` on the test attribute, no behavior change for the rest of the test. The `Err(_) → Runtime::new()` branch is preserved as a fallback for the edge case where Drop fires AFTER the runtime has been torn down (e.g. test panic during runtime shutdown). Won't normally trip in practice. * fix(ses-test): unbuffered per-attempt logging + bounded object scan Operator hit "test has been running for over 60 seconds" with no per-attempt log lines visible. Two underlying problems: 1. println! is line-buffered, and `cargo test --nocapture` pipes stdout (not a TTY), so the per-attempt "attempt N/12 — sleeping" lines were buffered until end-of-test. Looked like a hang from the operator side. 2. The poll loop did `list_objects_v2()` then iterated EVERY object's body. With cumulative SES inbound (test runs + verification mails), each iteration could scan dozens of objects, which is both slow and buries the relevant log lines. Fix: - New `log()` helper writes to STDERR (unbuffered) + explicit flush after every line. Operator sees progress in real time. - `eprintln!` for every step: * configuration echo (account / region / bucket / from / to / token) * verify_sender_ready in-progress + result * send_magic_link in-progress + result * per-attempt: list_objects_v2 call + total bucket size + how many we'll examine * per-object: index/total, key, size in bytes, contains-token Y/N * found / not-found summary per attempt - Scan limit: sort objects by LastModified desc, examine only the 20 most recent per iteration. Keeps the loop fast even when the bucket has thousands of stale objects. - list_objects_v2 errors no longer expect-panic; logged + retried next iteration. Gives the test a chance to recover from transient throttling. - Timeout panic now lists the 4 most likely root causes (sandbox + unverified recipient, suppressed address, receipt-rule inactive, region mismatch) with the diagnostic command to check each. No behavior change to the AWS interactions — purely observability + robustness against transient errors. * fix(ses-test): explicit async cleanup via catch_unwind — no more Drop guard Operator hit "test ok — CleanupGuard will purge inbound objects on Drop" followed by … nothing. No "deleted" log line ever printed. Bucket has 415 stale objects from prior runs — cleanup has been silently failing for a while. Root cause: Drop fires WHILE the tokio runtime is in shutdown handoff. `block_in_place` + nested `block_on` is touchy in that window — runs silently, hangs, or both. The pattern was wrong from the start. Fix: drop the Drop-based pattern entirely. - Test body extracted into `run_send_and_poll(...)` helper. - Outer test fn wraps it in `AssertUnwindSafe(...).catch_unwind().await` — captures any panic into Result without unwinding. - `cleanup_test_objects(...)` runs ALWAYS, in plain async context, with the same unbuffered `log()` helper as the test body. Logs every key it inspects + every delete + final count. - Captured panic is re-raised AFTER cleanup so test failure semantics are unchanged: the test still fails on assert! / expect, just AFTER cleanup has visibly run. Required new dev-dep: `futures-util = "0.3"` for `FutureExt::catch_unwind` on async futures. Standard tokio-test pattern. Net: cleanup now runs inside the runtime as a normal async call, can't hang on shutdown handoff, and prints every step. Note for operator: the existing 415 stale objects need a one-shot purge. Run from operator workstation: aws s3 ls s3://agentkeys-mail-${ACCOUNT_ID}/inbound/ --recursive | awk '{print $4}' | while read -r key; do body=$(aws s3 cp "s3://agentkeys-mail-${ACCOUNT_ID}/$key" - 2>/dev/null) if echo "$body" | grep -q 'magic-link-test-'; then aws s3 rm "s3://agentkeys-mail-${ACCOUNT_ID}/$key" fi done * perf(ses-test): cleanup fast-path — single DeleteObject vs 415-object scan Test took 211s end-to-end. Poll was instant (attempt 1, found in 1 RPC). Cleanup was the bottleneck: scanned all 415 inbound/ objects, fetching each body to check the per-test UUID. ~415 GetObject × ~500ms = ~3 min. Fix: poll already knows the exact key it found — pass it to cleanup. - run_send_and_poll takes Arc<Mutex<Option<String>>> as found_key_slot and writes the matching key into it on hit. - Outer fn drains the slot post-catch_unwind and passes Option<String> to cleanup_test_objects(s3, bucket, token, fast_key). - cleanup_test_objects: if fast_key=Some, single DeleteObject (~1 RPC). - Slow scan path preserved for the panic-before-find case (rare). Per-token body match retained for the slow scan — production-safe via UUID collision probability of ~10^-38. Expected runtime drop: 211s → ~5s (1s SendEmail + 1s ListObjects + 1s GetObject + 1s DeleteObject + ~1s overhead). * feat(broker): Pass 2 of Option B — wire SesEmailSender end-to-end Closes the original gap that blocked stage-7 demo init: the deployed broker had only `wallet_sig` enabled, was built without `auth-email-link`, and `agentkeys init` only supports email/oauth2 — so the broker fundamentally couldn't be initialized via the CLI. Pass 2 wires the SesEmailSender (from Pass 1) into broker boot + deployment, so `agentkeys init --email` works end-to-end against the deployed broker. Code: - crates/agentkeys-broker-server/src/env.rs: new BROKER_EMAIL_SENDER env var (`stub` | `ses`, default stub for back-compat). - crates/agentkeys-broker-server/src/boot.rs: branch on BROKER_EMAIL_SENDER. When `ses`, construct SesEmailSender via aws_config::defaults().load() using block_in_place + block_on (legal under multi-thread #[tokio::main]). When `stub`, preserve previous behavior. Unknown value → boot_fail. Deployment: - scripts/setup-broker-host.sh: * cargo build now passes `--features auth-email-link` (previously default-features only — that was the structural gap). * New section 4b: mints /etc/agentkeys/email-hmac.key (32 random bytes via openssl rand, mode 0600, owner agentkeys). Idempotent. * agentkeys-broker.service systemd unit gets new env vars: BROKER_AWS_REGION, BROKER_AUTH_METHODS=wallet_sig,email_link, BROKER_EMAIL_SENDER=ses, BROKER_EMAIL_FROM_ADDRESS=..., BROKER_EMAIL_HMAC_KEY_PATH=/etc/agentkeys/email-hmac.key. * New `--email-from <addr>` CLI flag + BROKER_EMAIL_FROM_ADDRESS env var fallback (default noreply-test@bots.litentry.org). Env defaults: - scripts/broker.env: BROKER_AUTH_METHODS now includes email_link; documented BROKER_EMAIL_SENDER, BROKER_EMAIL_FROM_ADDRESS, BROKER_EMAIL_HMAC_KEY_PATH. Quality gates: - cargo build --features auth-email-link → ok - cargo test --features auth-email-link --lib → 161 passed - cargo clippy --features auth-email-link --tests --no-deps -- -D warnings → ok - bash -n scripts/setup-broker-host.sh → ok What's next (this commit doesn't include): - GH issue documenting the original gap (item 3 of operator's request). - stage7-demo doc updates to confirm the now-working init flow (item 4). * docs: backfill issue #80 reference in setup-broker-host.sh comment * docs(stage7): §0.4 + §2.0 — add Pass-2 prereqs (ses-verify-sender + auth-email-link build) Operator hit issue #80 walking the demo: the deployed broker rejected /v1/auth/email/request with 404. Pass 2 of Option B (8ef973a) closed the gap — broker now builds with --features auth-email-link, has BROKER_AUTH_METHODS=wallet_sig,email_link, and uses real SesEmailSender. Demo doc updates: - §0.4: new "two-step prereq" callout listing the ses-verify-sender.sh step + the broker-host re-deploy. Cross-refs issue #80 so operators who Google the failure find the fix. - §2.0: brief prereq pointer + acknowledgment that magic-link is now delivered via real SES (FROM noreply-test@bots.litentry.org), not the prior in-process StubEmailSender. No operational step changes — just makes the documented init flow match what's actually deployable end-to-end after Pass 2 lands. * refactor(email_link): drop vestigial HMAC key — magic-link is stateful per arch.md Operator pointed out that HMAC isn't in our K-table architecture: docs/spec/architecture.md §3 (K1–K11 inventory) lists no HMAC key, and §5a.1.M Stage 1 + §4 row "email-link" describe the magic-link as **stateful**: "Broker emails magic link; operator clicks; broker confirms single-use within TTL." Audit showed `EmailLinkAuth.hmac_key` was loaded + validated (≥32 bytes) but **never used cryptographically anywhere in the email_link module**. Verified by `grep -rn 'self\.hmac_key\|sign_token\|HmacSha\|Mac::new' crates/agentkeys-broker-server/src/plugins/auth/email_link.rs` → zero matches. Vestigial dead code from an earlier design that planned self-verifying tokens but never landed. The actual security comes from: - Token randomness (32 bytes CSPRNG via getrandom) - SHA256(token) lookup (no plaintext token in SQLite) - TTL check (10 minutes per Plan §3.5.3) - Single-use enforcement (consume_token marks consumed) No HMAC needed. Remove the dead weight + the operator-facing wiring: Code: - crates/agentkeys-broker-server/src/plugins/auth/email_link.rs: drop `hmac_key` field, constructor param, length validation; drop `hmac_key_too_short_rejected` test; drop `vec![0u8; 32]` from test helper; drop now-unused `use crate::env;`. - crates/agentkeys-broker-server/src/boot.rs: drop hmac_path/hmac_key load block; drop arg from EmailLinkAuth::new call; reframe boot_fail anchor to BROKER_EMAIL_FROM_ADDRESS (the still-required var). - crates/agentkeys-broker-server/src/env.rs: drop BROKER_EMAIL_HMAC_KEY_PATH constant + introspection table entry. - crates/agentkeys-broker-server/tests/email_flow.rs: drop `vec![0u8; 32]` from EmailLinkAuth::new call. Deployment: - scripts/setup-broker-host.sh: drop section 4b (email-hmac.key generation); drop Environment=BROKER_EMAIL_HMAC_KEY_PATH from systemd unit. - scripts/broker.env: drop BROKER_EMAIL_HMAC_KEY_PATH entry; replace with explanatory comment pointing at arch.md §5a.1.M. Demo: - docs/stage7-demo-and-verification.md §0.4 prereq + §2.0 prereq: drop "+ email-HMAC key" wording; reference arch.md §5a.1.M for the stateful design rationale. OAuth2's state_hmac_key (oauth2/mod.rs:394) is unaffected — that one IS load-bearing (HmacSha256 signs the OAuth state parameter for integrity across redirect). Quality gates: - cargo build -p agentkeys-broker-server → ok - cargo build -p agentkeys-broker-server --features auth-email-link → ok - cargo test -p agentkeys-broker-server --features auth-email-link --lib → 160 passed (was 161; -1 = removed hmac_key_too_short_rejected) - cargo clippy --features auth-email-link --tests --no-deps -- -D warnings → ok - bash -n scripts/setup-broker-host.sh → ok * docs(policy): add no-hardcoded-values policy + hardcoded.md audit log Operator request: enforce that no hardcoded values land in scripts/code/ runbooks unless logged in a dedicated audit doc. CLAUDE.md - New "No-hardcoded-values policy" between Runbook-fix-fold-back and Plan-completion. Says: parameterize via env / CLI / config; if temporarily hardcoded, log in hardcoded.md with file+line, why, and the unblock action. hardcoded.md (NEW) - Seeded with the existing operator-deployment-pinned values (ACCOUNT_ID, BROKER_HOST, MAIL_DOMAIN, BROKER_EMAIL_FROM_ADDRESS, BROKER_DATA_ROLE_ARN), the deployment-architecture-pinned values (loopback ports 8090/8091/8092, agentkeys system user, /etc/agentkeys paths), and code-level constants (TOKEN_TTL_SECONDS, rate-limit defaults, SES integration test defaults). - Each entry: what's hardcoded, why, what would unblock making dynamic. - Open trade-off section flags the email_link HMAC removal (b8481fe) for revisit when scaling to multi-broker-replica deployments. scripts/broker.env (smell fix called out in hardcoded.md) - Add ACCOUNT_ID=429071895007 as the single source of truth. - Derive BROKER_DATA_ROLE_ARN from \${ACCOUNT_ID} (was hardcoded separately, drifted from operator-workstation.env's ACCOUNT_ID). - Verified: `set -a; source ./scripts/broker.env; set +a` expands ACCOUNT_ID + BROKER_DATA_ROLE_ARN correctly. * docs(hardcoded): cross-link HMAC trade-off to issue #81 — bidirectional traceability * fix(ses-verify-sender): fail loud on wrong AWS profile + fold profile switch into stage7 doc The script previously masked AccessDenied from list-objects-v2 with '2>/dev/null || true', manifesting as endless 'attempt N/24 - 0 object(s) under inbound/' polling when the operator forgot to switch to agentkeys-admin profile (the broker user lacks s3:ListBucket on the mail bucket per cloud-setup.md section 2.1). Two changes: 1. Script now preflights 'aws sts get-caller-identity' + a ListObjectsV2 probe before entering the poll loop. Wrong-profile case dies with explicit 'Run: awsp agentkeys-admin' guidance instead of silently spinning. Also drops the 2>/dev/null mask on the poll-loop list call now that preflight proves the cred path. 2. Stage 7 demo doc section 0.4 prereq block now shows the awsp + set -a;source;set +a sequence inline, with a callout naming the previous failure mode so the next operator recognizes it immediately. Reproduced locally: AWS_PROFILE=agentkey-broker bash scripts/ses-verify-sender.sh -> exits 1 with: 'wrong AWS profile: arn:...:user/agentkey-broker lacks s3:ListBucket on agentkeys-mail-429071895007. Run: awsp agentkeys-admin then re-run this script.' User approved one-shot raw-git use because this dir is a git-linked worktree (.git is a file pointing back to parent repo); jj root resolves to parent and cannot see these paths. * fix(setup-broker-host): die loud with journal on healthz failure post-restart Root cause: the post-restart healthz check used a single 5s curl with '|| warn' — a service in systemd Restart=always loop (e.g. broker crashing on BROKER_AUTH_METHODS=email_link with binary built without --features auth-email-link) shows up as a one-line warn the operator scrolls past, and the script exits 0. Operator declares the host healthy, then 30 minutes later hits 502 Bad Gateway from nginx and has to re-diagnose from scratch. Three changes: 1. scripts/setup-broker-host.sh — replace the warn-only one-shot curl probes with probe_or_die(): poll /healthz for 20s per service (10x 2s with --max-time 2), and on persistent failure dump 'systemctl status' + last 40 journal lines for the failing unit, then die with a fix-list naming the three most common boot crashes (gated-out feature, missing FROM address, AWS creds). 2. docs/stage7-demo-and-verification.md §0.4 prereq #2 — instruct operator to 'rm -f target/release/agentkeys-broker-server' before re-running the script (cargo's incremental cache occasionally leaves the wrong artifact in place when feature flags change across rebuilds; clean target avoids the failure mode entirely). Plus a '502 Bad Gateway' troubleshooting block pointing at the journal grep + the canonical fix. 3. Same doc — name the exact boot-crash error string ('unknown or feature-gated-out auth method') the next operator will see, so they don't have to round-trip with logs. Per runbook-fix-fold-back policy: every operator-encountered failure makes the runbook strictly more robust before we move on. * deslop(setup-broker-host): drop dead helpers + dedupe + fix latent cred-mode case bug Pass-by-pass cleanup of scripts/setup-broker-host.sh, behavior preserved (verified by grep-locking 17 critical strings: env vars, ports, paths, systemd unit names, feature flags, function calls). Net -75 lines (1019 -> 944, -7.4%). Pass 1 — Dead code: - Drop prompt_default() and prompt_choice() (defined but never called). - Drop --skip-pull flag, PULL_SKIP var, and the redundant '! $PULL_SKIP' guard (the outer '[[ -n "$PULL_REF" ]]' already gates the pull). --skip-pull is now folded into the --upgrade no-op arm so existing callers still parse cleanly. Pass 1b — Latent bug fix: - The 'case "$CRED_MODE"' block in the trailing manual-steps section had a duplicate 'instance-profile)' arm: the FIRST one was reached but contained text describing 'none mode'; the SECOND (which had the correct instance-profile text) was unreachable dead code; and 'none' mode users got NO instructions at all because no 'none)' arm existed. Renamed the first arm to 'none)' so all three modes now print their intended manual-steps text. Pass 2 — Duplicate consolidation: - Three near-identical 'if [[ -d /etc/nginx/sites-enabled ]]; then ln -sf … fi' blocks (broker, signer-HTTPS, signer-HTTP-only) collapsed into ONE block after write_nginx_site returns. ln -sf is idempotent so this is behavior-equivalent. - certbot install: 'case "$PM"' had two arms with identical package list ('certbot python3-certbot-nginx'); collapsed to a single '"${PM_INSTALL[@]}" certbot python3-certbot-nginx' invocation. Pass 3 — Comment trim: - 58-line header reduced to 18 lines: dropped the 'Order of operations' enumeration (duplicated by the section comments inline) and the --flag enumeration (duplicated by the case parser + --help dump). Kept the canonical 'CLAUDE.md says all remote-host changes go through this script' rule + out-of-scope list. Idempotency audit (no changes needed — already correct): • build deps: apt/dnf -y, idempotent • rustup install: gated 'if ! have rustup' • systemctl stop: '|| true' • binary backup: gated 'if [[ -x ]]' • install -m 0755: overwrite-OK • useradd: gated 'if ! id -u agentkeys' • install -d: idempotent • DEV_KEY_SERVICE secret: gated 'if ! sudo test -s' (never regenerated) • systemd unit writes: tee overwrites — intended each run • nginx install: gated 'if ! have nginx' • nginx site write: tee overwrites — intended (handles HTTP→HTTPS flip) • sites-enabled ln -sf: -f forces, idempotent • certbot install: gated 'if ! have certbot' • ensure_broker_keypairs: per-keypair 'if sudo test -f' guard • daemon-reload, enable, restart: idempotent Verification: bash -n scripts/setup-broker-host.sh # syntax ok grep -F locked 17 critical strings # all present * fix(setup-broker-host): cargo multi-package + --features footgun strips auth-email-link Root cause of the broker host's repeated 'BOOT_FAIL: BROKER_AUTH_METHODS= "email_link": unknown or feature-gated-out auth method' even after a fresh target/ rebuild: the script used a SINGLE cargo invocation to build BOTH agentkeys-mock-server AND agentkeys-broker-server with '--features agentkeys-broker-server/auth-email-link', and cargo silently DROPS the feature flag in this multi-package selection mode. Reproduced empirically with --message-format json: cargo build --release -p agentkeys-mock-server -p agentkeys-broker-server \ --features agentkeys-broker-server/auth-email-link → broker compiled features: [audit-sqlite, auth-wallet-sig, default, wallet-keystore] ← NO auth-email-link vs the working separate form: cargo build --release -p agentkeys-broker-server --features auth-email-link → broker compiled features: [audit-sqlite, auth-email-link, auth-wallet-sig, default, wallet-keystore] ← present Fix: 1. Split the build into two separate cargo invocations — mock-server alone (default features), broker-server alone with the feature flag. Documented the footgun in a long block comment so the next person who 'optimizes' by re-merging them will read why before doing it. 2. Added a post-build sanity check: 'strings target/release/agentkeys- broker-server | grep /v1/auth/email/(request|verify)' must match before install + restart. If the cargo footgun ever resurfaces (or anyone introduces a similar feature-strip bug), the script dies HERE with a clear diagnostic instead of after install + systemd restart loop + journal dump. Verified locally: bash -n scripts/setup-broker-host.sh # syntax ok strings target/release/agentkeys-broker-server | grep /v1/auth/email → /v1/auth/email/request /v1/auth/email/verify /v1/auth/email/status /v1/auth/email/landing (all four routes present) * fix(setup-broker-host): assert via cargo --message-format=json + cargo clean -p The previous fix (commit 6d75599) split the cargo build into separate invocations to defeat the multi-package + --features footgun, but the broker host STILL deployed binaries lacking auth-email-link. Two real root causes survived: 1. CARGO INCREMENTAL CACHE: 'rm -f target/release/agentkeys-broker-server' only removed the output binary, not target/release/deps/.fingerprint/ nor the per-feature-set cached .rlib deps. On a host that previously built without auth-email-link, cargo's incremental could relink from stale deps and produce a binary missing the feature even when the build call was correct. Fix: 'cargo clean -p agentkeys-broker-server --release' before the rebuild — only ~1s, only this crate's cache. 2. WEAK VERIFICATION: 'strings | grep -qE "/v1/auth/email/request"' is a heuristic that: - false-positives on tower middleware names containing 'email' - false-negatives when LTO dedupes string literals across the binary - dies with an unactionable 'this is the cargo footgun' guess that was wrong (the call was correct; the host environment was the bug) Replace with: parse cargo's own --message-format=json output and ASSERT auth-email-link is in the bin artifact's features list. Cargo's reported features ARE the truth — no heuristic. Critical bash detail: cargo --message-format=json sends NDJSON to stdout and compiler messages to stderr. Merging them with '2>&1' corrupts the NDJSON and jq dies with 'Invalid numeric litera…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the full Stage 7 pluggable broker (issue #64) running live at
https://broker.litentry.org, completes the OIDC-only auto-provision migration (issue #71 Option A — dropsmint_legacy+ static IAM user +AssumeRoletrait), and includes the operator demo + verification guide validated end-to-end against the live broker on AWS.What changed
Architecture
crates/agentkeys-{provisioner,mcp,cli}now fetchPOST /v1/mint-oidc-jwtthen doAssumeRoleWithWebIdentityclient-side. Server-side/v1/mint-aws-credsaggregator path is gone (legacymint_legacyhandler +looks_like_session_jwtheuristic deleted).BROKER_OIDC_ISSUERis now refuse-to-boot if unset (no silent fallback to a hardcoded URL — Codex adversarial-review M1)./v1/mint-oidc-jwtnow verifies the session JWT locally against the broker's session keypair instead of round-tripping to backend/session/validate(matches/v1/mint-aws-credspost-migration; closes the §3 demo 401).StsClient::assume_role+AwsStsClient::from_keysremoved. Broker holds zero AWS principals at runtime —AssumeRoleWithWebIdentityhappens client-side with the daemon's OIDC JWT.DAEMON_ACCESS_KEY_ID+BROKER_DAEMON_*env vars dropped. Static-IAM-user branch inmain.rsdeleted.BROKER_AGENT_ROLE_ARN/ACCOUNT_ID/REGIONlegacy aliases stay (still used bysetup-broker-host.sh).Live broker deploy
scripts/setup-broker-host.shis now one idempotent script. Bootstrap + upgrade detection auto-runs based on whether a unit file already exists. Reads existing config from/etc/systemd/system/agentkeys-broker.serviceEnvironment=lines./healthz(Kubernetes convention) across mock-server, broker, and docs./healthalias dropped./readyzbody always self-describing —{"status":"ready"|"degraded"|"unready", "degraded": bool, "checks":[…], "ready":[…]}. Empty{}reply removed (Codex review). Operator probes viajq -r .status.evm(deploy script pullsorigin/evm); diagnose-before-edit; land-the-fix-everywhere.Demo + verification guide
docs/stage7-demo-and-verification.md(new, 1192 lines). End-to-end live demo againstbroker.litentry.org: SIWE wallet auth →/v1/mint-oidc-jwt→AssumeRoleWithWebIdentity→ S3 isolation proof. Each silent capture has an explicit echo confirmation.curl -sfswapped tocurl -sS --fail-with-bodyacross docs (4 docs, 45 occurrences).-sfsilently swallows error bodies; the new form prints them — operators see real errors instead of empty\$VARs.echo \"\$VAR\" | jqswapped toprintf '%s' \"\$VAR\" | jq(5 docs, 30 occurrences). zsh'sechointerprets\\nas 0x0A, corrupting JSON-string escapes inside SIWE messages.Operator runbook
docs/operator-runbook-stage7.mdupdated for the simpler post-migration env-var surface.docs/cloud-setup.mdwalks operator-workstation env setup; companionscripts/operator-workstation.envlives next to broker-sidescripts/broker.env.scripts/archived/with README.Repo stats
Test plan
cargo test -p agentkeys-broker-server— 124 unit + 31 integration passingcargo test -p agentkeys-provisioner(post-migration provisioner using/v1/mint-oidc-jwt)cargo test -p agentkeys-mcp+-p agentkeys-daemonbash harness/stage-7-issue-64-done.shexits 0https://broker.litentry.org— wallet A reads own S3 prefix; wallet B's prefix returns AccessDenied from S3 (cloud-enforced via PrincipalTag)bash scripts/setup-broker-host.sh --upgradeon the live broker host applies a clean redeployWhat's intentionally not in this PR
/v1/auth/exchangelegacy bearer shim removal (waits on TEE signer; daemon will migrate to email/OAuth2 + TEE-managed wallet)🤖 Generated with Claude Code