feat(stage7): broker-server vertical slice + three-role docs#60
Merged
hanwencheng merged 6 commits intomainfrom Apr 27, 2026
Merged
feat(stage7): broker-server vertical slice + three-role docs#60hanwencheng merged 6 commits intomainfrom
hanwencheng merged 6 commits intomainfrom
Conversation
Resolves #58 phase 1 — the credential broker that lets app developers run daemons against operator infrastructure without holding any AWS keys. Doc reframe (front-loaded per CEO-review Q3): - docs/dev-setup.md rewritten around three roles (app developer / operator / end user). Each role's setup is its own section. - docs/operator-runbook.md (new) — start, supervise, rotate, audit. Calls out v0.1 scope vs Stage 7 phase 2 (OIDC) vs Stage 8 (vault). New crate crates/agentkeys-broker-server/ (vertical slice per CEO-review Q1): - POST /v1/mint-aws-creds — bearer auth via backend's new /session/validate, sts:AssumeRole on operator's daemon key, returns 1h temp creds. Static-IAM path; assume-role-with-web-identity deferred to phase 2. - GET /healthz, /readyz — supervisor probes; readyz exercises backend reachability + sts:GetCallerIdentity. - SQLite audit log on every mint (sha256-hashed bearer tokens, wallet, outcome, sts session name) at $HOME/.agentkeys/broker/audit.sqlite. - Trait-abstracted StsClient with AwsStsClient + StubStsClient (test-stub feature) — testable without live AWS. Env-var config only. mock-server adds GET /session/validate so the broker validates tokens through the backend instead of duplicating session state. Broker stays stateless w.r.t. sessions; backend is single source of truth. agentkeys-daemon gains --broker-url / AGENTKEYS_BROKER_URL flag (consumer wiring lands in phase 2 alongside provisioner-script integration). Tests: 3 unit + 5 broker integration (mock-backend + stub STS) — full workspace cargo test passes 194/194, no regressions. Out of scope (explicit, deferred): - OIDC discovery / JWKS / AssumeRoleWithWebIdentity — phase 2 (gated on public-hosting prereq, docs/stage7-wip.md §1). - TS oidc-stub retirement — phase 2. - Provisioner-scripts AWS-cred consumer rewiring — phase 2. Refs #58. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address /plan-eng-review findings on PR #60 phase 1. Critical (silent-failure trio): - audit.rs: replace lock().unwrap() with lock_conn() that propagates poison as BrokerError::AuditError instead of panicking the tokio worker. - mint.rs: failure-path audit writes were silently swallowed (let _ = ...); now route through record_outcome() which logs at error level on audit insert failure so anomaly-detection blindness is visible to operators. - main.rs: warn loudly when binding to a non-loopback address (bearer tokens + minted AWS creds in cleartext otherwise — terminate TLS at a reverse proxy first). Reliability: - main.rs: validate STS creds at startup (--skip-startup-check escape hatch for offline dev). Misconfigured creds now fail to bind, not on first mint. - main.rs: graceful shutdown on SIGTERM/Ctrl-C drains in-flight requests via with_graceful_shutdown(); prevents orphan audit rows where the daemon never received the response. - mint.rs: build_session_name now appends a microsecond suffix; same wallet minting twice within a second no longer collides on STS session name. Observability: - mint.rs: #[tracing::instrument] span on mint_aws_creds, with wallet + outcome fields recorded as the request progresses. DRY + tests: - mint.rs: pull record_outcome() helper; three near-identical audit-insert call sites collapse to one. - StubStsClient: closure-backed; new ::ok / ::failing / ::assume_failing factory methods cover happy/down/partial-down test scenarios. - audit.rs: new AuditLog::last_row() + hash_token exported for test introspection. - 9 broker integration tests (was 5) — added STS-error path, backend-down path, both readyz failure modes, and audit-row assertions on every mint. - 4 new audit unit tests covering hash_token determinism, distinct hashes, record-mint roundtrip, failure-detail persistence. Test count workspace-wide: 203 / 203 passing (was 194). No regressions. Refs #58, addresses /plan-eng-review findings #1, #2, #3, #4, #6, #10, #12, #13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address 7 issues from the codex review on top of /plan-eng-review. Critical: - audit.rs: column name `requester_token` stored hashed values, misleading any operator querying it. Renamed to `requester_token_hash` to match what's actually written. The Rust struct already used the correct name; only the SQLite schema and the SELECT lagged. - audit.rs: enable WAL + synchronous=FULL on the audit DB. Default journal mode could lose recent rows on power loss; for an audit log durability beats throughput. Reliability: - audit.rs: new MintOutcome::BackendError variant. Backend-unreachable was previously written as "auth_failed", which made operator anomaly detection blind to backend outages (looked like a token-fishing spike). - config.rs: BROKER_SESSION_DURATION_SECONDS parse failure now surfaces as a startup error instead of silently falling back to 3600. - config.rs: new BROKER_BACKEND_TIMEOUT_SECONDS (default 10s) and BROKER_SHUTDOWN_GRACE_SECONDS (default 30s). - main.rs: reqwest client gets the configured timeout + a 5s connect timeout. Previously a hung backend would pin a tokio task forever. - main.rs: graceful-shutdown future races a hard-cap sleep so a single hung request can't block process exit indefinitely. - main.rs: SIGTERM handler now expect()s on registration. Failing loud is better than the prior `if let Ok(...)` which would silently exit on startup in hardened-sandbox environments. Audit perf nit: - audit.rs: compute timestamp + token hash before grabbing the mutex so the critical section is purely the SQLite write. Tests updated: - mint_flow.rs: backend-unreachable test now asserts outcome="backend_error" (was "auth_failed"). - mint_flow.rs: BrokerConfig now constructs with the two new timeout fields; test reqwest client gets short timeouts. Test count workspace-wide: 203 / 203 passing. No regressions. Refs #58, addresses codex review findings on PR #60. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stage7-wip.md previously described Stage 7 as one undifferentiated "not running yet" surface. With PR #60 phase 1 (broker server) shipped, the doc was misleading: readers couldn't tell what's live, what isn't, or where the operator runbook had moved to. Restructured around the two halves: - Phase 1 (shipped) — points at crates/agentkeys-broker-server/, the three-role dev-setup.md, and the operator-runbook. Includes the three-terminal e2e proof (mock backend + broker + curl mint). - Phase 2 (deferred) — preserves the existing OIDC federation test recipe (IAM provider registration, federated trust policy, PrincipalTag bucket policy, JWT mint via TS stub, cross-prefix AccessDenied proof). Reframed as "still blocked on public hosting + TEE-derived ES256 key per heima-gaps §3." Refs #58. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The team persists BROKER_DAEMON_* in ~/.zshenv (mode 0600), not in a 1Password vault accessed via `op read`. Update the three Stage 7 docs to match actual operator workflow: - docs/operator-runbook.md §3.1 now describes ~/.zshenv (or supervisor env-injection) instead of recommending 1Password CLI. Adds the "shared/untrusted host" caveat for systemd LoadCredential / launchd EnvironmentVariables fallback. - docs/operator-runbook.md §5 (rotation): updates step 2 from "update your secret store (1Password)" to "update ~/.zshenv". - docs/operator-runbook.md §9 (out-of-scope): retitles "1Password CLI integration" to "secret-manager integration" generally. - docs/dev-setup.md §1 (optional tools): removes 1Password CLI bullet. - docs/dev-setup.md §3 (role table): "1Password" → "~/.zshenv or supervisor-managed env" in the operator row. - docs/dev-setup.md §5.1: replaces "stash in 1Password" with the ~/.zshenv persistence pattern. - docs/dev-setup.md §5.2 + §5.4: removes inline `op read` calls from the broker-startup snippets; comments now state BROKER_DAEMON_* are inherited from the shell. - docs/stage7-wip.md phase-1 e2e proof: same op-read removal. No code changes. The broker still reads BROKER_DAEMON_* from std::env exactly as before; only the operator-facing instructions changed. Refs #58. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator's ~/.zshenv already defines: DAEMON_ACCESS_KEY_ID DAEMON_SECRET_ACCESS_KEY ACCOUNT_ID REGION BUCKET DOMAIN scripts/stage6-demo-env.sh has read DAEMON_ACCESS_KEY_ID + DAEMON_SECRET_ACCESS_KEY since Stage 6. Introducing a second naming scheme (BROKER_DAEMON_*) for the same long-lived keys forces operators to either duplicate exports or rewrite ~/.zshenv. Align instead. Code (config.rs): - BROKER_DAEMON_ACCESS_KEY_ID env var renamed to DAEMON_ACCESS_KEY_ID, with BROKER_DAEMON_ACCESS_KEY_ID kept as a fallback for explicit callers. Same for DAEMON_SECRET_ACCESS_KEY. - BROKER_AGENT_ROLE_ARN now optional: if unset, derived from ACCOUNT_ID as arn:aws:iam::$ACCOUNT_ID:role/agentkeys-agent (the Stage 6 canonical role name). Operator can still override. - BROKER_AWS_REGION now falls back to REGION (the rest-of-agentKeys convention) before defaulting to us-east-1. - New first_env() helper picks the first non-empty match from a list of candidate env-var names. Docs: - docs/operator-runbook.md §3.1: env-var schema table updated; ~/.zshenv example shows REGION + ACCOUNT_ID + DAEMON_* (matches actual zshenv layout). Two new vars from prior commit (BROKER_BACKEND_TIMEOUT_SECONDS, BROKER_SHUTDOWN_GRACE_SECONDS) added to the table. - docs/operator-runbook.md §5: rotation step references DAEMON_*. - docs/dev-setup.md §5.2 + §5.4: the explicit `export BROKER_AGENT_ROLE_ARN=...` line drops out — broker derives from ACCOUNT_ID. Now the only per-run var is BROKER_BACKEND_URL. - docs/stage7-wip.md phase-1 e2e: same simplification. Tests: 17 / 17 broker tests passing (BrokerConfig is constructed literally in tests, so the env-var rename doesn't affect them). Refs #58. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves #58 phase 1 — the credential broker that lets app developers run daemons against operator infrastructure without holding any AWS keys. Phased per
/plan-ceo-review:POST /v1/mint-aws-credsend-to-end with real auth + audit. OIDC primitives (AssumeRoleWithWebIdentity+ JWKS) deferred to phase 2 because they're independently blocked on the public-hosting prereq indocs/stage7-wip.md.crates/agentkeys-broker-server/rather than extendingagentkeys-mock-server. Concerns differ; coupling chain-mock to AWS-broker would create an awkward operational surface.docs/dev-setup.mdrewrite + newdocs/operator-runbook.mdlead the change so reviewers see the framing before the code.✅ Phase-1 e2e proven on real AWS
Operator ran the live three-terminal flow from
docs/stage7-wip.mdend-to-end against the production AWS account. Result ofPOST /v1/mint-aws-creds:{ "access_key_id": "ASIAWHZVNRHP52WKVY63", "expiration": 1777268187, "wallet": "0xcd6e718c072917b5468157766ad2860944d0120d" }ASIA…prefix + future expiration confirm real STS-issued temp credentials (not the long-lived daemon AKIA key, which never leaves the broker process). Audit row written to~/.agentkeys/broker/audit.sqlitewithoutcome="ok"and the wallet attribution flowed through correctly.The broker minted creds are drop-in compatible with the legacy
scripts/stage6-demo-env.shenv-var shape, which means the existing OpenRouter scraper consumes them unchanged. Full provisioner-scripts rewiring (so the scraper calls the broker itself instead of a manual export) is the deferred phase-2 item.Commits in this PR
f4990f4agentkeys-broker-servercrate + three-role docs + daemon--broker-urlflagaba0dc3/plan-eng-reviewfollow-ups: mutex poison fix, silent-audit fix, plain-HTTP warn, microsecond suffix, tracing, startup STS check, graceful shutdown7b0b6f5requester_token→requester_token_hash), WAL+FULL pragmas,BackendErroroutcome variant, config parse strictness, reqwest + drain timeouts, SIGTERMexpect()f0960f6docs/stage7-wip.mdsplit into phase 1 (shipped) / phase 2 (deferred)4e974dd~/.zshenvpatternef892b8~/.zshenvconvention: readDAEMON_ACCESS_KEY_ID/DAEMON_SECRET_ACCESS_KEY(matchingscripts/stage6-demo-env.sh); deriveBROKER_AGENT_ROLE_ARNfromACCOUNT_ID; fall back toREGIONfor AWS regionWhat's in the diff
Tests: 8 broker unit + 9 broker integration + 186 existing = 203 / 203 passing, no regressions.
Architecture (v0.1)
The broker is stateless w.r.t. sessions — backend (
mock-serverin dev, chain in v0.2+) is the single source of truth for which bearer tokens are valid. The newGET /session/validateendpoint on mock-server is the join point. Trade-off: backend outage is transitive to broker (no cache); fine for v0.1 dev loop.STS is trait-abstracted (
StsClient) with anAwsStsClientfor production and aStubStsClient(gated behind atest-stubfeature) for integration tests. CI never hits AWS. The live test above is what validated the production path.Operator UX — env var alignment
The operator's existing
~/.zshenvalready hadDAEMON_ACCESS_KEY_ID,DAEMON_SECRET_ACCESS_KEY,ACCOUNT_ID, andREGIONfrom the Stage 6 setup. Phase-1 makes the broker read those same names so no~/.zshenvedits are required to start using it. The only per-run env var the operator now needs isBROKER_BACKEND_URL. Seedocs/operator-runbook.md§3.1.Acceptance criteria progress (issue #58)
AGENTKEYS_BROKER_URLpointing at the operator's broker. The flag is wired; the consumer of the temp creds (provisioner-scripts) lands in phase 2.docs/dev-setup.mdhas three top-level role sections. ✓docs/stage6-aws-setup.mdno longer asks anyone except the operator to handle AWS keys. ✓ The dev-setup rewrite makes this implicit by routing developers to §4.bash harness/stage-7-done.shexits 0. Deferred to phase 4 (the harness file currently has Stage 0 + 5a only; cleaning it up before Stage 7 sign-off is its own piece of work).Reviews run
/plan-ceo-review— HOLD SCOPE; identified 3 forks (vertical slice / separate crate / doc-first), all addressed in commitf4990f4./plan-eng-review— 11 findings; load-bearing 4 fixed in commitaba0dc3.7b0b6f5, 3 deferred (cachedcaller_identity_okfor k8s probes, test-broker join-on-teardown, infalliblefrom_keysconstructor).Test plan
cargo build --workspace— clean.cargo test --workspace— 203 / 203 passing.cargo test -p agentkeys-broker-server --features test-stub— 17 broker tests (8 unit + 9 integration) pass against mock backend + stub STS.DAEMON_*env-var alignment +ACCOUNT_ID-derived role ARN match the team's~/.zshenvconvention before merge.Out of scope (deliberately deferred)
/.well-known/openid-configuration,/.well-known/jwks.json,POST /v1/mint-oidc-jwt,sts:AssumeRoleWithWebIdentity). Independently blocked on the public-hosting prereq fromdocs/stage7-wip.md. Phase 2 PR.services/oidc-stub/retirement. Phase 2, once the OIDC half is in Rust.harness/features.jsonneeds Stages 1-4+6 entries first.Security notes
sha256(bearer_token), never the raw token. Reasoning indocs/operator-runbook.md.docs/spec/threat-model-key-custody.md.[900, 43200]seconds at config load time.Related
🤖 Generated with Claude Code