feat(stage7): broker-server vertical slice + three-role docs by hanwencheng · Pull Request #60 · litentry/agentKeys

hanwencheng · 2026-04-26T17:35:38Z

Summary

Resolves #58 phase 1 — the credential broker that lets app developers run daemons against operator infrastructure without holding any AWS keys. Phased per /plan-ceo-review:

Q1 vertical slice — phase 1 ships POST /v1/mint-aws-creds end-to-end with real auth + audit. OIDC primitives (AssumeRoleWithWebIdentity + JWKS) deferred to phase 2 because they're independently blocked on the public-hosting prereq in docs/stage7-wip.md.
Q2 separate crate — crates/agentkeys-broker-server/ rather than extending agentkeys-mock-server. Concerns differ; coupling chain-mock to AWS-broker would create an awkward operational surface.
Q3 doc-first — docs/dev-setup.md rewrite + new docs/operator-runbook.md lead the change so reviewers see the framing before the code.

✅ Phase-1 e2e proven on real AWS

Operator ran the live three-terminal flow from docs/stage7-wip.md end-to-end against the production AWS account. Result of POST /v1/mint-aws-creds:

{
  "access_key_id": "ASIAWHZVNRHP52WKVY63",
  "expiration": 1777268187,
  "wallet": "0xcd6e718c072917b5468157766ad2860944d0120d"
}

ASIA… prefix + future expiration confirm real STS-issued temp credentials (not the long-lived daemon AKIA key, which never leaves the broker process). Audit row written to ~/.agentkeys/broker/audit.sqlite with outcome="ok" and the wallet attribution flowed through correctly.

The broker minted creds are drop-in compatible with the legacy scripts/stage6-demo-env.sh env-var shape, which means the existing OpenRouter scraper consumes them unchanged. Full provisioner-scripts rewiring (so the scraper calls the broker itself instead of a manual export) is the deferred phase-2 item.

Commits in this PR

Commit	Purpose
`f4990f4`	Initial vertical slice: new `agentkeys-broker-server` crate + three-role docs + daemon `--broker-url` flag
`aba0dc3`	`/plan-eng-review` follow-ups: mutex poison fix, silent-audit fix, plain-HTTP warn, microsecond suffix, tracing, startup STS check, graceful shutdown
`7b0b6f5`	Codex review follow-ups: audit column rename (`requester_token` → `requester_token_hash`), WAL+FULL pragmas, `BackendError` outcome variant, config parse strictness, reqwest + drain timeouts, SIGTERM `expect()`
`f0960f6`	`docs/stage7-wip.md` split into phase 1 (shipped) / phase 2 (deferred)
`4e974dd`	Remove 1Password CLI references from operator-facing docs in favor of `~/.zshenv` pattern
`ef892b8`	Align broker env vars with existing `~/.zshenv` convention: read `DAEMON_ACCESS_KEY_ID` / `DAEMON_SECRET_ACCESS_KEY` (matching `scripts/stage6-demo-env.sh`); derive `BROKER_AGENT_ROLE_ARN` from `ACCOUNT_ID`; fall back to `REGION` for AWS region

What's in the diff

docs/dev-setup.md                         | rewrite around 3 roles
docs/operator-runbook.md                  | NEW (start/supervise/rotate/audit, v0.1 scope)
docs/stage7-wip.md                        | restructured: phase 1 (shipped) + phase 2 (deferred)
crates/agentkeys-broker-server/           | NEW crate (axum)
  Cargo.toml                              |   axum 0.7, aws-sdk-sts 1, rusqlite (WAL+FULL), reqwest
  src/lib.rs                              |   create_router(state) → 4 routes
  src/main.rs                             |   --port / --bind / --skip-startup-check, graceful shutdown, non-loopback warn
  src/config.rs                           |   BrokerConfig::from_env() — DAEMON_* + ACCOUNT_ID derivation, REGION fallback
  src/state.rs                            |   AppState { config, http (with timeouts), audit, sts }
  src/error.rs                            |   BrokerError → IntoResponse (5 kinds)
  src/audit.rs                            |   AuditLog (SQLite WAL+FULL, sha256 token hash, last_row inspect API)
  src/auth.rs                             |   validate_bearer_token() → backend /session/validate
  src/sts.rs                              |   StsClient trait + AwsStsClient + StubStsClient (closure-backed, 3 factories)
  src/handlers/health.rs                  |   /healthz, /readyz (backend + STS probes)
  src/handlers/mint.rs                    |   POST /v1/mint-aws-creds (instrumented; record_outcome helper)
  tests/mint_flow.rs                      |   9 broker integration tests (mock-backend + stub STS)
crates/agentkeys-mock-server/             | + GET /session/validate endpoint
crates/agentkeys-daemon/src/main.rs       | + --broker-url / AGENTKEYS_BROKER_URL flag
Cargo.toml                                | + workspace member entry

Tests: 8 broker unit + 9 broker integration + 186 existing = 203 / 203 passing, no regressions.

Architecture (v0.1)

   developer                 operator host
   ─────────                 ──────────────
                           ┌─ agentkeys-broker-server
                           │    │
   agentkeys-daemon ──────────► POST /v1/mint-aws-creds
       (bearer token)      │    │
                           │    ├──► GET backend/session/validate
                           │    │       (validates bearer, returns wallet)
                           │    │
                           │    ├──► sts:AssumeRole
                           │    │     (uses DAEMON_ACCESS_KEY_ID +
                           │    │      DAEMON_SECRET_ACCESS_KEY from
                           │    │      operator's ~/.zshenv)
                           │    │       │
                           │    │       ▼
                           │    │   1h scoped temp creds
                           │    │
                           │    └──► AuditLog::record_mint() → SQLite (WAL+FULL)
                           │
                           └─ daemon AWS key — never leaves this process

The broker is stateless w.r.t. sessions — backend (mock-server in dev, chain in v0.2+) is the single source of truth for which bearer tokens are valid. The new GET /session/validate endpoint on mock-server is the join point. Trade-off: backend outage is transitive to broker (no cache); fine for v0.1 dev loop.

STS is trait-abstracted (StsClient) with an AwsStsClient for production and a StubStsClient (gated behind a test-stub feature) for integration tests. CI never hits AWS. The live test above is what validated the production path.

Operator UX — env var alignment

The operator's existing ~/.zshenv already had DAEMON_ACCESS_KEY_ID, DAEMON_SECRET_ACCESS_KEY, ACCOUNT_ID, and REGION from the Stage 6 setup. Phase-1 makes the broker read those same names so no ~/.zshenv edits are required to start using it. The only per-run env var the operator now needs is BROKER_BACKEND_URL. See docs/operator-runbook.md §3.1.

Acceptance criteria progress (issue #58)

AC1 — Operator starts broker on a fresh laptop with daemon AWS keys in env. ✓ Verified live.
AC2 — App developer runs the daemon with AGENTKEYS_BROKER_URL pointing at the operator's broker. The flag is wired; the consumer of the temp creds (provisioner-scripts) lands in phase 2.
AC3 — End user flow unchanged. ✓ No CLI surface changes.
AC4 — docs/dev-setup.md has three top-level role sections. ✓
AC5 — docs/stage6-aws-setup.md no longer asks anyone except the operator to handle AWS keys. ✓ The dev-setup rewrite makes this implicit by routing developers to §4.
AC6 — bash harness/stage-7-done.sh exits 0. Deferred to phase 4 (the harness file currently has Stage 0 + 5a only; cleaning it up before Stage 7 sign-off is its own piece of work).

Reviews run

/plan-ceo-review — HOLD SCOPE; identified 3 forks (vertical slice / separate crate / doc-first), all addressed in commit f4990f4.
/plan-eng-review — 11 findings; load-bearing 4 fixed in commit aba0dc3.
Codex review — 9 findings; 6 fixed in commit 7b0b6f5, 3 deferred (cached caller_identity_ok for k8s probes, test-broker join-on-teardown, infallible from_keys constructor).
Codex adversarial review — 9 architectural challenges (separate crate cost, phase-split logic, stateless-broker as SPOF, SQLite as audit store, premature trait abstraction, three-role doc framing, microsecond suffix necessity, FULL sync over-spec, phase-2 hosting drift). Verdict: phase-1 carries its own weight even if phase-2 slips, primary unaddressed concern is the backend-as-SPOF coupling (call out as known gap rather than fix in v0.1).

Test plan

cargo build --workspace — clean.
cargo test --workspace — 203 / 203 passing.
cargo test -p agentkeys-broker-server --features test-stub — 17 broker tests (8 unit + 9 integration) pass against mock backend + stub STS.
Live three-terminal e2e on real AWS — broker mints temp creds; ASIA prefix + future expiration confirmed; audit row written; wallet attribution correct.
Operator review: confirm the DAEMON_* env-var alignment + ACCOUNT_ID-derived role ARN match the team's ~/.zshenv convention before merge.

Out of scope (deliberately deferred)

OIDC half of Stage 7 (/.well-known/openid-configuration, /.well-known/jwks.json, POST /v1/mint-oidc-jwt, sts:AssumeRoleWithWebIdentity). Independently blocked on the public-hosting prereq from docs/stage7-wip.md. Phase 2 PR.
TS services/oidc-stub/ retirement. Phase 2, once the OIDC half is in Rust.
Provisioner-scripts integration. The daemon flag is wired; the consumer code that asks the broker for creds before reading from the operator's S3 bucket lands in phase 2 alongside the OIDC work.
KMS-sealed config source for hosted shape. Interface-only; full implementation is hosted-deploy work.
Bearer-token validation cache. Backend round-trip on every mint; acceptable for v0.1 dev workload (mints every ~55 min per daemon).
Harness Stage 7 sign-off. Phase 4 — and the existing harness/features.json needs Stages 1-4+6 entries first.

Security notes

Audit log stores sha256(bearer_token), never the raw token. Reasoning in docs/operator-runbook.md.
Threat model — broker holds the long-lived AWS key, so a compromised broker host = unbounded AWS access for that role. v0.1 scope is "run the broker on a host you trust." TEE-backed hosting is the v0.2+ evolution per docs/spec/threat-model-key-custody.md.
Session duration clamped to [900, 43200] seconds at config load time.
STS session names sanitized — non-alphanumeric chars stripped, capped at 64 bytes, microsecond-suffixed for CloudTrail readability.
Plain-HTTP bind to non-loopback emits a startup warning. Operators are expected to terminate TLS at a reverse proxy before exposing the broker beyond the host.

Resolves #58 phase 1 — the credential broker that lets app developers run daemons against operator infrastructure without holding any AWS keys. Doc reframe (front-loaded per CEO-review Q3): - docs/dev-setup.md rewritten around three roles (app developer / operator / end user). Each role's setup is its own section. - docs/operator-runbook.md (new) — start, supervise, rotate, audit. Calls out v0.1 scope vs Stage 7 phase 2 (OIDC) vs Stage 8 (vault). New crate crates/agentkeys-broker-server/ (vertical slice per CEO-review Q1): - POST /v1/mint-aws-creds — bearer auth via backend's new /session/validate, sts:AssumeRole on operator's daemon key, returns 1h temp creds. Static-IAM path; assume-role-with-web-identity deferred to phase 2. - GET /healthz, /readyz — supervisor probes; readyz exercises backend reachability + sts:GetCallerIdentity. - SQLite audit log on every mint (sha256-hashed bearer tokens, wallet, outcome, sts session name) at $HOME/.agentkeys/broker/audit.sqlite. - Trait-abstracted StsClient with AwsStsClient + StubStsClient (test-stub feature) — testable without live AWS. Env-var config only. mock-server adds GET /session/validate so the broker validates tokens through the backend instead of duplicating session state. Broker stays stateless w.r.t. sessions; backend is single source of truth. agentkeys-daemon gains --broker-url / AGENTKEYS_BROKER_URL flag (consumer wiring lands in phase 2 alongside provisioner-script integration). Tests: 3 unit + 5 broker integration (mock-backend + stub STS) — full workspace cargo test passes 194/194, no regressions. Out of scope (explicit, deferred): - OIDC discovery / JWKS / AssumeRoleWithWebIdentity — phase 2 (gated on public-hosting prereq, docs/stage7-wip.md §1). - TS oidc-stub retirement — phase 2. - Provisioner-scripts AWS-cred consumer rewiring — phase 2. Refs #58. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address /plan-eng-review findings on PR #60 phase 1. Critical (silent-failure trio): - audit.rs: replace lock().unwrap() with lock_conn() that propagates poison as BrokerError::AuditError instead of panicking the tokio worker. - mint.rs: failure-path audit writes were silently swallowed (let _ = ...); now route through record_outcome() which logs at error level on audit insert failure so anomaly-detection blindness is visible to operators. - main.rs: warn loudly when binding to a non-loopback address (bearer tokens + minted AWS creds in cleartext otherwise — terminate TLS at a reverse proxy first). Reliability: - main.rs: validate STS creds at startup (--skip-startup-check escape hatch for offline dev). Misconfigured creds now fail to bind, not on first mint. - main.rs: graceful shutdown on SIGTERM/Ctrl-C drains in-flight requests via with_graceful_shutdown(); prevents orphan audit rows where the daemon never received the response. - mint.rs: build_session_name now appends a microsecond suffix; same wallet minting twice within a second no longer collides on STS session name. Observability: - mint.rs: #[tracing::instrument] span on mint_aws_creds, with wallet + outcome fields recorded as the request progresses. DRY + tests: - mint.rs: pull record_outcome() helper; three near-identical audit-insert call sites collapse to one. - StubStsClient: closure-backed; new ::ok / ::failing / ::assume_failing factory methods cover happy/down/partial-down test scenarios. - audit.rs: new AuditLog::last_row() + hash_token exported for test introspection. - 9 broker integration tests (was 5) — added STS-error path, backend-down path, both readyz failure modes, and audit-row assertions on every mint. - 4 new audit unit tests covering hash_token determinism, distinct hashes, record-mint roundtrip, failure-detail persistence. Test count workspace-wide: 203 / 203 passing (was 194). No regressions. Refs #58, addresses /plan-eng-review findings #1, #2, #3, #4, #6, #10, #12, #13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address 7 issues from the codex review on top of /plan-eng-review. Critical: - audit.rs: column name `requester_token` stored hashed values, misleading any operator querying it. Renamed to `requester_token_hash` to match what's actually written. The Rust struct already used the correct name; only the SQLite schema and the SELECT lagged. - audit.rs: enable WAL + synchronous=FULL on the audit DB. Default journal mode could lose recent rows on power loss; for an audit log durability beats throughput. Reliability: - audit.rs: new MintOutcome::BackendError variant. Backend-unreachable was previously written as "auth_failed", which made operator anomaly detection blind to backend outages (looked like a token-fishing spike). - config.rs: BROKER_SESSION_DURATION_SECONDS parse failure now surfaces as a startup error instead of silently falling back to 3600. - config.rs: new BROKER_BACKEND_TIMEOUT_SECONDS (default 10s) and BROKER_SHUTDOWN_GRACE_SECONDS (default 30s). - main.rs: reqwest client gets the configured timeout + a 5s connect timeout. Previously a hung backend would pin a tokio task forever. - main.rs: graceful-shutdown future races a hard-cap sleep so a single hung request can't block process exit indefinitely. - main.rs: SIGTERM handler now expect()s on registration. Failing loud is better than the prior `if let Ok(...)` which would silently exit on startup in hardened-sandbox environments. Audit perf nit: - audit.rs: compute timestamp + token hash before grabbing the mutex so the critical section is purely the SQLite write. Tests updated: - mint_flow.rs: backend-unreachable test now asserts outcome="backend_error" (was "auth_failed"). - mint_flow.rs: BrokerConfig now constructs with the two new timeout fields; test reqwest client gets short timeouts. Test count workspace-wide: 203 / 203 passing. No regressions. Refs #58, addresses codex review findings on PR #60. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

stage7-wip.md previously described Stage 7 as one undifferentiated "not running yet" surface. With PR #60 phase 1 (broker server) shipped, the doc was misleading: readers couldn't tell what's live, what isn't, or where the operator runbook had moved to. Restructured around the two halves: - Phase 1 (shipped) — points at crates/agentkeys-broker-server/, the three-role dev-setup.md, and the operator-runbook. Includes the three-terminal e2e proof (mock backend + broker + curl mint). - Phase 2 (deferred) — preserves the existing OIDC federation test recipe (IAM provider registration, federated trust policy, PrincipalTag bucket policy, JWT mint via TS stub, cross-prefix AccessDenied proof). Reframed as "still blocked on public hosting + TEE-derived ES256 key per heima-gaps §3." Refs #58. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The team persists BROKER_DAEMON_* in ~/.zshenv (mode 0600), not in a 1Password vault accessed via `op read`. Update the three Stage 7 docs to match actual operator workflow: - docs/operator-runbook.md §3.1 now describes ~/.zshenv (or supervisor env-injection) instead of recommending 1Password CLI. Adds the "shared/untrusted host" caveat for systemd LoadCredential / launchd EnvironmentVariables fallback. - docs/operator-runbook.md §5 (rotation): updates step 2 from "update your secret store (1Password)" to "update ~/.zshenv". - docs/operator-runbook.md §9 (out-of-scope): retitles "1Password CLI integration" to "secret-manager integration" generally. - docs/dev-setup.md §1 (optional tools): removes 1Password CLI bullet. - docs/dev-setup.md §3 (role table): "1Password" → "~/.zshenv or supervisor-managed env" in the operator row. - docs/dev-setup.md §5.1: replaces "stash in 1Password" with the ~/.zshenv persistence pattern. - docs/dev-setup.md §5.2 + §5.4: removes inline `op read` calls from the broker-startup snippets; comments now state BROKER_DAEMON_* are inherited from the shell. - docs/stage7-wip.md phase-1 e2e proof: same op-read removal. No code changes. The broker still reads BROKER_DAEMON_* from std::env exactly as before; only the operator-facing instructions changed. Refs #58. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Operator's ~/.zshenv already defines: DAEMON_ACCESS_KEY_ID DAEMON_SECRET_ACCESS_KEY ACCOUNT_ID REGION BUCKET DOMAIN scripts/stage6-demo-env.sh has read DAEMON_ACCESS_KEY_ID + DAEMON_SECRET_ACCESS_KEY since Stage 6. Introducing a second naming scheme (BROKER_DAEMON_*) for the same long-lived keys forces operators to either duplicate exports or rewrite ~/.zshenv. Align instead. Code (config.rs): - BROKER_DAEMON_ACCESS_KEY_ID env var renamed to DAEMON_ACCESS_KEY_ID, with BROKER_DAEMON_ACCESS_KEY_ID kept as a fallback for explicit callers. Same for DAEMON_SECRET_ACCESS_KEY. - BROKER_AGENT_ROLE_ARN now optional: if unset, derived from ACCOUNT_ID as arn:aws:iam::$ACCOUNT_ID:role/agentkeys-agent (the Stage 6 canonical role name). Operator can still override. - BROKER_AWS_REGION now falls back to REGION (the rest-of-agentKeys convention) before defaulting to us-east-1. - New first_env() helper picks the first non-empty match from a list of candidate env-var names. Docs: - docs/operator-runbook.md §3.1: env-var schema table updated; ~/.zshenv example shows REGION + ACCOUNT_ID + DAEMON_* (matches actual zshenv layout). Two new vars from prior commit (BROKER_BACKEND_TIMEOUT_SECONDS, BROKER_SHUTDOWN_GRACE_SECONDS) added to the table. - docs/operator-runbook.md §5: rotation step references DAEMON_*. - docs/dev-setup.md §5.2 + §5.4: the explicit `export BROKER_AGENT_ROLE_ARN=...` line drops out — broker derives from ACCOUNT_ID. Now the only per-run var is BROKER_BACKEND_URL. - docs/stage7-wip.md phase-1 e2e: same simplification. Tests: 17 / 17 broker tests passing (BrokerConfig is constructed literally in tests, so the env-var rename doesn't affect them). Refs #58. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

WildmetaAgent and others added 6 commits April 27, 2026 01:34

hanwencheng merged commit 5d4e652 into main Apr 27, 2026
1 check passed

hanwencheng mentioned this pull request Apr 27, 2026

feat(stage7): phase 2 — OIDC issuer in Rust broker + provisioner-scripts AWS-cred wiring #61

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stage7): broker-server vertical slice + three-role docs#60

feat(stage7): broker-server vertical slice + three-role docs#60
hanwencheng merged 6 commits intomainfrom
claude/sad-babbage-29a157

hanwencheng commented Apr 26, 2026 •

edited by WildmetaAgent

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hanwencheng commented Apr 26, 2026 • edited by WildmetaAgent Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

✅ Phase-1 e2e proven on real AWS

Commits in this PR

What's in the diff

Architecture (v0.1)

Operator UX — env var alignment

Acceptance criteria progress (issue #58)

Reviews run

Test plan

Out of scope (deliberately deferred)

Security notes

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hanwencheng commented Apr 26, 2026 •

edited by WildmetaAgent

Loading