Skip to content

feat(stage7): broker-server vertical slice + three-role docs#60

Merged
hanwencheng merged 6 commits intomainfrom
claude/sad-babbage-29a157
Apr 27, 2026
Merged

feat(stage7): broker-server vertical slice + three-role docs#60
hanwencheng merged 6 commits intomainfrom
claude/sad-babbage-29a157

Conversation

@hanwencheng
Copy link
Copy Markdown
Member

@hanwencheng hanwencheng commented Apr 26, 2026

Summary

Resolves #58 phase 1 — the credential broker that lets app developers run daemons against operator infrastructure without holding any AWS keys. Phased per /plan-ceo-review:

  • Q1 vertical slice — phase 1 ships POST /v1/mint-aws-creds end-to-end with real auth + audit. OIDC primitives (AssumeRoleWithWebIdentity + JWKS) deferred to phase 2 because they're independently blocked on the public-hosting prereq in docs/stage7-wip.md.
  • Q2 separate cratecrates/agentkeys-broker-server/ rather than extending agentkeys-mock-server. Concerns differ; coupling chain-mock to AWS-broker would create an awkward operational surface.
  • Q3 doc-firstdocs/dev-setup.md rewrite + new docs/operator-runbook.md lead the change so reviewers see the framing before the code.

✅ Phase-1 e2e proven on real AWS

Operator ran the live three-terminal flow from docs/stage7-wip.md end-to-end against the production AWS account. Result of POST /v1/mint-aws-creds:

{
  "access_key_id": "ASIAWHZVNRHP52WKVY63",
  "expiration": 1777268187,
  "wallet": "0xcd6e718c072917b5468157766ad2860944d0120d"
}

ASIA… prefix + future expiration confirm real STS-issued temp credentials (not the long-lived daemon AKIA key, which never leaves the broker process). Audit row written to ~/.agentkeys/broker/audit.sqlite with outcome="ok" and the wallet attribution flowed through correctly.

The broker minted creds are drop-in compatible with the legacy scripts/stage6-demo-env.sh env-var shape, which means the existing OpenRouter scraper consumes them unchanged. Full provisioner-scripts rewiring (so the scraper calls the broker itself instead of a manual export) is the deferred phase-2 item.

Commits in this PR

Commit Purpose
f4990f4 Initial vertical slice: new agentkeys-broker-server crate + three-role docs + daemon --broker-url flag
aba0dc3 /plan-eng-review follow-ups: mutex poison fix, silent-audit fix, plain-HTTP warn, microsecond suffix, tracing, startup STS check, graceful shutdown
7b0b6f5 Codex review follow-ups: audit column rename (requester_tokenrequester_token_hash), WAL+FULL pragmas, BackendError outcome variant, config parse strictness, reqwest + drain timeouts, SIGTERM expect()
f0960f6 docs/stage7-wip.md split into phase 1 (shipped) / phase 2 (deferred)
4e974dd Remove 1Password CLI references from operator-facing docs in favor of ~/.zshenv pattern
ef892b8 Align broker env vars with existing ~/.zshenv convention: read DAEMON_ACCESS_KEY_ID / DAEMON_SECRET_ACCESS_KEY (matching scripts/stage6-demo-env.sh); derive BROKER_AGENT_ROLE_ARN from ACCOUNT_ID; fall back to REGION for AWS region

What's in the diff

docs/dev-setup.md                         | rewrite around 3 roles
docs/operator-runbook.md                  | NEW (start/supervise/rotate/audit, v0.1 scope)
docs/stage7-wip.md                        | restructured: phase 1 (shipped) + phase 2 (deferred)
crates/agentkeys-broker-server/           | NEW crate (axum)
  Cargo.toml                              |   axum 0.7, aws-sdk-sts 1, rusqlite (WAL+FULL), reqwest
  src/lib.rs                              |   create_router(state) → 4 routes
  src/main.rs                             |   --port / --bind / --skip-startup-check, graceful shutdown, non-loopback warn
  src/config.rs                           |   BrokerConfig::from_env() — DAEMON_* + ACCOUNT_ID derivation, REGION fallback
  src/state.rs                            |   AppState { config, http (with timeouts), audit, sts }
  src/error.rs                            |   BrokerError → IntoResponse (5 kinds)
  src/audit.rs                            |   AuditLog (SQLite WAL+FULL, sha256 token hash, last_row inspect API)
  src/auth.rs                             |   validate_bearer_token() → backend /session/validate
  src/sts.rs                              |   StsClient trait + AwsStsClient + StubStsClient (closure-backed, 3 factories)
  src/handlers/health.rs                  |   /healthz, /readyz (backend + STS probes)
  src/handlers/mint.rs                    |   POST /v1/mint-aws-creds (instrumented; record_outcome helper)
  tests/mint_flow.rs                      |   9 broker integration tests (mock-backend + stub STS)
crates/agentkeys-mock-server/             | + GET /session/validate endpoint
crates/agentkeys-daemon/src/main.rs       | + --broker-url / AGENTKEYS_BROKER_URL flag
Cargo.toml                                | + workspace member entry

Tests: 8 broker unit + 9 broker integration + 186 existing = 203 / 203 passing, no regressions.

Architecture (v0.1)

   developer                 operator host
   ─────────                 ──────────────
                           ┌─ agentkeys-broker-server
                           │    │
   agentkeys-daemon ──────────► POST /v1/mint-aws-creds
       (bearer token)      │    │
                           │    ├──► GET backend/session/validate
                           │    │       (validates bearer, returns wallet)
                           │    │
                           │    ├──► sts:AssumeRole
                           │    │     (uses DAEMON_ACCESS_KEY_ID +
                           │    │      DAEMON_SECRET_ACCESS_KEY from
                           │    │      operator's ~/.zshenv)
                           │    │       │
                           │    │       ▼
                           │    │   1h scoped temp creds
                           │    │
                           │    └──► AuditLog::record_mint() → SQLite (WAL+FULL)
                           │
                           └─ daemon AWS key — never leaves this process

The broker is stateless w.r.t. sessions — backend (mock-server in dev, chain in v0.2+) is the single source of truth for which bearer tokens are valid. The new GET /session/validate endpoint on mock-server is the join point. Trade-off: backend outage is transitive to broker (no cache); fine for v0.1 dev loop.

STS is trait-abstracted (StsClient) with an AwsStsClient for production and a StubStsClient (gated behind a test-stub feature) for integration tests. CI never hits AWS. The live test above is what validated the production path.

Operator UX — env var alignment

The operator's existing ~/.zshenv already had DAEMON_ACCESS_KEY_ID, DAEMON_SECRET_ACCESS_KEY, ACCOUNT_ID, and REGION from the Stage 6 setup. Phase-1 makes the broker read those same names so no ~/.zshenv edits are required to start using it. The only per-run env var the operator now needs is BROKER_BACKEND_URL. See docs/operator-runbook.md §3.1.

Acceptance criteria progress (issue #58)

  • AC1 — Operator starts broker on a fresh laptop with daemon AWS keys in env. ✓ Verified live.
  • AC2 — App developer runs the daemon with AGENTKEYS_BROKER_URL pointing at the operator's broker. The flag is wired; the consumer of the temp creds (provisioner-scripts) lands in phase 2.
  • AC3 — End user flow unchanged. ✓ No CLI surface changes.
  • AC4docs/dev-setup.md has three top-level role sections. ✓
  • AC5docs/stage6-aws-setup.md no longer asks anyone except the operator to handle AWS keys. ✓ The dev-setup rewrite makes this implicit by routing developers to §4.
  • AC6bash harness/stage-7-done.sh exits 0. Deferred to phase 4 (the harness file currently has Stage 0 + 5a only; cleaning it up before Stage 7 sign-off is its own piece of work).

Reviews run

  • /plan-ceo-review — HOLD SCOPE; identified 3 forks (vertical slice / separate crate / doc-first), all addressed in commit f4990f4.
  • /plan-eng-review — 11 findings; load-bearing 4 fixed in commit aba0dc3.
  • Codex review — 9 findings; 6 fixed in commit 7b0b6f5, 3 deferred (cached caller_identity_ok for k8s probes, test-broker join-on-teardown, infallible from_keys constructor).
  • Codex adversarial review — 9 architectural challenges (separate crate cost, phase-split logic, stateless-broker as SPOF, SQLite as audit store, premature trait abstraction, three-role doc framing, microsecond suffix necessity, FULL sync over-spec, phase-2 hosting drift). Verdict: phase-1 carries its own weight even if phase-2 slips, primary unaddressed concern is the backend-as-SPOF coupling (call out as known gap rather than fix in v0.1).

Test plan

  • cargo build --workspace — clean.
  • cargo test --workspace — 203 / 203 passing.
  • cargo test -p agentkeys-broker-server --features test-stub — 17 broker tests (8 unit + 9 integration) pass against mock backend + stub STS.
  • Live three-terminal e2e on real AWS — broker mints temp creds; ASIA prefix + future expiration confirmed; audit row written; wallet attribution correct.
  • Operator review: confirm the DAEMON_* env-var alignment + ACCOUNT_ID-derived role ARN match the team's ~/.zshenv convention before merge.

Out of scope (deliberately deferred)

  • OIDC half of Stage 7 (/.well-known/openid-configuration, /.well-known/jwks.json, POST /v1/mint-oidc-jwt, sts:AssumeRoleWithWebIdentity). Independently blocked on the public-hosting prereq from docs/stage7-wip.md. Phase 2 PR.
  • TS services/oidc-stub/ retirement. Phase 2, once the OIDC half is in Rust.
  • Provisioner-scripts integration. The daemon flag is wired; the consumer code that asks the broker for creds before reading from the operator's S3 bucket lands in phase 2 alongside the OIDC work.
  • KMS-sealed config source for hosted shape. Interface-only; full implementation is hosted-deploy work.
  • Bearer-token validation cache. Backend round-trip on every mint; acceptable for v0.1 dev workload (mints every ~55 min per daemon).
  • Harness Stage 7 sign-off. Phase 4 — and the existing harness/features.json needs Stages 1-4+6 entries first.

Security notes

  • Audit log stores sha256(bearer_token), never the raw token. Reasoning in docs/operator-runbook.md.
  • Threat model — broker holds the long-lived AWS key, so a compromised broker host = unbounded AWS access for that role. v0.1 scope is "run the broker on a host you trust." TEE-backed hosting is the v0.2+ evolution per docs/spec/threat-model-key-custody.md.
  • Session duration clamped to [900, 43200] seconds at config load time.
  • STS session names sanitized — non-alphanumeric chars stripped, capped at 64 bytes, microsecond-suffixed for CloudTrail readability.
  • Plain-HTTP bind to non-loopback emits a startup warning. Operators are expected to terminate TLS at a reverse proxy before exposing the broker beyond the host.

Related

  • Issue #58 — Stage 7 broker server (this PR is phase 1).
  • Issue #57 — Stage 8 off-chain vault (separate; pairs with this work post-Stage-7).
  • PR #59 — threat model + dev-env bootstrap (the doc work this PR builds on).

🤖 Generated with Claude Code

WildmetaAgent and others added 6 commits April 27, 2026 01:34
Resolves #58 phase 1 — the credential broker that lets app developers run
daemons against operator infrastructure without holding any AWS keys.

Doc reframe (front-loaded per CEO-review Q3):
- docs/dev-setup.md rewritten around three roles (app developer / operator /
  end user). Each role's setup is its own section.
- docs/operator-runbook.md (new) — start, supervise, rotate, audit. Calls
  out v0.1 scope vs Stage 7 phase 2 (OIDC) vs Stage 8 (vault).

New crate crates/agentkeys-broker-server/ (vertical slice per CEO-review Q1):
- POST /v1/mint-aws-creds — bearer auth via backend's new /session/validate,
  sts:AssumeRole on operator's daemon key, returns 1h temp creds. Static-IAM
  path; assume-role-with-web-identity deferred to phase 2.
- GET /healthz, /readyz — supervisor probes; readyz exercises backend
  reachability + sts:GetCallerIdentity.
- SQLite audit log on every mint (sha256-hashed bearer tokens, wallet,
  outcome, sts session name) at $HOME/.agentkeys/broker/audit.sqlite.
- Trait-abstracted StsClient with AwsStsClient + StubStsClient (test-stub
  feature) — testable without live AWS. Env-var config only.

mock-server adds GET /session/validate so the broker validates tokens
through the backend instead of duplicating session state. Broker stays
stateless w.r.t. sessions; backend is single source of truth.

agentkeys-daemon gains --broker-url / AGENTKEYS_BROKER_URL flag (consumer
wiring lands in phase 2 alongside provisioner-script integration).

Tests: 3 unit + 5 broker integration (mock-backend + stub STS) — full
workspace cargo test passes 194/194, no regressions.

Out of scope (explicit, deferred):
- OIDC discovery / JWKS / AssumeRoleWithWebIdentity — phase 2 (gated on
  public-hosting prereq, docs/stage7-wip.md §1).
- TS oidc-stub retirement — phase 2.
- Provisioner-scripts AWS-cred consumer rewiring — phase 2.

Refs #58.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address /plan-eng-review findings on PR #60 phase 1.

Critical (silent-failure trio):
- audit.rs: replace lock().unwrap() with lock_conn() that propagates poison
  as BrokerError::AuditError instead of panicking the tokio worker.
- mint.rs: failure-path audit writes were silently swallowed (let _ = ...);
  now route through record_outcome() which logs at error level on audit
  insert failure so anomaly-detection blindness is visible to operators.
- main.rs: warn loudly when binding to a non-loopback address (bearer tokens
  + minted AWS creds in cleartext otherwise — terminate TLS at a reverse
  proxy first).

Reliability:
- main.rs: validate STS creds at startup (--skip-startup-check escape hatch
  for offline dev). Misconfigured creds now fail to bind, not on first mint.
- main.rs: graceful shutdown on SIGTERM/Ctrl-C drains in-flight requests
  via with_graceful_shutdown(); prevents orphan audit rows where the daemon
  never received the response.
- mint.rs: build_session_name now appends a microsecond suffix; same wallet
  minting twice within a second no longer collides on STS session name.

Observability:
- mint.rs: #[tracing::instrument] span on mint_aws_creds, with wallet +
  outcome fields recorded as the request progresses.

DRY + tests:
- mint.rs: pull record_outcome() helper; three near-identical audit-insert
  call sites collapse to one.
- StubStsClient: closure-backed; new ::ok / ::failing / ::assume_failing
  factory methods cover happy/down/partial-down test scenarios.
- audit.rs: new AuditLog::last_row() + hash_token exported for test
  introspection.
- 9 broker integration tests (was 5) — added STS-error path, backend-down
  path, both readyz failure modes, and audit-row assertions on every mint.
- 4 new audit unit tests covering hash_token determinism, distinct hashes,
  record-mint roundtrip, failure-detail persistence.

Test count workspace-wide: 203 / 203 passing (was 194). No regressions.

Refs #58, addresses /plan-eng-review findings #1, #2, #3, #4, #6, #10, #12, #13.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address 7 issues from the codex review on top of /plan-eng-review.

Critical:
- audit.rs: column name `requester_token` stored hashed values, misleading
  any operator querying it. Renamed to `requester_token_hash` to match
  what's actually written. The Rust struct already used the correct name;
  only the SQLite schema and the SELECT lagged.
- audit.rs: enable WAL + synchronous=FULL on the audit DB. Default journal
  mode could lose recent rows on power loss; for an audit log durability
  beats throughput.

Reliability:
- audit.rs: new MintOutcome::BackendError variant. Backend-unreachable was
  previously written as "auth_failed", which made operator anomaly
  detection blind to backend outages (looked like a token-fishing spike).
- config.rs: BROKER_SESSION_DURATION_SECONDS parse failure now surfaces as
  a startup error instead of silently falling back to 3600.
- config.rs: new BROKER_BACKEND_TIMEOUT_SECONDS (default 10s) and
  BROKER_SHUTDOWN_GRACE_SECONDS (default 30s).
- main.rs: reqwest client gets the configured timeout + a 5s connect timeout.
  Previously a hung backend would pin a tokio task forever.
- main.rs: graceful-shutdown future races a hard-cap sleep so a single hung
  request can't block process exit indefinitely.
- main.rs: SIGTERM handler now expect()s on registration. Failing loud is
  better than the prior `if let Ok(...)` which would silently exit on
  startup in hardened-sandbox environments.

Audit perf nit:
- audit.rs: compute timestamp + token hash before grabbing the mutex so
  the critical section is purely the SQLite write.

Tests updated:
- mint_flow.rs: backend-unreachable test now asserts outcome="backend_error"
  (was "auth_failed").
- mint_flow.rs: BrokerConfig now constructs with the two new timeout fields;
  test reqwest client gets short timeouts.

Test count workspace-wide: 203 / 203 passing. No regressions.

Refs #58, addresses codex review findings on PR #60.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stage7-wip.md previously described Stage 7 as one undifferentiated
"not running yet" surface. With PR #60 phase 1 (broker server) shipped,
the doc was misleading: readers couldn't tell what's live, what isn't,
or where the operator runbook had moved to.

Restructured around the two halves:
- Phase 1 (shipped) — points at crates/agentkeys-broker-server/, the
  three-role dev-setup.md, and the operator-runbook. Includes the
  three-terminal e2e proof (mock backend + broker + curl mint).
- Phase 2 (deferred) — preserves the existing OIDC federation test
  recipe (IAM provider registration, federated trust policy,
  PrincipalTag bucket policy, JWT mint via TS stub, cross-prefix
  AccessDenied proof). Reframed as "still blocked on public hosting +
  TEE-derived ES256 key per heima-gaps §3."

Refs #58.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The team persists BROKER_DAEMON_* in ~/.zshenv (mode 0600), not in a
1Password vault accessed via `op read`. Update the three Stage 7 docs
to match actual operator workflow:

- docs/operator-runbook.md §3.1 now describes ~/.zshenv (or supervisor
  env-injection) instead of recommending 1Password CLI. Adds the
  "shared/untrusted host" caveat for systemd LoadCredential / launchd
  EnvironmentVariables fallback.
- docs/operator-runbook.md §5 (rotation): updates step 2 from
  "update your secret store (1Password)" to "update ~/.zshenv".
- docs/operator-runbook.md §9 (out-of-scope): retitles "1Password CLI
  integration" to "secret-manager integration" generally.
- docs/dev-setup.md §1 (optional tools): removes 1Password CLI bullet.
- docs/dev-setup.md §3 (role table): "1Password" → "~/.zshenv or
  supervisor-managed env" in the operator row.
- docs/dev-setup.md §5.1: replaces "stash in 1Password" with the
  ~/.zshenv persistence pattern.
- docs/dev-setup.md §5.2 + §5.4: removes inline `op read` calls from
  the broker-startup snippets; comments now state BROKER_DAEMON_* are
  inherited from the shell.
- docs/stage7-wip.md phase-1 e2e proof: same op-read removal.

No code changes. The broker still reads BROKER_DAEMON_* from std::env
exactly as before; only the operator-facing instructions changed.

Refs #58.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator's ~/.zshenv already defines:
  DAEMON_ACCESS_KEY_ID
  DAEMON_SECRET_ACCESS_KEY
  ACCOUNT_ID
  REGION
  BUCKET
  DOMAIN

scripts/stage6-demo-env.sh has read DAEMON_ACCESS_KEY_ID +
DAEMON_SECRET_ACCESS_KEY since Stage 6. Introducing a second naming
scheme (BROKER_DAEMON_*) for the same long-lived keys forces operators
to either duplicate exports or rewrite ~/.zshenv. Align instead.

Code (config.rs):
- BROKER_DAEMON_ACCESS_KEY_ID env var renamed to DAEMON_ACCESS_KEY_ID,
  with BROKER_DAEMON_ACCESS_KEY_ID kept as a fallback for explicit
  callers. Same for DAEMON_SECRET_ACCESS_KEY.
- BROKER_AGENT_ROLE_ARN now optional: if unset, derived from ACCOUNT_ID
  as arn:aws:iam::$ACCOUNT_ID:role/agentkeys-agent (the Stage 6 canonical
  role name). Operator can still override.
- BROKER_AWS_REGION now falls back to REGION (the rest-of-agentKeys
  convention) before defaulting to us-east-1.
- New first_env() helper picks the first non-empty match from a list
  of candidate env-var names.

Docs:
- docs/operator-runbook.md §3.1: env-var schema table updated;
  ~/.zshenv example shows REGION + ACCOUNT_ID + DAEMON_* (matches actual
  zshenv layout). Two new vars from prior commit (BROKER_BACKEND_TIMEOUT_SECONDS,
  BROKER_SHUTDOWN_GRACE_SECONDS) added to the table.
- docs/operator-runbook.md §5: rotation step references DAEMON_*.
- docs/dev-setup.md §5.2 + §5.4: the explicit `export BROKER_AGENT_ROLE_ARN=...`
  line drops out — broker derives from ACCOUNT_ID. Now the only per-run
  var is BROKER_BACKEND_URL.
- docs/stage7-wip.md phase-1 e2e: same simplification.

Tests: 17 / 17 broker tests passing (BrokerConfig is constructed
literally in tests, so the env-var rename doesn't affect them).

Refs #58.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hanwencheng hanwencheng merged commit 5d4e652 into main Apr 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stage 7 — Broker server: implement the missing 'broker' so developers stop holding AWS daemon keys

2 participants