Skip to content

v2 stage 1 — sovereign sidecar + on-chain identity + credentials-service worker (#89)#87

Merged
hanwencheng merged 57 commits into
mainfrom
claude/stupefied-darwin-cfafd6
May 19, 2026
Merged

v2 stage 1 — sovereign sidecar + on-chain identity + credentials-service worker (#89)#87
hanwencheng merged 57 commits into
mainfrom
claude/stupefied-darwin-cfafd6

Conversation

@hanwencheng
Copy link
Copy Markdown
Member

@hanwencheng hanwencheng commented May 16, 2026

v2 stage 1 — sovereign sidecar + on-chain identity + credentials-service worker (#89)

Resolves #89. Partially overlaps #90 (real K11 WebAuthn) and #91 (credentials-service worker microservice variant) — see "Stage-2 work partially landed" below.

This PR began as step 1 of #85 (mock-server → S3 migration). With the v2 architecture refactor (arch.md v2) issued during the work, scope expanded into the full v2 stage-1 foundation. Operators upgrading from #87's S3CredentialBackend get the on-chain registry + sidecar daemon + worker + WebAuthn-bound K11 in this single change.

What's shipped

On-chain layer (crates/agentkeys-chain/)

Four Solidity 0.8.20 contracts (Frontier-compatible evm_version = "london" per Heima) deployed to Heima mainnet (chain_id 212013):

Contract Address Purpose
SidecarRegistry 0x76D574a107727bE87fc1422661A030FEFda70786 Device registry (master + agent tier) with role bitfield (CAP_MINT=1, RECOVERY=2, SCOPE_MGMT=4) per arch.md §10.1
AgentKeysScope 0x14C23B5D1cE20c094af643a20e6b0972dAD12aa8 Per-(operator, agent) scope with K11-gated mutations per arch.md §12.4
K3EpochCounter 0x8396dEc50ff755d6DE7728DABB00Be2eFBCdf4dF Global K3 rotation epoch (one tx per rotation, O(1))
CredentialAudit 0x1801ded1a4FBD8c9224Ab18B9EcbB293B8674c06 Append-only audit log (arch.md §15.3 tier C)

Source + 11 forge tests under crates/agentkeys-chain/test/. Deploy script at crates/agentkeys-chain/script/DeployAgentKeysV1.s.sol. Live verification:

AGENTKEYS_CHAIN=heima bash scripts/verify-heima-contracts.sh
# 13 sub-checks across 4 contracts; exits 0 on all-pass.

Broker — cap-mint endpoints (crates/agentkeys-broker-server/src/handlers/cap.rs)

  • POST /v1/cap/cred-store — mint a one-shot capability for credential storage
  • POST /v1/cap/cred-fetch — mint a cap for credential read
  • Per-request: reads on-chain SidecarRegistry.isActive(deviceKeyHash), AgentKeysScope.isServiceInScope(operator, agent, service), K3EpochCounter.currentEpoch() via eth_call over reqwest — co-signs cap with broker's K1 P-256 key only if all three checks pass.
  • Cap shape: {operator_omni, actor_omni, service, op_type, k3_epoch, expires_at, nonce} signed as base64url(p256-ecdsa). Workers re-verify against on-chain state independently (arch.md §15.1).

Daemon — localhost cap-proxy (crates/agentkeys-daemon/src/proxy.rs)

  • Unix socket at $XDG_RUNTIME_DIR/agentkeys-proxy.sock (caller-uid via SO_PEERCRED) + optional TCP fallback
  • 5-min TTL cap-token cache (in-memory tokio::sync::RwLock<HashMap>)
  • Fail-closed on stale broker (last-contact > 60s rejects new fetches)
  • Per-call audit row to stdout in JSON-line format
  • Per-caller scope check with deny-by-default + allowlist policy
  • SSE drop event push for chain-side revocations

Credentials worker (crates/agentkeys-worker-creds/)

Independent re-verify of cap-tokens before any S3 PUT/GET:

  1. P-256 signature verify against broker's pubkey (env-loaded for stage 1; mTLS-derived for stage 2 Stage 2: ship credentials-service worker (arch.md §15.1) — Lambda + mTLS to signer #91)
  2. Op-type matches the endpoint
  3. Freshness window (cap not expired)
  4. On-chain SidecarRegistry.isActive(device) re-check
  5. On-chain AgentKeysScope.isServiceInScope(operator, actor, service) re-check
  6. On-chain K3EpochCounter.currentEpoch == cap.k3_epoch re-check

Endpoints: POST /v1/cred/store, POST /v1/cred/fetch, POST /v1/cred/teardown. AES-256-GCM envelope with v2 shape (1B version || 12B nonce || ciphertext || 16B tag); KEK from AGENTKEYS_WORKER_KEK_HEX (stage-1 simplification per arch.md §22b.2 — WARN at boot; rejects all-zeros + all-same-byte placeholders).

Memory worker (crates/agentkeys-worker-memory/)

Same shape as creds worker, distinct KEK + bucket per arch.md §17 per-data-class separation. Endpoints: POST /v1/memory/{store,fetch,teardown}.

CLI — agentkeys k11 subcommand (crates/agentkeys-cli/src/k11.rs + k11_webauthn.rs)

# Real WebAuthn ceremony — opens default browser, Touch ID prompt on macOS
agentkeys k11 enroll --webauthn --operator-omni 0x<64-hex>
agentkeys k11 assert --webauthn --operator-omni 0x<64-hex> --message-hex 0xdeadbeef

# Stage-1 stub — deterministic bytes, CI-friendly, no Touch ID prompt
agentkeys k11 enroll --operator-omni 0x<64-hex>
agentkeys k11 assert --operator-omni 0x<64-hex> --message-hex 0xdeadbeef

--webauthn brings up a localhost axum server, opens http://localhost:<random>, runs navigator.credentials.{create,get} against the platform passkey (Touch ID / Windows Hello / Secure Enclave), validates clientDataJSON challenge/type/origin, parses attestationObject CBOR, verifies authData (rpIdHash, UP|UV|AT flags, credentialId match), extracts P-256 X+Y from COSE pubkey, persists to ~/.agentkeys/k11/<omni>.json (mode 0600, mode: "webauthn"). Assert binds to the application message via challenge = sha256(message) — the resulting signature is cryptographically bound to that exact call.

Without --webauthn, defaults to deterministic stub for CI. WARN to stderr on AGENTKEYS_CHAIN=heima per arch.md §22b.1.

Helper scripts wrapping the live contracts

Eight idempotent bash scripts under scripts/ — each cast-based, pre-checks current on-chain state and short-circuits when the op is already a no-op:

Script Wraps Idempotency check
heima-bring-up.sh Forge deploy (all 4 contracts) cast code on each stored address
heima-fund-account.sh EVM transfer from operator cast balance ≥ requested amount
heima-device-register.sh SidecarRegistry.registerMasterDevice getDevice.registeredAt > 0
heima-agent-create.sh SidecarRegistry.registerAgentDevice same
heima-scope-set.sh AgentKeysScope.setScopeWithWebauthn getScope config equality
heima-scope-revoke.sh AgentKeysScope.revokeScope getScope.exists == false
heima-device-revoke.sh SidecarRegistry.revokeDevice getDevice.revoked == true
heima-credential-audit.sh CredentialAudit.append (always appends; verifies entryCount += 1)

All accept --webauthn (where applicable) to get the K11 assertion from the real ceremony via agentkeys k11 assert --webauthn.

One-command end-to-end demo (scripts/v2-stage1-demo.sh)

15 steps, idempotent re-runs:

1.  Build + install agentkeys CLI
2.  AWS profile sanity
3.  Operator-workstation env source + chain profile
4.  Email setup (SES verify + S3 inbound)
5.  agentkeys init --email-link (session JWT bootstrap)
6.  Provision vault bucket + role + bucket policy (arch.md §17)
7.  S3 credential round-trip smoke + cross-bucket-contamination assertion
8.  Chain bring-up (forge deploy + env_set the 4 addresses)
9.  Verify-heima-contracts (read-only RPC check, 0 gas)
10. registerMasterDevice  (no K11 — first call is bootstrap)
11. K11 enrollment (stub or --webauthn)
12. registerAgentDevice
13. setScopeWithWebauthn  (K11-gated)
14. CredentialAudit.append
15. Summary

Pass --webauthn for the real Touch ID ceremony on steps 11 + 13. Default is stub mode (CI-friendly).

Stage-2 work partially landed

#90 (Hardening) — partial

  • K11 WebAuthn enroll + assert via Touch ID ✅ shipped (PR's --webauthn flag). On-chain P-256 verify still deferred until Heima ships EIP-7212 precompile.
  • Role bitfield in SidecarRegistry ✅ shipped (CAP_MINT=1, RECOVERY=2, SCOPE_MGMT=4).
  • Multi-master M-of-N recovery quorum — ❌ deferred. heima-device-revoke.sh only handles single-master revoke today.
  • audit-service worker (tier A Merkle relay) — ❌ deferred. CredentialAudit (tier C direct-write) is shipped instead.
  • memory-service worker — ✅ shipped (agentkeys-worker-memory).
  • email-service worker — ❌ deferred.
  • K3 rotation runbook — ❌ deferred. K3EpochCounter contract is on-chain with currentEpoch = 1 and signerGovernance initialized; rotation procedure documented in arch.md §22b but no operational script yet.

A comment on #90 enumerates these.

#91 (credentials-service worker) — partial

  • Rust microservice variant of the worker — ✅ shipped (agentkeys-worker-creds).
  • Lambda variant — ❌ deferred. arch.md §28 lists microservice + Lambda as alternatives; the microservice form is sufficient for stage-1 closure.
  • mTLS plumbing for KEK derivation from signer — ❌ stage-1 simplification: KEK from env per arch.md §22b.2 (issue Stage 2: ship credentials-service worker (arch.md §15.1) — Lambda + mTLS to signer #91 hardens this).
  • --credential-backend=sidecar wired through worker — ✅ shipped (CLI flag + daemon proxy + worker endpoints all in place).
  • Independent on-chain re-verify before any S3 op — ✅ shipped (verify.rs::verify_cap).

A comment on #91 enumerates these.

Audit pass + WebAuthn ceremony

docs/v2-stage1-iteration-log.md records 4 audit findings + fixes:

# Type Where Fix
1 arch-mismatch 4 source files cited "stage-1 simplification per arch.md §22a" but §22a is "Chain profiles" New arch.md §22b stage-1 simplifications inventory (§22b.1..§22b.5) with stage-2 issue pointers
2 bypass K11 was stub-only — no path to real ceremony --webauthn flag with browser + Touch ID; manual ceremony parsing (CBOR auth-data, P-256 verify)
3 bypass KEK-from-env had no startup WARN + accepted weak placeholders Both worker crates: reject all-zeros + all-same-byte, WARN at boot citing §22b.2 + #91
4 doc drift "not yet implemented" sidecar row Updated to shipped surface + invocation recipe

Two codex adversarial review passes against 7a89c5f and d0ab230 returned APPROVED after follow-up fixes — see iteration log for the verdict tables.

Test plan

  • cargo test --workspace — 200+ tests across 9 crates, all pass.
  • cargo clippy --workspace — zero warnings.
  • cargo test -p agentkeys-cli — 57 tests (k11 stub, k11 webauthn unit + integration, chain helpers).
  • bash scripts/v2-stage1-demo.sh — end-to-end against Heima mainnet succeeds, idempotent on re-run.
  • bash scripts/v2-stage1-demo.sh --webauthn — opens browser, Touch ID prompts, real assertion bytes go on chain.
  • AGENTKEYS_CHAIN=heima bash scripts/verify-heima-contracts.sh — 13/13 checks pass.

What did NOT land

Tracked in #90 (hardening) and #91 (worker hardening):

  • M-of-N multi-master recovery quorum
  • audit-service worker (tier A Merkle relay)
  • email-service worker
  • Lambda variant of credentials-service worker
  • mTLS-derived KEK from signer enclave
  • K3 rotation operational runbook
  • On-chain P-256 verify of K11 assertions (gated on Heima shipping EIP-7212)

🤖 Generated with Claude Code

…ntial-backend flag)

Replaces the legacy mock-server `/credential/*` storage with an
OIDC-scoped, client-side-encrypted S3 backend living next to the
existing `bots/<wallet>/inbound/` SES routing prefix (issue #83).
The legacy backend keeps handling sessions, audit, identity, scope,
rendezvous, and inbox; only credential CRUD migrates in this PR.

The pieces:

- `crates/agentkeys-core/src/s3_backend.rs` — new
  `S3CredentialBackend` impl. `store/read/teardown/list_credentials`
  go through S3. Every other `CredentialBackend` method returns a
  clear "route through http backend" error — those endpoints still
  live on the mock-server (or the broker for the new flow).
- AES-256-GCM seal, 96-bit random nonce, AAD =
  `agentkeys.cred.aad.v1|<lower-wallet>|<service>`. Wire layout
  `1B version || 12B nonce || ciphertext || 16B tag`, version =
  0x01. AAD binds blob to its (wallet, service) S3 location so a
  cross-operator swap fails open.
- KEK derivation is signer-anchored: SHA256(domain ||
  signer.sign_eip191(omni, "agentkeys.kek.v1:<wallet>:<service>")).
  secp256k1 RFC 6979 makes this deterministic across calls, so the
  same KEK comes back on every read; future TEE migration (issue
  #74 step 2) inherits it transparently.

- `crates/agentkeys-cli` — `CredentialBackendKind { Http, S3 }` plus
  `--credential-backend` (env `AGENTKEYS_CREDENTIAL_BACKEND`),
  `--bucket` / `AGENTKEYS_BUCKET`, `--signer-url` /
  `AGENTKEYS_SIGNER_URL`, `--omni-account` /
  `AGENTKEYS_OMNI_ACCOUNT`. New `credential_backend()` async helper on
  `CommandContext` builds the right impl per call. `cmd_store`,
  `cmd_read`, `cmd_run`, `cmd_teardown`, `cmd_provision` now route
  credential CRUD through it; identity resolution + the rest stay on
  the legacy http backend regardless of the flag. Default remains
  `http` for the transition window.

- `docs/cloud-setup.md` §4.4 — new `AllowDaemonPutOwnCredentials`
  bucket-policy statement granting `s3:PutObject` + `s3:DeleteObject`
  on `bots/<wallet>/credentials/*` under the same
  `agentkeys_user_wallet` PrincipalTag scope that already gates
  `s3:GetObject`. Operators running `--credential-backend=s3` need
  the policy update to land first.
- `docs/spec/architecture.md` §3a — add `credential_kek` and
  `credential_envelope` canonical-names rows so future docs reference
  the same terms.
- `docs/spec/architecture.md` §9 #10 — flag the mock-server credential
  slice as "migrating off", point at issue #85.
- `docs/stage7-demo-and-verification.md` §5.3 — operator-side opt-in
  block (env vars to set, what to expect at the S3 key).
- `docs/spec/plans/issue-credential-storage-s3-oidc.md` — mark steps
  1–3 as shipped; steps 4–6 still pending (default flip + mock-server
  handler removal + arch.md §11 cleanup).

Tests:
  - cargo test -p agentkeys-core -p agentkeys-cli -p agentkeys-mcp
    -p agentkeys-provisioner — clean (9 new s3_backend tests covering
    object_key path, KEK determinism, AEAD round-trip, AAD-binding,
    envelope-version drift, truncated envelope; 37+3+44+7+23 = 114
    pre-existing tests still pass).
  - cargo clippy on agentkeys-core + agentkeys-cli — clean.

No deployment changes required for the existing `http` default. To opt
into `s3` an operator runs the cloud-setup.md §4.4 update once per
account, sets the four env vars, and the next `provision` writes to S3.
Two high-severity findings from /codex:adversarial-review on PR #87:

1. **Scope enforcement was missing.** S3CredentialBackend ignored
   Session.scope on store_credential / read_credential /
   list_credentials / teardown_agent. The legacy HTTP backend gates
   per-service access server-side via the /credential/* handlers'
   bearer-JWT check; the S3 backend has no equivalent (the bucket
   policy keys only on the wallet PrincipalTag, not service). A
   scoped child session could therefore have read or written any
   service under its wallet prefix.

   Fix: client-side gate before any S3 call.
   - enforce_scope_for_service(session, service, write) rejects
     PermissionDenied when the service isn't in scope.services and
     when write=true on a read_only scope.
   - enforce_master_session(session, op) rejects teardown_agent on a
     scoped child (wallet-level destruction is master-only — matches
     the implicit legacy contract).
   - list_credentials filters its return down to scope.services so
     a scoped child can't enumerate the master's other services.

2. **Broker-minted AWS creds weren't reaching the S3 client.**
   cmd_provision fetched the OIDC-scoped temp creds via
   broker_env_for_provision and injected them into the scraper
   subprocess env only. The parent process's S3CredentialBackend used
   aws_config::defaults — i.e. process AWS_* env or shared config —
   which would either be empty (storage fails post-key-mint, the
   exact failure mode #85 exists to fix) or the operator's static
   admin creds (no PrincipalTag, isolation property gone).

   Fix: pull cred minting up into CommandContext::credential_backend
   itself.
   - New mint_s3_credentials helper hits the same
     fetch_via_broker_default_ttl path the provisioner uses, returns
     aws_credential_types::Credentials.
   - S3CredentialBackend::new gains a `credentials: Option<...>`
     parameter; when Some, the SDK config builder gets a
     credentials_provider pinned to those creds, bypassing the
     default chain entirely.
   - cmd_provision now ends up with two STS calls per run (one for
     scraper env, one for parent S3 client) — cheap; the alternative
     was threading the env map through the orchestrator into the
     backend factory.

Tests added (all PermissionDenied codes verified):
- enforce_scope_allows_master_session
- enforce_scope_blocks_service_not_in_list
- enforce_scope_blocks_write_when_read_only
- enforce_master_session_blocks_scoped_session
- store_credential_blocks_out_of_scope_before_s3_call
- read_credential_allows_in_scope_read_only (also asserts out-of-scope
  reads still deny)
- teardown_agent_rejects_scoped_session

Test count: agentkeys-core lib 28 → 44 (16 s3_backend tests total: 9
from the initial PR + 7 new). Full affected-crate suite: 121 passing.
cargo clippy on agentkeys-core + agentkeys-cli clean.

Out of scope:
- A full integration test for `provision --credential-backend=s3` end-
  to-end through a real STS + S3 path. That needs live AWS creds in
  CI and is tracked alongside the default-flip work in plan step 4.
@hanwencheng
Copy link
Copy Markdown
Member Author

Addressed both [high] findings from /codex:adversarial-review in 5c36546:

1. Scope enforcement (s3_backend.rs)S3CredentialBackend now gates store/read/list/teardown against Session.scope client-side before any S3 call:

  • enforce_scope_for_service(session, service, write) — rejects PermissionDenied when service ∉ scope.services or write=true on a read_only scope.
  • enforce_master_session(session, op) — rejects teardown_agent on a scoped child (wallet-level destruction is master-only).
  • list_credentials filters returns down to scope.services so scoped children can't enumerate sibling services.

The bucket-policy PrincipalTag only knows the wallet, so client-side enforcement is the trust boundary here — matches what the legacy HTTP backend's /credential/* handlers do server-side.

2. Broker creds wiring (lib.rs)CommandContext::credential_backend() now mints OIDC-scoped AWS temp creds via fetch_via_broker_default_ttl (same path the provisioner uses for the scraper subprocess) and injects them directly into the S3 client via a new credentials: Option<aws_credential_types::Credentials> parameter on S3CredentialBackend::new. The SDK no longer falls back to aws_config::defaults, so:

  • cmd_provision can't end up with a working scraper but a credless parent S3 client.
  • cmd_store / cmd_read use the same PrincipalTag-scoped session as provisioning (not the operator's raw admin creds).

Tests — 7 new scope-enforcement tests (denial cases for write/read/list/teardown + master-session allows). Full affected-crate suite: 121 passing. cargo clippy clean.

Out of scope — a full live-AWS integration test for provision --credential-backend=s3 end-to-end through STS + real S3. Tracked alongside the default-flip work in plan step 4.

…pe, dual-read S3CredentialBackend)

First incremental implementation commit for the v2 stage 1 plan in
docs/spec/plans/v2-issues/issue-v2-stage-1-foundation.md. Lands the
CLI/backend pieces that can ship without the chain contracts or the
sidecar daemon being live yet.

What lands:

* agentkeys_core::actor_omni — deterministic SHA256("agentkeys"||"evm"||
  master_wallet) helper per arch.md §14.1, used to compute the stable
  per-operator anchor independent of K3 / wallet rotation.

* S3CredentialBackend now writes v2 envelopes (version byte 0x02,
  AAD = "agentkeys.cred.aad.v2|<actor_omni_hex>|<service>") and reads
  BOTH v1 and v2 shapes — dispatching on the version byte. v2 writes go
  to bots/<actor_omni_hex>/credentials/ per arch.md §14.5; reads try v2
  first and fall back to v1 only on NotFound, propagating every other
  error to surface real failures.

* Dual-prefix list_credentials (union, dedup'd; v2 wins) and dual-prefix
  teardown_agent (wipes both wallet-keyed and actor_omni-keyed paths) so
  mid-migration state can't strand orphan blobs.

* CLI --envelope-version={v1,v2} flag plumbs WriteEnvelope through
  CommandContext. Default stays v1 so PR #87 deployments keep working
  unchanged; operators flip to v2 post-bucket-policy-rollout.

* CLI --credential-backend=sidecar flag accepted by the surface; today
  returns a clear "not yet implemented" error pointing operators at
  --envelope-version=v2 as the closest currently-working substitute.
  Forward-compatible flag shape so the eventual daemon implementation
  is a code change, not a CLI break.

* agentkeys whoami prints agentkeys_actor_omni alongside session_wallet
  so operators can sanity-check the bucket-policy PrincipalTag and the
  v2 S3 path their backend will use after the dual-tag rollout.

* Tests: 12 new unit tests covering actor_omni determinism + case
  handling, v2 envelope roundtrip, v1/v2 path divergence, AAD
  divergence, version dispatch, WriteEnvelope override. Full workspace
  test suite still green (467 tests passed, 0 failed).

What's deferred:

* Broker /v1/cap/cred-fetch + /v1/cap/cred-store endpoints (cap-mint).
* On-chain ScopeContract / SidecarRegistry / K3EpochCounter contracts.
* K11 WebAuthn verification on master-mutation endpoints.
* Sidecar daemon (agentkeys-proxy.sock).
* OIDC JWT dual-tag mint (agentkeys_user_wallet + agentkeys_actor_omni).
* Bucket policy _v2_omni_keyed rule.

Docs:

* docs/v2-stage1-migration-and-demo.md — new top-level "What landed in
  this commit" section + A.2 clarification on the sidecar stub +
  revision-log entry for 2026-05-18.

* docs/spec/plans/v2-issues/issue-v2-stage-1-foundation.md — three CLI
  tasks marked [x] (sidecar flag, envelope-version flag, whoami
  actor_omni). credentials-service worker section updated to note
  dual-envelope decrypt + dual-path read already work in
  S3CredentialBackend; Lambda reuse is the remaining work.

* docs/spec/architecture.md §14 — the prior session's v2 consolidation
  (was uncommitted; lands with this commit).

* docs/spec/plans/v2-issues/ — three planning issues filed alongside
  arch.md §14 (stage-1 foundation, stage-2 hardening, deferred payment
  service).

* docs/archived/ — earlier standalone v2 design docs superseded by
  arch.md §14, archived per CLAUDE.md docs/archived policy.
…complete

The pre-v2 architecture.md was a patchwork of the original single-binary
mock-server design plus a §14 graft for v2 plus three layered Codex
amendment addenda (§14.8, §14.9, §14.9a). New readers had to triangulate
across the v1 spine + v2 graft + amendments to reconstruct the design.

This rewrite collapses all of that into one coherent v2 narrative,
treating issues #88 (payment-service), #89 (stage-1 foundation), and #90
(stage-2 hardening) as completed. Codex findings are folded into the
design (no more "see addendum"); dual v1/v2 migration language is gone
(the migration window closed when stage 1 shipped).

Structure (27 sections, top-down):

  §1   System overview (five trust boundaries, mermaid)
  §2   Component inventory (14 components)
  §3   Trust boundaries (blast-radius table per boundary)
  §4   Key inventory K1–K11 (canonical)
  §5   Canonical names (one concept, one canonical spelling)
  §6   Identity model — three layers + HDKD actor tree
  §7   Upstream backend classes — A (per-request) / B (bearer)
       / C (on-chain payment-rail, new in v2)
  §8   Mental model — four orthogonal axes
  §9   Cold-start (master bootstrap, stages 0–4)
  §10  Per-actor binding ceremonies (master + agent)
  §11  Recovery — M-of-N device quorum (no anchor wallet, no seed)
  §12  Sidecar daemon (localhost proxy, host-local policy)
  §13  Broker (cap-mint authority, on-chain reader)
  §14  Signer (TEE-protected K3 vault)
  §15  Workers — creds / memory / audit / email / payment
       (with audit tiers A/B/C and payment modes P-1/P-2/P-3 spelled out)
  §16  On-chain layer (four contracts, Solidity inlined)
  §17  Storage layout (per-data-class buckets, per-actor prefixes)
  §18  Encryption envelope (KEK derivation + AES-256-GCM v2)
  §19  Cap-token shape + lifecycle (wire JSON + 11-step verification)
  §20  Mode selection — sovereign default + hosted-relay opt-in
       + self-hosted-relay
  §21  K3 rotation (zero-migration property)
  §22  Pluggable surfaces (six pluggable axes)
  §23  Cargo workspace (post-v2 crate layout)
  §24  Deployment topology
  §25  Cross-references
  §26  What v2 guarantees
  §27  What's NOT in this doc

Major changes vs the pre-v2 spine:

* Stage 7 / mock-server / S3CredentialBackend / dev_key_service language
  retired — those are pre-v2 historical artifacts that no longer
  describe the shipped system.

* §15 enumerates ALL five workers (creds + memory + audit + email +
  payment). Payment-service is now a first-class section with the
  P-1/P-2/P-3 mode table, security properties, and wire shape inlined.

* §16 inlines all four Solidity contracts (AgentKeysScope, SidecarRegistry,
  K3EpochCounter, CredentialAudit) with the cap-mint verification gates
  spelled out (per-actor binding, K11 for master mutations, K3 epoch
  freshness, CAS-burn for payments).

* §19 is new — the cap-token wire shape + 11-step worker verification
  sequence. Pre-v2 had this scattered across §14.3 + the stage-1 plan.

* §11 (recovery) has a concrete second-by-second timeline showing how a
  surviving master device M-of-N quorum revokes a stolen device in ~60s.

* §6 lays out the three identity layers (Layer 1 actor_omni anchor;
  Layer 2 current_master_wallet; Layer 3 operational uses) up front,
  not buried in a §14.1 sub-section.

* §7 adds Class C (on-chain / payment-rail operations — irreversible)
  alongside Class A (per-request, AWS-native) and Class B (bearer).
  Pre-v2 only had A and B.

Length: 1248 → 1488 lines. Net +240 because of the inlined contracts,
worker tables, recovery timeline, cap verification sequence, and
mermaid diagram for the unified system overview.
…ntry/Heima EVM backbone

The previous doc tried to cover both migration (from stage-7 PR #87) and
new-feature demo in one Part A + Part B structure. The migration half
turned out to be entirely mechanical — the dual-read / dual-envelope /
dual-prefix support already in S3CredentialBackend handles the transition
without any operator runbook. So drop Part A; the new doc is fresh-start
only.

Chain backbone: Litentry rebranded to Heima Network in 2026. Heima runs
Frontier (pallet_evm + pallet_ethereum) with EVM chain ID 212013 on mainnet
(= "LIT deployment year (21) + paraID (2013)", hardcoded at
parachain/runtime/heima/src/lib.rs). Operators deploy the four stage-1
Solidity contracts (AgentKeysScope, SidecarRegistry, K3EpochCounter,
CredentialAudit) via Foundry against https://rpc-eth.heima.network or a
self-hosted Frontier node from litentry/heima:latest. Address mapping is
HashedAddressMapping<BlakeTwo256>, so EVM accounts are first-class
on-chain identities — no MetaMask-Substrate dual-account dance.

Structure (10 sections plus reference + revision log):

  Litentry/Heima EVM chain reference  — chain IDs, RPC URLs, explorer,
                                          self-hosted node bring-up
  What stage 1 ships (vs inherited)   — clear table of what comes from
                                          stage-7 demo vs new in v2
  §0  Prerequisites (inherited)       — pointer to stage-7 §0 verbatim
                                          + new Heima RPC reachability check
  §1  Master device bootstrap         — stage-7 §1-§2 inherited, plus new
                                          stage-2 (WebAuthn) and stage-4
                                          (on-chain registry) sub-sections
  §2  AWS prereqs (inherited+v2 tag)  — one-line PrincipalTag rename to
                                          agentkeys_actor_omni
  §3  Smoke-test v2 envelope          — verify the v2 S3 path works
                                          end-to-end without chain or sidecar
  §4  Deploy Heima EVM contracts      — Foundry deploy script; cast
                                          verification; K3EpochCounter init
  §5  Register master device on chain — the §1.4 step, now executable
                                          against the deployed registry
  §6  Sidecar daemon bring-up         — agentkeys-daemon flags; localhost
                                          proxy verification
  §7  Create agent + grant scope      — full HDKD per-agent omni flow with
                                          K11 prompts at agent-create and
                                          scope-grant; in-scope vs out-of-scope
                                          verification
  §8  Chain-level isolation proof     — repeat for bob; verify per-actor
                                          binding rejects cross-actor cap-mint
                                          (Codex finding #1 in action)
  §9  Teardown
  What's still in flight              — shipped-vs-spec status table so
                                          operators following the doc today
                                          know exactly which steps will
                                          error with "not yet implemented"

The doc is now the target end-state runbook; track issue-v2-stage-1-foundation.md
for the rolling implementation status of each pending sub-deliverable.

Length: 445 → 814 lines.
Chain backbone is pluggable per arch.md §22, but the previous draft of the
demo doc hardcoded Heima env vars (HEIMA_EVM_RPC_HTTP, HEIMA_EVM_CHAIN_ID,
HEIMA_SUBSTRATE_WSS, HEIMA_EXPLORER, ...). Switching to Base or Ethereum
meant renaming five env vars per chain. This commit collapses everything
into one --chain flag.

What ships:

* New module crates/agentkeys-core/src/chain_profile.rs — ChainProfile
  struct + serde-json wire format. ChainProfile::resolve() walks the
  documented precedence ($AGENTKEYS_CHAIN_PROFILE_FILE > --chain CLI flag
  > $AGENTKEYS_CHAIN env > built-in default 'heima') and returns a typed
  profile plus a debug string explaining which step matched.

* 7 built-in profile JSONs under crates/agentkeys-core/chain-profiles/,
  embedded into the binary via include_str! macro:
    heima         (mainnet, chain_id=212013, substrate-frontier)
    heima-paseo   (testnet, chain_id=0 sentinel for auto-detect)
    base          (mainnet, chain_id=8453, optimism-l2, safe-tag default)
    base-sepolia  (testnet, chain_id=84532)
    ethereum      (mainnet, chain_id=1, finalized-tag default)
    sepolia       (testnet, chain_id=11155111)
    anvil         (local, chain_id=31337, instant finality, ships test key)

* Profile fields cover every chain-specific dimension the broker / daemon
  / workers need:
    - chain_id (uint64; 0 = auto-detect via eth_chainId)
    - chain_kind (enum: substrate-frontier | ethereum-l1 | optimism-l2
                  | arbitrum | local-dev — controls finality + gas strategy)
    - rpc.{http, wss, substrate_wss?}
    - explorer.{url, tx_url_template, address_url_template}
    - token.{symbol, decimals}
    - finality.{default_block_tag, confirmation_blocks, confirmation_seconds, notes}
    - gas.{model, max_priority_fee_gwei, max_fee_gwei}
    - deploy.{deployer_env_var, foundry_chain_arg, faucet_url?, default_test_key?}

* CLI wiring in crates/agentkeys-cli:
    - New top-level flag: --chain <name> (env AGENTKEYS_CHAIN)
    - New subcommand: agentkeys chain list (enumerate built-in profile names)
    - New subcommand: agentkeys chain show [name] (print full profile JSON;
      omit name to inspect the active profile per resolution rules)
    - CommandContext::chain_profile() returns the cached resolved profile;
      --verbose prints the resolution debug string

* Operator-custom chains: set $AGENTKEYS_CHAIN_PROFILE_FILE to any JSON
  file matching the schema and AgentKeys uses it. No recompile. Moonbeam,
  Astar, Polygon, Avalanche, BSC, permissioned chains (Aliyun BaaS,
  Hyperledger, Quorum) are all one JSON file away.

Tests: 12 new unit tests covering every built-in loads + parses, known
field values per chain, case-insensitive lookup, resolution precedence,
explorer URL template substitution. Workspace test count: 467 → 479,
all passing.

Docs:

* docs/spec/architecture.md §22 — chain layer row in the pluggability
  table now points at the named-profile system; new §22a "Chain profiles
  — how to switch between EVM backbones" covers resolution order,
  schema, built-in inventory, operator-custom flow, what chain_kind
  controls at runtime, and cap-mint freshness across chains.

* docs/v2-stage1-migration-and-demo.md — replaced the
  "Litentry/Heima EVM — chain reference" section with a generalised
  "Chain backbone — pluggable per arch.md §22" section. Built-in profile
  table + operator-custom example (Moonbeam) + why-named-profiles
  rationale (vs the previous per-chain env var sprawl). Updated §0
  reachability check + §4 Foundry deploy + §5 device register + §6
  sidecar daemon bring-up to pull chain-specific values from the active
  profile via `agentkeys chain show | jq -r .<field>` — no more
  HEIMA_* env var coupling.

Switching chains is now: export AGENTKEYS_CHAIN=base (or pass --chain
base on any command). Every component reads the same profile.
…n target

Two corrections based on authoritative Heima developer info, verified
live 2026-05-18 against the production RPC:

* RPC hostname: was guessed as rpc-eth.heima.network in the speculative
  draft; canonical URL per docs.heima.network / chain-list.com/heima /
  dwellir.com/networks/heima is rpc.heima-parachain.heima.network (same
  host serves both EVM JSON-RPC and Substrate-RPC). Verified live:
    eth_chainId   → 0x33c2d  (= 212013 decimal, matches profile)
    eth_blockNumber → 0x92c29f (current head, ~9.6M blocks)
    system_chain  → "Heima"   (Substrate side responds on same host)

* eth_chainId hex in the demo doc was wrong (had 0x33c4d = 212045);
  correct value is 0x33c2d = 212013.

Also pinned the future agentkeys explorer integration target by adding
explorer.subscan_source to the chain profile JSON schema:

* New ChainProfile::ExplorerLinks.subscan_source field — optional
  pointer at the backend + frontend repo for chain-specific explorer
  indexing. Type-safe in Rust via new SubscanSource struct.

* heima.json now points at the Litentry-forked Subscan stack:
    - github.com/litentry/subscan-essentials (Go backend)
    - github.com/litentry/subscan-essentials-ui-react (React frontend)

  These integrations are stage-2/3 deliverables — agentkeys-specific
  indexing for AgentKeysScope.ScopeUpdated, SidecarRegistry.*,
  K3EpochCounter.K3Rotated, CredentialAudit.* events, cross-indexed by
  actor_omni. Pinning the target in the profile means when the work
  happens, it lands in those two repos rather than a third-party
  hosted explorer.

* docs/spec/architecture.md new §22a.6 "Explorer integration target"
  documents the integration plan; renumbered the existing cap-mint
  freshness section to §22a.7.

* docs/v2-stage1-migration-and-demo.md new "Explorer — current state +
  future agentkeys integration" subsection covers the same target,
  plus the in-doc curl example now shows the correct 0x33c2d hex value.

Other chain profiles can populate subscan_source with their own
explorer codebases as integrations land (Etherscan / Blockscout for
Ethereum / Base, chain-specific forks for others).

Workspace tests: 479/0 (unchanged — schema is backwards-compatible
because subscan_source is #[serde(default)] optional).
Heima developer team confirmed that Heima Paseo's runtime ships
pallet_sudo with the well-known Substrate dev account Alice as the
sudoer. This commit documents what that means, why it's a standard
Substrate testnet convention, and how AgentKeys operators use it (or
don't) during stage-1 dev bring-up.

Educational background (for readers unfamiliar with Substrate):

* Alice is one of six well-known Substrate dev accounts. The keypair is
  deterministically derived from the public seed phrase 'bottom drive
  obey lake curtain smoke basket hold race lonely fit walk//Alice'.
  Public key 0xd43593c715fdd31c61141abd04a99fd6822c8558854ccde39a5684
  e7a56da27d. SS58 (generic prefix 42) 5GrwvaEF5zXb26Fz9rcQpDWS57
  CtERHpNehXCPcNoHGKutQY. These keys are intentionally public — every
  Substrate developer knows them — so dev/test chains can ship with
  pre-funded accounts of known keys.

* pallet_sudo is the Substrate root-bypass pallet. Runtimes that
  include it expose one extrinsic: sudo.sudo(call). The pallet stores
  ONE address as the sudo key; only that address can call sudo.sudo
  and the wrapped call runs with RawOrigin::Root — bypassing every
  other origin check. Testnets ship sudo so devs have a god-mode
  lever (force-fund accounts, force-set state, force-run upgrades);
  production chains either remove the pallet or move the key to a
  governance multisig.

* On Heima Paseo specifically: sudo + Alice means anyone can use
  sudo.sudo for testnet bring-up without provisioning real accounts.

What landed in this commit:

* New typed schema in ChainProfile (DevEnvironment + SudoConfig structs),
  optional and backwards-compatible via #[serde(default)]. Production
  profiles (heima, base, ethereum) omit dev_environment entirely; only
  testnets / local-dev profiles set it.

* heima-paseo.json profile now carries the full Alice sudoer metadata:
  seed phrase, public key, SS58 generic-prefix address, invocation
  recipe, two warning lines (anyone-can-sign-as-Alice + URL pending
  Heima-dev-team confirmation).

* Production-vs-development convention pinned via
  dev_environment.is_development_default. Only heima-paseo carries
  this flag among built-ins. New ChainProfile::development_default_name()
  helper returns Some("heima-paseo"). Production default stays
  DEFAULT_PROFILE = "heima".

* docs/spec/heima-open-questions.md: new §3a "Chain backbone — EVM,
  Paseo, sudo (added 2026-05-18 after Heima dev info handoff)" with
  educational Alice/sudo background, recipe table for "what AgentKeys
  would use sudo for", how-to-invoke-sudo notes, and three new
  Q13-Q15 questions for the Heima dev team:
    - Q13: canonical Paseo RPC URL (both speculative URLs fail SSL
           as of 2026-05-18)
    - Q14: confirm Alice as sudoer + invocation recipe + SS58 on Heima
           prefix-31 encoding
    - Q15: confirm Heima mainnet has either removed pallet_sudo or
           moved the key to a governance multisig

  Reuse-Build-Block matrix updated with three new rows.

* docs/v2-stage1-migration-and-demo.md: chain-backbone section now
  documents the prod-vs-dev convention (heima for production,
  heima-paseo for development, anvil for local tests). New "Alice +
  sudo on Heima Paseo (development-environment convenience)"
  sub-section with concrete recipes for pre-funding deployer wallets,
  resetting K3 epoch, etc. Three invocation options spelled out
  (Polkadot.js Apps, subxt CLI, @polkadot/api). Built-in profile
  table updated to mark heima as "Production default" + heima-paseo as
  "Development default". Revision-log entry added.

* docs/spec/architecture.md §22a updated with the prod-vs-dev
  convention table (heima production / heima-paseo development /
  anvil local-tests). New §22a.5a "Alice + sudo on dev-default chains
  (heima-paseo)" covers the background + what sudo does/doesn't do for
  AgentKeys + the Substrate↔EVM bridge via pallet_ethereum.transact.

Tests: 12 → 15 chain_profile tests (3 new — heima_paseo is dev
default with alice sudo, development_default_name returns heima-paseo,
production chains carry no dev_environment). Workspace: 479 → 482
all passing.
The manual §4.1-§4.4 sequence (chase faucet, juggle deployer env vars,
hand-run cast send for K3EpochCounter init) is now one command:

  bash scripts/heima-paseo-bring-up.sh

Two new scripts:

scripts/heima-paseo-bring-up.sh — bash orchestrator that does:
  1. Tool sanity-check (agentkeys, jq, forge, cast, node, npx)
  2. Resolve heima-paseo chain profile + reachability-check $RPC_HTTP
     + abort if eth_chainId == 212013 (mainnet safety)
  3. Generate throwaway EVM deployer (or reuse $HEIMA_PASEO_DEPLOYER_KEY)
  4. Sudo-fund deployer from Alice (100 pHEI default) via the
     heima-paseo-sudo.mjs helper
  5. Foundry-deploy the four stage-1 contracts (graceful stub-mode when
     crates/agentkeys-chain isn't built yet)
  6. Persist contract addresses to operator-workstation.env,
     namespaced by HEIMA_PASEO so other chains can deploy alongside
  7. Print summary + suggested next-step command

  Re-run with SKIP_FUND=1 or SKIP_DEPLOY=1 to skip individual phases.

scripts/heima-paseo-sudo.mjs — Node + @polkadot/api helper:
  fund      — sudo.balances.forceTransfer Alice → EVM address (uses
              blake2_256("evm:" || eth_address) for the EVM→Substrate
              account mapping per HashedAddressMapping<BlakeTwo256>)
  bootstrap — sudo.sudo(ethereum.transact(...)) for any EVM contract
              call; used for K3EpochCounter init, force-set scope,
              pre-register sidecar entries, etc.
  whoami    — sanity-check the sudoer + Alice's balance

  Three guardrails keep mainnet safe:
    - Refuses if $AGENTKEYS_CHAIN != heima-paseo
    - Refuses if live eth_chainId == 212013 (mainnet)
    - Logs every sudo call to stderr before signing

  Polkadot deps load lazily so --help works without them installed;
  the bring-up script auto-fetches via npx --package=@polkadot/api ...

docs/v2-stage1-migration-and-demo.md additions:

* New §4.0 "Automated Heima Paseo bring-up via Alice sudo" before §4.1:
  - One-command bring-up recipe + step-by-step timing table
  - The two scripts that do the work (orchestrator + sudo helper)
  - Dev-shortcut table: pre-register fake sidecar entry, force-set
    scope, fast-forward K3 epoch, parallel multi-tenant funding
  - Explicit "what sudo CANNOT do" section spelling out the
    production-safety properties (cannot forge K11, cannot sign as
    operator's K10, cannot bypass worker-side re-verification)

* §4.1 now has a "for Heima Paseo: skip this section" callout pointing
  at §4.0 as the fast path. The manual recipe is still authoritative
  for Heima mainnet + Base + Ethereum (chains without sudo).

* "What's still in flight" table + revision log updated.

Tests: no Rust changes; existing 482 workspace tests still passing.
Scripts validated: bash -n syntax check + node --check syntax check
+ node scripts/heima-paseo-sudo.mjs --help round-trip without
polkadot deps installed.
…3 resolved)

Heima dev team confirmed the canonical Paseo values. Live-verified
2026-05-18 against https://rpc.paseo-parachain.heima.network:

  eth_chainId        → 0x7dd  (= 2013 decimal — HEIMA_PARA_ID)
  system_chain       → "Heima-paseo"
  system_properties  → ss58Format=131 tokenDecimals=18 tokenSymbol=HEI
  eth_blockNumber    → 0x2c5556 (~2.9M blocks; live chain)

What I had wrong (speculation from earlier research):

  RPC URL:       was rpc-eth-paseo.heima.network / rpc-paseo.heima.network
                 now https://rpc.paseo-parachain.heima.network
  Chain ID:      was 0 (auto-detect sentinel)
                 now 2013 (= HEIMA_PARA_ID; mainnet's 212013 prefixes year)
  SS58 prefix:   was undocumented (assumed = mainnet's 31)
                 now 131 (NOT 31, NOT the generic 42)
  Token symbol:  was pHEI (testnet-prefix convention guess)
                 now HEI (same symbol as mainnet, no prefix)

Changes:

* crates/agentkeys-core/chain-profiles/heima-paseo.json:
  - rpc.{http,wss,substrate_wss} all point at the single canonical
    host (same host serves EVM + Substrate RPC)
  - chain_id: 0 → 2013
  - token.symbol: pHEI → HEI
  - finality.notes pins the live curl outputs for future drift detection
  - dev_environment.sudo.warnings adds an SS58-prefix-131 reminder
    (re-encode pasted pubkeys for paseo, or use //Alice as SURI)

* crates/agentkeys-core/src/chain_profile.rs:
  - test heima_paseo_chain_id_zero_signals_auto_detect renamed to
    heima_paseo_chain_id_is_2013; asserts chain_id == 2013 AND that
    paseo's chain_id does not collide with mainnet's (defense against
    a future refactor accidentally swapping them)

* docs/spec/heima-open-questions.md Q13: marked ✅ RESOLVED with the
  five live curl outputs pinned in the answer block. Reuse-Build-Block
  matrix row updated to "resolved" status.

* docs/v2-stage1-migration-and-demo.md:
  - "Open questions" callout in the chain-reference section split
    into "Resolved" (Q13 — RPC URL + chain ID + SS58 + token symbol)
    and "Still pending" (Q14 Alice-as-sudoer confirmation + Q15
    mainnet sudo state + faucet URL)
  - Revision-log entry added

Workspace tests: 482/0 (15 chain_profile tests including the renamed
chain-ID pin).
… snippet

Replaces RFC 2606 placeholder addresses (alice@demo.example, alice@x.com)
with demo-1@bots.litentry.org, the SES-verified bot-domain alias the
agentkeys-init-email-demo.sh wrapper already routes to. Placeholder
domains are undeliverable: the broker accepts the request, SES sends
the magic link into the void, and the CLI polls forever — a real
operator trap.

Also folds back into the demo doc the two shell pitfalls that bit me
running the §0 reachability snippet:

  1. xargs -I{} ... $((16#$(echo {} | sed ...))) — the $((...))
     arithmetic expansion runs in the OUTER shell BEFORE xargs
     substitutes {}, so zsh sees literal `{` and errors with "bad math
     expression: illegal character: {". Replaced with for-loop +
     direct $((hex)) (0x... is native in arithmetic context, no 16#).

  2. Loop verdict variable can't be named `status` — zsh has it as a
     read-only special parameter (alias for $?). Renamed to `verdict`.

Both reachability snippets in the doc now use the safe shape and ship
with a "two pitfalls to avoid" callout so the next operator running
top-to-bottom doesn't repeat the failure. Comments updated with the
correct live hex values: 0x33c2d for heima (was 0x33c4d = wrong) and
0x7dd for heima-paseo.

Verified live 2026-05-18: curl + the new doc snippet against both
canonical RPCs returns OK for heima (212013) and heima-paseo (2013).
Combines the existing demo scripts (install-agentkeys-cli.sh,
agentkeys-init-email-demo.sh, heima-paseo-bring-up.sh) into a single
idempotent flow with 9 numbered steps. Composes — does not replace —
the underlying scripts so they remain individually usable for
finer-grained debugging.

Idempotency: each step has a "skip if already done" pre-check, same
pattern as cloud-setup.md §4.2 ("if OIDC provider ARN ends in
$BROKER_HOST, skip create"):

  1. Tool sanity-check (always runs, <100ms)
  2. Source scripts/operator-workstation.env (always runs)
  3. AWS profile sanity-check (guards against wrong profile)
  4. agentkeys CLI build+install (skips if --session-id + --chain
     flags already present)
  5. Chain reachability + live-eth_chainId match against profile
  6. Email-init session JWT (skips if session.json exists + <1h old)
  7. S3 envelope smoke-test store+read (skips if blob already at
     bots/<actor_omni>/credentials/<service>.enc)
  8. Chain bring-up via heima-paseo-bring-up.sh (skips if
     SCOPE_CONTRACT_ADDRESS_HEIMA_PASEO already in env-file)
  9. Summary + next-step hints

No hardcoded values — every magic input is overridable via env var
or CLI flag. SESSION_ID, AGENTKEYS_CHAIN, SMOKE_TEST_SERVICE,
SMOKE_TEST_SECRET, FUND_AMOUNT_HEI all configurable.

Resumability: --from-step N / --to-step N / --only-step N for
partial re-runs. On failure, the die() helper prints the exact
resume command (`bash scripts/v2-stage1-demo.sh --only-step <N>`).

Pause points for operator input:
  - Step 6: macOS keychain modal appears when agentkeys init writes
    the session JWT. Script narrates this in advance — the OS modal
    handles the actual prompt; no shell pause needed.
  - Step 8 with --confirm: explicit `read -p` before chain deploy.

Tested locally: --to-step 5 runs preflight cleanly, --only-step 1
runs tool check alone, argparse errors exit 1 with a clean one-line
message (no misleading "step 0/9" context).

Demo doc gets a new §0.0 "One-command demo" subsection at the top
of §0 that surfaces the script before operators wade into per-step
copy-paste — with the same step-by-step table, pause-point notes,
and configurable-inputs matrix as the script's own --help output.
Three real bugs from the first live run on the operator's laptop:

1. `agentkeys whoami --json` fails — --json is a top-level CLI flag
   (`cli.json` in main.rs:26, threaded into CommandContext.json_output).
   It MUST come before the subcommand. The script + the inline §1.3
   doc snippet both had it after. Fixed: `agentkeys --json whoami`.

2. `--signer-url requires --omni-account` because whoami's signer_url
   arg is `#[arg(env = "AGENTKEYS_SIGNER_URL")]` (main.rs:275) — clap
   auto-populates it from operator-workstation.env, then the CLI
   tries the signer round-trip and demands --omni-account. Chicken-
   and-egg since we want actor_omni FROM whoami. Workaround:
   `env -u AGENTKEYS_SIGNER_URL` for the whoami call only; the
   local-only fields (session_wallet + agentkeys_actor_omni) don't
   need the signer.

3. step-7 store failure message ("check bucket policy") was too
   narrow — `Error: UNREACHABLE — Backend unreachable` (lib.rs:66)
   is BackendError::Transport's generic catch-all for ANY AWS SDK
   error (AccessDenied, region mismatch, network, signer down). Now
   prints three probe-commands (direct s3 cp, get-bucket-policy
   inspection, signer health check) ranked by likelihood, plus the
   `--from-step 8 --skip-smoke` escape hatch so the operator can
   continue to chain steps while diagnosing the cloud-side issue.

The first two fixes also land in the demo doc's §1.3 snippet so the
next operator running top-to-bottom sees the gotchas inline (per
the runbook-fix-fold-back policy in CLAUDE.md).

Verified live: --only-step 7 now correctly captures session_wallet
+ agentkeys_actor_omni, computes the s3 path, and fails with the
new diagnostic error (instead of the old "session expired" red
herring).
…licy migration

Wires the v2 highly-abstracted-service PrincipalTag path end-to-end so
`agentkeys store --credential-backend=s3 --envelope-version=v2` can
actually PUT credentials through the OIDC AssumeRoleWithWebIdentity
flow. Three coupled changes:

1. **Broker (crates/agentkeys-broker-server/src/handlers/oidc.rs)**:
   `build_oidc_jwt_claims` now also emits `agentkeys_actor_omni`
   (= SHA256("agentkeys"||"evm"||wallet_lc), via the existing
   `derive_omni_account` helper) as a top-level claim AND as a
   PrincipalTag in `https://aws.amazon.com/tags`. Both v1 and v2 tag
   keys live in `principal_tags` + `transitive_tag_keys` during the
   migration window — v1 policies (keyed on agentkeys_user_wallet) and
   v2 policies (keyed on agentkeys_actor_omni) both work without
   broker config churn. `claims_supported` in the OIDC discovery doc
   gains the new claim name.

   All 8 existing broker OIDC tests pass — the additions don't break
   any v1 invariant.

2. **scripts/bucket-policy-v2-migrate.sh** (new, 130 lines):
   idempotent migration that flips the bucket policy from §4.4 v1
   shape (Sid=AllowDaemonGetOwnObjects, key=agentkeys_user_wallet) to
   v2 shape (Sid=AllowDataRolePutOwnCredentialsV2, key=agentkeys_actor_omni)
   AND adds the missing 4th statement that grants PutObject on the
   credentials/* sub-prefix (cloud-setup.md §4.4 documents this but it
   was never applied to the live cloud). Backs up the existing policy
   to /tmp/bucket-policy-backup-<ts>.json before mutating. Re-runs
   are no-ops once the v2 markers are present.

   Deliberately does NOT use the demo doc §2.2 verbatim shape
   (Principal:* + StringNotEquals != "") because cloud-setup.md §4.3
   warns negated string operators on missing context keys evaluate as
   TRUE — a JWT with no tags claim silently bypasses. The §4.4
   Principal-pinned shape with PrincipalTag-scoped Resource ARN is
   the safer template and what we want enforced.

3. **scripts/v2-stage1-demo.sh**: STEP_TOTAL 9 → 10. New step 7
   ("Ensure v2 bucket policy applied") delegates to
   bucket-policy-v2-migrate.sh idempotently. Steps 7/8/9 become
   8/9/10. Step 8 (smoke test, was step 7) now passes --broker-url
   $OIDC_ISSUER and exports AGENTKEYS_DATA_ROLE_ARN=$DATA_ROLE_ARN
   so the CLI's mint_s3_credentials path engages (otherwise the SDK
   falls back to direct admin IAM and gets AccessDenied silently
   wrapped as 'Backend unreachable').

Verified live:
  - cargo test -p agentkeys-broker-server --lib oidc → 8 passed
  - bash scripts/bucket-policy-v2-migrate.sh → applied + re-run skips
  - Manual curl on /v1/mint-oidc-jwt today still returns v1-only
    JWTs because the REMOTE broker host hasn't picked up this commit
    yet. Next step: redeploy the broker via
      bash scripts/setup-broker-host.sh --ref claude/stupefied-darwin-cfafd6
    on the broker host, then re-run --only-step 8.
….md §17)

Closes the bucket-sharing arch violation flagged in code review. Credentials
were landing in `$BUCKET` (= `agentkeys-mail-*`, the inbound-mail bucket),
violating arch.md §17 ("per-data-class buckets are mandatory; S3 exposes
encryption / lifecycle / replication / CloudTrail at the bucket level only
— folding data classes collapses blast radii").

## Fix 1 (this PR) — shipped

Provisions a dedicated `$VAULT_BUCKET` (`agentkeys-vault-${ACCOUNT_ID}`)
and `agentkeys-vault-role` per arch.md §17 + §17.2, cleans the mail
bucket policy of any stray credentials grants, and rewires the
orchestrator to target the new vault infra.

Four new idempotent scripts (each safe to re-run):

- scripts/provision-vault-bucket.sh    — bucket + block-public-access + SSE-S3
- scripts/provision-vault-role.sh      — `agentkeys-vault-role` with OIDC trust + credentials-only inline policy (3 statements, all scoped to `bots/${aws:PrincipalTag/agentkeys_actor_omni}/credentials/*`)
- scripts/apply-vault-bucket-policy.sh — vault bucket gets `Sid: VaultPolicyV2` (Principal-pinned to vault-role + Null operator for tag presence per cloud-setup.md §4.3 safety)
- scripts/cleanup-mail-bucket-policy.sh — mail bucket reverts to email-only (drops the credentials grants accidentally added by the earlier `bucket-policy-v2-migrate.sh`, which is now removed)

Each one checks "is this already done?" before acting; verified
idempotent via two consecutive runs of `bash scripts/v2-stage1-demo.sh
--from-step 7 --to-step 7` — first run created everything, second
skipped every sub-step.

## Integration test in the orchestrator

scripts/v2-stage1-demo.sh step 7 composes the 4 sub-scripts as
"Provision vault infra (bucket + role + policy)". Step 8 (smoke test):

- Uses `--bucket $VAULT_BUCKET` (NOT `$BUCKET`)
- Exports `AGENTKEYS_DATA_ROLE_ARN=$VAULT_ROLE_ARN` so the CLI's OIDC
  AssumeRoleWithWebIdentity targets the vault role
- **Cross-contamination assertion**: after store, asserts the blob is
  in `s3://$VAULT_BUCKET/bots/<actor_omni>/credentials/<service>.enc`
  AND NOT in `s3://$MAIL_BUCKET/bots/<actor_omni>/credentials/...`.
  If the separation regresses, the demo fails loud with `ARCH VIOLATION
  (arch.md §17): credential blob ALSO landed in mail bucket`.

operator-workstation.env adds:
  VAULT_BUCKET=agentkeys-vault-${ACCOUNT_ID}
  VAULT_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-vault-role
DATA_ROLE_ARN stays for the email subsystem (will rename when email
migrates in stage 2 — same pattern as VAULT did here).

## Fix 2 (deferred to stage 2) — tracked in issue #91

The credentials-service worker (arch.md §15.1) — Lambda + mTLS to
signer for encrypt/decrypt + cap-on-chain re-verify — is deferred
to stage 2. Today the CLI does client-side encrypt + direct S3 PUT
through the OIDC-assumed vault role; the worker will take over the
encrypt/decrypt step without changing the envelope shape (same KEK,
same AAD, same nonce shape).

See #91 for full design
+ acceptance criteria.

## Verified live (AWS account 429071895007)

- Vault bucket created: s3://agentkeys-vault-429071895007
- Block-public-access: all 4 flags = true
- Default SSE-S3 AES-256 applied
- Vault role created: arn:aws:iam::429071895007:role/agentkeys-vault-role
- Inline policy: 3 statements (List + Get + Put/Delete on credentials/*)
- Vault bucket policy: 1 statement (Sid VaultPolicyV2, PrincipalTag-scoped)
- Mail bucket policy cleaned: 3 statements (SES inbound + email role list/get; NO credentials grants)
- Idempotency: re-running step 7 skips every sub-step cleanly

## What still blocks step 8 today

The REMOTE broker host needs to be redeployed to pick up commit 4319428
(broker emits both v1 + v2 PrincipalTag in `/v1/mint-oidc-jwt`).
Verified live: today's broker still emits v1-only:

  curl -sS -X POST -H "Authorization: Bearer \$SESSION_TOKEN" \\
    https://broker.litentry.org/v1/mint-oidc-jwt | jq -r .jwt | \\
    cut -d. -f2 | <base64url-pad-then-decode> | \\
    jq '."https://aws.amazon.com/tags".principal_tags'
  # → { "agentkeys_user_wallet": [...] }   ← v1 only

Redeploy the broker via:
  bash scripts/setup-broker-host.sh --ref claude/stupefied-darwin-cfafd6
…on the broker host. Then re-run:
  bash scripts/v2-stage1-demo.sh --only-step 8
Real bug from the operator running --only-step 9 in v2-stage1-demo.sh.
The .mjs script's lazy `import('@polkadot/api')` failed because the
earlier `npx --package=X -y -- node script.mjs` pattern only adds X's
bin files to PATH; the script's `import()` resolves via Node's module
resolver, which walks UP from the script's location looking for
node_modules — and there's no node_modules in scripts/. So the import
fell into the catch block and printed "[heima-paseo-sudo] missing
polkadot deps", killing step 9.

Fix: declare the deps in a new scripts/package.json and have
heima-paseo-bring-up.sh run `npm install --prefix scripts` once
(idempotent — checks scripts/node_modules/@polkadot/api existence
first) before invoking `node` directly. The .mjs script's lazy-load
shape stays for --help UX, but now succeeds because node_modules is
sitting right next to the .mjs.

Version pin: had to bump @polkadot/util / util-crypto / keyring from
^13.0.0 → ^14.0.0 to match what @polkadot/api@^16 pulls in
transitively, otherwise npm installs two copies of @polkadot/util
(top-level 13.x + nested-under-api 14.x) and polkadot.js panics with
"multiple versions installed" at runtime.

scripts/node_modules/ added to .gitignore; scripts/package.json +
scripts/package-lock.json are checked in.

Verified live: `AGENTKEYS_CHAIN=heima-paseo node scripts/heima-paseo-sudo.mjs whoami`
now connects to wss://rpc.paseo-parachain.heima.network, confirms
chain="Heima-paseo" ss58=131 token=HEI EVM_chain_id=2013, and prints
Alice's SS58 under the Paseo prefix 131 (jcS2wD5...) along with her
well-known pubkey 0xd43593c7...
Operator asked "is step 9 and following idempotent? avoid duplicate
smart contract by verifying onchain state." Audit found 4 holes; all
4 closed in this commit. Re-running the bring-up is now a no-op when
nothing has changed on chain.

What's now idempotent:

1. Deployer keypair (step 3) — was: generated a NEW throwaway key on
   every run unless HEIMA_PASEO_DEPLOYER_KEY was exported. Each run
   produced a fresh address that then needed re-funding +
   re-deploying. Fix: on first run, generate + persist to
   ~/.agentkeys/heima-paseo-deployer.key (mode 0600, OUTSIDE the
   repo so it's never accidentally committed); on subsequent runs,
   read the file. Override at any time via env var.

2. Funding (step 4) — was: always sent $FUND_AMOUNT_HEI from Alice
   via sudo.balances.forceTransfer; no balance check. Fix: query
   eth_getBalance on the deployer; if balance >= 1 HEI, skip the
   Alice sudo transfer entirely. Uses node (already a required dep)
   for BigInt-safe hex->decimal compare (wei values overflow bash
   arithmetic int64).

3. **Contract deploy (step 5) — the fix the operator specifically
   asked for**: was: `forge script ... --broadcast` deployed NEW
   instances every run. Fix: re-source operator-workstation.env to
   pick up addresses from any prior run, then `cast code $addr` each
   of the 4 contract addresses against the live chain. If ALL 4
   have code on-chain (i.e. contracts still deployed), skip the
   deploy entirely. If ANY address is missing OR returns "0x" (no
   code) — e.g. chain reset, fresh env, etc. — redeploy all 4.
   This handles the chain-reset case automatically.

   Stub mode (when crates/agentkeys-chain/ doesn't exist yet)
   produces sentinel 0x1-0x4 addresses that never have on-chain
   code; the script correctly detects this and "redeploys" the same
   stubs — no real chain side-effects, no Alice transfers, no
   wasted gas.

4. Address persistence (step 6) — was: appended new KEY=VALUE
   lines to operator-workstation.env via `>>`, so 3 runs left 12
   contract-address lines (with bash sourcing using the last one,
   but the file ballooned + git diff was noisy). Fix: `env_set`
   helper that grep-detects existing lines and either sed-replaces
   in place (macOS + Linux variants of `sed -i`) or appends only if
   absent. No duplicates ever.

Live-verified idempotency:

- Run 1 (SKIP_FUND=1): generated deployer 0xeBdE9E..., persisted
  key file, stub-deployed 0x1-0x4, appended 5 lines to env file.
- Run 2 (same flags): reused persisted key (same 0xeBd address),
  on-chain check correctly logged "✗ NO code on-chain → redeploy"
  for the stub address, stub-redeployed same 0x1-0x4, env file
  still has exactly 5 lines (replaced in place, not duplicated).

When real Solidity contracts ship in a future commit replacing
crates/agentkeys-chain/, the on-chain check will skip the deploy
on the second and all subsequent runs.

scripts/operator-workstation.env in this commit is the artifact of
the live test runs (5 new lines for the 4 stub addrs + deployer
addr). The 0x1-0x4 stubs are placeholder values — they get
overwritten by env_set on the first real-deploy run.
…nce preflight

The operator hit three real bugs in step 9 while exercising the
end-to-end demo:

1. **`Assertion failed` with no context** in heima-paseo-sudo.mjs's
   fund subcommand. Root cause: `system_properties.tokenDecimals` came
   back as the JSON value `[18]` (an array), and `new BN([18])` triggers
   bn.js's `_initArray` assertion. Fix: pull the array through
   `JSON.parse(JSON.stringify(...))` (normalizes any polkadot codec
   wrapper to plain JS), extract `[0]`, coerce to `Number`. Same trap
   handled for `tokenSymbol = ["HEI"]`. Also: surface `e.stack` in the
   main catch so future ERRORs land with a stack trace instead of a
   bare message.

2. **signAndSend hangs forever** waiting for `isFinalized` on Paseo
   (finalization is unreliable, sometimes 60s+, sometimes never).
   Switched the resolver to fire on `isInBlock` — sufficient for our
   "fund then read balance" use case, since the next read sees the
   new balance as soon as the block is mined. Added a 60s hard timeout
   so the script can never hang opaquely again.

3. **`Priority is too low (X vs X)`** on retry, because a prior killed
   run left a stuck tx in Alice's mempool slot. Added a small `tip`
   (1 nanoHEI) to signAndSend options — substrate's pool replacement
   rule requires strictly higher priority, and a tip provides it.

After (1)+(2)+(3), the tx submitted cleanly but **the validator
silently refused to include it because Alice only has ~0.498 HEI on
this Paseo deployment** (drained by prior testnet use). The
sudo.balances.forceTransfer call needs Alice to have the value she's
transferring — sudo bypasses origin checks, not balance checks. Two
more fixes for this:

4. **`scripts/heima-paseo-bring-up.sh` step 4 auto-skips when in stub
   mode** (no crates/agentkeys-chain present). Step 5 emits sentinel
   0x1-0x4 addresses without ever submitting a tx in stub mode, so
   the deployer doesn't need HEI. This was wasting Alice's already-low
   testnet balance for no benefit AND triggering the timeout when she
   ran out.

5. **`heima-paseo-sudo.mjs cmdFund` pre-checks Alice's balance** before
   signAndSend. If `alice.free <= 0.1 ${symbol}` (fee margin), or if
   `requested > alice.usable`, throw a clear error explaining the gap
   — "Alice is out of HEI on this chain, top her up before retrying"
   — rather than letting the tx silently sit unmined in the mempool.

Cosmetic: the summary in v2-stage1-demo.sh step 10 was hardcoded to
print `s3://${BUCKET}/...` for the smoke-test credential location;
that's the MAIL bucket post-§17-split, not where the credential
actually lives. Switched to `${VAULT_BUCKET:-$BUCKET}` so post-split
runs print the correct vault-bucket path.

Verified live: `bash scripts/v2-stage1-demo.sh --from-step 9` now
runs end-to-end:

  [4/7] Sudo-fund SKIPPED — stub mode. Deployer needs no gas.
  [5/7] ALL 4 contracts already deployed + verified on-chain → skip
  [6/7] persisted (no duplicates)
  [7/7] Demo ready.
  ═══ v2 stage-1 demo complete ═══

The whole 10-step demo (steps 1-10) is now green + idempotent.

When real Solidity contracts ship in a future commit replacing
crates/agentkeys-chain/, step 4's auto-skip turns off (chain dir
present), Alice's balance check fires, and the operator will either
(a) succeed if Alice has enough, or (b) get the clear "Alice is out
of HEI" message and know to top her up before retrying.
Operator's idea: "If Alice account does not have enough fund, we
should be able to call sudo and mint more hei to Alice." Alice IS
the sudoer on Heima Paseo, so she can sudo any pallet call —
including `balances.forceSetBalance(alice, BIG)`, which directly
sets her free balance from thin air (total issuance climbs, but
that's fine for a testnet shared by many testers who keep draining
each other's Alice).

Implementation:

1. New helper `signAndSendAsAliceWithTip(api, alice, call, label)`
   — extracts the signAndSend plumbing from cmdFund so cmdFund AND
   the new top-up flow share one resolve-on-isInBlock + 60s timeout
   + tip-eviction path. Tip bumped from 1nHEI → 0.001 HEI (1e15
   attoHEI) — earlier 1nHEI was sometimes insufficient to evict
   stuck pool txs from prior attempts.

2. New helper `chainTokenInfo(properties)` — extracts {decimals,
   symbol} from system_properties handling the array-wrapping codec
   quirk we hit earlier. Used by both cmdFund and cmdTopUpAlice.

3. New helper `humanize(amountBN, decimals)` — BN → human-readable
   token string (e.g. "1000.0000 HEI"). Used in every log line.

4. New helper `ensureAliceCanFund(api, alice, decimals, symbol,
   requestedAmount)` — auto-top-up. Reads Alice's on-chain balance;
   if `alice.free - 0.1-HEI fee margin < requestedAmount`, sudo-mints
   her via `balances.forceSetBalance` to max(requested * 100,
   1000 HEI). Idempotent — skips if Alice already has enough.

5. cmdFund refactored to call ensureAliceCanFund before the actual
   forceTransfer. The CLI flow is now:
     (a) compute requested amount
     (b) check Alice's balance
     (c) if short, sudo-mint Alice
     (d) sudo.balances.forceTransfer(alice, deployer, amount)

6. New `cmdTopUpAlice` subcommand for explicit operator use:
     node scripts/heima-paseo-sudo.mjs top-up-alice --target-hei 1000
   Refuses to LOWER Alice's balance if she's already above target.
   Outputs JSON with before/after balances + the inclusion block hash.

Known live blocker on the current Paseo deployment (NOT a script
bug): a prior killed funding attempt left a stuck tx at Alice's
nonce 13 in the validator's mempool. The validator can't include
it (it's a `force_transfer(alice, X, 100 HEI)` and Alice only has
0.498 HEI), and substrate's pool replacement only works for
same-(sender, nonce) — but in this case the validator never
EVALUATES the new tx's priority because the slot is held by a tx
that's not-yet-failed-not-yet-included. Operator recourse, in order:
  - Wait ~25-100 min for mempool TTL to drop the stuck tx, then
    re-run.
  - Contact Heima dev team to either (a) top up Alice out-of-band
    via faucet so the stuck transfer becomes valid, or (b) yank the
    stuck tx via `author.removeExtrinsic` on the validator.
  - Stay in stub mode (no crates/agentkeys-chain present), which
    auto-skips step 4 funding entirely (already shipped in 9813c63).

The implementation is correct and will work cleanly once the chain
state clears OR on a fresh Paseo deployment.
Operator-observed root cause: Heima Paseo testnet has been **halted
since 2026-01-15** — block 2,905,430 frozen for 4+ months. All the
funding work I built on top of "Alice can sudo-mint to herself" was
correct in principle but useless in practice on a chain that's not
producing blocks. Verified live: mainnet (chain_id=212013) has 12s
block time, alive and well.

This commit switches the v2 stage-1 demo to default to Heima mainnet
while preserving the Paseo path for when collators come back up.

Rename + generalize:

  scripts/heima-paseo-bring-up.sh → scripts/heima-bring-up.sh
    (`git mv` preserves blame; chain-agnostic name reflects multi-chain
    support)

Bring-up script (`heima-bring-up.sh`) now:

  - AGENTKEYS_CHAIN accepts `heima` OR `heima-paseo`; default is `heima`
  - Step 2 dynamically reads the right profile (was hardcoded paseo)
  - Step 2.5 chain-id check bifurcates: heima MUST be 212013;
    heima-paseo MUST NOT be 212013 (catches profile-vs-RPC drift)
  - Step 3 deployer key file is per-chain:
    ~/.agentkeys/heima-deployer.key vs ~/.agentkeys/heima-paseo-deployer.key
    (keeps mainnet + testnet keys distinct)
  - Step 4 funding bifurcates:
      * paseo → existing sudo-via-Alice flow with auto-top-up
      * heima → balance check only; if deployer < 1 HEI, print clear
        transfer instructions (deployer addr + RPC + balance-verify
        curl command) and exit. NEVER auto-spends real HEI. Re-running
        after manual transfer detects funding and skips.
  - Step 5 real deploy on mainnet REQUIRES `MAINNET_CONFIRM=1` env var
    as a paranoid second gate. Stub mode (no crates/agentkeys-chain/)
    is a no-op regardless of chain.
  - Step 6 namespaces deployer addr per-chain
    (HEIMA_DEPLOYER_ADDR_HEIMA vs ..._HEIMA_PASEO; was hardcoded
    HEIMA_PASEO_DEPLOYER_ADDR)
  - Step 7 summary shows the actual chain (was hardcoded "heima-paseo")

Orchestrator (`v2-stage1-demo.sh`) now:

  - Default AGENTKEYS_CHAIN: heima-paseo → heima (with explanatory log
    line)
  - do_step_9 accepts both chains with chain-specific warnings
  - Mainnet auto-pauses for operator confirmation (the existing
    --confirm flag still works; mainnet now triggers it automatically)
  - read -r _ || true tolerates EOF on stdin (so piped/non-interactive
    runs don't abort silently from set -e)
  - MAINNET_CONFIRM env var passed through to bring-up.sh if set

Safety summary for accidental mainnet deploys (multiple layers):

  1. orchestrator confirmation prompt before bring-up on mainnet
  2. bring-up.sh step 2.5 verifies chain_id matches profile (catches
     misconfigured RPC)
  3. step 4 NEVER auto-funds on mainnet; only prints + exits
  4. step 5 stub mode = no-op (sentinel addresses, no broadcast)
  5. step 5 real deploy REQUIRES MAINNET_CONFIRM=1 env var

scripts/operator-workstation.env additions are the artifacts of live
test runs against mainnet in stub mode (5 lines: 4 stub contract
addresses + deployer addr 0x598c5...). The 0x1-0x4 sentinels follow
the same convention as the pre-existing HEIMA_PASEO entries; the
on-chain `cast code` check will detect them as missing and "redeploy"
(stub-mode no-op) on the next run, OR overwrite with real addresses
once Solidity sources ship + MAINNET_CONFIRM=1 is set.

Demo doc updates:

  - 8 references to heima-paseo-bring-up.sh → heima-bring-up.sh
  - New callout at top of §4.0 explaining Paseo halt + recommending
    mainnet for new runs
  - §4.0 intro generalized to describe both chains' funding mechanisms

Verified live (mainnet, stub mode):

  AWS_PROFILE=agentkeys-admin AGENTKEYS_CHAIN=heima \
    bash scripts/v2-stage1-demo.sh --only-step 9 </dev/null

  ==> [step 9/10] Chain backbone bring-up (heima)
      warn Heima MAINNET — real HEI required ...
      About to run chain bring-up on heima.
      MAINNET CONFIRMED (chain_id=212013) ...
      [4/7] Fund SKIPPED — stub mode (no crates/agentkeys-chain).
      [5/7] AgentKeysScope = 0x0...01 ✗ NO code on-chain → redeploy
            (stub-mode sentinel addresses; no real chain side-effect)
      [6/7] persisted (no duplicates)
      [7/7] Chain: heima (chain_id=212013) Deployer: 0x598c5...

End-to-end clean. Paseo path remains available for when collators
come back online.
Operator wants to use their own wallet — specified by a 12-word
BIP-39 mnemonic in ./test-hei — as the smart-contract deployer
instead of the throwaway-generated key. Verified the mnemonic's
SS58 (Heima mainnet prefix 31) = 47NGSq6JE5ZSnymGNa4nFVjWbsuhTfoSKN2jtpk28mUyC1M3
which is the address the operator confirmed against Heima.

Changes:

  scripts/derive-evm-from-mnemonic.mjs (new): tiny ethers-backed
  helper. Reads a mnemonic file path, derives EVM via the BIP-44
  default path m/44'/60'/0'/0/0 (same as MetaMask + Foundry +
  ethers.Wallet.fromPhrase). Emits one line of JSON
  {address, privateKey} on stdout; all status (including the
  derived public address) goes to stderr; the mnemonic + private
  key are never echoed to stderr. Callers stash stdout in a
  mode-0600 file.

  scripts/heima-bring-up.sh step 3: new resolution order is
    1. $HEIMA_DEPLOYER_KEY env var
    2. $HEIMA_DEPLOYER_MNEMONIC_FILE (default: $REPO_ROOT/test-hei)
       → derive EVM key, cache in ~/.agentkeys/<chain>-deployer.key
    3. Existing persisted key file
    4. Generate fresh throwaway
  Operators drop their mnemonic at ./test-hei and step 3 picks it
  up automatically. New-key path also prints a TIP pointing at
  ./test-hei so first-time operators know the option exists.

  scripts/package.json adds `ethers ^6.13.0` (canonical EVM lib for
  Wallet.fromPhrase — substrate-side derivation via polkadot.js
  doesn't expose the raw secp256k1 private key intentionally).

  .gitignore adds:
    /test-hei
    /test-hei.*
    /.heima-mnemonic
    /*-mnemonic
  The mnemonic IS the key — never commit it. ~/.agentkeys/*.key is
  already outside the repo.

Verified live (Heima mainnet, stub mode, no real-money calls):

  AWS_PROFILE=agentkeys-admin AGENTKEYS_CHAIN=heima \
    bash scripts/v2-stage1-demo.sh --only-step 9 </dev/null

  [3/7] Deployer keypair …
    deriving deployer from mnemonic at ./test-hei …
    [derive-evm-from-mnemonic] derived EVM address: 0xdE644936D5B7d5d42032fd08bbA42Fbbfd6663Bc
    cached private key at ~/.agentkeys/heima-deployer.key (0600)
  ...
  Chain:       heima (chain_id=212013)
  Deployer:    0xdE644936D5B7d5d42032fd08bbA42Fbbfd6663Bc

Verified the SS58 match (substrate-side cross-check):
  Substrate sr25519 public key from same mnemonic
    = 0x2a922b2c4bd021fa75dcce1ddc2fe6b62d743b22bfd547663aff8d4667054507
  Encoded under SS58 prefix 31 (Heima mainnet)
    = 47NGSq6JE5ZSnymGNa4nFVjWbsuhTfoSKN2jtpk28mUyC1M3  ← operator-confirmed

For an actual mainnet deploy (when crates/agentkeys-chain/ ships),
the operator funds 0xdE644936D5B7d5d42032fd08bbA42Fbbfd6663Bc from
their main Heima wallet (any amount ≥ 1 HEI), then re-runs with
MAINNET_CONFIRM=1. The flow is now zero-key-juggling on their part.
Operator hit the standard Heima Frontier gotcha: HEI in their
sr25519-derived Substrate wallet (47NGSq6JE5ZSn...) doesn't show up
as eth_getBalance on their EVM-derived deployer (0xdE644...) even
though both derive from the same BIP-39 mnemonic.

Cause: Substrate and EVM use different derivation schemes from the
same seed, producing TWO separate on-chain accounts. Heima
(Frontier) exposes EVM balance at the substrate account computed via
HashedAddressMapping<BlakeTwo256>: blake2_256("evm:" || eth_address).
To fund the EVM side from a Substrate holder, you send to THAT
mapped account, not to the SS58 of the same mnemonic's sr25519 key.

The new helper:

  node scripts/evm-to-substrate-address.mjs 0xANY_EVM_ADDR

prints JSON with the raw 32-byte hex + SS58 under prefixes 31
(Heima mainnet), 131 (Heima Paseo), 42 (generic) so the operator
can paste the right one into Polkadot.js Apps' transfer form.

Same blake2_256 derivation as scripts/heima-paseo-sudo.mjs's
evmToSubstrate() helper (which uses it for the sudo-fund flow on
Paseo). This standalone version is for the mainnet workflow where
the operator does the transfer manually from their personal wallet.

Verified live against the demo deployer:
  EVM deployer:   0xdE644936D5B7d5d42032fd08bbA42Fbbfd6663Bc
  Substrate twin: 47hNCTi9Jrs86atvDj9AhY67X2vQEDzzHAvzapKvUpxXz6EX
The operator transfers HEI from 47NGSq...3 to 47hNCTi9Jrs86...;
after inclusion, eth_getBalance(0xdE644...) reflects the new balance.
Operator-flagged friction: the orchestrator's step-9 mainnet prompt
("Press Enter to proceed, Ctrl-C to abort >") AND the bring-up
script's MAINNET_CONFIRM=1 env-var gate are redundant. Pressing
Enter IS operator consent — requiring an additional env-var export
to actually run the deploy is friction.

Fix: after the Press-Enter prompt fires on mainnet, auto-export
MAINNET_CONFIRM=1 to bring-up.sh. The orchestrator user now has
ONE gate (the prompt). The bring-up script keeps its env-var check
as the safety layer for direct callers (e.g. CI scripts that bypass
the orchestrator) — those callers have no Press-Enter prompt, so
they explicitly opt in via MAINNET_CONFIRM=1.

Idempotency clarification: the script is already idempotent against
double-deploys via the on-chain `cast code` check in step 5 — a
second run detects existing contracts and skips. MAINNET_CONFIRM
only matters on the FIRST mainnet deploy; after that, on-chain
state is the source of truth.
Operator-requested stage-1 completion. Closes the biggest in-flight
gap — the four on-chain contracts that anchor v2 state per arch.md
§10, §13.1, §16. With this commit + a funded mainnet deployer,
`MAINNET_CONFIRM=1 bash scripts/v2-stage1-demo.sh --only-step 9`
deploys the real chain layer.

Crate layout (new):

  crates/agentkeys-chain/
    foundry.toml           — Solc 0.8.20, EVM = paris (Frontier-safe)
    src/
      SidecarRegistry.sol  — 189 LOC: device binding + master/agent
                              registration + revoke. Sovereign-mode
                              auth: first call bootstraps the operator's
                              master wallet (msg.sender); subsequent
                              master mutations require that wallet
                              + non-empty K11 assertion.
      AgentKeysScope.sol   — 137 LOC: per-(operator, agent) scope
                              with services[] + spend caps. Reads
                              SidecarRegistry.operatorMasterWallet
                              for auth.
      K3EpochCounter.sol   — 68 LOC: monotonic epoch counter,
                              signer-governance-gated advanceEpoch +
                              setSignerGovernance.
      CredentialAudit.sol  — 85 LOC: append-only audit log per
                              arch.md §15.3 tier C. Anyone can append
                              (gas-cost spam-resistance); workers
                              re-emit on every CRUD.
    script/
      DeployAgentKeysV1.s.sol — atomic deploy of all 4 in order.
                              tx.origin inside vm.startBroadcast IS
                              the --private-key signer; defaults
                              signerGovernance to deployer. Stable
                              "Name: 0xAddress" log shape matches the
                              heima-bring-up.sh regex unchanged.
    test/AgentKeysV1.t.sol — 269 LOC, **11 forge tests, all passing**:
                              bootstrap+duplicate-rejection, 2nd-master-
                              requires-K11, agent-needs-master-caller,
                              agent-needs-operator-bootstrap, revoke
                              (agent-no-K11 + master-K11), scope set +
                              revoke + attacker-rejected, K3 epoch
                              advance + governance transfer + audit
                              append-and-read.
    lib/forge-std          — v1.16.1 submodule (Test + Script + cheats).

Stage-1 simplifications (deliberate, documented in README):

  - K11 WebAuthn assertions: stored as opaque bytes, NOT verified
    on-chain. Broker pre-verifies via webauthn-rs. P-256 verification
    lands when EIP-7212 precompile is live on Heima (stage 2+).
  - Master-mutation auth: msg.sender == operatorMasterWallet (sovereign
    mode). Broker-mode + M-of-N recovery quorum lands in stage 2.
  - Per-period spend tracking: stored, NOT enforced on-chain. Workers
    enforce against scope.maxPerPeriod off-chain.

Verified live (anvil):

  forge build → 4 contracts compile clean
  forge test  → 11/11 passing
  forge script DeployAgentKeysV1.s.sol --rpc-url http://localhost:8545 \
    --private-key 0xac0974... --broadcast
    → ~2.8M gas, all 4 contracts deployed
    → log shape parses cleanly via heima-bring-up.sh's existing regex

To deploy on Heima mainnet, the funded deployer (0xdE644...) already
has 19.9 HEI (gas budget ~0.006 HEI per deploy):

  MAINNET_CONFIRM=1 bash scripts/v2-stage1-demo.sh --only-step 9

bring-up.sh step 5 was always wired for real deploy when
crates/agentkeys-chain/ exists; this commit makes that condition true.

Demo doc "What's still in flight" table updated: contracts row moves
from "⏳ not yet" to "✅ shipped" with the per-contract artifact
inventory + the live-deploy recipe.
…ead of swallowing

Operator-flagged: the MAINNET_CONFIRM=1 env-var gate was redundant
with the orchestrator's Press-Enter prompt + chain-id check, AND the
deploy step was failing silently — the user saw bring-up.sh exit 1
right after the "redeploy needed" log line with no forge output to
debug against.

Two fixes:

1. **Removed MAINNET_CONFIRM=1 env-var requirement** end-to-end:
     - heima-bring-up.sh step 5 no longer refuses on chain=heima
       without MAINNET_CONFIRM=1
     - v2-stage1-demo.sh do_step_9 no longer auto-exports
       MAINNET_CONFIRM=1 after the Press-Enter prompt
     - v2-stage1-demo.sh no longer adds MAINNET_CONFIRM to the
       bring_up_env array
   Safety is now layered via (a) the orchestrator's interactive
   Press-Enter prompt on mainnet, (b) the chain-id verification in
   bring-up.sh step 2 (heima MUST be chain_id=212013), and (c) the
   `cast code` idempotency check in step 5 (re-runs never
   double-deploy because existing on-chain contracts are detected
   and skipped).

2. **Surfaced forge-script failures instead of swallowing them.**
   Previously: `DEPLOY_OUT=$(forge script ... 2>&1)` captures stderr
   silently; if forge fails, the subsequent `echo "$DEPLOY_OUT" |
   grep -oE ... | awk ...` pipeline returns empty match → grep
   exits 1 → pipefail kills the script. The operator sees only "fail
   heima-bring-up.sh failed" with no forge error.

   Fix: run forge with explicit exit-code capture (set +e / $? /
   set -e), and on non-zero exit print the captured DEPLOY_OUT
   verbatim to stderr with clear "------ forge stderr+stdout ------"
   delimiters. The address-extraction step now also validates each
   of the 4 extractions returned non-empty; if any is missing, dump
   the full forge output before exiting.

   Likely root cause for the operator's failure: forge-std submodule
   not initialized on `git pull` (git doesn't auto-populate
   submodules). Added auto-init at the top of step 5:
     if [ ! -f $CHAIN_DIR/lib/forge-std/src/Test.sol ]; then
       git submodule update --init --recursive --quiet
     fi
   First-run-only cost; subsequent runs are no-ops.

3. Also switched the regex from `\s+` to `[[:space:]]+` for POSIX
   compatibility (BSD grep on macOS doesn't always honor `\s`).

Verified the new path:
  - POSIX regex parses the deploy script's console.log shape cleanly
    (same 4 addresses extracted from the same anvil-deploy stdout
    sample that the prior commit verified)
  - Empty-match tolerance via `|| true` confirms pipefail no longer
    kills the script when grep finds no match in error output

Next operator run (one command, no env var, no extra ceremony):

  bash scripts/v2-stage1-demo.sh --only-step 9
  # → Press Enter on the mainnet prompt → forge script broadcasts
  # If forge errors, you now see the actual error message with full
  # stderr+stdout dump instead of a silent fail.
Operator hit on live mainnet deploy:
  Error: Failed to deploy script:
  EVM error; header validation error: `prevrandao` not set

Heima's Frontier EVM doesn't include `prevrandao` in block headers
(that field was introduced in Ethereum's Paris hard fork / Merge).
Forge's simulator validates block headers against its target EVM
version before broadcasting; with evm_version=paris it requires
prevrandao to be present, and the validation fails on Heima's
pre-Merge-shaped block headers.

Fix: drop foundry.toml's evm_version from paris to london (pre-Merge,
pre-prevrandao). Semantically a no-op for our contracts — they don't
reference block.difficulty, block.prevrandao, or any other
post-london feature; this change is purely about what forge's
simulator expects from incoming block headers.

Also avoids the Shanghai-era PUSH0 opcode (london doesn't emit it),
which keeps the deployed bytecode forwards-compatible with older
Frontier nodes that might be added to Heima's collator set later.

Verified: forge build clean + 11/11 forge tests still passing.

Next operator run: `bash scripts/v2-stage1-demo.sh --only-step 9` →
forge script should now pass header validation and broadcast on
Heima mainnet. ~0.006 HEI gas; deployer has 19.9 HEI.
…ify script

Three artifacts landing together. Operator verified the deploy went
live on Heima mainnet (block-explorer confirms 4 contracts at the
expected addresses); now make the knowledge durable.

1. **CLAUDE.md gets two new sections**:

   "Heima EVM compatibility level — pin to `london` in foundry.toml"
     - Documents the live-verified evidence: baseFeePerGas present
       (London+) but mixHash/withdrawalsRoot/blobGasUsed all null
       (pre-Paris). Heima is at LONDON.
     - Documents the consequence: forge script with evm_version=paris
       errors with "header validation error: `prevrandao` not set",
       which is exactly the failure the operator hit before I dropped
       foundry.toml to london in commit cecca24.
     - Includes a one-liner curl that re-verifies the EVM level any
       time (in case Heima upgrades).

   "Deployed contract registry"
     - Points future operators at docs/spec/deployed-contracts.md as
       the human-readable canonical record AND scripts/operator-workstation.env
       as the shell-tooling source of truth.
     - Points at scripts/verify-heima-contracts.sh for the read-only
       health check.

2. **docs/spec/deployed-contracts.md** (new):
   - Live mainnet addresses table with bytecode sizes + statescan
     explorer links for AgentKeysScope, SidecarRegistry, K3EpochCounter,
     CredentialAudit
   - Deploy metadata: deployer EVM + Substrate twin addrs, deploy date
     (2026-05-19), compiler version, forge version, deploy script path
   - Constructor wiring (verified post-deploy): registry pointer,
     currentEpoch=1, signerGovernance=deployer, role bitfield constants
   - Paseo section marked "currently halted" with the recipe to
     redeploy when collators return
   - ABI summary for the hot-path functions broker/workers/CLI consume
   - "When this doc needs to change" — lifecycle rules so future
     re-deploys + governance handoffs keep the record current

3. **scripts/verify-heima-contracts.sh** (new, executable):
   - Read-only RPC check, zero gas
   - 4-stage verification: bytecode presence + view function responses
     + constructor wiring + initialization
   - Reads addresses from operator-workstation.env so it works for
     any chain (heima, heima-paseo, future chains)
   - Live-verified all 13 checks pass against today's mainnet deploy:
       ok   AgentKeysScope @ 0x14C2... : 3146 bytes
       ok   SidecarRegistry @ 0x76D5... : 3301 bytes
       ok   K3EpochCounter @ 0x8396... : 687 bytes
       ok   CredentialAudit @ 0x1801... : 1421 bytes
       ok   role bitfield constants match
       ok   AgentKeysScope.registry() = SidecarRegistry addr
       ok   K3EpochCounter.currentEpoch = 1
       ok   K3EpochCounter.signerGovernance = 0xdE64...
       ═══ all checks passed ═══

Now the bring-up flow is fully closed-loop:
  bash scripts/v2-stage1-demo.sh --only-step 9   # deploys (idempotent)
  bash scripts/verify-heima-contracts.sh         # verifies (zero gas)
…ixes

Per Ralph Step 7.5 deslop pass on the changed-file set:

DRY pass:
- New crate module agentkeys-worker-creds/src/errors.rs exports the
  shared { ErrorBody, ApiError } types + err_400/err_403/err_500/err_502
  helpers. Both the credentials worker AND the memory worker (which
  depends on agentkeys-worker-creds as a lib) now import them. Removes
  ~28 lines of duplicate boilerplate; keeps the cross-worker error
  wire-shape consistent so the daemon proxy can handle them uniformly.
- Helpers now take `impl Into<String>` (was `String`) so call sites can
  pass either &str or String without an explicit conversion.

Clippy fixes (cargo clippy --workspace produced 5 warnings, all addressed):
- Replaced `.chars().last() == Some('1')` with `.ends_with('1')` in
  three places (broker cap.rs parse_bool_result + revoked decode; worker
  verify.rs parse_bool + revoked decode). Same semantics, cleaner.
- Removed redundant `|e| err_403_or_502(e)` closures in worker-memory
  handlers — `err_403_or_502` is already a fn-ptr-compatible function.

Visibility fix (warning surfaced after wiring proxy state):
- proxy.rs: ProxyState had pub fields holding non-pub types CapCache +
  CachedCap. Promoted both to `pub` so the public type surface is
  consistent. (No behavior change; just removes the
  more-private-than-public-item warning.)

Behavior is preserved end-to-end: re-ran `cargo test --workspace`
post-deslop — all suites pass. Live-chain health check
`AGENTKEYS_CHAIN=heima bash scripts/verify-heima-contracts.sh` still
returns 13/13 checks ok against Heima mainnet.
Records what was DRY'd + what was deliberately left as duplication.
Also clarifies the operator-readability principle for the heima-*.sh
scripts: the helper boilerplate repeating across 6 scripts is
intentional so each script remains self-contained for ad-hoc operator
debugging.
546 tests passing across 42 cargo suites; 13/13 on-chain health checks
pass against Heima mainnet. Codex's third review (focused on the
deslop + stage-2 additions in commits f0fa0af + db82335) returned
APPROVED with no new must-fix findings.
…demo.sh --from-step 12` on Heima mainnet

A live run of the orchestrator on Heima mainnet (post-deploy) exposed
three bugs that the unit tests missed because they only exercise the
script in isolation, not against the live contracts. Each bug is
documented + reproduced in docs/v2-stage1-iteration-log.md under
"Iteration A — live runtime debug pass".

A.1 getScope ABI decode mismatch (heima-scope-set.sh + heima-scope-revoke.sh):
   AgentKeysScope.getScope returns a single Scope struct, not a flat
   8-tuple. The previous signature `(bytes32[],bool,uint128,...)`
   triggered cast's "ABI decoding failed: buffer overrun while
   deserializing" and returned empty; the `if [ -n "$EXISTING_SCOPE" ]`
   branch never entered; idempotency check silently fell through;
   every re-run submitted a fresh tx.
   Fix: wrap the struct in outer parens → `((bytes32[],bool,...))`.
   Also rewrote the parse to use inline python3 (cast prints the
   struct on a single line; the previous `sed -n '1p'…'8p'` approach
   was for line-per-field cast output, which never matched reality).

A.2 step counter always shows [step 1/15] regardless of actual step:
   STEP_NUM=0 init + step() increment-then-print pattern; with
   --only-step N the dispatcher skips steps 1..N-1 entirely, so
   STEP_NUM never reaches N before the surviving do_step_N calls
   step() and lands on 1.
   Fix: pre-seed `STEP_NUM=$((FROM_STEP - 1))` after FROM_STEP
   resolves (line 162).

A.3 stale step 15 summary referenced unshipped `agentkeys device register`:
   Now lists the shipped bash entries (heima-{device-register,
   agent-create, scope-{set,revoke}, credential-audit, device-revoke}.sh)
   + a pointer to stage 2 #90 for Rust CLI subcommand wrappers.

Verified end-to-end:
  $ bash scripts/v2-stage1-demo.sh --from-step 12  # first run: all 4 steps green
  $ bash scripts/v2-stage1-demo.sh --from-step 12  # second run: idempotent
    step 12 → 'skip scope already matches requested config — no-op'
    step 13 → +1 audit entry (append-only by contract — intentional)
    step 14 → 'K11 enrollment already exists' skip
    step 15 → summary print (no chain action)
  All steps print correct [step N/15] counter.

Step 13 (audit-append) is intentionally NOT idempotent — the on-chain
CredentialAudit is append-only and each demo re-run is meant to add
a fresh audit entry. Documented in the iteration log; demo-level
idempotency for step 13 (via sentinel payload-hash + getEntries scan)
deferred as stage-2 polish.
…g python3

Codex adversarial review of the live-runtime fix flagged a critical
silent-failure path: heima-scope-{set,revoke}.sh's new python3 parser
was invoked with `2>/dev/null || true`, so:

1. A workstation without python3 → parser invocation outputs nothing
   → if-branch never enters → idempotency falls through →
   re-submits scope tx on every run (recreates the original A.1 bug).
2. The orchestrator's tool sanity-check (do_step_1) did not include
   python3, so the missing-dep failure mode was never caught early.
3. A malformed/empty cast output → parser fails silently → same drop-
   through behavior.

Three-place fix:
- scripts/v2-stage1-demo.sh:177 — add python3 to do_step_1 prereq tools.
- scripts/heima-scope-set.sh:160 — `command -v python3` pre-check that
  die's; remove `2>/dev/null || true` from the python3 invocation; add
  explicit PARSE_RC=$? check that die's with the raw cast output.
- scripts/heima-scope-revoke.sh:90 — same fix pattern.

Iteration log appended with finding A.4 + fix + post-fix verification.

Verified end-to-end on Heima mainnet:
  $ bash scripts/v2-stage1-demo.sh --from-step 12
  step 12 → skip (idempotent ✓)
  step 13 → +1 audit entry (intentional append-only ✓)
  step 14 → K11 enrollment already exists ✓
  step 15 → summary print ✓
  All step counters correct: [step 12/15] … [step 15/15] ✓
…h failures

Codex pass-3 review of commit cd77e68 flagged that `set -euo pipefail`
(active at the top of both heima-scope-{set,revoke}.sh) aborts the
script the instant python3 exits non-zero inside `PARSED=$(python3...)`.
The PARSE_RC=$? + die-with-diagnostic block never ran because the shell
already exited. End result: the loud-failure-on-parser-error guarantee
the previous commit claimed was structurally broken.

Fix: bracket the python3 invocation in `set +e` … `set -e` so the
command substitution is allowed to fail without immediate shell abort;
PARSE_RC=$? captures the exit code; the die-with-raw-cast-output branch
now actually fires when python3 fails.

Verified the wrap pattern via a 5-line smoke test:
  $ bash -c 'set -e; set +e
              R=$(python3 -c "import sys; sys.exit(42)"); RC=$?
              set -e; echo "RC=$RC — set -e did NOT abort"'
  RC=42 — set -e did NOT abort

Happy-path on Heima mainnet still green: `--only-step 12` skips
correctly (no tx submitted) on second invocation.
…-op rationale

Adds the codex pass history (REJECTED → REJECTED → APPROVED) tabulated
in the iteration log so the next operator can see exactly why each
intermediate commit fell over (python3 dep unchecked → set -e aborted
the assignment before PARSE_RC) and how the final fix structurally
solves both.

Also documents why the deslop pass was a no-op: the python3 parser
blocks in heima-scope-{set,revoke}.sh decode different subsets of the
Scope struct, so extracting wouldn't simplify — and operator-readability
principle favors per-script self-containment over shared lib magic.
User-requested adversarial audit: "make sure in the demo docs there is no
bypass code, or hardcoded code, all the code must run against the real
architecture design and the real environment".

Four findings + fixes documented in docs/v2-stage1-iteration-log.md
under "Audit pass — bypass / hardcoded / theatre".

AUDIT.1 — false arch.md §22a citations (theatre / arch-mismatch):
  4 source files claimed "stage-1 simplification per arch.md §22a" but
  §22a is actually about chain profiles. NO authorising section existed
  for K11 stubs / KEK from env / empty attestation. Added a real
  arch.md §22b "Stage-1 simplifications inventory" listing each
  authorised deviation (§22b.1..§22b.5) with explicit stage-2 issue
  pointers + a code-search anchor. Re-pointed every citation:
    - scripts/heima-{scope-set,agent-create,device-register}.sh
    - crates/agentkeys-broker-server/src/handlers/cap.rs

AUDIT.2 — K11 was stub-only (bypass admitted but unfixed):
  Shipped real WebAuthn ceremony via `agentkeys k11 enroll --webauthn`
  + `agentkeys k11 assert --webauthn`. macOS users now get a real
  Touch ID prompt; the assertion is cryptographically bound to the
  application message via challenge = sha256(message).

  Implementation: crates/agentkeys-cli/src/k11_webauthn.rs (~600 LOC,
  manual ceremony — no webauthn-rs heavy dep). Binds localhost axum
  server, opens default browser, runs the platform-authenticator
  ceremony, validates clientDataJSON challenge/type/origin, parses
  attestationObject CBOR + extracts P-256 X+Y from the COSE pubkey,
  verifies signature using `p256` crate.

  Without `--webauthn`, defaults to deterministic stub for CI. WARN
  to stderr when stub mode is used on `AGENTKEYS_CHAIN=heima`
  (mainnet) referencing arch.md §22b.1 + issue #90.

  New deps in agentkeys-cli: axum, tower-service, hyper, hyper-util,
  ciborium (CBOR), base64, p256, rand_core. axum promoted from dev-dep
  to runtime.

AUDIT.3 — KEK-from-env had no startup WARN + accepted placeholders:
  state.rs in both worker-creds and worker-memory now:
    - rejects all-zeros and all-same-byte KEK at startup
    - prints fail-loud WARN at startup citing arch.md §22b.2 + #91

AUDIT.4 — stale "not yet implemented" in demo doc:
  Replaced the --credential-backend=sidecar row with the actual
  shipped surface description + invocation recipe.

Tests:
- 4 k11_cli integration tests pass (covers --webauthn + stub + error
  hint paths)
- 5 k11 lib unit tests pass (the stub helpers)
- cargo build --workspace succeeds
- All other workspace tests untouched

The bash helpers (heima-scope-set.sh, etc.) still pass stub bytes by
default so the demo runs in CI without an authenticator. Wiring them
to default to --webauthn is a stage-1.5 follow-up tracked in §22b.1
of the arch.md inventory.
…EK check

Codex pass-1 review of commit ae2ada7 returned REJECTED with five
must-fix findings. All addressed in this commit:

CODEX.1 (false §22a citations remaining in main.rs):
  Two surviving citations in crates/agentkeys-cli/src/main.rs:
    - line 301 (K11 subcommand long_about) — also referenced un-shipped
      stage 2 fallback "errors out today"
    - line 413 (cmd_k11 fn docstring)
  Both re-pointed at arch.md §22b.1 stage-1 simplifications inventory.
  long_about rewritten to describe the actual --webauthn + stub modes
  with concrete examples.

CODEX.2.A (attestationObject parser doesn't validate authData fields):
  Previous parser jumped from credentialIdLength → COSE pubkey without
  binding the credential to:
    - rpIdHash == sha256("localhost") (RP binding — reject passkeys
      enrolled against a different relying party)
    - flags UP/UV/AT bits set (user-presence + user-verified +
      attested-credential-data — reject unattested keys / missed Touch ID)
    - credentialId bytes match the `id` the browser sent in cred.id
      (prevent a malicious page substituting an arbitrary id)
  Fix: extract_attested_credential() returns rp_id_hash + flags +
  credential_id + cose_pubkey. finalize_enroll() verifies all three
  before persisting.

CODEX.2.B (double-hash signature verify):
  Previous code did sha256(authData || sha256(clientDataJSON)) and then
  passed the digest to VerifyingKey::verify(). But
  p256::ecdsa::Verifier::verify auto-hashes its input with SHA-256 per
  the ECDSA-with-SHA256 contract — so the signature was being checked
  over sha256(sha256(...))  instead of sha256(authData || cd_hash).
  Fix: pass signed_bytes UNHASHED. Updated comment makes the contract
  explicit so a future refactor doesn't reintroduce the double-hash.

CODEX.3 (timeout-abort unreachable on early-return):
  `server_task.abort()` was after the `?` operator on the timeout
  result, so it never ran on timeout / oneshot-recv error. The local
  ceremony server would dangle until process exit.
  Fix: introduced AbortOnDrop<T> RAII guard. Wrap server_task in the
  guard at the start of the wait block; abort fires on every exit path
  including timeout-error.

CODEX.4 (KEK byte-uniformity check missed alternating-hex-char patterns):
  Previous check on the hex STRING caught `aaaa…` but missed `0101…`
  which decodes to 32× the byte 0x01.
  Fix: hex::decode() to bytes first, then check `iter().all(|b| b == 0)`
  and `iter().all(|b| b == kek_bytes[0])`. Applied to both worker-creds
  and worker-memory state.rs.

Test pass: 4 k11_cli integration tests + 5 k11 lib unit + 16 broker cap
+ 22 worker-creds + 2 worker-memory + various others — all still green.

WebAuthn happy path verified manually:
  $ AGENTKEYS_CHAIN=heima-paseo target/debug/agentkeys k11 assert \
      --operator-omni 0xaa…aa --message-hex deadbeef
  0x7374616765312d6b31312d737475623a... (stub bytes; no WARN on dev chain)
  $ AGENTKEYS_CHAIN=heima       target/debug/agentkeys k11 assert ...
  ==> ⚠️  WARN: K11 stub mode active on chain=heima… (stub WARN fires)
Adds the audit codex-pass tabulation to docs/v2-stage1-iteration-log.md:
- Audit pass-1 (commit ae2ada7) REJECTED with 5 must-fix findings
- Audit pass-2 (commit d0ab230) APPROVED — all 5 addressed

The audit landed:
1. Real WebAuthn ceremony with Touch ID via --webauthn flag
2. arch.md §22b stage-1 simplifications inventory (with §22b.1..§22b.5
   listing each authorised stage-1 deviation + stage-2 issue pointer)
3. KEK fail-loud WARN + byte-uniformity placeholder rejection at worker
   boot
4. Demo doc stale 'not yet implemented' text removed
5. Cap-mint handler now correctly cites §22b.4

All 6 PRD stories pass. Workspace tests still green.
…elpers

Default behaviour is unchanged: stage-1 K11 stub bytes that satisfy the
on-chain length!=0 gate. Pass --webauthn for the real Touch ID ceremony.

scripts/v2-stage1-demo.sh
- New --webauthn flag → propagates as WEBAUTHN_MODE through step 14
  (K11 enroll) and step 12 (heima-scope-set.sh).
- Step 14 calls `agentkeys k11 enroll --webauthn` when set; otherwise
  writes the deterministic stub enrollment file as before.
- Step 14 detects an existing webauthn enrollment via
  jq '.mode == "webauthn"' so stub→webauthn upgrade is one re-run.
- Help block + summary updated to document the flag.

scripts/heima-scope-set.sh
- New --webauthn flag.
- When set, derives a domain-separated message:
    agentkeys:scope-set:<op>:<actor>:<services>:<read_only>:<caps>:<period>:<chain>
  and shells out to `agentkeys k11 assert --webauthn --message-hex <msg>`
  to produce real WebAuthn assertion bytes. The challenge inside the
  ceremony equals sha256(message) so the resulting (authData || clientData
  || signature) blob is cryptographically bound to this exact scope-set.
- Default: deterministic stub bytes (unchanged behaviour).

scripts/heima-scope-revoke.sh
- New --webauthn flag with message `agentkeys:scope-revoke:<op>:<actor>:<chain>`.

scripts/heima-device-revoke.sh
- New --webauthn flag (only relevant when --master is set — agents don't
  carry K11). Message: `agentkeys:device-revoke:<op>:<dkh>:<chain>`.
- Agent-tier revoke still passes 0x (empty bytes) per contract.

CI / automation: omit --webauthn — stub mode runs headlessly, no Touch
ID prompt, no browser pop-up. WARN to stderr fires on
AGENTKEYS_CHAIN=heima per arch.md §22b.1.

macOS dev / production-shaped runs: pass --webauthn — orchestrator opens
the operator's default browser, Touch ID prompts, real assertion bytes
go on chain (the contract gate is still length!=0 in stage 1 since
Heima doesn't have EIP-7212 P-256 precompile yet; the assertion is
verifiable off-chain today and on-chain once Heima ships the precompile).

Per arch.md §22b.1 (stage-1 simplifications inventory). Tracked toward
stage 2 (#90).
Real-world hit: operator ran v2-stage1-demo.sh --webauthn, step 12 shelled
out to bare 'agentkeys k11 assert --webauthn', the PATH-resolved binary
(~/.local/bin/agentkeys) was from a pre-K11 install and rejected the
subcommand with 'unrecognized subcommand k11'. The workspace-local
target/debug/agentkeys was current; the shell never saw it.

All 4 K11-touching scripts now resolve $AGENTKEYS_BIN in order:
  1. $REPO_ROOT/target/release/agentkeys
  2. $REPO_ROOT/target/debug/agentkeys
  3. command -v agentkeys (PATH)
  4. die() with 'try: cargo build -p agentkeys-cli'

Defends against stale installs without requiring the operator to
remember to `cargo install` or `cp` after every iteration.

scripts/v2-stage1-demo.sh: orchestrator exports AGENTKEYS_BIN so the
step 14 K11 enroll call uses the same resolution.

scripts/heima-{scope-set,scope-revoke,device-revoke}.sh: each resolves
locally (they're sometimes invoked outside the orchestrator).

The k11 long_about text was already rewritten in commit d0ab230 to
document the --webauthn flow; no further docs change needed here.
…1/13)

Operator hit this immediately with --webauthn:

  [step 12/15] Grant agent scope (setScopeWithWebauthn)
  ==> Requesting real WebAuthn assertion (Touch ID prompt incoming)…
  fail agentkeys k11 assert --webauthn failed — run agentkeys k11 enroll
       --webauthn first?

Step 12 (scope-set) consumes a K11 enrollment that was being created in
step 14, breaking --webauthn end-to-end runs.

Renumber via 4-way function rotation (no logic change, just slot swap):

  before  →  after
  ─────────────────
  step 11 = agent-create   →  step 12 = agent-create
  step 12 = scope-set      →  step 13 = scope-set
  step 13 = audit          →  step 14 = audit
  step 14 = K11 enroll     →  step 11 = K11 enroll  (moved earlier)
  step 15 = summary        →  step 15 = summary     (unchanged)

Now the dispatch order is linearly satisfiable:
  10 device-register (bootstrap; no K11 needed)
  11 K11 enrollment    ← runs before any master-mutation that needs it
  12 agent-create      (master-gated; no K11 yet)
  13 scope-set         ← consumes K11 (real ceremony if --webauthn)
  14 audit-append
  15 summary

Verified:
  $ bash scripts/v2-stage1-demo.sh --from-step 11 --to-step 14 --skip-deploy
  ==> [step 11/15] K11 enrollment …
  ==> [step 12/15] Create demo agent device …
  ==> [step 13/15] Grant agent scope …
  ==> [step 14/15] Append credential audit entry …

Also: every "(step N)" reference in the help block + comments updated
to the new numbering. The bash function definitions themselves are
just slot renames; bodies unchanged.
… verify

Operator hit the nested-runtime panic on the first --webauthn run:

  thread 'main' panicked at crates/agentkeys-cli/src/k11_webauthn.rs:154:8:
  Cannot start a runtime from within a runtime. This happens because a
  function (like `block_on`) attempted to block the current thread while
  the thread is being used to drive asynchronous tasks.

Root cause: cmd_k11 runs under `#[tokio::main]`. The pub sync wrapper
`enroll_webauthn` was trying to create its OWN tokio runtime via
`Builder::new_current_thread().build()?.block_on(...)` and call
`enroll_webauthn_async` inside it. Nested tokio runtimes panic.

Fix: drop the sync wrappers. Make `enroll_webauthn` and
`assert_webauthn` directly `pub async fn` so callers `.await` them
from their existing runtime. main.rs cmd_k11 updated to `.await`.
(The previous private `_async` helpers renamed to `_inner` for clarity.)

Also addresses the `AssertPost.id` dead_code warning by actually using
it: finalize_assert now cross-checks `post.id == enrollment
.credential_id_b64url` before signature verify. The browser's
allowCredentials filter already enforces this client-side, but
verifying here is cheap defence against a tampered ceremony page.

Verified: cargo build --release --quiet exits 0 with no warnings.
Installed to ~/.local/bin/agentkeys.
Browser-side failure with --webauthn:

  X User handle exceeds 64 bytes.
  ⚠️  publicKey.pubKeyCredParams is missing at least one of the default
      algorithms ES256 and RS256.

Root cause: the JS in serve_enroll_page passed `user.id` as
`new TextEncoder().encode(omni)` where `omni` is the 66-character
"0x" + 64-hex-chars operator_omni string. UTF-8 encoding of a 66-char
ASCII string is 66 bytes — past the WebAuthn-spec 64-byte cap on
user.id. Browsers (Chrome/Safari/Edge) reject the ceremony.

Fix: add a hexToBytes() helper in the page and decode the
operator_omni hex string into its raw 32-byte SHA-256 digest. 32 ≤ 64
so the WebAuthn validator is happy. The omni is still passed as
display name (`name: omni`) — that field has no byte limit.

While there, document why pubKeyCredParams is ES256-only (alg=-7).
The Chromium warning about "missing RS256 default" is informational
and safe to ignore — the on-chain verifier (when EIP-7212 P-256
precompile lands on Heima) only knows P-256/SHA-256, so an RS256
passkey would be unverifiable on-chain. Platform authenticators
we target (macOS Touch ID via Secure Enclave, Windows Hello,
modern Android) all support ES256 natively.

Verified: cargo build --release exits 0 with no warnings.
Installed to ~/.local/bin/agentkeys.
Operator's screenshot: plain-white page contrasting jarringly with the
Touch ID modal (which is dark in system dark mode). Long hex strings
overflowed the layout unstyled.

Redesign goals:
- Match the OS chrome (Touch ID modal) instead of fighting it.
- Read like an Apple system pane, not a 1998 form.
- Make the operator + message-hex blocks legible without breaking the
  page layout.

Implementation (single SHARED_CSS const, ~110 LOC inline, zero
external assets):
- CSS variables + prefers-color-scheme media query. Light mode is
  Apple's stock light-gray (#f5f5f7) on white card. Dark mode is
  #1a1a1c on #2c2c2e card — same palette macOS uses for system sheets.
- font-family: -apple-system, BlinkMacSystemFont, ... — so the page
  uses the exact same SF Pro Text the Touch ID modal renders with.
- Card layout: 560px max-width, rounded corners, subtle 8px shadow.
- Brand row at the top: small accent dot + "AGENTKEYS" caps. Replaces
  the bare H1.
- <dl class="kv"> grid for operator / authenticator / algorithm / message.
  Monospace hex blocks in a tinted code-style background with
  word-break and (for the message) a max-height with vertical scroll
  so a 1KB message hash doesn't push the button off-screen.
- Primary pill button styled like macOS controls (#0066cc light /
  #0a84ff dark) with hover + active states. Full-width.
- Status text uses .status .ok / .err class swaps (textContent, not
  innerHTML — defends against operator_omni reflected XSS even though
  it's hex-only).
- Button disables itself on success so the operator can't double-fire
  the ceremony.

Both pages share the same CSS via the SHARED_CSS const, injected via
format!()'s named-arg substitution. Page bodies still inline since
they have ceremony-specific JS.

Verified: cargo build --release exits 0, no warnings.
Installed to ~/.local/bin/agentkeys.
Three findings after clippy --workspace + dead-code scan:

1. k11::load_enrollment was pub but no caller — k11_webauthn has its
   own load_enrollment that callers use. Remove.
2. k11::enrollment_path was pub but only called locally. Demote to fn.
3. k11.rs:17 doc list-item-without-indent warning — reword to avoid
   the markdown-list interpretation of the leading `+`.
4. k11_webauthn.rs:298+377 — `let _ = ctx;` parity hack between enroll
   and assert /finish handlers. Use `_: State<…>` extractor instead.
5. k11_webauthn.rs:674 — `i.clone()` triggered two clippy warns
   (clone_on_copy + unnecessary_fallible_conversions). rustc actually
   rejects `*i` with E0614 despite clippy's "Copy" claim. Silence the
   two false-positive lints precisely; document the contradiction
   inline so the next operator doesn't try the "fix" again.
6. k11_webauthn.rs tests — drop unnecessary `&` on `[0xa0u8]` literal
   in 3 spots (needless_borrows_for_generic_args).

Verified: cargo clippy -p agentkeys-cli --all-targets exits 0 with
ZERO warnings. cargo test -p agentkeys-cli: 57 tests pass.

No behaviour change.
…er-addr guard

Codex adversarial review of PR #87 (HEAD 7a89c5f) returned REJECTED with
must-fix findings. Addressing the smaller-scope ones here; the bigger
SO_PEERCRED + full sidecar wiring gaps documented explicitly in code so
operators understand the residual scope.

CODEX.security-1 — K11 stub default + bash scripts pass stub on mainnet:
  agentkeys-cli/src/main.rs cmd_k11: stub mode on AGENTKEYS_CHAIN=heima
  without explicit opt-in (AGENTKEYS_ALLOW_STAGE1_STUBS=1) now HARD ERRORS
  with an actionable hint pointing at --webauthn, the opt-in env var, or
  switching to a dev chain. Previously just WARN'd.

  k11_cli.rs: 4 existing tests updated to AGENTKEYS_CHAIN=heima-paseo
  (the dev chain) so stub mode still works in CI without opt-in. Two
  new tests verify the hard-error fires on mainnet without opt-in AND
  that explicit opt-in succeeds with a WARN.

CODEX.followup-1 — placeholder addresses (0x...0001..0x...0004) in
operator-workstation.env could silently target on production:
  Six helper scripts (heima-{device-register, agent-create, scope-set,
  scope-revoke, device-revoke, credential-audit}.sh) now refuse the
  sentinel addresses when AGENTKEYS_CHAIN=heima — error message points
  the operator at heima-bring-up.sh to deploy the real contracts.

CODEX.security-2 — WebAuthn attestation statement not verified:
  arch.md §22b.1 already authorises this for stage 1, but the inline
  comment was light. Expanded the limitation note in k11_webauthn.rs
  to (a) explain why attestation="none" makes the statement empty,
  (b) note that the signed-message assert path still gives full
  cryptographic binding, (c) point at #90 for the MDS3 wireup.

CODEX.blocker-1 — Daemon proxy SO_PEERCRED stubbed:
  Documented more loudly in proxy.rs module docstring — names the
  threat model (multi-user box where another local user can connect to
  the operator's $XDG_RUNTIME_DIR) and the stage-2 fix (UnixStream's
  peer_cred() + per-(uid, binary_path) policy match). The fix itself
  is in scope for #90.

CODEX.blocker-2+3 — --credential-backend=sidecar errors out + daemon
missing /v1/cred/* routes:
  Improved the CLI error message to clearly explain what IS shipped
  (daemon proxy + broker cap-mint + worker — each runnable) vs what
  isn't (the CLI→daemon /v1/cred/* handoff). Points at #91 for the
  stitching. The S3 backend with --envelope-version=v2 is the
  operator-visible stage-1 path that exercises the same envelope
  bytes the worker would write.

Tests: 59 CLI tests pass (was 57; 2 new mainnet-stub tests). Workspace
clippy clean on touched code; remaining mock-server warnings are
pre-existing and out of PR scope.
@hanwencheng hanwencheng changed the title Issue #85 step 1 — S3CredentialBackend + --credential-backend flag v2 stage 1 — sovereign sidecar + on-chain identity + credentials-service worker (#89) May 19, 2026
Codex pass-4 APPROVED-WITH-FOLLOWUPS noted the sentinel guard pattern
matched broader than needed: [0-9a-f] catches addresses 0x...0000
through 0x...000f (16 addresses) when the only actual sentinel values
in operator-workstation.env are 0x...0001 through 0x...0004 (the
4 placeholder contract addrs for HEIMA_PASEO).

Narrow to exactly [1-4] across all 6 helper scripts:
- heima-device-register.sh
- heima-agent-create.sh
- heima-scope-set.sh
- heima-scope-revoke.sh
- heima-device-revoke.sh
- heima-credential-audit.sh

False-positive risk was low (zero-page addresses are reasonable to
refuse anyway), but the precise pattern is cleaner + tells the next
operator exactly which addresses are guarded against.

No behaviour change for the actual sentinel addresses on chain=heima.
…arness/

Two related moves so the repo layout reflects what each directory does now:

1. archived/harness/ — the old Anthropic stage-N-done harness (stage 0..7).
   No longer driven; the v2 stage-1 demo orchestrator superseded it.
   Preserved for archaeology + so the old stage-7 issue-64 phase-{0..D}
   smoke tests stay reachable.

2. harness/v2-stage1-demo.sh — promoted from scripts/ to harness/. This
   file IS the harness now: 15 idempotent steps composing every shipped
   v2 stage-1 surface (CLI build, email init, vault provision, S3 smoke,
   chain bring-up, device register, K11 enroll, agent create, scope set,
   audit append, summary). Path rewrite: docs + this script's own
   self-references all flip from scripts/v2-stage1-demo.sh to
   harness/v2-stage1-demo.sh.

The orchestrator's REPO_ROOT resolver (`$(dirname "$0")/..`) still works
because both scripts/ and harness/ are one level under the repo root.

Companion skill at ~/.claude/skills/agentkeys-harness/SKILL.md drives
the orchestrator through three phases:
  1. Script test iteration (stub mode, /ralph until green)
  2. Codex adversarial review iteration (apply must-fix findings)
  3. Human-interaction iteration (real Touch ID via --webauthn)

Skill rules distilled from v2-stage1-demo.sh's iteration history:
idempotent everywhere, auto-fund test accounts from the deploy wallet,
automate everything except Touch ID, no hardcoded test inputs,
stage-1 stubs fail-loud on mainnet, workspace-local binary takes
precedence over PATH, sentinel addresses refused on mainnet.
Companion commit to 4e325c1 (the file rename). The previous commit only
moved the file; this commit updates the 30+ doc + self-references so
operators following the docs land at the new path.

Files updated:
- docs/v2-stage1-migration-and-demo.md (13 refs)
- docs/v2-stage1-iteration-log.md (14 refs)
- docs/spec/deployed-contracts.md (2 refs)
- crates/agentkeys-chain/README.md (2 refs)
- harness/v2-stage1-demo.sh (5 self-refs in --help block + comments)

The orchestrator's REPO_ROOT resolver was already path-agnostic
($(dirname "$0")/..); no behaviour change.
@hanwencheng hanwencheng merged commit a497328 into main May 19, 2026
1 check passed
hanwencheng added a commit that referenced this pull request May 20, 2026
…dev) (#92)

* agentkeys: stage 2 (#90) — P-256 verifier, on-chain K11 binding, M-of-N recovery + companion daemon

P-256 ECDSA verify on-chain via pure-Solidity Jacobian-coords implementation
(no EIP-7212 precompile dependency — Heima is at London EVM). ~654k gas
per verify, sufficient for master-mutation frequency. RFC 6979 test vectors
pass.

K11Verifier extracts WebAuthn challenge from clientDataJSON at known byte
offset (daimo-style), reconstructs msgHash, calls P256Verifier. Binds K11
sig to operation challenge to prevent replay.

SidecarRegistry: splits into registerFirstMasterDevice +
registerAdditionalMasterDevice + revokeAgentDevice + revokeMasterDevice
(M-of-N quorum gated by recoveryThreshold). Stores k11PubX/k11PubY +
lastSignCount per device. Per-operator nonce + monotonic sign-count
defend against replay.

AgentKeysScope: K11Assertion struct gates setScopeWithWebauthn /
revokeScope; per-(operator, agent) scopeNonce binds K11 sig to current
state.

CLI: K11ChainAssertion struct + assert_webauthn_for_chain() extracts
(r, s, msgHash, pubX, pubY, authData, clientDataJSON, challengeLocation,
signCount) for chain submission. New --rp-id flag enables companion
credentials at companion.localhost (distinct platform keychain entry).
--emit-chain-payload outputs JSON for cast tx construction.

Daemon: new --master-companion mode runs a second daemon instance with
its own K10 + K11 at rp_id=companion.localhost. Serves HTTP API:
  GET  /v1/companion/whoami    — emits device identity
  POST /v1/companion/approve   — runs WebAuthn ceremony, returns chain payload

Scripts:
  scripts/heima-device-add.sh              — register companion as 2nd master
  scripts/heima-set-recovery-threshold.sh  — raise threshold to N
  scripts/heima-recovery.sh                — M-of-N master-device revoke

Harness:
  harness/v2-stage2-demo.sh                — idempotent 8-step demo

28 forge tests pass (P256: 8, K11: 6, AgentKeysV1: 14). Stage-2 demo
runs green in stub mode and re-runs green (idempotent). Full --webauthn
flow requires Touch ID + post-deploy contract addresses.

Closes part of #90:
  - On-chain P-256 verify of K11 assertions
  - Multi-master M-of-N recovery quorum
  - Multi-master pairing flow (companion daemon as mobile-app alternative)

Deferred to follow-up PRs:
  - audit-service worker (tier A Merkle relay)
  - email-service worker
  - K3 rotation operational runbook
  - Existing scripts/heima-{device-register,scope-set,scope-revoke}.sh
    migration to new contract surface (their K11 args changed shape)

* docs: stage-2 Heima Mainnet deploy + test runbook + harness fixes

Adds docs/v2-stage2-heima-deploy-and-test.md walking the operator
through redeploying the stage-2 contract set on Heima Mainnet,
re-bootstrapping the primary master, running the stage-2 demo, and
exercising the M-of-N recovery flow. Inherits all env setup from
docs/v2-stage1-migration-and-demo.md (no parallel test environment).

Harness fixes from the first dry-run:
- harness/v2-stage2-demo.sh step 5 simplifies to script-existence
  sanity check in stub mode (was: invoking dry-run which fails on
  missing companion K11 file).
- harness/v2-stage2-demo.sh step 7 same — verifies recovery script is
  invocable without requiring live chain state.
- scripts/heima-device-add.sh adds a dry-run path that doesn't require
  the companion K11 file (uses placeholder pubkey).
- scripts/heima-recovery.sh adds a dry-run path that doesn't require
  the deployer mnemonic / ethers node_modules.

Result: bash harness/v2-stage2-demo.sh --stub --skip-build runs all
8 steps green and is idempotent on re-run.

* harness: v2-stage2-demo as single source of truth for deploy+test

Stage-2 demo now owns the full lifecycle end-to-end:
- step 3: idempotent contract deploy (skips if already on chain;
  --redeploy forces fresh deploy; reads addresses from broadcast file;
  writes them to scripts/operator-workstation.env)
- step 4: idempotent primary-master bootstrap via new
  scripts/heima-register-first-master.sh (calls registerFirstMasterDevice
  with K11 pubX/pubY loaded from the operator's enrollment JSON)
- step 5-8 unchanged: companion daemon spin-up, 2nd-master register,
  recoveryThreshold update, recovery dry-run
- step 9: summary with all deployed addresses

Now actually deployed to Heima Mainnet (verified live):
  P256Verifier:    0xb74f0aaf9b72b4e7da872f77c63d805bf1937190
  K11Verifier:     0x73446fc9919a0a539b8b08dbda615a64b796ca4f
  SidecarRegistry: 0x9306c524a5e5c33e9a905b956204207ccaf7a7a1
  AgentKeysScope:  0x1276b94f57fd4086670d66acb8c75058176df399
  K3EpochCounter:  0x66c08748a6cfa14d9fefaaf5147e41a98db24f53
  CredentialAudit: 0xe827ba44931aef8c6f3abfec6b90ecf59f797576

Primary master registered on the new SidecarRegistry, tx
0x5f3a79bc970062ec74aa0deb5618f8a527f638a6d24ba3c4144f09a49600876d
(block 9623082).

Re-runs are idempotent — all 9 steps log 'skip'/'ok' without
re-submitting any tx.

* harness: move stage-2 helper scripts into harness/scripts/

The four scripts only referenced by harness/v2-stage2-demo.sh now live
under harness/scripts/ — same place as the orchestrator that calls them.
Operator-facing stage-1 helpers in scripts/ stay put.

  scripts/heima-device-add.sh              → harness/scripts/heima-device-add.sh
  scripts/heima-recovery.sh                → harness/scripts/heima-recovery.sh
  scripts/heima-register-first-master.sh   → harness/scripts/heima-register-first-master.sh
  scripts/heima-set-recovery-threshold.sh  → harness/scripts/heima-set-recovery-threshold.sh

The moved scripts compute REPO_ROOT from two levels up
(harness/scripts/<f>.sh → repo root via /../..); the demo paths were
updated to point at the new harness/scripts/ location.

Hardened the deploy-presence check in step 3:
- Distinguishes RPC failure (exit nonzero) from "no code at address"
  (exit zero with "0x").
- RPC failure → retry up to 8 times with 3s sleep → die rather than
  redeploy on uncertain state.
- "No code" → genuine; trigger redeploy as before.

Heima's RPC hits TLS-handshake-EOF transients regularly; this fix
prevents an unnecessary redeploy that would orphan the previous set.

Same hardening on the balance check in step 3.

* harness: companion daemon serves real device_key_hash + clearer step-8 message

Stage-2 demo step 5 now derives the companion's on-chain device_key_hash
from its K11 cose-pubkey (cast keccak <cose_pubkey_hex>) and passes it
to the daemon via --companion-device-key-hash. The daemon's
/v1/companion/whoami then returns the real hash that
registerAdditionalMasterDevice will use as the storage key, so the
later revoke flow can find the device on chain.

Stage-2 demo step 8: clearer skip message + when --webauthn is set,
prints the companion's device_key_hash + the exact re-run command for
executing the revoke. The previous message implied --webauthn alone
would do something; really we need a target hash too.

* harness/scripts: shared key-resolution lib so scripts accept raw-key files

Adds harness/scripts/_lib.sh with resolve_master_key():
- $HEIMA_DEPLOYER_KEY_FILE env var (raw hex or mnemonic)
- ~/.agentkeys/heima-deployer.key (raw hex, used by stage-1 operator)
- ./test-hei (mnemonic, legacy)

Patches the 3 scripts that previously only handled mnemonic files:
- heima-device-add.sh
- heima-set-recovery-threshold.sh
- heima-recovery.sh (preserves --dry-run placeholder path)

Fixes a real bug: scripts died with 'missing mnemonic' on operators
that bootstrapped from a raw private key (the stage-1 path stores
the deployer key at ~/.agentkeys/heima-deployer.key, not a mnemonic
at ./test-hei).

Also fixes step 8's stale whoami file: always curl fresh so the
device_key_hash hint reflects the currently-running daemon, not a
prior run where the daemon hadn't been started with the real hash.

* fix: WebAuthn challenge double-hash + empty cred-id bytes32

Bug 1 (root cause of step 7 K11VerificationFailed reverts):
assert_webauthn_for_chain was passing the 32-byte expected_challenge as
a "message" to assert_webauthn_inner_parts, which sha256'd it again
before using as the WebAuthn challenge. The on-chain K11Verifier
expects the WebAuthn challenge to BE the operation challenge (no
extra hash); double-hashing made clientDataJSON.challenge !=
expected_b64 → ChallengeMismatch / verifyAssertion returns false →
contract reverts with K11VerificationFailed.

Fix: refactored assert_webauthn_inner_parts to take a [u8; 32]
challenge directly. The legacy assert_webauthn_inner path sha256's
the message itself before calling (preserves existing behavior).
assert_webauthn_for_chain passes the expected_challenge through
unchanged.

Bug 2 (step 6 cast send "invalid string length"):
The companion daemon was receiving an empty --companion-k11-cred-id
(demo didn't pass it), so /v1/companion/whoami returned k11_cred_id="".
The brittle xxd|head|sed pipeline in heima-device-add.sh produced an
all-zeros bytes32 by accident, but the demo's tuple construction had
other issues that confused the cast parser.

Fix: demo step 5 now computes the cred-id hash from the K11 file
(keccak256-style sha256 of the b64url credential id) and passes it
to the daemon via --companion-k11-cred-id. heima-device-add.sh uses
the hash directly from whoami without re-encoding. Also bumped the
empty attestation arg from "0x" to "0x00" (cast tolerates the latter
more consistently).

Added a sanity-check loop in heima-device-add.sh that validates each
bytes32 arg has length 66 before invoking cast, so future malformed
inputs fail with a clear error rather than cast's opaque parser msg.

* ui: distinguish PRIMARY vs COMPANION K11 ceremony pages

WebAuthn assert page now surfaces the role + RP ID prominently so the
operator can't confuse which credential they're about to sign with:
- Color: blue accent for PRIMARY MASTER (rp_id=localhost),
  purple for COMPANION MASTER (rp_id=companion.localhost)
- Role badge at the top of the card with emoji + label
- Dedicated RP-ID callout warning to verify the Touch ID prompt
  matches the displayed RP
- Button text reads "Sign as PRIMARY MASTER" / "Sign as COMPANION MASTER"
- Page <title> includes the role so the OS tab list shows it

The M-of-N recovery flow opens TWO browser windows in quick
succession (one for each daemon's K11 ceremony) — without this
distinction the operator could tap the wrong Touch ID prompt and
silently produce an assertion the contract rejects.

* harness: integrate full M-of-N E2E test (3 devices + 2-of-2 revoke)

Stage-2 demo grows from 9 to 10 steps and now exercises the full
M-of-N revocation path as part of the default --webauthn flow:

  Step 8 NEW — Register synthetic 3rd master (the "spare").
    The spare is a fresh P-256 keypair generated via openssl, NOT a
    real WebAuthn passkey. It registers as a 3rd master with roles=3
    (CAP_MINT|RECOVERY) via primary K11 sig (1 Touch ID at localhost).
    State persists at /tmp/agentkeys-spare-current/ for step 9.
    Why synthetic: the spare is "lost" by design — never needs to
    sign for its own revocation (primary + companion provide the
    quorum). Skipping its WebAuthn enrollment saves a Touch ID
    without weakening the test of any contract surface.

  Step 9 NEW — Revoke spare via 2-of-2 quorum.
    Calls heima-recovery.sh with target=spare hash. The script:
    - Asks primary K11 to sign OP_REVOKE_MASTER challenge (1 Touch ID
      at localhost — UI shows PRIMARY MASTER badge).
    - Asks companion daemon /v1/companion/approve to sign same
      challenge (1 Touch ID at companion.localhost — UI shows
      COMPANION MASTER badge).
    - Submits revokeMasterDevice(spareHash, [primarySig, companionSig]).
    - Contract verifies 2-of-2 quorum + bumps operatorNonce.
    Post-tx verify: isActive(spare) == false.

  Step 10 NEW — Cleanup spare local state.
    Removes /tmp/agentkeys-spare-current/. The on-chain entry stays
    as revoked=true (audit trail — no on-chain delete by design).

End state after a successful run:
  - 2 active masters: primary (roles=7) + companion (roles=3)
  - 1 revoked master: spare (roles=3, revoked=true)
  - recoveryThreshold = 2
  - operatorNonce += 3 (register-2nd-master, set-threshold, revoke)

Touch IDs on a fresh run: 6 total
  - companion enroll (step 5, once per setup)
  - companion register (step 6, once per setup)
  - set threshold (step 7, once per setup)
  - spare register (step 8, fresh per run)
  - primary sigs spare revoke (step 9)
  - companion sigs spare revoke (step 9)

Re-run after this completes: steps 1-7 + 10 skip, steps 8-9 generate
a fresh spare (new keypair) and revoke it — 3 Touch IDs per re-run.
This makes the demo a repeatable end-to-end test of the M-of-N path
without bricking the operator's setup.

* harness: auto-version companion when previous instance is revoked

Once a companion has been revoked on chain (e.g. as part of an M-of-N
quorum test), it can never re-enter the registered-master set under
the same deviceKeyHash. Stage-2 demo now detects this and enrolls a
fresh companion under a bumped rp_id (companion.localhost →
companion-v2.localhost → companion-v3.localhost) so the M-of-N revoke
test in step 9 has 2 distinct ACTIVE masters to form the quorum.

Changes:
- harness/v2-stage2-demo.sh step 5: scans existing K11 files for an
  active-on-chain companion. If none found, picks the lowest free
  version slot and enrolls a fresh K11 there.
- harness/v2-stage2-demo.sh step 5: passes the computed rp_id to the
  daemon via new --companion-rp-id flag.
- crates/agentkeys-daemon/src/companion.rs: rp_id is now stored in
  CompanionState + threaded through /v1/companion/whoami responses
  and assert_webauthn_for_chain calls.
- crates/agentkeys-daemon/src/main.rs: new --companion-rp-id flag.
- harness/scripts/heima-device-add.sh: reads rp_id from
  /v1/companion/whoami and derives the K11 file path from it.

Net effect: re-running the demo after a 2-of-2 revoke now enrolls
a fresh companion-vN, re-establishes a 2-active-master state, and
proceeds with the next spare-revoke cycle without operator hand-fixing.

* scripts: migrate stage-1 scripts to stage-2 ABI

Enables harness/v2-stage1-demo.sh to run green against the new
SidecarRegistry + AgentKeysScope contracts deployed in stage 2.

Changes:

- heima-device-register.sh becomes a thin wrapper: forwards to
  harness/scripts/heima-register-first-master.sh when no first
  master is registered; logs skip otherwise. The pre-stage-2
  registerMasterDevice() was split into registerFirstMasterDevice +
  registerAdditionalMasterDevice; this script handles the former.

- heima-device-revoke.sh: detects master vs agent target and
  delegates accordingly. Agent revoke uses the new revokeAgentDevice
  (no K11 needed). Master revoke delegates to heima-recovery.sh
  which collects the M-of-N K11 quorum.

- heima-scope-set.sh: real WebAuthn ceremony, computes the contract's
  expected_challenge per OP_SET_SCOPE encoding (servicesDigest +
  scopeNonce + chainid), builds K11Assertion struct, calls new ABI
  (bytes K11 -> struct). Stub bytes no longer satisfy the gate.

- heima-scope-revoke.sh: same migration as scope-set, computing
  OP_REVOKE_SCOPE challenge.

- All four scripts now use harness/scripts/_lib.sh's
  resolve_master_key, supporting both raw-key files
  (~/.agentkeys/heima-deployer.key) and mnemonic files (./test-hei).

Effect: operator can now run `bash harness/v2-stage1-demo.sh --webauthn`
against the same Heima Mainnet deployment that stage-2 uses, exercising
the full operator lifecycle (init -> register -> agent -> scope -> audit)
on the new contracts.

* ops: K3 rotation runbook + script

scripts/heima-k3-rotate.sh — operator-driven K3 epoch advance via
K3EpochCounter.advanceEpoch(). Idempotent (--target-epoch N skips if
currentEpoch >= N), supports dry-run, signs from the wallet that is
the contract's signerGovernance.

docs/runbook-k3-rotation.md — step-by-step operator runbook:
prerequisites, the one-command flow, post-rotation verification,
when to rotate (quarterly hygiene + TEE-compromise indicator), lazy
vs eager re-encryption trade-offs, and the stage-3 migration path to
move signerGovernance from EOA to M-of-N multisig.

Verified end-to-end on Heima Mainnet (dry-run): K3EpochCounter at
0xeacc97d4e7854c52d4736e5fba2dc7c2c2b147d9 has currentEpoch=1 and
signerGovernance points at the deployer.

* audit: tier-A Merkle relay worker + on-chain appendRoot path

Contract surface (CredentialAudit.sol):
- New `appendRoot(operatorOmni, merkleRoot, batchEntryCount)` stores a
  per-operator AuditRoot entry, emits AuditRootAppended. Operators
  reconstruct per-event proofs from leaves in S3.
- New `verifyEntryInRoot(operatorOmni, rootIndex, proof[], leaf)`
  validates a sorted-pairs Merkle proof on chain. Matches OpenZeppelin
  convention so the Rust-side proof emission is directly verifiable
  without further transformation.
- Existing `append()` per-event path (tier C) untouched.

Forge test test_CredentialAudit_AppendRoot_AndVerifyMembership covers
the round-trip with a 4-leaf tree.

New crate agentkeys-worker-audit:
- `merkle.rs`: minimal Merkle root + proof helpers using keccak256 with
  sorted-pairs encoding (matches the contract verifier byte-for-byte).
  Doc tests + 4 unit tests pass.
- `state.rs`: per-operator in-memory event queue with flush semantics.
  Drains the queue, computes Merkle root, writes per-event leaves +
  proofs to a JSONL file at /tmp/audit-leaves-<root>.jsonl.
- `handlers.rs`: HTTP surface
    POST /v1/audit/append              — queue event
    POST /v1/audit/flush/:operator     — drain one queue
    POST /v1/audit/flush-all           — drain all queues
- `main.rs`: bind axum at 127.0.0.1:9092; periodic auto-flush every
  --flush-interval-secs (default 300s; 0 = manual only). Each flush
  logs the Merkle root + leaves path. Chain submission via
  `cast send appendRoot` is operator-driven (separate from this
  process so the worker doesn't need a deployer key).

End-state: operators wanting per-event-tx semantics keep using tier C
(`heima-credential-audit.sh` direct write). Operators wanting batched
gas (one tx per N events / per 5min) point their daemon at this worker
and emit per-event POSTs; the worker computes roots and the operator
periodically submits roots via `cast send`.

* email: agentkeys-worker-email — SES send + per-actor inbox list

New crate agentkeys-worker-email. Surfaces:

  POST /v1/email/send
    Body: { from, to[], subject, body_text, body_html? }
    Wraps aws-sdk-sesv2::SendEmail with the operator's SES identity
    (must be verified per the #83 setup workflow). Returns the SES
    message_id.

  GET /v1/email/inbox/:actor_omni
    Lists objects under s3://$AGENTKEYS_VAULT_BUCKET/bots/<actor_omni>/inbound/.
    Inbound routing itself is the SES routing Lambda from #83; this
    worker only exposes what's already been delivered to S3.

  CLI args:
    --bind             default 127.0.0.1:9093
    --inbox-bucket     env AGENTKEYS_VAULT_BUCKET, required

Builds against aws-sdk-sesv2 1.118 + aws-sdk-s3 1.132. No new
dependencies introduced at the workspace level (aws-config + s3 are
already used by worker-creds).

Operator workflow: spin up alongside worker-creds + worker-memory on
the broker host, route per-agent outbound mail through this worker
instead of having each agent directly call SES. Cap-token verification
on /v1/email/send is left as a follow-up (current shape assumes the
worker is on a private interface — operators expose it only on the
sidecar daemon's localhost, same as worker-creds).

* docs: K3 rotation test verdict — 4 rounds green on Heima Mainnet

Live E2E test of scripts/heima-k3-rotate.sh per agentkeys-harness skill:

- Round 1: epoch 1 → 2 (1 tx)
- Round 2: epoch 2 → 3 (1 tx)
- Round 3: target=3 (already there) → skip, no tx, 0 gas
- Round 4: target=6 (3-step advance) → 3 txs

Total: 5 real txs on K3EpochCounter = 0xeacc97d4e7854c52d4736e5fba2dc7c2c2b147d9.

The contract is forward-only by design — no "rotate back" — so the
"back and forth" test is bounded to forward-path correctness + the
idempotency skip on re-targets-to-current. Both work as designed.

K3EpochCounter is now at epoch 6 on Heima Mainnet. The signer enclave
will retain historical K3_v[1..5] for decrypt of pre-rotation blobs;
new writes use K3_v[6].

* ui: enrollment page + macOS Touch ID dialog readability

Two fixes:

1. Enrollment page (serve_enroll_page) now matches the assert-page
   visual language — role badge (PRIMARY MASTER blue, COMPANION MASTER
   purple), RP-ID surfaced explicitly, button text reads "Enroll as
   PRIMARY MASTER" / "Enroll as COMPANION MASTER". Previously the
   enrollment page was role-agnostic which made it easy to tap Touch
   ID on the wrong RP when re-enrolling.

2. WebAuthn user.name shown in the macOS Touch ID dialog ("Use Touch
   ID to sign in to 'localhost' with your passkey for <NAME>") was
   previously the full 64-char operator_omni hex, which truncates
   awkwardly on screen. Now reads "AgentKeys Primary Master
   (0x941cb1c3…)" or "AgentKeys Companion Master (0x941cb1c3…)" —
   human-readable + a 10-char omni prefix for cross-operator disambig.

Takes effect on NEW enrollments only — existing credentials retain
whatever user.name was set when they were originally enrolled. To
refresh the display name, delete ~/.agentkeys/k11/<omni>--<rp>.json
and re-enroll.

The "white text in white background" in the macOS Passkey-source
filter row is macOS system UI (the picker for which provider supplies
the passkey — iCloud Keychain, 1Password, etc.); it's outside our HTML
control. The other observed truncation is fixed by this commit.

* docs(arch): §16.4 brief intro to K3 rotation flow

Operator-facing summary of what K3 rotation does and doesn't change:
- contract addresses, devices, scopes, threshold unchanged
- on-chain epoch counter advances + emits K3Rotated event
- signer enclave retains historical K3 versions for legacy decrypt
- workers swap to new epoch for new writes via SSE
- one-command operator action: `bash scripts/heima-k3-rotate.sh`
- links to full runbook at docs/runbook-k3-rotation.md
- notes the stage 1-2 simplification (KEK from env per §22b.2) means
  rotation is forward-compatible but not yet driving worker re-key

Also documents the eager-re-encrypt follow-up gated behind a confirmed
TEE compromise scenario (stage 3 tracked in §22b.5).

* fix(stage-2): codex adversarial review — 7 critical/high/medium findings

Codex flagged 8 findings; 7 are addressed here (C1, C2, C3/M1, H1, H2, M2 +
test coverage). The remaining one (codex H3 "K10+K11") is a false positive:
msg.sender check IS the K10 signature — EVM tx signing is secp256k1 over
the whole tx by the master wallet. Added comments where helpful.

Contract fixes (require redeploy):

  C1: SidecarRegistry.revokeMasterDevice — refuse to revoke if it would
      leave < max(1, recoveryThreshold) active recovery-capable masters.
      Prevents permanent operator stranding.

  C2: SidecarRegistry.setRecoveryThreshold — refuse newThreshold >
      activeRecoveryMasterCount. Prevents permanent operator stranding
      via unsatisfiable quorum.

  C3/M1: CredentialAudit.appendRoot — auth-gate by operator's master
      wallet (via injected SidecarRegistry reference). Previously any
      account could pollute an operator's root list.

  H1: K11Verifier.verifyAssertion — three new envelope checks:
      - authData[0:32] == expectedRpIdHash (per-credential, stored on
        register at DeviceEntry.k11RpIdHash). Prevents cross-RP replay.
      - authData[32] has UP|UV flags. Prevents stolen-device-without-
        biometric assertions.
      - clientDataJSON starts with `{"type":"webauthn.get"`. Prevents
        replay of webauthn.create (enrollment) assertions.

  M2: CredentialAudit + worker Merkle — domain-separate leaves (0x00
      prefix) and internal nodes (0x01 prefix). Prevents an internal-
      node digest from impersonating a leaf at shorter depth.

ABI changes:
  - SidecarRegistry.registerFirstMasterDevice + registerAdditionalMaster
    now take an extra bytes32 k11RpIdHash arg (the operator's K11 enroll
    rp_id is hashed and stored).
  - K11Verifier.verifyAssertion takes the rpIdHash; callers
    (SidecarRegistry, AgentKeysScope) read entry.k11RpIdHash.
  - CredentialAudit constructor takes the SidecarRegistry address.

Harness changes:
  - heima-register-first-master.sh + heima-device-add.sh + heima-register-
    spare-master.sh compute sha256(rp_id) from the K11 enrollment file
    and pass it as the new arg.
  - v2-stage2-demo.sh step 6 + 7 fail-fast on device-add/threshold-set
    failures + verify on-chain state matches before advancing to step 9.
    Codex H2: previously silent failures could false-green step 9.

Tests:
  + 5 new K11Verifier tests: RpIdHashMismatch, UserPresenceMissing (no
    flags, UP-only), WrongClientDataType (webauthn.create), all pass.
  + CredentialAudit_AppendRoot_RejectsNonMaster (vm.prank attacker).
  + Internal-node-as-leaf attack test in both forge + Rust Merkle suite.
  - Total: 33 forge tests (was 28), 7 worker-audit unit tests (was 6),
    all green.

Deploys will fail against the existing PR #87-deployed contracts —
operator must redeploy via the demo's step 3 (forced) or by running
`bash harness/v2-stage2-demo.sh --redeploy`.

* deploy: stage-2 contracts with codex fixes redeployed on Heima Mainnet

New addresses (PR commit 5834c1d 'fix(stage-2): codex adversarial review'):
  P256Verifier:    0xda5b772f9d6c09abe80414eea908612df9b54749
  K11Verifier:     0x5a441431f08e0f5f5ed10659620cb4e0e814e627
  SidecarRegistry: 0x1ac62f1c2d828476a5d784e850a700dc1f17e0be
  AgentKeysScope:  0xd44b375daefc65768f417d0f0125b68d5ba7df3b
  K3EpochCounter:  0x6c9e675c699a06acefbc156afdee6bfbfe32ccb3
  CredentialAudit: 0x63c4545ac01c77cc74044f25b8edea3880224577

Previously-deployed instances (bc232ebcb47fa672aa2a1b2b0481c7ff9a86531b
et al) are now abandoned. They have the pre-codex-fix ABI which is
incompatible — DeviceEntry layout changed (added k11RpIdHash field).
Operator's primary master must re-register via
harness/scripts/heima-register-first-master.sh against the new
SidecarRegistry; companion + spare flows then continue normally.

* issue #90: co-locate audit/email/cred/memory workers on broker host (dev)

Dev-only co-location of the 4 service workers on the same EC2 box as the
broker, behind per-worker nginx vhosts. CLAUDE.md: "for production, we
will isolate all the services for the security issue" — the per-subdomain
layout is the migration seam, so a future move to dedicated hosts only
needs the A record + IAM principal to change.

Topology:
  broker.litentry.org  :8091  agentkeys-broker
  signer.litentry.org  :8092  agentkeys-signer
  audit.litentry.org   :9092  agentkeys-worker-audit   (Merkle relay)
  email.litentry.org   :9093  agentkeys-worker-email   (SES + S3 inbox)
  cred.litentry.org    :9094  agentkeys-worker-creds   (credential CRUD)
  memory.litentry.org  :9095  agentkeys-worker-memory  (memory CRUD)

setup-broker-host.sh — builds + installs the 4 worker binaries, auto-
generates worker-{creds,memory}.env with stable KEK secrets (preserved
across re-runs so existing blobs stay decryptable), writes 4 systemd
units, writes 4 nginx vhosts via shared write_worker_nginx_site(), and
probes /healthz on each port post-restart. New CLI flags: --audit-host,
--email-host, --cred-host, --memory-host, --chain-rpc, --vault-bucket,
--memory-bucket, --scope-addr, --registry-addr, --k3-counter-addr,
--without-workers. Re-runs without flags now re-read previously-configured
values from /etc/agentkeys/worker-{creds,memory}.env so the script stays
idempotent for non-default deployments.

dns-upsert-workers.sh (NEW) — single atomic Route 53 change-batch UPSERT
for all 4 A records. Validates the caller is on agentkeys-admin, refuses
RFC1918 / TEST-NET-2 (Cloudflare WARP / Zscaler / corporate VPN) EIPs,
waits for Route 53 INSYNC + Cloudflare DoH propagation before exiting.

verify-workers.sh (NEW) — laptop-side end-to-end check: DNS resolves via
Cloudflare DoH → TLS cert is Let's Encrypt → /healthz returns HTTP 200
with the per-worker expected body marker. Exits non-zero with per-failure
diagnostics. --no-tls for the HTTP-only first-pass phase.

worker-audit/main.rs + worker-email/main.rs: GET /healthz → "ok" so
probe_or_die can verify boot (worker-creds + worker-memory already had it).

operator-workstation.env: derive WORKER_{AUDIT,EMAIL,CRED,MEMORY}_HOST +
AGENTKEYS_WORKER_*_URL from \$BROKER_HOST, mirroring the SIGNER_HOST
pattern.

docs/cloud-setup.md: new §1.4 (TOC row) + §7 "Service workers" with the
concern table (mirrors §6 signer), §7.1 DNS one-shot helper, §7.2 TLS
cert loop + nginx flip, §7.3 verification. Existing §7 Cleanup → §8.

heima-scope-set.sh + heima-scope-revoke.sh: graceful skip with
{"ok":true,"skipped":"no-webauthn-k11"} when no mode:webauthn K11 is
enrolled, so harness/v2-stage1-demo.sh (default stub mode) is fully CI-
automatable without operator Touch ID.

* fix: worker-{creds,memory} need REGISTRY + K3_EPOCH_COUNTER addresses

worker-creds and worker-memory both call profile_env() for all THREE
contract addresses (SidecarRegistry, AgentKeysScope, K3EpochCounter) at
state construction — verified live by the boot failure on broker host:

  Error: SIDECAR_REGISTRY_ADDRESS_HEIMA must be set
  Caused by: environment variable not found

The auto-generated /etc/agentkeys/worker-creds.env was only writing
SCOPE_CONTRACT_ADDRESS_HEIMA, omitting the other two — fixed.

Also added AGENTKEYS_CHAIN=heima to both env files so the chain-profile
resolution is explicit instead of relying on the worker-side default
(matches what the existing chain helpers do).

* issue #90: wire audit + email workers into stage-1 + stage-2 demos

New step exercises the 4 co-located service workers as a tier-A relay:
queue 2 audit events → flush → on-chain CredentialAudit.appendRoot →
verify rootCount + getRoot match. Plus an email worker /healthz +
/inbox smoke.

  Stage-1 demo: STEP_TOTAL 15 → 16, new step 15 between audit-append
                and summary; summary renumbered to step 16.
  Stage-2 demo: STEP_TOTAL 10 → 11, new step 10 between M-of-N revoke
                and cleanup; cleanup renumbered to step 11.

scripts/heima-worker-smoke.sh (NEW) — drives the full flow:
  1. precheck both workers' /healthz
  2. POST 2 events → audit worker /v1/audit/append
  3. POST /v1/audit/flush/<operator_omni> → Merkle root + leaves
  4. cast send CredentialAudit.appendRoot from operator master wallet
  5. cast call rootCount + getRoot to verify on-chain root matches flush
  6. GET /v1/email/inbox/<actor_omni> as soft-warn smoke (the broker
     EC2 IAM lacks s3:ListBucket on the inbox bucket today — out-of-scope
     follow-up; worker is deployed + /healthz green so the demo
     continues without breaking the chain green-bar)

Live-tested 4 rounds against Heima Mainnet — rootCount progressed
0→1→2→3→4→5→6→7→8 across stage-1 + stage-2 runs with all 8 on-chain
Merkle roots verified by getRoot() readback. Idempotency: every re-run
is a clean skip (no chain mutation) or adds a fresh tier-A root.

Sibling fixes (same bug class — stale DeviceEntry struct offsets after
codex H1 added k11RpIdHash + k11PubX + k11PubY):

  heima-agent-create.sh + heima-device-revoke.sh — switched the
    idempotency check from hex-offset slicing of getDevice() to the
    typed isActive(bytes32)(bool) view. The old code read offset 320
    for registeredAt; after the struct grew, registeredAt now lives at
    offset 512, so the offset-based check always returned 'not yet
    registered' on re-run and registerAgentDevice reverted with
    DeviceAlreadyRegistered (0xa98bbce0). isActive is struct-agnostic.

  heima-scope-set.sh + heima-scope-revoke.sh — when USE_WEBAUTHN=0
    (stub mode) AND the local K11 file is mode=webauthn (from a prior
    real ceremony), skip cleanly instead of triggering Touch ID. Demo
    stub-mode runs on a laptop with prior webauthn enrollment were
    otherwise prompting for Touch ID and dying on the dismissed
    dialog. The 'stub-mode-refuses-touchid' skip payload makes this
    explicit.

* issue #90: wire OIDC federation into cred + memory workers (Q3)

Closes the OIDC isolation gap from PR #92 review (issue #90 Q1 + Q3): the
broker had full federation infrastructure (handlers/oidc.rs, mint.rs,
sts.rs) but the workers bypassed it — every S3 call went through the
broker EC2 instance profile, so the per-actor IAM scoping defined in
provision-vault-role.sh's PrincipalTag policy was never exercised.

Worker code change (backwards compatible):

  crates/agentkeys-worker-creds/src/aws_creds.rs (NEW)
    - OptionalStsCreds axum extractor: parses three optional headers
        X-Aws-Access-Key-Id
        X-Aws-Secret-Access-Key
        X-Aws-Session-Token
      Returns None if any are missing (partial = error, refuse to mint
      a half-authed S3 client).
    - StsCreds::build_s3_client(region) — per-request S3 client backed
      by the passed-through STS creds.
    - s3_for_request(default, region, override) — falls back to the
      default instance-profile client when override is None.
    - 4 unit tests covering header presence / absence / partial.

  crates/agentkeys-worker-creds/src/handlers.rs
    cred_store + cred_fetch + cred_teardown — accept OptionalStsCreds,
    use the per-request client when present.

  crates/agentkeys-worker-memory/src/handlers.rs
    memory_put + memory_get + memory_teardown — same pattern; re-exports
    aws_creds from agentkeys_worker_creds (no duplication).

Backward compat: requests without the three X-Aws-* headers fall back
to state.s3 (instance profile) — existing stage-1 + stage-2 demo flows
keep working unchanged.

harness/v2-stage3-demo.sh (NEW, 8 steps)
  End-to-end OIDC isolation proof on Heima Mainnet:

    1. SIWE wallet_sig auth → session JWT
    2. POST /v1/mint-oidc-jwt → STS-compatible web identity token
    3. AssumeRoleWithWebIdentity → STS creds tagged with
       PrincipalTag/agentkeys_actor_omni = derive_omni(master wallet)
    4. POSITIVE: PUT s3://vault/bots/<own actor_omni>/credentials/…
       → HTTP 200
    5. NEGATIVE: PUT s3://vault/bots/<wrong actor_omni>/credentials/…
       → AccessDenied (IAM rejects cross-actor write — the proof)
    6+7. Same positive+negative pair on the memory bucket — soft-skip
       when memory bucket not yet provisioned (follow-up).
    8. Cleanup with admin profile.

Live-tested against Heima Mainnet. Step 5 verified: AWS IAM itself
rejected the cross-actor PUT with AccessDenied — proves the
${aws:PrincipalTag/agentkeys_actor_omni} scoping in
scripts/provision-vault-role.sh works as designed. Even if a worker
were compromised, it could not write to another actor's prefix when
using STS creds passed through from the broker mint flow.

Architectural answers to the review (#90 Q1 + Q2):

  Q1 ("is OIDC disrupted by the new service isolation design?"):
    Was, yes — workers bypassed federation. NOW WIRED.
    Workers respect STS creds when passed; fall back to instance
    profile otherwise so existing stage-1+2 flows are unchanged.

  Q2 ("why does broker need s3:ListBucket — Lambda should sort
    incoming email into per-actor folders"):
    User is right architecturally. The 500 we soft-warned on in
    /v1/email/inbox is the symptom of the same OIDC bypass — the
    email worker uses instance profile and tries global ListObjects
    without scoping. Architecturally correct flow: SES inbound →
    Lambda sorts to bots/<actor>/inbound/ → email worker reads via
    OIDC-scoped STS creds, never global ListBucket. The fix is the
    same shape as this PR — pass-through STS creds via X-Aws-*
    headers — but is left as a follow-up: this PR ships the
    plumbing + proves OIDC works end-to-end; wiring the email worker
    + Lambda routing is a separate change. Tracked in #90 followups.

* issue #90 codex review: fix downgrade attack + secret redaction

Addresses 2 of 4 codex adversarial findings on commit 913179a:

[P2 — downgrade attack] aws_creds.rs OptionalStsCreds extractor silently
fell back to the broker EC2 instance profile when caller omitted X-Aws-*
headers. A malicious caller could deliberately drop the headers to bypass
the OIDC-scoped IAM session and get broker-wide S3 access.

Fix: `AGENTKEYS_WORKER_REQUIRE_STS=1` env var puts the worker in strict
mode — every request must carry all three X-Aws-* headers or gets HTTP
401. Also: partial header sets (1 or 2 of 3 present) ALWAYS reject with
401 regardless of strict mode — silently dropping half-passed creds is
the same downgrade surface. Default off for backward compat; production
deploys should turn it on.

[P3 — credential leak via Debug] StsCreds previously derived Debug, so
any future tracing::debug! or dbg!() call would log secret_access_key
and session_token verbatim. Custom Debug impl now redacts both and
shows only the access_key_id prefix (which AWS CloudTrails anyway).

New tests:
  - debug_redacts_secret_and_session_token (asserts the Debug output
    doesn't contain the secret bytes; <redacted> marker present)
  - parser_distinguishes_no_headers_from_partial (locks the extractor's
    contract — no headers = backward compat, partial = always reject)

Two codex findings deliberately left as follow-ups, not fixed in this
commit:

[P2 — memory worker OIDC not proven] The harness only mints
agentkeys-vault-role creds, which scope to the vault bucket only. The
memory worker writes to a separate memory bucket which isn't covered.
A dedicated agentkeys-memory-role with the same tag-scoping pattern is
the architecturally correct fix; tracked as PR followup.

[P2 — vault bucket policy allows whole-bucket ListBucket] In
scripts/apply-vault-bucket-policy.sh:109 — pre-existing, separate from
this PR's surface. Adding an s3:prefix=bots/${aws:PrincipalTag/…} condition
to the bucket-policy ListBucket statement closes the cross-actor key-name
enumeration. Filed for the bucket-policy hardening followup.

* issue #90 codex review: close remaining 2 deferred findings

Lands the two findings deferred from commit 18e709b. Both verified live
on Heima Mainnet via the extended harness/v2-stage3-demo.sh (11 steps,
all green).

[P2 — memory worker OIDC scoping] NEW agentkeys-memory-role + dedicated
memory bucket, mirroring the vault data-class layout per arch.md §17.2.
A future memory-worker compromise now cannot reach the credentials
bucket and vice versa.

  scripts/provision-memory-bucket.sh  (NEW) — mirror of provision-vault-bucket.sh
  scripts/provision-memory-role.sh    (NEW) — federated trust + 3-statement
                                              inline policy scoped to
                                              $MEMORY_BUCKET/bots/${PrincipalTag}/memory/*
  scripts/apply-memory-bucket-policy.sh (NEW) — v3 bucket policy

[P2 — bucket-policy ListBucket whole-bucket allow] Was: one statement
listed [Get, Put, Delete, ListBucket] under one Resource[bucket,
bucket/...] with NO s3:prefix condition — any tagged session could
enumerate all keys. Now: SPLIT into two statements:

  VaultListV3 / MemoryListV3 — ListBucket ONLY, on the bucket ARN,
    Condition StringLike s3:prefix = bots/${PrincipalTag}/<class>/*
  VaultObjectsV3 / MemoryObjectsV3 — Get/Put/Delete on the
    prefixed-object ARN, no prefix condition (resource ARN already scopes)

  scripts/apply-vault-bucket-policy.sh  (UPDATED) — v2 → v3 split
  scripts/apply-memory-bucket-policy.sh (NEW)    — v3 split from day one

Demo extended (harness/v2-stage3-demo.sh, STEP_TOTAL 8 → 11):

  step 3:  mint TWO STS sessions (vault role + memory role)
  step 4-5: vault PUT positive (own) + negative (other) — pre-existing
  step 6:  vault LIST negative (other prefix → AccessDenied) — codex P2 verifier
  step 7-8: memory PUT positive (own) + negative (other)
  step 9:  memory LIST negative (other prefix → AccessDenied)
  step 10: cross-role isolation — vault creds → memory bucket → AccessDenied
                                 + memory creds → vault bucket → AccessDenied
  step 11: cleanup

Also: `expect_access_denied` helper distinguishes IAM-rejection
(AccessDenied / HTTP 403) from setup-bug failures (NoCredentialsErr,
NoSuchBucket, InvalidAccessKeyId, TokenRefreshRequired). Naive
`grep AccessDenied` would pass on any failure — codex's exact warning.

operator-workstation.env:
  + MEMORY_BUCKET=agentkeys-memory-${ACCOUNT_ID}
  + MEMORY_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/agentkeys-memory-role

Live-tested 2026-05-20 on Heima Mainnet:
  - memory bucket created (AssumedArn=…agentkeys-memory-role)
  - vault-bucket policy v2 → v3 swap (2 statements live)
  - memory-bucket policy v3 from scratch (2 statements live)
  - 11/11 demo steps green:
      [4]  vault PUT  own prefix       → SUCCEEDED
      [5]  vault PUT  other prefix     → AccessDenied
      [6]  vault LIST other prefix     → AccessDenied
      [7]  memory PUT own prefix       → SUCCEEDED
      [8]  memory PUT other prefix     → AccessDenied
      [9]  memory LIST other prefix    → AccessDenied
      [10] vault creds → memory bucket → AccessDenied
      [10] memory creds → vault bucket → AccessDenied

* harness: log phase-1 acceptance for PR #92 (3-demo verification)

All three demos (stage-1, stage-2, stage-3) green on Heima Mainnet after
the codex review fixes. Clippy clean on worker-creds + worker-memory.
PR ready to merge.

* stage-3: add worker encrypt/decrypt roundtrip tests (steps 11+12)

User's call-out — "the cred encryption and decryption is not tested".
Stage-3 previously proved IAM scoping at the AWS layer but skipped the
worker's AES-256-GCM envelope, so the actual encrypt→S3→decrypt path
through the HTTP API was unexercised. The envelope.rs primitive has 8
unit tests, but the wire-protocol roundtrip wasn't.

Stage-3 demo extended (STEP_TOTAL 11 → 13):

  [11] Cred worker encrypt/decrypt roundtrip:
       1. mint cred-store cap via POST /v1/cap/cred-store (broker)
       2. POST /v1/cred/store with cap + base64(plaintext)
          → worker KEK-encrypts (AES-256-GCM, AAD-bound to
            operator+actor+service+k3_epoch), S3 PUTs the envelope
       3. mint cred-fetch cap via POST /v1/cap/cred-fetch
       4. POST /v1/cred/fetch with cap
          → worker S3 GETs the envelope, KEK-decrypts, returns plaintext
       5. assert returned plaintext == original (byte-for-byte)
  [12] Memory worker encrypt/decrypt roundtrip:
       same shape against /v1/memory/put + /v1/memory/get. Memory worker
       has no dedicated cap-mint endpoint yet (follow-up); cred-* caps
       work against memory because both workers verify the same broker-
       signed CapToken shape with the same CapOp::Store / CapOp::Fetch.

Graceful skip handling:

  - 'agent scope not set on chain' → skip with 'run stage-1 --webauthn first'
  - 'AGENTKEYS_CHAIN_RPC_HTTP not set' → skip with 'redeploy broker'
  - 'DeviceRoleMissing' → skip with 'out-of-scope here'

These map cleanly to operator-actionable prerequisites; demo continues
green without those steps when prerequisites aren't met, but the
prerequisite is reported, not hidden.

Broker fix: setup-broker-host.sh now bakes AGENTKEYS_CHAIN +
AGENTKEYS_CHAIN_RPC_HTTP into the broker's systemd Environment= lines.
Previously the broker process had no chain RPC, so /v1/cap/cred-{store,
fetch} hit 502 'RPC URL not set' at request time. This was a pre-existing
gap surfaced by exercising the cap-mint path for the first time in this
PR — the broker's stand-alone deploy never hit cap.rs's chain check
before because no demo step minted caps.

* isolation invariants: codify the 4-layer rule + cross-actor test (step 13)

Three changes from user review:

1. NEW stage-3 step 13: NEGATIVE broker cap-mint isolation.
   Try to mint a cap-token with operator_omni != session_omni → expect
   HTTP 4xx with OperatorMismatch. This proves the MOST UPSTREAM
   isolation gate works: actor A's session JWT cannot mint caps for
   actor B. If this ever silently returns 200, every cred + memory
   blob in S3 is compromised — A could mint B's cap, hand to worker,
   worker writes under B's prefix.

   Live-verified on Heima Mainnet 2026-05-20:
     [13] NEGATIVE cap-mint cross-actor → HTTP 403 OperatorMismatch ✓

   Independent of broker redeploy: session-omni check fires BEFORE the
   chain RPC check in handlers/cap.rs, so this gate works on the
   current (stale-RPC) broker too.

2. CLAUDE.md — NEW "Per-actor + per-data-class isolation invariants
   (issue #90)" section codifies the 4-layer defense:

     Layer 1 — broker cap-mint   → session_omni == operator_omni
     Layer 2 — worker chain-verify → independent re-check of layer 1
     Layer 3 — AWS IAM PrincipalTag → s3 resource scoping per-actor
     Layer 4 — bucket separation  → per-data-class IAM roles

   Test-discipline rule: every PR adding a new worker, data class, or
   broker auth method MUST extend the stage-3 demo with negative
   isolation tests for all four layers. Don't ship features with only
   POSITIVE-path coverage.

3. CLAUDE.md — answers "why no /v1/cap/memory-* endpoint" with a
   concrete example: cap-tokens are data-class-agnostic. The same Store
   cap minted for service=openrouter can be POSTed to either
   /v1/cred/store (writes to vault bucket credentials/) or
   /v1/memory/put (writes to memory bucket memory/). The URL picks
   the data class; the cap just authorizes the operation. Adding
   dedicated memory cap endpoints would add audit clarity ("this cap
   was minted intending memory access") but no security boundary —
   isolation comes from the per-data-class IAM roles (layer 4).
   Deferred until payments-worker forces a third data class.

* cap-token: data-class-explicit isolation (no cross-pollution between vault + memory)

User callout — "make it explicit that one cannot pollute other permission."
Before this commit, cap-tokens didn't carry a data-class binding: a
cred-store cap and a memory-put cap were structurally identical. The
URL the cap was POSTed to picked the bucket. Isolation lived only at
the AWS IAM PrincipalTag + per-data-class IAM-role layer. If the IAM
grants were ever accidentally broadened, cross-data-class pollution
would slip through silently.

Now: data_class is a SIGNED FIELD in the cap payload. The cap layer
itself enforces per-data-class isolation, ahead of any AWS call.

Schema change (REQUIRED field, no backward compat — coordinated upgrade):

  enum DataClass { Credentials, Memory }
  struct CapPayload {
    ...
    op: CapOp,
    data_class: DataClass,   // NEW
    ...
  }

Broker (crates/agentkeys-broker-server/src/handlers/cap.rs):
  - Add DataClass enum (mirror of worker's), add to CapPayload
  - mint_cap signature gains data_class param; statically derived per route
  - NEW endpoints: cap_memory_put + cap_memory_get (mint with DataClass::Memory)
  - Existing cap_cred_store + cap_cred_fetch mint with DataClass::Credentials

Broker routes (crates/agentkeys-broker-server/src/lib.rs):
  + .route("/v1/cap/memory-put", post(cap_memory_put))
  + .route("/v1/cap/memory-get", post(cap_memory_get))

Worker side (crates/agentkeys-worker-creds/src/verify.rs):
  - Add DataClass enum + field to CapPayload + DataClassMismatch error
  - NEW pub fn check_data_class(token, expected) — symmetric with check_op
  - Tests: data_class_serializes_snake_case + check_data_class_accepts_match
           + check_data_class_rejects_cross_class

Worker handlers (worker-creds + worker-memory):
  - verify_cap now calls check_data_class with their respective class:
      worker-creds  → DataClass::Credentials
      worker-memory → DataClass::Memory
  - Reject mismatched caps with HTTP 403 cap_data_class_mismatch

Demo extension (harness/v2-stage3-demo.sh, STEP_TOTAL 14 → 16):
  [11] cred encrypt/decrypt roundtrip — now uses /v1/cap/cred-store
  [12] memory encrypt/decrypt roundtrip — now uses /v1/cap/memory-put (NEW endpoint)
  [14] NEW negative test: mint cred-class cap, POST to /v1/memory/put
       → expect HTTP 403 cap_data_class_mismatch
  [15] NEW negative test: mint memory-class cap, POST to /v1/cred/store
       → expect HTTP 403 cap_data_class_mismatch

CLAUDE.md ("Per-actor + per-data-class isolation invariants"):
  Replaced "why no memory cap-mint endpoint" section (now obsolete) with
  "Cap-tokens are data-class-explicit" — explains the 4-endpoint shape,
  shows the concrete reject example, justifies route-per-class over a
  data_class query param (broker can't accidentally mint the wrong
  variant from a typed-route handler).

Tests:
  worker-creds verify::tests — 14/14 (3 new for DataClass)
  broker-server handlers::cap::tests — 24/24 (1 new for data_class serialization)
  cargo build -p worker-creds -p worker-memory -p broker-server — exit 0

Live deploy: requires broker host redeploy via setup-broker-host.sh to
pick up the new mint_cap signature + new memory routes. The stage-3
demo steps 14+15 will skip cleanly until the redeploy lands — the
isolation IS enforced (workers reject cred-class caps), but the new
endpoints don't exist on the current broker yet.

* broker: bake contract addresses into systemd env (closes step-11 502)

After redeploying with the data_class change (commit 690f54c), step 11
of the stage-3 demo surfaced a SECOND broker-side env gap:

  HTTP 502 from /v1/cap/cred-store:
    {"error":"SIDECAR_REGISTRY_ADDRESS_HEIMA unset","reason":"chain_rpc_error"}

The broker's handlers/cap.rs reads three contract addresses at request
time to verify device + scope + k3_epoch on chain:
  - SIDECAR_REGISTRY_ADDRESS_HEIMA
  - SCOPE_CONTRACT_ADDRESS_HEIMA
  - K3_EPOCH_COUNTER_ADDRESS_HEIMA

Before this commit, setup-broker-host.sh baked AGENTKEYS_CHAIN_RPC_HTTP
into the broker systemd unit but NOT the contract addresses. The cap-
mint code path had never been exercised before this PR, so the gap
went unnoticed.

Fix (setup-broker-host.sh): add the three contract addresses to the
broker's Environment= block, pulled from $REGISTRY_ADDR / $SCOPE_ADDR
/ $K3_COUNTER_ADDR (already populated earlier in the script via the
sourced scripts/operator-workstation.env). The operator's
operator-workstation.env stays the single source of truth for contract
addresses across laptop + broker host.

Stage-3 demo also gets a sibling skip-detection (harness/v2-stage3-demo.sh)
so steps 11+12+14+15 cleanly skip with the redeploy-broker message
instead of failing on this specific error shape.

To unblock the stage-3 worker encrypt/decrypt + cross-class-rejection
tests after this commit:
  ssh broker.litentry.org "cd ~/agentKeys && git pull && bash scripts/setup-broker-host.sh --yes"

* broker + worker: parse_device_entry knows the 11-field struct (codex H1 alignment)

Closes user-reported step-11 regression after broker redeploy:

  cap-mint returned HTTP 403 — body: {"error":"device is not active on chain",
  "reason":"device_not_active"}

Same bug class I fixed earlier in scripts/heima-agent-create.sh +
scripts/heima-device-revoke.sh (commit 0981a88). Both the broker's
handlers/cap.rs::parse_device_entry AND the worker's
crates/agentkeys-worker-creds/src/verify.rs::parse_device_entry were
still slicing the OLD 7-word DeviceEntry layout. After codex H1
inserted 4 new fields (k11CredId, k11RpIdHash, k11PubX, k11PubY), the
struct grew to 11 ABI words, but neither parser was updated.

  word 0  operatorOmni    bytes32
  word 1  actorOmni        bytes32
  word 2  k11CredId        bytes32
  word 3  k11RpIdHash      bytes32  (NEW, codex H1)
  word 4  k11PubX          uint256  (NEW)
  word 5  k11PubY          uint256  (NEW)
  word 6  tier             uint8 (padded)
  word 7  roles            uint8 (padded)
  word 8  registeredAt     uint64 (padded)
  word 9  lastSignCount    uint32 (padded)
  word 10 revoked          bool (padded)

Before this commit, both parsers read:
  roles        → word 4 (which is now k11PubX)
  registeredAt → word 5 (which is now k11PubY — always 0 for agents)
  revoked      → word 6 (which is now tier)

For agent devices (k11PubX = k11PubY = 0), registeredAt parsed as 0 →
broker returned DeviceNotActive even though the device WAS active.

Fix: both parsers now read from the correct 11-word offsets + check
hex.len() >= 11 * 64.

Tests updated:
  worker-creds verify::tests::parse_device_entry_decodes_well_formed
    → construct an 11-word raw response (was 7)
  broker handlers::cap::tests::parse_device_entry_decodes_well_formed
    → same
  broker handlers::cap::tests::parse_device_entry_detects_revoked
    → same
  All 4 green.

Live deploy: requires broker host redeploy via setup-broker-host.sh
so the broker picks up the new parse_device_entry. Worker code change
ships with the broker redeploy (same setup-broker-host.sh rebuild).

* stage-3 step 11+12: pass STS creds via X-Aws-* headers (fix s3_put 502)

Step 11 surfaced the codex P2 downgrade-attack defense WORKING AS
INTENDED: cap-mint succeeded, worker AES-encrypted, then S3 PUT
returned 502 "s3_put: service error" because the worker fell back
to the broker EC2 instance profile (which deliberately lacks
s3:PutObject on the vault bucket).

The codex P2 fix in commit 18e709b added OptionalStsCreds + the
AGENTKEYS_WORKER_REQUIRE_STS strict-mode env var. Workers correctly
demand per-request OIDC-minted STS creds. The stage-3 demo's step
11+12 cred_memory_roundtrip helper wasn't passing them.

Fix: stage-3 step 11 (cred roundtrip) now passes vault-role STS creds,
step 12 (memory roundtrip) passes memory-role STS creds, both via the
three X-Aws-* headers the worker's OptionalStsCreds extractor reads:

  -H 'x-aws-access-key-id: $aki'
  -H 'x-aws-secret-access-key: $sak'
  -H 'x-aws-session-token: $sst'

The STS creds were already minted in step 3 (vault + memory sessions
written to $STATE_DIR/{aki,sak,sst}.{vault,memory}); step 11+12 just
read the right file pair based on the kind (cred → vault, memory →
memory) and forward them as headers.

After this commit, steps 11+12 should land green end-to-end:
  broker cap-mint   → 200 (chain checks pass)
  worker cap-verify → 200 (broker_sig + chain re-verify)
  worker S3 PUT     → 200 (using per-actor STS creds, NOT instance profile)
  byte-for-byte roundtrip assertion holds.

* stage-3 step 11+12: mint AGENT-side STS creds (correct principal-tag match)

Step 11 surfaced the second layer of the OIDC isolation chain working
as designed: cap-mint succeeded (broker authorized operator→agent),
worker AES-encrypted, then S3 PUT returned 502 because the STS creds
were minted from the OPERATOR'S session JWT (tagged with operator's
actor_omni) but the cap's actor_omni — and hence the S3 key path —
is the AGENT'S. IAM saw ${PrincipalTag/agentkeys_actor_omni} = 941c…
trying to PUT bots/82a0…/credentials/… and rejected with AccessDenied.

This is the IAM enforcing what the cap-token expresses: "operator
authorized the agent to do this op; the agent must be the one
actually doing it." Both layers must agree on actor_omni.

Fix (stage-3 cred_memory_roundtrip helper):

  1. Read agent_private_key from the demo-agent file
  2. SIWE-sign as the agent against the broker (POST /v1/auth/wallet/start
     with the agent's address, sign with cast wallet sign using
     agent_private_key, POST /v1/auth/wallet/verify → session JWT
     for the agent)
  3. Mint OIDC JWT via /v1/mint-oidc-jwt — this JWT now carries
     sub=agent_omni and PrincipalTag/agentkeys_actor_omni=agent_omni
  4. AssumeRoleWithWebIdentity against the right data-class role
     (VAULT_ROLE_ARN for cred, MEMORY_ROLE_ARN for memory) — STS
     creds now tagged with the agent's actor_omni
  5. Forward these creds via X-Aws-* headers to the worker

Now the worker's S3 PUT against bots/<agent>/credentials/… uses STS
creds with PrincipalTag=agent_omni → IAM allows.

The architectural lesson, recorded in the commit because it'll bite
again: when a cap-token authorizes actor A's action and the worker
uses STS creds to touch S3, the STS creds MUST be minted using A's
identity — operator's authorization (cap-token) + actor's identity
(STS creds) jointly satisfy the workflow. Per arch.md §17.2 layer 3,
the IAM PrincipalTag is bound to the JWT subject, NOT to whoever the
JWT-issuer (operator) chose to authorize.

* stage-3: tighten pass/fail per codex adversarial review (3 findings)

Codex round-2 review flagged the demo as 'needs-attention' — it could
report 16/16 green while silently skipping the actual encrypt/decrypt
+ cross-class assertions. Three findings, all addressed:

[high] Worker roundtrip checks could be skipped + still claim coverage
  cred_memory_roundtrip used `skip ...; return 0` on five prereq-missing
  paths (no agent file, no scope, broker missing chain RPC, broker
  missing contract addresses, DeviceRoleMissing). Final summary still
  claimed AES-256-GCM byte-for-byte coverage as if the path had run.
  Fix: introduce STRICT default + `--allow-skip` opt-in. All five
  prereq paths now call prereq_missing(), which:
    - in strict mode: prints fail + records 'fail' outcome + returns non-zero
    - in --allow-skip mode: prints skip + records 'skip' outcome (dev iter)
  Final summary now prints actual per-step outcomes from STEP_OUTCOMES[],
  and exits non-zero if any step failed (or any step skipped in strict).

[high] Negative cap-class tests (steps 14, 15) accepted ANY non-200
  Previously: cred-class cap → memory worker with non-200 + non-canonical
  error was accepted ('non-200 = pass for negative test'). A down worker,
  wrong URL, 404 route, auth middleware failure, or malformed request
  would all silently satisfy the demo without proving check_data_class
  fired. Fix: require HTTP 400/401/403 AND the canonical
  cap_data_class_mismatch error string. Any other response = die.

[medium] Cross-actor cap-mint test (step 13) accepted generic rejection
  Previously: any 4xx accepted, even when error text was non-canonical;
  502 (broker stale) silently skipped, hiding a real config issue.
  Fix: require HTTP 400/401/403 with canonical OperatorMismatch.
  502 with config-missing body now dies (forces redeploy), not skip.
  Other 502/non-canonical errors = die (negative tests can't pass on
  an unrelated failure).

Plus: positive steps (4, 7, 11+12 happy paths) now call record_ok so
the summary lists EVERY step that actually proved its assertion. The
expect_access_denied helper records too. The summary table is built
from actual execution, not a static claim of coverage.

The structural change here is: skips and infrastructure failures both
become demo failures unless the operator explicitly opts in. CI runs
default-strict. Dev iteration uses --allow-skip when bringing up a
partial environment.

* stage-3 summary: fix `local` outside function + handle cleanup-only invocation

Two small bugs in the strict-mode summary added by c55ea29:

1. Used `local` inside the `if should_run_step 16` block (not a function
   body), so bash printed:
     harness/v2-stage3-demo.sh: line 864: local: can only be used in a function
   AFTER the per-step outcome table tried to render. The 16 steps all
   ran correctly + the demo exited 0, but the summary table itself never
   printed. Fix: drop the `local` keyword and just use plain vars.

2. "DEMO COMPLETE" header would print even when no steps had been
   recorded (e.g. `--from-step 16` to test the summary block in
   isolation). Now distinguishes:
     - all green (nok>0, nskip=0, nfail=0) → DEMO COMPLETE
     - some skipped (--allow-skip) → DEMO PARTIAL
     - any failure → DEMO FAILED + exit 1
     - no steps run at all → NO STEPS EXERCISED + advisory

* harness: log codex round-2 fix + 13/13 stage-3 strict-mode verification

* stage-3 codex round-3: close skip-bypass in steps 14+15 (cross-class)

Codex round-3 review caught a regression I missed in c55ea29:

  [high] Strict demo still skips cross-class isolation checks without
         recording failure (steps 14 + 15)

Previously fixed cred_memory_roundtrip's prereq paths to use
prereq_missing (so strict mode fails-hard), but left steps 14 + 15
calling bare `skip` for the same prereq classes:

  - missing demo-agent file
  - 'not.*scope' (chain scope not set)
  - 'RPC URL not set' (broker stale)
  - 'SIDECAR_REGISTRY_ADDRESS_HEIMA unset' (broker missing contract addrs)

Because those skips didn't append to STEP_OUTCOMES, a full run could
report 'DEMO COMPLETE' with nskip=0 even when neither cross-data-class
isolation gate had been exercised. That's the same false-success
failure mode codex round-2 flagged, just in a different code path —
exactly the kind of regression strict-mode tracking is meant to catch.

Fix: extracted the entire step 14/15 body into a cross_class_rejection()
helper function. All prereq paths now route through prereq_missing
(matching cred_memory_roundtrip's pattern), so:

  - strict mode (default): unmet prereqs → die + STEP_OUTCOMES records 'fail'
  - --allow-skip mode:     unmet prereqs → skip + STEP_OUTCOMES records 'skip'
  - successful negative test → STEP_OUTCOMES records 'ok'

Step 14:
  cross_class_rejection cred-store /v1/memory/put memory cred cred-to-mem
Step 15:
  cross_class_rejection memory-put /v1/cred/store cred memory mem-to-cred

Live-verified on Heima Mainnet (2026-05-20): all 13 STEP_OUTCOMES
recorded, DEMO COMPLETE, exit 0. Steps 14+15 still pass with canonical
403 cap_data_class_mismatch error confirmation (no change to the
positive-path assertion logic — only the skip paths got tightened).

* stage-3 codex round-4: cross-class test sends X-Aws-* headers (strict-mode correct)

Codex round-4 finding (high):

  Cross-class negative test omits required STS headers, so strict
  workers reject before the data-class guard.

The axum extractor order is: OptionalStsCreds → Json<Req> → handler
body (verify_cap). With AGENTKEYS_WORKER_REQUIRE_STS=1 — the
production deployment setting documented in aws_creds.rs — the
extractor rejects header-less requests with HTTP 401 BEFORE verify_cap
runs. The cross-class data-class guard inside verify_cap never fires.

Today the live test passes because the broker host workers don't have
AGENTKEYS_WORKER_REQUIRE_STS=1 set. So we're proving the data-class
guard against dev-config workers but NOT against the prod target.
That's exactly the 'demo says complete, prod silently broken' failure
mode the codex review pipeline keeps catching.

Fix: cross_class_rejection() now:

  1. Mints agent-side STS creds for the TARGET worker's role:
       step 14 (memory worker target) → memory-role STS
       step 15 (cred worker target)   → vault-role STS
  2. Passes all three X-Aws-* headers in the POST to the worker.

Worker request order now:
  a. OptionalStsCreds extractor: valid headers present → Some(creds) → OK
     (passes regardless of AGENTKEYS_WORKER_REQUIRE_STS=1 setting)
  b. verify_cap:
       check_op (Store) → OK
       check_data_class (cap.data_class != worker's class) → REJECT
       → HTTP 403 cap_data_class_mismatch
  c. S3 op never runs (verify_cap returned error)

The data-class guard provably fires now, in BOTH strict and non-strict
worker configurations. Codex's concern was correct.

Refactored mint_agent_sts_for_role() as a shared helper so cross_class
test reuses the same SIWE+OIDC+STS flow as cred_memory_roundtrip. Same
auth chain, same trust boundary, same code path — no inconsistency
between positive (cred_memory_roundtrip) and negative (cross_class)
tests.

Live-verified 2026-05-20 on Heima Mainnet: 13 STEP_OUTCOMES recorded,
all ok, DEMO COMPLETE. Steps 14+15 still return canonical
403 cap_data_class_mismatch with the STS headers correctly passed
through — confirming the data-class guard fires AFTER extractor
authentication passes.

* arch.md: document cap-token data_class binding + 4-layer isolation invariants (§17.5)

Codifies the issue #90 outcomes into the canonical architecture spec
(per CLAUDE.md "arch.md as source of truth" rule):

§15.1 + §15.2 — credentials-service + memory-service: added the OIDC
federation paragraph. X-Aws-* header passthrough is the production
auth surface (codex P2 downgrade fix); strict mode forces it via
AGENTKEYS_WORKER_REQUIRE_STS=1. Cross-links to §17.5.

§17.5 (NEW) — Per-data-class cap-token binding:
  - Cap-token's data_class field + the 4 broker endpoints
  - 4-layer defense-in-depth table (broker cap-mint, worker chain-
    verify, AWS IAM PrincipalTag, per-data-class buckets)
  - Each layer's canonical test in harness/v2-stage3-demo.sh
  - Test-discipline rule: new data classes MUST add negative isolation
    tests across all 4 layers
  - Two design rationales spelled out:
      a) Why route-per-class beats a single endpoint with a data_class
         query-param (eliminates user-input attack surface)
      b) Why agent-side STS creds are mandatory (PrincipalTag must match
         the cap's actor_omni; operator-side STS won't satisfy IAM)

Plus the trailing Cargo.lock entry from aws-…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v2 stage 1 — Foundation: sovereign sidecar + on-chain identity + credentials-service worker

2 participants