Phase 3: production hardening + cross-cloud parity (follow-up to #79)

## Context

Follow-up to #79 (Phase 2: cloud-native auth providers). Through the
real-AWS testing and the cross-cloud usage discussion on #79, five
concrete gaps surfaced that should ship as Phase 3 (`v0.12.0`).

**Original Phase 3 plan** (per `PHASE2_CLOUD_NATIVE_PROVIDERS.md` §9.6)
was an **Okta vendor provider** — a vendor-specific composition layer
on top of the Phase 1 `oidc` provider, covering Okta's non-standard
group endpoint and app-mode token quirks. That work isn't urgent: the
generic `oidc` provider already handles Okta-as-OIDC for basic JWT
flow; the vendor extensions only matter once a customer signal
materializes.

**Revised Phase 3** prioritizes the five items below — driven by what
production usage and multi-cloud customer onboarding actually need.
**Originally-planned Okta work moves to Phase 3.5 / `v0.13.0`** to be
revisited when an Okta-specific customer signal appears.

---

## Scope (5 items)

### 1. `oidc.allowed_subjects` — per-identity allowlist

`forge-core/auth/providers/oidc/`

- Add `AllowedSubjects []string` to `oidc.Config`
- Shell-style glob matching against the JWT `sub` claim (and `email`
  claim as a fallback when present)
- Empty list = current behavior (admit any token from issuer with
  matching audience)
- Validated at Factory time via `path.Match("", pattern)`

**Why:** The generic `oidc` provider currently admits any token from
the configured issuer with a matching audience. For Google Cloud
service-account flows, the issuer is `https://accounts.google.com`
(globally shared), so without per-subject filtering, any Google
service account in the world with the right `aud` claim is admitted.
This blocks per-identity GCP customer onboarding.

**Recipe after fix:**
```yaml
auth:
  providers:
    - type: oidc
      settings:
        issuer:   https://accounts.google.com
        audience: api://forge
        allowed_subjects:
          - "*@agent-fleet-prod.iam.gserviceaccount.com"
```

**Effort:** ~1 day (~80 LOC source + ~120 LOC tests).

---

### 2. `azure_ad.allowed_apps` — per-identity allowlist

`forge-core/auth/providers/azure_ad/`

- Add `AllowedApps []string` to `azure_ad.Config`
- Glob matching against the JWT `appid` claim (preferred) or `oid`
  claim (fallback for legacy tokens)
- Empty list = current behavior (admit any token matching `tid` +
  `audience`)

**Why:** Azure tenant gating works today, but at tenant granularity
only. A customer with many managed identities in a single Entra
tenant cannot today say "only `agent-orders` and `agent-fraud` may
call me; not every workload in the tenant." This is the equivalent
of `aws_sigv4`'s `allowed_principals` for Azure.

**Recipe after fix:**
```yaml
auth:
  providers:
    - type: azure_ad
      settings:
        tenant_id: <T>
        audience:  api://forge
        allowed_apps:
          - "agent-orders"
          - "agent-fraud"
```

**Effort:** ~1 day (~80 LOC source + ~120 LOC tests).

---

### 3. Hot-reload of auth chain

`forge-cli/runtime/`

- `forge.yaml` hot-reload currently picks up content changes for most
  fields, but the auth chain is constructed once at startup. Editing
  `auth.providers` (incl. `allowed_principals`, `allowed_accounts`,
  the soon-to-land `allowed_subjects`/`allowed_apps`) requires a hard
  `Ctrl-C` + restart for the change to take effect.
- Rebuild and atomically swap the chain on `auth:` block changes.
- Don't drain in-flight requests — they use the chain captured at
  request entry; new requests use the new chain.

**Why:** Surfaced during PR #79 real-AWS testing — when an operator
tightens `allowed_principals` to revoke a compromised credential, the
revocation doesn't take effect until restart. Reasonable for v0.11.0
to leave deferred (documented in CHANGELOG) but worth fixing for
v0.12.0 since security-relevant config edits shouldn't require
restart.

**Effort:** ~1 day (~60 LOC source + ~80 LOC tests).

---

### 4. AWS Org-via-trust-policy onboarding recipe

`docs/auth/recipes/aws-org-wide.md` (new file, once docs are
gitignore-restored or external doc-site path is decided)

- Worked example of the AWS-IAM-Identity-Center / SSO recipe
- Walks through: customer creates an entry IAM role in one account,
  trust policy uses `aws:PrincipalOrgID` condition, every user in
  their AWS Org can `sts:AssumeRole` it transparently via SSO, Forge's
  allowlist contains just that one assumed-role ARN
- Pairs with `aws_sigv4`'s existing `allowed_principals` + the new
  `allowed_accounts` shortcut

**Why:** Pattern came up during the multi-cloud planning discussion.
Customers asked "how do I admit anyone in my AWS Org without
enumerating accounts?" The answer is the IAM trust-policy condition,
NOT a Forge-side feature — but the recipe needs to be documented
prominently so operators don't try to build it the hard way.

**Effort:** ~0.5 day (docs only).

---

### 5. Audit-log dashboard guide

`docs/auth/recipes/audit-log-mesh-observability.md` (new file)

- How to grep `user_id` across Forge audit logs to answer:
  - Which agents called which agents (mesh map)
  - What reason codes are spiking (incident response)
  - Which IAM principals are getting rejected (allowlist mis-config)
- Sample queries for CloudWatch Insights, Datadog, jq, Loki
- Pairs with the per-agent IAM role convention from the fleet-account
  pattern

**Why:** Customer-facing operations doc. Phase 1 review #4 + #6
pinned the audit-log shape; PR #79 confirmed via live test that
`user_id` carries the canonical caller ARN. Operators need a guide
to actually use this for mesh observability — without it they can't
debug "why is agent-X failing to call agent-Y" or "what credential
was leaked."

**Effort:** ~0.5 day (docs only).

---

## Total estimate

**~4 days of focused work.** Two ~1-day code drops + one ~1-day
runtime change + two ~0.5-day doc pages.

## Out of scope for Phase 3 (revised)

- **Okta vendor provider** → moves to Phase 3.5 / v0.13.0 (revisit on
  customer signal)
- **Authorization layer (RBAC/ABAC)** → Phase 4
- **mTLS client auth** → Phase 4
- **SAML 2.0** → Phase 5+
- **OBO / token exchange (RFC 8693)** → Phase 5+ (pulls earlier if
  end-user attribution becomes a hard customer requirement)

## Acceptance criteria

- [ ] `allowed_subjects` works against Google ID tokens; pinned by integration test using a fake JWKS
- [ ] `allowed_apps` works against AAD JWT `appid` claim; pinned by integration test using a fake AAD
- [ ] Hot-reload changes auth chain without dropping in-flight requests; integration test boots Forge, edits forge.yaml, confirms new allowlist takes effect within ~1 second
- [ ] AWS-Org-wide recipe doc reviewed by someone unfamiliar with the design (catches jargon)
- [ ] Audit-log doc has sample queries that actually run against `~/.forge/<agent>/.forge/*.log`
- [ ] All Phase 1 + Phase 2 tests pass unchanged (regression)
- [ ] `golangci-lint` 0 issues, `gofmt` clean
- [ ] Target release tag: `v0.12.0`

cc @ — anyone scheduled for Phase 3 review.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3: production hardening + cross-cloud parity (follow-up to #79) #80

Context

Scope (5 items)

1. `oidc.allowed_subjects` — per-identity allowlist

2. `azure_ad.allowed_apps` — per-identity allowlist

3. Hot-reload of auth chain

4. AWS Org-via-trust-policy onboarding recipe

5. Audit-log dashboard guide

Total estimate

Out of scope for Phase 3 (revised)

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase 3: production hardening + cross-cloud parity (follow-up to #79) #80

Description

Context

Scope (5 items)

1. oidc.allowed_subjects — per-identity allowlist

2. azure_ad.allowed_apps — per-identity allowlist

3. Hot-reload of auth chain

4. AWS Org-via-trust-policy onboarding recipe

5. Audit-log dashboard guide

Total estimate

Out of scope for Phase 3 (revised)

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `oidc.allowed_subjects` — per-identity allowlist

2. `azure_ad.allowed_apps` — per-identity allowlist