Context
Follow-up to #79 (Phase 2: cloud-native auth providers). Through the
real-AWS testing and the cross-cloud usage discussion on #79, five
concrete gaps surfaced that should ship as Phase 3 (v0.12.0).
Original Phase 3 plan (per PHASE2_CLOUD_NATIVE_PROVIDERS.md §9.6)
was an Okta vendor provider — a vendor-specific composition layer
on top of the Phase 1 oidc provider, covering Okta's non-standard
group endpoint and app-mode token quirks. That work isn't urgent: the
generic oidc provider already handles Okta-as-OIDC for basic JWT
flow; the vendor extensions only matter once a customer signal
materializes.
Revised Phase 3 prioritizes the five items below — driven by what
production usage and multi-cloud customer onboarding actually need.
Originally-planned Okta work moves to Phase 3.5 / v0.13.0 to be
revisited when an Okta-specific customer signal appears.
Scope (5 items)
1. oidc.allowed_subjects — per-identity allowlist
forge-core/auth/providers/oidc/
- Add
AllowedSubjects []string to oidc.Config
- Shell-style glob matching against the JWT
sub claim (and email
claim as a fallback when present)
- Empty list = current behavior (admit any token from issuer with
matching audience)
- Validated at Factory time via
path.Match("", pattern)
Why: The generic oidc provider currently admits any token from
the configured issuer with a matching audience. For Google Cloud
service-account flows, the issuer is https://accounts.google.com
(globally shared), so without per-subject filtering, any Google
service account in the world with the right aud claim is admitted.
This blocks per-identity GCP customer onboarding.
Recipe after fix:
auth:
providers:
- type: oidc
settings:
issuer: https://accounts.google.com
audience: api://forge
allowed_subjects:
- "*@agent-fleet-prod.iam.gserviceaccount.com"
Effort: ~1 day (~80 LOC source + ~120 LOC tests).
2. azure_ad.allowed_apps — per-identity allowlist
forge-core/auth/providers/azure_ad/
- Add
AllowedApps []string to azure_ad.Config
- Glob matching against the JWT
appid claim (preferred) or oid
claim (fallback for legacy tokens)
- Empty list = current behavior (admit any token matching
tid +
audience)
Why: Azure tenant gating works today, but at tenant granularity
only. A customer with many managed identities in a single Entra
tenant cannot today say "only agent-orders and agent-fraud may
call me; not every workload in the tenant." This is the equivalent
of aws_sigv4's allowed_principals for Azure.
Recipe after fix:
auth:
providers:
- type: azure_ad
settings:
tenant_id: <T>
audience: api://forge
allowed_apps:
- "agent-orders"
- "agent-fraud"
Effort: ~1 day (~80 LOC source + ~120 LOC tests).
3. Hot-reload of auth chain
forge-cli/runtime/
forge.yaml hot-reload currently picks up content changes for most
fields, but the auth chain is constructed once at startup. Editing
auth.providers (incl. allowed_principals, allowed_accounts,
the soon-to-land allowed_subjects/allowed_apps) requires a hard
Ctrl-C + restart for the change to take effect.
- Rebuild and atomically swap the chain on
auth: block changes.
- Don't drain in-flight requests — they use the chain captured at
request entry; new requests use the new chain.
Why: Surfaced during PR #79 real-AWS testing — when an operator
tightens allowed_principals to revoke a compromised credential, the
revocation doesn't take effect until restart. Reasonable for v0.11.0
to leave deferred (documented in CHANGELOG) but worth fixing for
v0.12.0 since security-relevant config edits shouldn't require
restart.
Effort: ~1 day (~60 LOC source + ~80 LOC tests).
4. AWS Org-via-trust-policy onboarding recipe
docs/auth/recipes/aws-org-wide.md (new file, once docs are
gitignore-restored or external doc-site path is decided)
- Worked example of the AWS-IAM-Identity-Center / SSO recipe
- Walks through: customer creates an entry IAM role in one account,
trust policy uses aws:PrincipalOrgID condition, every user in
their AWS Org can sts:AssumeRole it transparently via SSO, Forge's
allowlist contains just that one assumed-role ARN
- Pairs with
aws_sigv4's existing allowed_principals + the new
allowed_accounts shortcut
Why: Pattern came up during the multi-cloud planning discussion.
Customers asked "how do I admit anyone in my AWS Org without
enumerating accounts?" The answer is the IAM trust-policy condition,
NOT a Forge-side feature — but the recipe needs to be documented
prominently so operators don't try to build it the hard way.
Effort: ~0.5 day (docs only).
5. Audit-log dashboard guide
docs/auth/recipes/audit-log-mesh-observability.md (new file)
- How to grep
user_id across Forge audit logs to answer:
- Which agents called which agents (mesh map)
- What reason codes are spiking (incident response)
- Which IAM principals are getting rejected (allowlist mis-config)
- Sample queries for CloudWatch Insights, Datadog, jq, Loki
- Pairs with the per-agent IAM role convention from the fleet-account
pattern
Why: Customer-facing operations doc. Phase 1 review #4 + #6
pinned the audit-log shape; PR #79 confirmed via live test that
user_id carries the canonical caller ARN. Operators need a guide
to actually use this for mesh observability — without it they can't
debug "why is agent-X failing to call agent-Y" or "what credential
was leaked."
Effort: ~0.5 day (docs only).
Total estimate
~4 days of focused work. Two ~1-day code drops + one ~1-day
runtime change + two ~0.5-day doc pages.
Out of scope for Phase 3 (revised)
- Okta vendor provider → moves to Phase 3.5 / v0.13.0 (revisit on
customer signal)
- Authorization layer (RBAC/ABAC) → Phase 4
- mTLS client auth → Phase 4
- SAML 2.0 → Phase 5+
- OBO / token exchange (RFC 8693) → Phase 5+ (pulls earlier if
end-user attribution becomes a hard customer requirement)
Acceptance criteria
cc @ — anyone scheduled for Phase 3 review.
Context
Follow-up to #79 (Phase 2: cloud-native auth providers). Through the
real-AWS testing and the cross-cloud usage discussion on #79, five
concrete gaps surfaced that should ship as Phase 3 (
v0.12.0).Original Phase 3 plan (per
PHASE2_CLOUD_NATIVE_PROVIDERS.md§9.6)was an Okta vendor provider — a vendor-specific composition layer
on top of the Phase 1
oidcprovider, covering Okta's non-standardgroup endpoint and app-mode token quirks. That work isn't urgent: the
generic
oidcprovider already handles Okta-as-OIDC for basic JWTflow; the vendor extensions only matter once a customer signal
materializes.
Revised Phase 3 prioritizes the five items below — driven by what
production usage and multi-cloud customer onboarding actually need.
Originally-planned Okta work moves to Phase 3.5 /
v0.13.0to berevisited when an Okta-specific customer signal appears.
Scope (5 items)
1.
oidc.allowed_subjects— per-identity allowlistforge-core/auth/providers/oidc/AllowedSubjects []stringtooidc.Configsubclaim (andemailclaim as a fallback when present)
matching audience)
path.Match("", pattern)Why: The generic
oidcprovider currently admits any token fromthe configured issuer with a matching audience. For Google Cloud
service-account flows, the issuer is
https://accounts.google.com(globally shared), so without per-subject filtering, any Google
service account in the world with the right
audclaim is admitted.This blocks per-identity GCP customer onboarding.
Recipe after fix:
Effort: ~1 day (~80 LOC source + ~120 LOC tests).
2.
azure_ad.allowed_apps— per-identity allowlistforge-core/auth/providers/azure_ad/AllowedApps []stringtoazure_ad.Configappidclaim (preferred) oroidclaim (fallback for legacy tokens)
tid+audience)Why: Azure tenant gating works today, but at tenant granularity
only. A customer with many managed identities in a single Entra
tenant cannot today say "only
agent-ordersandagent-fraudmaycall me; not every workload in the tenant." This is the equivalent
of
aws_sigv4'sallowed_principalsfor Azure.Recipe after fix:
Effort: ~1 day (~80 LOC source + ~120 LOC tests).
3. Hot-reload of auth chain
forge-cli/runtime/forge.yamlhot-reload currently picks up content changes for mostfields, but the auth chain is constructed once at startup. Editing
auth.providers(incl.allowed_principals,allowed_accounts,the soon-to-land
allowed_subjects/allowed_apps) requires a hardCtrl-C+ restart for the change to take effect.auth:block changes.request entry; new requests use the new chain.
Why: Surfaced during PR #79 real-AWS testing — when an operator
tightens
allowed_principalsto revoke a compromised credential, therevocation doesn't take effect until restart. Reasonable for v0.11.0
to leave deferred (documented in CHANGELOG) but worth fixing for
v0.12.0 since security-relevant config edits shouldn't require
restart.
Effort: ~1 day (~60 LOC source + ~80 LOC tests).
4. AWS Org-via-trust-policy onboarding recipe
docs/auth/recipes/aws-org-wide.md(new file, once docs aregitignore-restored or external doc-site path is decided)
trust policy uses
aws:PrincipalOrgIDcondition, every user intheir AWS Org can
sts:AssumeRoleit transparently via SSO, Forge'sallowlist contains just that one assumed-role ARN
aws_sigv4's existingallowed_principals+ the newallowed_accountsshortcutWhy: Pattern came up during the multi-cloud planning discussion.
Customers asked "how do I admit anyone in my AWS Org without
enumerating accounts?" The answer is the IAM trust-policy condition,
NOT a Forge-side feature — but the recipe needs to be documented
prominently so operators don't try to build it the hard way.
Effort: ~0.5 day (docs only).
5. Audit-log dashboard guide
docs/auth/recipes/audit-log-mesh-observability.md(new file)user_idacross Forge audit logs to answer:pattern
Why: Customer-facing operations doc. Phase 1 review #4 + #6
pinned the audit-log shape; PR #79 confirmed via live test that
user_idcarries the canonical caller ARN. Operators need a guideto actually use this for mesh observability — without it they can't
debug "why is agent-X failing to call agent-Y" or "what credential
was leaked."
Effort: ~0.5 day (docs only).
Total estimate
~4 days of focused work. Two ~1-day code drops + one ~1-day
runtime change + two ~0.5-day doc pages.
Out of scope for Phase 3 (revised)
customer signal)
end-user attribution becomes a hard customer requirement)
Acceptance criteria
allowed_subjectsworks against Google ID tokens; pinned by integration test using a fake JWKSallowed_appsworks against AAD JWTappidclaim; pinned by integration test using a fake AAD~/.forge/<agent>/.forge/*.loggolangci-lint0 issues,gofmtcleanv0.12.0cc @ — anyone scheduled for Phase 3 review.