Skip to content

Phase 3: production hardening + cross-cloud parity (follow-up to #79) #80

@naveen-kurra

Description

@naveen-kurra

Context

Follow-up to #79 (Phase 2: cloud-native auth providers). Through the
real-AWS testing and the cross-cloud usage discussion on #79, five
concrete gaps surfaced that should ship as Phase 3 (v0.12.0).

Original Phase 3 plan (per PHASE2_CLOUD_NATIVE_PROVIDERS.md §9.6)
was an Okta vendor provider — a vendor-specific composition layer
on top of the Phase 1 oidc provider, covering Okta's non-standard
group endpoint and app-mode token quirks. That work isn't urgent: the
generic oidc provider already handles Okta-as-OIDC for basic JWT
flow; the vendor extensions only matter once a customer signal
materializes.

Revised Phase 3 prioritizes the five items below — driven by what
production usage and multi-cloud customer onboarding actually need.
Originally-planned Okta work moves to Phase 3.5 / v0.13.0 to be
revisited when an Okta-specific customer signal appears.


Scope (5 items)

1. oidc.allowed_subjects — per-identity allowlist

forge-core/auth/providers/oidc/

  • Add AllowedSubjects []string to oidc.Config
  • Shell-style glob matching against the JWT sub claim (and email
    claim as a fallback when present)
  • Empty list = current behavior (admit any token from issuer with
    matching audience)
  • Validated at Factory time via path.Match("", pattern)

Why: The generic oidc provider currently admits any token from
the configured issuer with a matching audience. For Google Cloud
service-account flows, the issuer is https://accounts.google.com
(globally shared), so without per-subject filtering, any Google
service account in the world with the right aud claim is admitted.
This blocks per-identity GCP customer onboarding.

Recipe after fix:

auth:
  providers:
    - type: oidc
      settings:
        issuer:   https://accounts.google.com
        audience: api://forge
        allowed_subjects:
          - "*@agent-fleet-prod.iam.gserviceaccount.com"

Effort: ~1 day (~80 LOC source + ~120 LOC tests).


2. azure_ad.allowed_apps — per-identity allowlist

forge-core/auth/providers/azure_ad/

  • Add AllowedApps []string to azure_ad.Config
  • Glob matching against the JWT appid claim (preferred) or oid
    claim (fallback for legacy tokens)
  • Empty list = current behavior (admit any token matching tid +
    audience)

Why: Azure tenant gating works today, but at tenant granularity
only. A customer with many managed identities in a single Entra
tenant cannot today say "only agent-orders and agent-fraud may
call me; not every workload in the tenant." This is the equivalent
of aws_sigv4's allowed_principals for Azure.

Recipe after fix:

auth:
  providers:
    - type: azure_ad
      settings:
        tenant_id: <T>
        audience:  api://forge
        allowed_apps:
          - "agent-orders"
          - "agent-fraud"

Effort: ~1 day (~80 LOC source + ~120 LOC tests).


3. Hot-reload of auth chain

forge-cli/runtime/

  • forge.yaml hot-reload currently picks up content changes for most
    fields, but the auth chain is constructed once at startup. Editing
    auth.providers (incl. allowed_principals, allowed_accounts,
    the soon-to-land allowed_subjects/allowed_apps) requires a hard
    Ctrl-C + restart for the change to take effect.
  • Rebuild and atomically swap the chain on auth: block changes.
  • Don't drain in-flight requests — they use the chain captured at
    request entry; new requests use the new chain.

Why: Surfaced during PR #79 real-AWS testing — when an operator
tightens allowed_principals to revoke a compromised credential, the
revocation doesn't take effect until restart. Reasonable for v0.11.0
to leave deferred (documented in CHANGELOG) but worth fixing for
v0.12.0 since security-relevant config edits shouldn't require
restart.

Effort: ~1 day (~60 LOC source + ~80 LOC tests).


4. AWS Org-via-trust-policy onboarding recipe

docs/auth/recipes/aws-org-wide.md (new file, once docs are
gitignore-restored or external doc-site path is decided)

  • Worked example of the AWS-IAM-Identity-Center / SSO recipe
  • Walks through: customer creates an entry IAM role in one account,
    trust policy uses aws:PrincipalOrgID condition, every user in
    their AWS Org can sts:AssumeRole it transparently via SSO, Forge's
    allowlist contains just that one assumed-role ARN
  • Pairs with aws_sigv4's existing allowed_principals + the new
    allowed_accounts shortcut

Why: Pattern came up during the multi-cloud planning discussion.
Customers asked "how do I admit anyone in my AWS Org without
enumerating accounts?" The answer is the IAM trust-policy condition,
NOT a Forge-side feature — but the recipe needs to be documented
prominently so operators don't try to build it the hard way.

Effort: ~0.5 day (docs only).


5. Audit-log dashboard guide

docs/auth/recipes/audit-log-mesh-observability.md (new file)

  • How to grep user_id across Forge audit logs to answer:
    • Which agents called which agents (mesh map)
    • What reason codes are spiking (incident response)
    • Which IAM principals are getting rejected (allowlist mis-config)
  • Sample queries for CloudWatch Insights, Datadog, jq, Loki
  • Pairs with the per-agent IAM role convention from the fleet-account
    pattern

Why: Customer-facing operations doc. Phase 1 review #4 + #6
pinned the audit-log shape; PR #79 confirmed via live test that
user_id carries the canonical caller ARN. Operators need a guide
to actually use this for mesh observability — without it they can't
debug "why is agent-X failing to call agent-Y" or "what credential
was leaked."

Effort: ~0.5 day (docs only).


Total estimate

~4 days of focused work. Two ~1-day code drops + one ~1-day
runtime change + two ~0.5-day doc pages.

Out of scope for Phase 3 (revised)

  • Okta vendor provider → moves to Phase 3.5 / v0.13.0 (revisit on
    customer signal)
  • Authorization layer (RBAC/ABAC) → Phase 4
  • mTLS client auth → Phase 4
  • SAML 2.0 → Phase 5+
  • OBO / token exchange (RFC 8693) → Phase 5+ (pulls earlier if
    end-user attribution becomes a hard customer requirement)

Acceptance criteria

  • allowed_subjects works against Google ID tokens; pinned by integration test using a fake JWKS
  • allowed_apps works against AAD JWT appid claim; pinned by integration test using a fake AAD
  • Hot-reload changes auth chain without dropping in-flight requests; integration test boots Forge, edits forge.yaml, confirms new allowlist takes effect within ~1 second
  • AWS-Org-wide recipe doc reviewed by someone unfamiliar with the design (catches jargon)
  • Audit-log doc has sample queries that actually run against ~/.forge/<agent>/.forge/*.log
  • All Phase 1 + Phase 2 tests pass unchanged (regression)
  • golangci-lint 0 issues, gofmt clean
  • Target release tag: v0.12.0

cc @ — anyone scheduled for Phase 3 review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions