Skip to content

feat(agent): Workload Identity for Azure, with Service Principal fallback#367

Closed
gdrojas wants to merge 4 commits into
mainfrom
feat/agent-azure-wi-support
Closed

feat(agent): Workload Identity for Azure, with Service Principal fallback#367
gdrojas wants to merge 4 commits into
mainfrom
feat/agent-azure-wi-support

Conversation

@gdrojas

@gdrojas gdrojas commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds opt-in Workload Identity support to nullplatform/agent. Mirrors the WI / SP toggle pattern from PR #361 (cert_manager, external_dns), but defaults to Service Principal so existing Azure callers keep working without any code change.

Behavior

azure_workload_identity_enabled What's required What the agent's SA / pod get
false (default — current behavior) azure_client_secret No WI annotations or pod labels
true (opt-in) azure_federated_credential_id (pass module.iam_*.id) azure.workload.identity/client-id on the SA + azure.workload.identity/use=true on the pod

In WI mode, AZURE_CLIENT_SECRET is dropped from the agent's Helm Secret entirely. The Azure WI webhook injects AZURE_FEDERATED_TOKEN_FILE, AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_AUTHORITY_HOST into the pod at runtime; any future script that uses Azure SDK with DefaultAzureCredential authenticates automatically.

Why opt-in (different from #361)

cert_manager and external_dns (#361) defaulted to WI = true because they were less deeply deployed and the WI annotation is functionally required (those services actually call Azure SDK).

nullplatform/agent has more existing callers and the SP credentials are currently dead weight at runtime: endpoint-exposer uses kubectl only, and the azure-cosmos-db service script that would need Azure auth has the SP block commented out (it expects ARM_* env vars, not AZURE_*). Defaulting WI = true would have broken every existing caller at plan time without unblocking any active code path.

Migration path

To migrate to WI (recommended when you set up a new cluster):

 module "agent" {
   # ...
-  azure_client_secret           = var.azure_client_secret
+  azure_workload_identity_enabled = true
+  azure_federated_credential_id   = module.iam_agent.id
+  azure_client_id                 = module.iam_agent.client_id  # was var.azure_client_id (SP); now UAMI client_id
 }

You also need an infrastructure/azure/iam module instance to create the federated identity credential for the agent's SA:

module "iam_agent" {
  source               = "git::https://github.com/nullplatform/tofu-modules.git//infrastructure/azure/iam?ref=v1.56.1"
  name                 = "agent-${local.cluster_name}"
  resource_group_name  = module.resource_group.resource_group_name
  location             = var.location
  oidc_issuer_url      = module.aks.oidc_issuer_url
  namespace            = "nullplatform-tools"
  service_account_name = "nullplatform-agent"
  role_definition_name = "Reader"   # adjust to whatever Azure perms the agent's services need
  scope                = "/subscriptions/${var.azure_subscription_id}/resourceGroups/${module.resource_group.resource_group_name}"
  depends_on           = [module.aks]
}

gdrojas added 4 commits May 21, 2026 10:16
…back

The agent module unconditionally required azure_client_secret and injected
AZURE_CLIENT_SECRET into the Helm Secret — even when callers wanted to
use Workload Identity. The downstream service scripts in nullplatform/services
don't actually consume the secret today (the SP auth path in
azure-cosmos-db/scripts/azure/resolve_azure_context is commented out and
the azurerm provider relies on ARM_USE_OIDC / ARM_USE_MSI env wiring),
so the over-validation was a strict regression for WI users.

Mirrors the PR #361 pattern applied to cert_manager and external_dns:

- azure_workload_identity_enabled (default true) gates the auth mode
- azure_federated_credential_id is required when WI is enabled — pass the
  id output of an infrastructure/azure/iam module instance to enforce
  ordering between the federated identity credential and the agent SA
- azure_client_secret stays optional unless the caller opts out of WI
- ServiceAccount gets azure.workload.identity/client-id annotation and
  the pod gets azure.workload.identity/use=true label only in WI mode,
  so the Azure Workload Identity webhook injects the federated token
  env vars at runtime
- AZURE_CLIENT_SECRET is dropped from the Secret in WI mode

Locals are null-tolerant so the cross_variable_validation preconditions
fire with a clear message instead of templatefile blowing up on missing
inputs at graph-construction time.

Tests cover both auth modes plus AWS / GCP / OCI to prevent the WI
wiring from leaking into non-Azure paths.

BREAKING CHANGE: callers passing cloud_provider = "azure" now hit a plan
error unless they either pass azure_federated_credential_id (preferred,
recommended path) or set azure_workload_identity_enabled = false and
keep passing azure_client_secret.
Tests will land in a separate change.
…ure WI

Keep the original aws_iam_role_arn-gated serviceAccount block in the
template untouched and add the Azure WI annotations + podLabel as a
separate, mutually exclusive block. Drop the service_account_annotations
and pod_labels locals — they were a refactor that mixed AWS and Azure
paths to support tests we no longer carry.

Also drop the null-tolerance added to existing Azure config fields:
those were only there for tests that fed null inputs to trigger
preconditions; without those tests, defensive coercion is dead code
and the AZURE_* fields stay at their original `= var.x` style.

Net diff vs main is now limited to:
  - new `azure_workload_identity_active` local
  - conditional AZURE_CLIENT_SECRET injection (via merge in cloud_config.azure)
  - new templatefile inputs: azure_workload_identity_active, azure_client_id
  - new Azure WI block in the YAML template
Make the change backwards-compatible: existing Azure callers passing
azure_client_secret keep working without any code change. Workload
Identity is now opt-in by setting azure_workload_identity_enabled = true
(which then requires azure_federated_credential_id).

This diverges from the cert_manager / external_dns default in #361
(those default to WI = true) on purpose — the agent module has more
deployed callers, so the conservative default avoids breaking them
at plan time on the first apply after the bump.
@gdrojas gdrojas force-pushed the feat/agent-azure-wi-support branch from df39afc to c7a89f1 Compare May 21, 2026 13:16
@davidf-null

Copy link
Copy Markdown
Collaborator

The agent doesn't have an SDK or CLI, so what would we use the identity workload for?

@gdrojas

gdrojas commented May 21, 2026

Copy link
Copy Markdown
Collaborator Author

Closing without merging.

Discussion with the team (cred. David Fernandez) clarified that the agent's Azure client_secret requirement is purely vestigial: it was originally consumed by an in-agent curl against the Azure DNS API, but that code path was already moved out to external-dns (which has WI auth via #361). The agent no longer makes any active Azure API call in its current workflows, so there is no runtime consumer for the WI plumbing this PR would add.

Net result: the PR would be infrastructure without a consumer. Closing now to keep main lean; if a future workflow brings Azure auth back into the agent (e.g., active use of nullplatform/services/databases/azure-cosmos-db with azurerm provider configured for WI), a more targeted PR can be opened then with concrete usage in scope.

End-to-end validation done during this work (WI annotation on SA, federated token exchange against Microsoft Entra, ARM API call via the token, opt-in/opt-out toggle) is preserved in this PR's history for reference.

Branch feat/agent-azure-wi-support left intact in case someone wants to resurrect this.

@gdrojas gdrojas closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants