feat(agent): Workload Identity for Azure, with Service Principal fallback#367
feat(agent): Workload Identity for Azure, with Service Principal fallback#367gdrojas wants to merge 4 commits into
Conversation
…back The agent module unconditionally required azure_client_secret and injected AZURE_CLIENT_SECRET into the Helm Secret — even when callers wanted to use Workload Identity. The downstream service scripts in nullplatform/services don't actually consume the secret today (the SP auth path in azure-cosmos-db/scripts/azure/resolve_azure_context is commented out and the azurerm provider relies on ARM_USE_OIDC / ARM_USE_MSI env wiring), so the over-validation was a strict regression for WI users. Mirrors the PR #361 pattern applied to cert_manager and external_dns: - azure_workload_identity_enabled (default true) gates the auth mode - azure_federated_credential_id is required when WI is enabled — pass the id output of an infrastructure/azure/iam module instance to enforce ordering between the federated identity credential and the agent SA - azure_client_secret stays optional unless the caller opts out of WI - ServiceAccount gets azure.workload.identity/client-id annotation and the pod gets azure.workload.identity/use=true label only in WI mode, so the Azure Workload Identity webhook injects the federated token env vars at runtime - AZURE_CLIENT_SECRET is dropped from the Secret in WI mode Locals are null-tolerant so the cross_variable_validation preconditions fire with a clear message instead of templatefile blowing up on missing inputs at graph-construction time. Tests cover both auth modes plus AWS / GCP / OCI to prevent the WI wiring from leaking into non-Azure paths. BREAKING CHANGE: callers passing cloud_provider = "azure" now hit a plan error unless they either pass azure_federated_credential_id (preferred, recommended path) or set azure_workload_identity_enabled = false and keep passing azure_client_secret.
Tests will land in a separate change.
…ure WI Keep the original aws_iam_role_arn-gated serviceAccount block in the template untouched and add the Azure WI annotations + podLabel as a separate, mutually exclusive block. Drop the service_account_annotations and pod_labels locals — they were a refactor that mixed AWS and Azure paths to support tests we no longer carry. Also drop the null-tolerance added to existing Azure config fields: those were only there for tests that fed null inputs to trigger preconditions; without those tests, defensive coercion is dead code and the AZURE_* fields stay at their original `= var.x` style. Net diff vs main is now limited to: - new `azure_workload_identity_active` local - conditional AZURE_CLIENT_SECRET injection (via merge in cloud_config.azure) - new templatefile inputs: azure_workload_identity_active, azure_client_id - new Azure WI block in the YAML template
Make the change backwards-compatible: existing Azure callers passing azure_client_secret keep working without any code change. Workload Identity is now opt-in by setting azure_workload_identity_enabled = true (which then requires azure_federated_credential_id). This diverges from the cert_manager / external_dns default in #361 (those default to WI = true) on purpose — the agent module has more deployed callers, so the conservative default avoids breaking them at plan time on the first apply after the bump.
df39afc to
c7a89f1
Compare
|
The agent doesn't have an SDK or CLI, so what would we use the identity workload for? |
|
Closing without merging. Discussion with the team (cred. David Fernandez) clarified that the agent's Azure Net result: the PR would be infrastructure without a consumer. Closing now to keep main lean; if a future workflow brings Azure auth back into the agent (e.g., active use of End-to-end validation done during this work (WI annotation on SA, federated token exchange against Microsoft Entra, ARM API call via the token, opt-in/opt-out toggle) is preserved in this PR's history for reference. Branch |
Summary
Adds opt-in Workload Identity support to
nullplatform/agent. Mirrors the WI / SP toggle pattern from PR #361 (cert_manager, external_dns), but defaults to Service Principal so existing Azure callers keep working without any code change.Behavior
azure_workload_identity_enabledfalse(default — current behavior)azure_client_secrettrue(opt-in)azure_federated_credential_id(passmodule.iam_*.id)azure.workload.identity/client-idon the SA +azure.workload.identity/use=trueon the podIn WI mode,
AZURE_CLIENT_SECRETis dropped from the agent's Helm Secret entirely. The Azure WI webhook injectsAZURE_FEDERATED_TOKEN_FILE,AZURE_CLIENT_ID,AZURE_TENANT_ID,AZURE_AUTHORITY_HOSTinto the pod at runtime; any future script that uses Azure SDK withDefaultAzureCredentialauthenticates automatically.Why opt-in (different from #361)
cert_managerandexternal_dns(#361) defaulted to WI = true because they were less deeply deployed and the WI annotation is functionally required (those services actually call Azure SDK).nullplatform/agenthas more existing callers and the SP credentials are currently dead weight at runtime:endpoint-exposeruseskubectlonly, and theazure-cosmos-dbservice script that would need Azure auth has the SP block commented out (it expectsARM_*env vars, notAZURE_*). Defaulting WI = true would have broken every existing caller at plan time without unblocking any active code path.Migration path
To migrate to WI (recommended when you set up a new cluster):
module "agent" { # ... - azure_client_secret = var.azure_client_secret + azure_workload_identity_enabled = true + azure_federated_credential_id = module.iam_agent.id + azure_client_id = module.iam_agent.client_id # was var.azure_client_id (SP); now UAMI client_id }You also need an
infrastructure/azure/iammodule instance to create the federated identity credential for the agent's SA: