Skip to content

feat(commons/azure): Workload Identity for cert-manager and external-dns, with Service Principal fallback#361

Merged
gdrojas merged 12 commits into
mainfrom
feat/azure-commons-wi-sp-auth
May 20, 2026
Merged

feat(commons/azure): Workload Identity for cert-manager and external-dns, with Service Principal fallback#361
gdrojas merged 12 commits into
mainfrom
feat/azure-commons-wi-sp-auth

Conversation

@gdrojas

@gdrojas gdrojas commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • cert-manager: adds Azure DNS01 solver support with Workload Identity (enabled by default) and Service Principal fallback. azure_federated_credential_id is enforced at plan time when WI is enabled — caller must pass module.iam_cert_manager.id, which also creates the correct Tofu dependency ordering.
  • external-dns: same pattern — WI default, SP fallback via azure_workload_identity_enabled = false. azure.json secret is built conditionally to include only the fields required by the selected auth method.
  • Validation: terraform_data preconditions reject misconfigured inputs at tofu plan time (missing client ID, missing federated credential when WI=true, missing client secret when WI=false).
  • Tests: updated Azure fixture files to include azure_federated_credential_id required by the new validation.
  • fix(istio): default istiod to 2 replicas to avoid PDB blocking node drains on single-replica setup.

Auth modes

Variable WI (default) SP fallback
azure_workload_identity_enabled true false
azure_client_id required required
azure_federated_credential_id required not used
azure_client_secret not used required

Breaking changes

  • Azure consumers of cert_manager and external_dns must now pass azure_federated_credential_id (default WI mode). Failure mode is a clear tofu plan error pointing to the fix:

    azure_federated_credential_id is required when ... and azure_workload_identity_enabled is true. Use module.iam to create the federated identity credential and pass its id output.

    Consumers using Service Principal can opt out with azure_workload_identity_enabled = false and pass azure_client_secret instead.

  • All consumers of istio: istiod_replicas default goes from 12. Existing clusters will redeploy istiod with an extra replica. Override with istiod_replicas = 1 to preserve the old behavior (not recommended — single-replica istiod blocks node drains due to its PDB).

gdrojas added 8 commits May 18, 2026 20:21
Adds azure_client_secret variable for environments using service principal
auth instead of workload identity, and wires it into the Azure DNS01 solver
configuration in the ClusterIssuer.
Adds azure_* variables and wires Azure credentials into the external-dns
helm values. Supports both service principal and workload identity flows.
…rains

The istiod chart installs a PDB with minAvailable=1. A single replica istiod
blocks node drains (e.g. during EKS AMI upgrades). Setting both replicaCount
and autoscaleMin to 2 prevents the HPA from scaling back to 1.
…configurable

Add azure_workload_identity_enabled (default: true) to cert_manager and external_dns.
When false, WI annotations and pod labels are omitted, azure_client_id is not
required, and useWorkloadIdentityExtension is set to false in the azure.json secret.
… is disabled

cert_manager and external_dns now support both auth modes for Azure:
- WI (default): azure_workload_identity_enabled=true, uses SA annotation + pod label
- SP (opt-out): azure_workload_identity_enabled=false, requires azure_client_secret

azure_client_id is now always required for Azure regardless of auth mode.
…y is enabled

When azure_workload_identity_enabled=true (default), callers must pass
azure_federated_credential_id from module.iam. This enforces that the
Azure AD federation exists before the Helm release runs, and creates the
correct apply-time dependency ordering automatically via Tofu references.
gdrojas added 4 commits May 20, 2026 10:27
cert_manager only manages helm_release resources and a terraform_data for
validation — no kubernetes_* resources. Removing the unused provider
keeps the dependency graph minimal and drops the kubernetes provider lock
hash from the module.
@gdrojas gdrojas merged commit f11896e into main May 20, 2026
43 checks passed
@gdrojas gdrojas deleted the feat/azure-commons-wi-sp-auth branch May 20, 2026 13:57
gdrojas added a commit that referenced this pull request May 20, 2026
Make the change backwards-compatible: existing Azure callers passing
azure_client_secret keep working without any code change. Workload
Identity is now opt-in by setting azure_workload_identity_enabled = true
(which then requires azure_federated_credential_id).

This diverges from the cert_manager / external_dns default in #361
(those default to WI = true) on purpose — the agent module has more
deployed callers, so the conservative default avoids breaking them
at plan time on the first apply after the bump.
gdrojas added a commit that referenced this pull request May 21, 2026
…back

The agent module unconditionally required azure_client_secret and injected
AZURE_CLIENT_SECRET into the Helm Secret — even when callers wanted to
use Workload Identity. The downstream service scripts in nullplatform/services
don't actually consume the secret today (the SP auth path in
azure-cosmos-db/scripts/azure/resolve_azure_context is commented out and
the azurerm provider relies on ARM_USE_OIDC / ARM_USE_MSI env wiring),
so the over-validation was a strict regression for WI users.

Mirrors the PR #361 pattern applied to cert_manager and external_dns:

- azure_workload_identity_enabled (default true) gates the auth mode
- azure_federated_credential_id is required when WI is enabled — pass the
  id output of an infrastructure/azure/iam module instance to enforce
  ordering between the federated identity credential and the agent SA
- azure_client_secret stays optional unless the caller opts out of WI
- ServiceAccount gets azure.workload.identity/client-id annotation and
  the pod gets azure.workload.identity/use=true label only in WI mode,
  so the Azure Workload Identity webhook injects the federated token
  env vars at runtime
- AZURE_CLIENT_SECRET is dropped from the Secret in WI mode

Locals are null-tolerant so the cross_variable_validation preconditions
fire with a clear message instead of templatefile blowing up on missing
inputs at graph-construction time.

Tests cover both auth modes plus AWS / GCP / OCI to prevent the WI
wiring from leaking into non-Azure paths.

BREAKING CHANGE: callers passing cloud_provider = "azure" now hit a plan
error unless they either pass azure_federated_credential_id (preferred,
recommended path) or set azure_workload_identity_enabled = false and
keep passing azure_client_secret.
gdrojas added a commit that referenced this pull request May 21, 2026
Make the change backwards-compatible: existing Azure callers passing
azure_client_secret keep working without any code change. Workload
Identity is now opt-in by setting azure_workload_identity_enabled = true
(which then requires azure_federated_credential_id).

This diverges from the cert_manager / external_dns default in #361
(those default to WI = true) on purpose — the agent module has more
deployed callers, so the conservative default avoids breaking them
at plan time on the first apply after the bump.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants