Skip to content

feat(azure): enable ACR image pulls via managed identity on worker nodes#8205

Open
twolff-gh wants to merge 5 commits intoopenshift:mainfrom
twolff-gh:azure-acr-mi-api
Open

feat(azure): enable ACR image pulls via managed identity on worker nodes#8205
twolff-gh wants to merge 5 commits intoopenshift:mainfrom
twolff-gh:azure-acr-mi-api

Conversation

@twolff-gh
Copy link
Copy Markdown

@twolff-gh twolff-gh commented Apr 10, 2026

Followup to #7793

Add support for configuring kubelet's image credential provider to
authenticate to Azure Container Registry (ACR) using a user-assigned
managed identity. This eliminates the need for static pull secrets
by leveraging IMDS-based token acquisition on worker nodes.

Summary by CodeRabbit

  • New Features

    • Azure node pools can include image registry credentials (managed identity + registries) to enable ACR authentication on worker nodes.
    • When provided, worker VMs are assigned the configured identity and a MachineConfig is generated to install the ACR credential provider and supporting config files.
  • Bug Fixes

    • Serialization compatibility between versions preserved for Azure node pool platform and registry credential data.
  • Tests

    • Added comprehensive tests covering config generation, ignition payloads, validation, and JSON/YAML serialization.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 10, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an optional ImageRegistryCredentials field to AzureNodePoolPlatform and a new AzureImageRegistryCredentials type with validation (ManagedIdentity, Registries). The operator sets VMIdentityUserAssigned and UserAssignedIdentities on Azure VM specs when credentials are present. It generates a worker MachineConfig (Ignition v3.2.0) embedding a kubelet credential-provider YAML and an acr-azure.json payload derived from Azure identifiers. Tests cover API JSON compatibility, credential-provider and acr JSON generation, Ignition contents, and MachineConfig creation. Kubeapilinter exclusions were added for the new field and test-only structs.

Sequence Diagram(s)

sequenceDiagram
    actor Admin as Admin/User
    participant NP as NodePool Spec
    participant Op as Operator
    participant AzureSpec as Azure VM Template
    participant MCO as MachineConfig Generator
    participant Ign as Ignition Config
    participant Node as Worker Node

    Admin->>NP: Create/Update NodePool with ImageRegistryCredentials
    Op->>NP: Reconcile NodePool
    Op->>AzureSpec: Build azureMachineTemplateSpec()
    alt ImageRegistryCredentials set
        AzureSpec->>AzureSpec: Set Identity = VMIdentityUserAssigned
        AzureSpec->>AzureSpec: Add UserAssignedIdentities (ManagedIdentity)
    end
    Op->>MCO: generateMCORawConfig()
    MCO->>MCO: getACRCredentialProviderConfig()
    alt Azure + ImageRegistryCredentials
        MCO->>MCO: generateACRCredentialProviderMachineConfig()
        MCO->>Ign: Create Ignition v3.2.0 with:
        Ign->>Ign: /etc/kubernetes/credential-providers/acr-credential-provider.yaml
        Ign->>Ign: /etc/kubernetes/acr-azure.json
        MCO->>Op: Return MachineConfig (wrapped in ConfigMap)
    end
    Op->>Node: Apply MachineConfig / Ignition
    Node->>Node: Kubelet uses credential provider (ManagedIdentity)
Loading
🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Topology-Aware Scheduling Compatibility ⚠️ Warning PR unconditionally labels MachineConfig as worker role without topology-aware validation, causing silent failures on SNO/TNF/TNA clusters with no dedicated worker nodes. Add topology validation to detect non-HyperShift topologies (SNO, TNF, TNA) and return explicit errors, or document the limitation prominently.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature: enabling ACR image pulls via managed identity on worker nodes, which aligns with the primary changes adding ImageRegistryCredentials support to AzureNodePoolPlatform and generating ACR credential provider machine configs.
Stable And Deterministic Test Names ✅ Passed All test function names are static and descriptive following Go conventions; table-driven test cases use stable names without dynamic values like timestamps or UUIDs.
Test Structure And Quality ✅ Passed Tests follow established quality patterns: single responsibility per test, table-driven organization, no problematic setup/cleanup, synchronous execution without timeouts, and descriptive assertion messages consistent with codebase.
Microshift Test Compatibility ✅ Passed The pull request does not add any Ginkgo e2e tests. All three test files added are standard Go unit tests using the testing package rather than Ginkgo framework tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This pull request does not add any Ginkgo e2e tests. All newly added test files use standard Go unit testing with the func TestXXX(t *testing.T) pattern. No Ginkgo patterns like It(), Describe(), Context(), or When() are present.
Ote Binary Stdout Contract ✅ Passed The pull request contains no violations of the OTE Binary Stdout Contract. No process-level code writes to stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The custom check for IPv6 and disconnected network compatibility targets Ginkgo e2e tests, but this PR adds only standard Go unit tests using table-driven patterns with no Ginkgo imports or BDD-style declarations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added do-not-merge/needs-area needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 10, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: twolff-gh
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform and removed do-not-merge/needs-area labels Apr 10, 2026
short-lived tokens for ACR image pulls instead of relying on static pull secrets.
Changing this field will trigger a node rollout.
properties:
managedIdentity:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for standalone or other cases, do we want to support system-assigned identity? If so, we wouldn't have a customer pass these in at creation time.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the scope of this feature, I believe it was only for user-assigned identities. I'll confirm.
If we needed to support system-assigned, I wonder if it would need to be populated differently / new field?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, user-assigned was the main ask, but system-assigned would likely follow in the future.
Do they make sense as separate work - user now and system as another PR after?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if system-assigned identities are added per-VM, then this doesn't make sense from a feature perspective. Maybe try and see what happens, then ignore it or implement it as you see fit.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see, a cluster needs to be created before getting a principal id which some process would need permissions to access customer resources.

Does the customer trust the platform to write IAM policies on their container registry? (Don't know)

Seems like a followup item. I'll gladly look into

twolff-gh added a commit to twolff-gh/hypershift that referenced this pull request Apr 13, 2026
…/R3)

Address bennerv review comments R2 and R3 on PR openshift#8205: make the
registries field optional (remove +required and MinItems=1), allow empty
registries to fall through to default wildcard patterns (*.azurecr.io,
etc.), and update tests accordingly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hypershift-operator/controllers/nodepool/config.go`:
- Around line 242-248: The code dereferences
cg.hostedCluster.Spec.Platform.Azure directly which can panic; before calling
generateACRCredentialProviderMachineConfig, add a guard that
cg.hostedCluster.Spec.Platform != nil and cg.hostedCluster.Spec.Platform.Azure
!= nil, and handle the nil case (return an appropriate error or skip registry
credential config) so you don't call generateACRCredentialProviderMachineConfig
with a nil azureSpec; keep references to
cg.nodePool.Spec.Platform.Azure.ImageRegistryCredentials and the
generateACRCredentialProviderMachineConfig call when adding the guard and
subsequent handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 51b240dc-cd08-46f4-8ba4-b589a0fd067b

📥 Commits

Reviewing files that changed from the base of the PR and between 4402b11 and 6578833.

⛔ Files ignored due to path filters (14)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/azureimageregistrycredentials.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/azurenodepoolplatform.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • cmd/install/assets/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azure.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (8)
  • api/.golangci.yml
  • api/hypershift/v1beta1/azure.go
  • api/hypershift/v1beta1/azure_test.go
  • hypershift-operator/controllers/nodepool/azure.go
  • hypershift-operator/controllers/nodepool/azure_acr.go
  • hypershift-operator/controllers/nodepool/azure_acr_test.go
  • hypershift-operator/controllers/nodepool/azure_test.go
  • hypershift-operator/controllers/nodepool/config.go

Comment thread hypershift-operator/controllers/nodepool/config.go
@twolff-gh twolff-gh changed the title WIP: feat(azure): enable ACR image pulls via managed identity on worker nodes feat(azure): enable ACR image pulls via managed identity on worker nodes Apr 14, 2026
@twolff-gh twolff-gh marked this pull request as ready for review April 14, 2026 22:34
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2026
@openshift-ci openshift-ci bot requested review from cblecker and jparrill April 14, 2026 22:34
twolff-gh added a commit to twolff-gh/hypershift that referenced this pull request Apr 14, 2026
…/R3)

Address bennerv review comments R2 and R3 on PR openshift#8205: make the
registries field optional (remove +required and MinItems=1), allow empty
registries to fall through to default wildcard patterns (*.azurecr.io,
etc.), and update tests accordingly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
api/hypershift/v1beta1/azure.go (1)

139-153: Consider validating registries entries against the documented formats.

The docs say values should be ACR hostnames or wildcard patterns, but current validation accepts any non-empty string.

Suggested schema refinement
 	// +kubebuilder:validation:items:MinLength=1
 	// +kubebuilder:validation:items:MaxLength=256
+	// +kubebuilder:validation:items:Pattern=`^((\*\.)?[a-z0-9]([a-z0-9-]{0,251}[a-z0-9])?\.(azurecr\.io|azurecr\.cn|azurecr\.de|azurecr\.us))$`
 	// +listType=set
 	Registries []string `json:"registries,omitempty"`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/hypershift/v1beta1/azure.go` around lines 139 - 153, The Registries slice
accepts any non-empty string but should only allow ACR hostnames or wildcard
patterns; update the Registries field validation by adding an items-level
kubebuilder pattern (e.g.
+kubebuilder:validation:items:Pattern="^(\*\.)?[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)+$")
so each entry matches either a hostname or a leading-*. wildcard hostname (apply
the pattern to the Registries field declaration and adjust the comment to
reflect the stricter schema), ensuring MinLength/MaxLength and MinItems/MaxItems
remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/hypershift/v1beta1/azure.go`:
- Around line 134-137: The current XValidation on the ManagedIdentity field only
enforces path shape and allows non-UUID subscription IDs; update the XValidation
rule for ManagedIdentity to validate the subscription segment is a GUID and
tighten other segments (resource group and identity name) with appropriate regex
constraints so malformed IDs are rejected at admission time; locate the
ManagedIdentity string tag in azure.go and replace the existing rule with a
stricter regex that enforces subscription GUID format (e.g. 8-4-4-4-12 hex
bytes) and valid resourceGroup/identity name characters and lengths.

---

Nitpick comments:
In `@api/hypershift/v1beta1/azure.go`:
- Around line 139-153: The Registries slice accepts any non-empty string but
should only allow ACR hostnames or wildcard patterns; update the Registries
field validation by adding an items-level kubebuilder pattern (e.g.
+kubebuilder:validation:items:Pattern="^(\*\.)?[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)+$")
so each entry matches either a hostname or a leading-*. wildcard hostname (apply
the pattern to the Registries field declaration and adjust the comment to
reflect the stricter schema), ensuring MinLength/MaxLength and MinItems/MaxItems
remain unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 3b163967-081b-4fb7-bdb9-0fee70cf5f36

📥 Commits

Reviewing files that changed from the base of the PR and between 7d83d3e and 6ef165b.

⛔ Files ignored due to path filters (14)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/azureimageregistrycredentials.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/azurenodepoolplatform.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azure.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (8)
  • api/.golangci.yml
  • api/hypershift/v1beta1/azure.go
  • api/hypershift/v1beta1/azure_test.go
  • hypershift-operator/controllers/nodepool/azure.go
  • hypershift-operator/controllers/nodepool/azure_acr.go
  • hypershift-operator/controllers/nodepool/azure_acr_test.go
  • hypershift-operator/controllers/nodepool/azure_test.go
  • hypershift-operator/controllers/nodepool/config.go
✅ Files skipped from review due to trivial changes (3)
  • hypershift-operator/controllers/nodepool/azure.go
  • api/.golangci.yml
  • hypershift-operator/controllers/nodepool/azure_acr.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • hypershift-operator/controllers/nodepool/azure_test.go
  • hypershift-operator/controllers/nodepool/config.go
  • api/hypershift/v1beta1/azure_test.go
  • hypershift-operator/controllers/nodepool/azure_acr_test.go

Comment thread api/hypershift/v1beta1/azure.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
hypershift-operator/controllers/nodepool/azure_acr_test.go (1)

503-534: ⚠️ Potential issue | 🟡 Minor

Align test name with actual assertions in the “default wildcards only” case.

This case name claims wildcard-only verification, but the assertions only check generic MachineConfig markers. Either assert wildcard content explicitly or rename the case to match what it currently validates.

♻️ Minimal fix (rename to match current assertions)
-			name: "When registries is empty it should return a ConfigMap with default wildcards only",
+			name: "When registries is empty it should return a ConfigMap with credential provider MachineConfig",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/azure_acr_test.go` around lines 503
- 534, The test case name "When registries is empty it should return a ConfigMap
with default wildcards only" is misleading given the assertions in the validate
closure (they check for "50-acr-credential-provider" and "MachineConfig" in
mcYAML). Rename the test case title string to reflect what is actually asserted
(e.g., mention MachineConfig and 50-acr-credential-provider) OR modify the
validate function (the block referencing mcYAML and TokenSecretConfigKey) to
assert wildcard content explicitly; the minimal fix is to update the test case
name string to match the validate function's expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@hypershift-operator/controllers/nodepool/azure_acr_test.go`:
- Around line 503-534: The test case name "When registries is empty it should
return a ConfigMap with default wildcards only" is misleading given the
assertions in the validate closure (they check for "50-acr-credential-provider"
and "MachineConfig" in mcYAML). Rename the test case title string to reflect
what is actually asserted (e.g., mention MachineConfig and
50-acr-credential-provider) OR modify the validate function (the block
referencing mcYAML and TokenSecretConfigKey) to assert wildcard content
explicitly; the minimal fix is to update the test case name string to match the
validate function's expectations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 8b8b43b0-d345-448e-877c-8c1e0b186d55

📥 Commits

Reviewing files that changed from the base of the PR and between 6ef165b and 1419528.

⛔ Files ignored due to path filters (15)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/azureimageregistrycredentials.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/azurenodepoolplatform.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/stable.nodepools.azure.testsuite.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azure.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (8)
  • api/.golangci.yml
  • api/hypershift/v1beta1/azure.go
  • api/hypershift/v1beta1/azure_test.go
  • hypershift-operator/controllers/nodepool/azure.go
  • hypershift-operator/controllers/nodepool/azure_acr.go
  • hypershift-operator/controllers/nodepool/azure_acr_test.go
  • hypershift-operator/controllers/nodepool/azure_test.go
  • hypershift-operator/controllers/nodepool/config.go
✅ Files skipped from review due to trivial changes (2)
  • hypershift-operator/controllers/nodepool/azure.go
  • hypershift-operator/controllers/nodepool/azure_acr.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • api/.golangci.yml
  • api/hypershift/v1beta1/azure.go

Comment thread api/.golangci.yml Outdated
Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread api/hypershift/v1beta1/azure_test.go Outdated
// acrDefaultMatchImages lists the default wildcard patterns for Azure Container Registry
// endpoints across all Azure clouds. These patterns are always appended to the user-specified
// registries to ensure the credential provider covers all standard ACR endpoints.
var acrDefaultMatchImages = []string{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these not be defaulted in the api for the registries?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I found, these come from MCO credential provider config, if changed upstream, API default could go stale. Does that sound accurate?

Can or should we change this at the MCO config?

Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread api/hypershift/v1beta1/azure.go Outdated
// multiple registries. When specified, these registries are added to the credential
// provider's matchImages list alongside the default wildcard patterns (*.azurecr.io,
// *.azurecr.cn, *.azurecr.de, *.azurecr.us). When omitted, only the default wildcard
// patterns are used, which cover all standard Azure Container Registry endpoints.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call out a customer only gets 16 of the 20 spots since 4 are used for the wildcards?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this is additive. If the hard limit is 20 including the defaults, you should limit this to only what the user can actually provide.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxItems=16 added. user gets 16 slots, 4 defaults added by controller.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the registries field. It was in the original WIP, but doesnt make sense in this final version.
Adds more complexity than necessary
The default patterns copied in plus the RBAC for ids should be enough

Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread hypershift-operator/controllers/nodepool/azure.go Outdated

azureMachineTemplate.Template.Spec.SSHPublicKey = dummySSHKey

if nodePool.Spec.Platform.Azure.ImageRegistryCredentials != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment documenting that identity assignment currently replaces (not appends), and should be refactored to append if additional identities are needed in the future.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment thread hypershift-operator/controllers/nodepool/azure_acr.go Outdated
Comment thread hypershift-operator/controllers/nodepool/azure_acr_test.go Outdated
Comment thread api/hypershift/v1beta1/azure.go Outdated
// When configured, worker nodes will use the acr-credential-provider plugin to obtain
// short-lived tokens for ACR image pulls instead of relying on static pull secrets.
// Changing this field will trigger a node rollout.
// +optional
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if I don't configure this field?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added: When not configured, no additional image credential provider is set up and worker nodes use the default pull secret for image authentication.

Comment thread api/hypershift/v1beta1/azure.go Outdated
// multiple registries. When specified, these registries are added to the credential
// provider's matchImages list alongside the default wildcard patterns (*.azurecr.io,
// *.azurecr.cn, *.azurecr.de, *.azurecr.us). When omitted, only the default wildcard
// patterns are used, which cover all standard Azure Container Registry endpoints.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this is additive. If the hard limit is 20 including the defaults, you should limit this to only what the user can actually provide.

Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment on lines +142 to +144
// multiple registries. When specified, these registries are added to the credential
// provider's matchImages list alongside the default wildcard patterns (*.azurecr.io,
// *.azurecr.cn, *.azurecr.de, *.azurecr.us). When omitted, only the default wildcard
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I allowed to specify an entry that matches one of these default patterns?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller duplicates silently. Duplicate is dropped.
Does this work?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

De-duplication is fine - should we prevent users from specifying the wildcard patterns that are applied by default though? It doesn't look like there would be any value to the users to be able to specify those aside from unnecessarily consuming some of their 16 slots.

Do we anticipate these defaults potentially changing much in the future (i.e removing one of them)? If so, it could make sense to allow end-users to explicitly provide them to be more resilient to changes in the defaults.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the registries field. It was in the original WIP, but doesnt make sense in this final version.
Adds more complexity than necessary
The default patterns copied in plus the RBAC for ids should be enough

Comment thread api/hypershift/v1beta1/azure.go Outdated
@twolff-gh
Copy link
Copy Markdown
Author

/retest

Comment thread api/hypershift/v1beta1/azure.go Outdated
// +kubebuilder:validation:MaxItems=16
// +kubebuilder:validation:items:MinLength=1
// +kubebuilder:validation:items:MaxLength=253
// +kubebuilder:validation:items:Pattern=`^(\*\.)?[a-zA-Z0-9]([a-zA-Z0-9\-]*[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9\-]*[a-zA-Z0-9])?)+$`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the Pattern marker, we generally recommend using the XValidation marker to enforce the pattern via a CEL expression and supplying a human-readable error message on validation failure.

Using the pattern marker means that the end user is provided the literal regex as part of the validation failure instead of a user-friendly message that describes the constraints.

Comment thread api/hypershift/v1beta1/azure.go Outdated
// Format: /subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identityName}
//
// +required
ManagedIdentity AzureManagedIdentityResourceID `json:"managedIdentity,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the length constraints for this field (enforced by the type) in the GoDoc here as well please

@cblecker
Copy link
Copy Markdown
Member

/uncc

@openshift-ci openshift-ci bot removed the request for review from cblecker April 16, 2026 16:06
Comment thread hypershift-operator/controllers/nodepool/azure_acr.go
@hypershift-jira-solve-ci
Copy link
Copy Markdown

Now I have all the evidence needed for the complete analysis. Here is the report:

Test Failure Analysis Complete

Job Information

  • Prow Job: Envtest OCP API Validation (GitHub Actions workflow)
  • Build ID: 24539978803
  • Run: #265
  • PR: #8205feat(azure): enable ACR image pulls via managed identity on worker nodes
  • Branch: azure-acr-mi-api
  • Failed Job: Envtest OCP (K8s 1.35.1)job/71743528982
  • Conclusion Job: Conclusionjob/71744135153 (gate — failed because envtest failed)
  • Test Results: 571 Passed | 1 Failed | 0 Pending | 10 Skipped

Test Failure Analysis

Error

[hypershift.openshift.io/v1beta1, Resource=hcpetcdbackups][FeatureSet="CustomNoUpgrade"]
[File=hcpetcdbackups-CustomNoUpgrade.crd.yaml] HCPEtcdBackup immutability
On Update — When S3 spec storage is changed it should fail

[FAILED] Timed out after 5.000s.
initial object should create successfully
Expected success, but got an error:
    <*errors.StatusError>:
    the server could not find the requested resource (post hcpetcdbackups.hypershift.openshift.io)

Summary

This failure is a pre-existing flaky test bug in the envtest framework, not caused by PR #8205's changes. The failing test (HCPEtcdBackup immutability) was introduced on 2026-04-15 by PR #8199 and tests CRD validation rules for hcpetcdbackups. PR #8205 does not modify any hcpetcdbackups code — it only changes nodepools CRD manifests and Azure ACR logic. The same test fails identically on an unrelated PR (autoscale-558-kubeletconfig-overflow, run 24534594723). A known fix is already open: PR #8261 (OCPBUGS-83585: Wait for CRD removal in GenerateCRDInstallTest to fix flaky envtest).

Root Cause

The root cause is a CRD lifecycle race condition in test/envtest/generator.go between GenerateCRDInstallTest and GenerateTestSuite.

Execution sequence in suite_test.go:

  1. GenerateCRDInstallTest("Default") — installs all Default-featureset CRDs, validates, uninstalls
  2. GenerateCRDInstallTest("TechPreviewNoUpgrade") — installs all TechPreviewNoUpgrade CRDs (including hcpetcdbackups.hypershift.openshift.io), validates, then calls UninstallCRDs without waiting for removal to complete
  3. Individual GenerateTestSuite tests start — the hcpetcdbackups-CustomNoUpgrade variant tries to install the CRD

The race:

  • GenerateCRDInstallTest (line ~254 in generator.go) calls envtest.UninstallCRDs() which only issues a DELETE API call — it does not wait for the CRD to be fully removed from the API server. The CRD enters Terminating state asynchronously.
  • When GenerateTestSuite's BeforeEach runs for hcpetcdbackups-CustomNoUpgrade.crd.yaml, it calls envtest.InstallCRDs(). The controller-runtime code checks if the CRD exists with a Get — it finds the CRD still present (in Terminating state), logs "CRD already exists, updating", and attempts an Update.
  • The Update to a terminating CRD may succeed from the API perspective, but the CRD will never serve resources — it is being torn down. The subsequent WaitForCRDs polls discovery for the resource, never finds it, and times out.
  • The test's Eventually wrapper retries InstallCRDs at 1-second intervals for 30 seconds, but each attempt hits the same stale CRD until deletion finally completes. The first test entry (When S3 spec storage is changed it should fail) then tries to Create an HCPEtcdBackup resource via a 5-second Eventually, but the CRD was never properly installed, resulting in the 404: the server could not find the requested resource error.

Why GenerateTestSuite doesn't have this bug:

The AfterEach(OncePerOrdered) in GenerateTestSuite (line ~218) correctly waits for CRD removal:

Eventually(func() error {
    return k8sClient.Get(ctx, client.ObjectKeyFromObject(crd), crd)
}, "30s").ShouldNot(Succeed())

GenerateCRDInstallTest is missing this wait, creating the race.

Why it's flaky (not deterministic):

The failure depends on how fast the API server finalizes CRD deletion, which varies by Kubernetes version and CI node load. Under light load, the CRD may be fully deleted before the next test starts. Under heavier load (or with slower K8s versions), the CRD lingers in Terminating state.

Recommendations
  1. Rebase PR feat(azure): enable ACR image pulls via managed identity on worker nodes #8205 onto main after PR #8261 merges. PR OCPBUGS-83585: Wait for CRD removal in GenerateCRDInstallTest to fix flaky envtest #8261 adds the missing wait-for-removal loop in GenerateCRDInstallTest, matching the pattern in GenerateTestSuite. This is the correct fix and is already under review.

  2. Re-run the failed workflow (/retest or re-trigger via GitHub Actions). Since this is a race condition, the test may pass on retry depending on CI timing. However, this is not a reliable long-term solution.

  3. No changes needed in PR feat(azure): enable ACR image pulls via managed identity on worker nodes #8205. The failure is entirely unrelated to the Azure ACR managed identity changes in this PR.

Evidence
Evidence Detail
Failing test HCPEtcdBackup immutability > On Update > When S3 spec storage is changed it should fail
Error the server could not find the requested resource (post hcpetcdbackups.hypershift.openshift.io) — CRD not installed
CRD state at install "CRD already exists, updating" — stale CRD from prior GenerateCRDInstallTest still in Terminating state
Timeout 5 seconds between install attempt (23:51:40) and failure (23:51:45)
PR #8205 files changed azure.go, nodepools CRDs, azure_acr.go — zero overlap with hcpetcdbackups
Same failure on unrelated PR run 24534594723 on branch autoscale-558-kubeletconfig-overflow — identical error, identical test
hcpetcdbackups tests introduced PR #8199, merged 2026-04-15 — 1 day before this failure
Known fix PR PR #8261 (OCPBUGS-83585) — adds wait-for-removal to GenerateCRDInstallTest
Bug location test/envtest/generator.go, GenerateCRDInstallTest() — calls UninstallCRDs without waiting for full removal
Correct pattern (already in codebase) GenerateTestSuite's AfterEach waits: Eventually(func() error { return k8sClient.Get(...) }, "30s").ShouldNot(Succeed())

twolff-gh and others added 5 commits April 17, 2026 10:25
…Identity

Add UserAssignedManagedIdentity struct with resourceID (required, ARM
format for CAPZ VMSS attachment) and clientID (optional, for credential
provider config). Azure IMDS docs describe userAssignedIdentityID as a
client ID; cloud-provider-azure auth_func.go accepts both formats. As
a first-party Azure service, we support both.

Add AzureImageRegistryCredentials on AzureNodePoolPlatform using the
new struct. No Registries field — controller uses hardcoded wildcard
patterns covering all standard ACR endpoints.

Signed-off-by: Todd Wolff <twolff@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Todd Wolff <twolff@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntroller

When imageRegistryCredentials.managedIdentity is set on a NodePool:
- CAPI: AzureMachineTemplate gets Identity=UserAssigned with the MI
  resource ID on the worker VMSS
- MachineConfig: 50-acr-credential-provider generates kubelet
  CredentialProviderConfig with wildcard ACR patterns plus azure.json
  with IMDS-based managed identity authentication
- If clientID is provided, it is used in azure.json instead of
  resourceID (cloud-provider-azure accepts both formats)

Signed-off-by: Todd Wolff <twolff@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Envtest YAML suite with 3 cases: no imageRegistryCredentials (pass),
valid managedIdentity with resourceID (pass), invalid resourceID
format (fail with ARM resource ID error message).

Signed-off-by: Todd Wolff <twolff@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Todd Wolff <twolff@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 17, 2026

@twolff-gh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants