OCPBUGS-83790: change Azure workload identity webhook FailurePolicy from Fail to Ignore by bryan-cox · Pull Request #8288 · openshift/hypershift

bryan-cox · 2026-04-20T13:12:04Z

What this PR does / why we need it:

Changes the Azure workload identity MutatingWebhookConfiguration FailurePolicy from Fail to Ignore, matching the pattern already used by the AWS pod identity webhook.

The e2e-azure-self-managed presubmit has had a ~43% pass rate over the past 24 hours (3/7). Two of the three failures show the same pattern: EnsureNoCrashingPods detecting single restarts of openshift-oauth-apiserver and router during bootstrap across most hosted clusters. This is Azure-only — no other platform is seeing EnsureNoCrashingPods failures.

The most likely cause is a race condition during bootstrap. The Azure workload identity webhook is deployed as a sidecar in the KAS pod and waits for KAS /version before starting. Meanwhile, the HCCO registers the MutatingWebhookConfiguration in the guest cluster as soon as it can talk to KAS. With FailurePolicy: Fail, any pod creation matching azure.workload.identity/use: "true" during this window would be rejected, potentially causing downstream component restarts.

We were unable to confirm this definitively — the test framework only logs restartCount > 0 without capturing exit codes or termination reasons. However, the code structure supports this hypothesis:

The webhook sidecar intentionally delays startup until KAS is serving (deployment.go:41-54)
The webhook uses FailurePolicy: Fail while the equivalent AWS webhook uses Ignore
The ObjectSelector targets pods with the azure.workload.identity/use label
Failures are Azure-specific, consistent with an Azure-only component in the admission path

Pods that miss the mutation during the startup window will be re-created by their controllers once the webhook is ready. Note that there is a brief window where pods could run without Azure workload identity tokens injected — operators managing these pods should naturally retry on authentication failures.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-83790

Special notes for your reviewer:

The AWS pod identity webhook already uses FailurePolicy: Ignore here: resources.go:2534

A follow-up PR to convert the webhook to a native sidecar init container (using the existing NativeSidecarContainersEnabled pattern from token-minter-container.go) would provide a more architecturally correct solution by guaranteeing the webhook is ready before KAS starts serving.

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

…to Ignore During hosted cluster bootstrap on Azure, there is a race condition between the MutatingWebhookConfiguration being registered and the webhook sidecar being ready to serve. The webhook sidecar waits for KAS to be available before starting, but the HCCO registers the webhook configuration as soon as it can talk to guest KAS. With FailurePolicy: Fail, any pod creation matching the azure.workload.identity/use label during this window is rejected, causing components like oauth-apiserver and router to restart. This matches the pattern used by the AWS pod identity webhook, which already uses FailurePolicy: Ignore. Pods that miss the mutation will be re-created by their controllers once the webhook is ready. Signed-off-by: Bryan Cox <brcox@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-merge-bot · 2026-04-20T13:12:06Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2026-04-20T13:12:19Z

📝 Walkthrough

Walkthrough

The changes update the Azure workload identity mutating webhook's failure policy from Fail to Ignore in both the implementation and its corresponding test. The FailurePolicy is changed in the reconcileAzureIdentityWebhook function, with the supporting variable renamed from failFailurePolicy to ignoreFailurePolicy. The test assertion is updated to expect admissionregistrationv1.Ignore instead of admissionregistrationv1.Fail. No other webhook field expectations are modified.

🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Test violates single responsibility by combining ClusterRole, ClusterRoleBinding, and MutatingWebhookConfiguration verification in one test.	Split TestReconcileAzureIdentityWebhook into separate tests for each resource type to follow single responsibility principle.

✅ Passed checks (8 passed)

Check name	Status	Explanation
Stable And Deterministic Test Names	✅ Passed	Test file contains stable, deterministic test names with no dynamic values. PR modification only updates line 74 assertion, leaving all test names unchanged.
Microshift Test Compatibility	✅ Passed	This PR does not add any new Ginkgo e2e tests. The modified test file contains only standard Go unit tests using testing.T and Gomega matchers, not Ginkgo framework tests.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This pull request does not add any new Ginkgo e2e tests. The changes are limited to modifying existing unit tests that use Go's standard testing package and a fake Kubernetes client, specifically TestReconcileAzureIdentityWebhook and TestReconcileAzureIdentityWebhookIdempotent. Since the SNO compatibility check applies only to new Ginkgo e2e tests (using patterns like It(), Describe(), Context(), When()), and none are introduced in this PR, the check is not applicable and passes.
Topology-Aware Scheduling Compatibility	✅ Passed	The PR modifies only the FailurePolicy field of a MutatingWebhookConfiguration from Fail to Ignore, which is an admission control setting unrelated to pod scheduling, deployment topology, or topology-aware constraints.
Ote Binary Stdout Contract	✅ Passed	Changes are to standard Go tests and controller code, not OTE binaries. No process-level code or stdout writes present.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The custom check for IPv6 and disconnected network compatibility is not applicable to this pull request. The PR modifies only unit tests (using standard Go testing.T), not Ginkgo e2e tests. The test file contains TestReconcileAzureIdentityWebhook and TestReconcileAzureIdentityWebhookIdempotent, which are standard Go unit tests that use Gomega for assertions but lack the defining Ginkgo patterns (It(), Describe(), Context(), When()). Since the custom check specifically applies to new Ginkgo e2e tests and this PR only updates existing unit test assertions, the check does not apply.
Title check	✅ Passed	The title accurately describes the main change: updating the Azure workload identity webhook's FailurePolicy from Fail to Ignore, which is the core modification across both the implementation and test files.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-20T13:19:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-04-20T13:20:57Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-83790, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Changes the Azure workload identity MutatingWebhookConfiguration FailurePolicy from Fail to Ignore, matching the pattern already used by the AWS pod identity webhook.

The e2e-azure-self-managed presubmit has had a ~43% pass rate over the past 24 hours (3/7). Two of the three failures show the same pattern: EnsureNoCrashingPods detecting single restarts of openshift-oauth-apiserver and router during bootstrap across most hosted clusters. This is Azure-only — no other platform is seeing EnsureNoCrashingPods failures.

The most likely cause is a race condition during bootstrap. The Azure workload identity webhook is deployed as a sidecar in the KAS pod and waits for KAS /version before starting. Meanwhile, the HCCO registers the MutatingWebhookConfiguration in the guest cluster as soon as it can talk to KAS. With FailurePolicy: Fail, any pod creation matching azure.workload.identity/use: "true" during this window would be rejected, potentially causing downstream component restarts.

We were unable to confirm this definitively — the test framework only logs restartCount > 0 without capturing exit codes or termination reasons. However, the code structure supports this hypothesis:

The webhook sidecar intentionally delays startup until KAS is serving (deployment.go:41-54)

The webhook uses FailurePolicy: Fail while the equivalent AWS webhook uses Ignore

The ObjectSelector targets pods with the azure.workload.identity/use label

Failures are Azure-specific, consistent with an Azure-only component in the admission path

Pods that miss the mutation during the startup window will be re-created by their controllers once the webhook is ready. Note that there is a brief window where pods could run without Azure workload identity tokens injected — operators managing these pods should naturally retry on authentication failures.

Which issue(s) this PR fixes:

Fixes intermittent e2e-azure-self-managed presubmit failures caused by EnsureNoCrashingPods detecting single restarts of openshift-oauth-apiserver and router during bootstrap.

Special notes for your reviewer:

The AWS pod identity webhook already uses FailurePolicy: Ignore here: resources.go:2534

A follow-up PR to convert the webhook to a native sidecar init container (using the existing NativeSidecarContainersEnabled pattern from token-minter-container.go) would provide a more architecturally correct solution by guaranteeing the webhook is ready before KAS starts serving.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

csrwng · 2026-04-20T13:26:18Z

/lgtm

bryan-cox · 2026-04-20T13:26:53Z

/jira refresh

openshift-ci-robot · 2026-04-20T13:26:58Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-83790, which is invalid:

expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.23" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

codecov · 2026-04-20T13:27:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 35.70%. Comparing base (5c06422) to head (6ca11ce).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8288   +/-   ##
=======================================
  Coverage   35.70%   35.70%           
=======================================
  Files         767      767           
  Lines       93401    93401           
=======================================
  Hits        33353    33353           
  Misses      57346    57346           
  Partials     2702     2702

Files with missing lines	Coverage Δ
...rconfigoperator/controllers/resources/resources.go	`50.52% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bryan-cox · 2026-04-20T13:27:40Z

/jira refresh

openshift-ci-robot · 2026-04-20T13:27:48Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-83790, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-bot · 2026-04-20T13:30:40Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

bryan-cox · 2026-04-20T13:32:12Z

/verified by e2e

openshift-ci-robot · 2026-04-20T13:32:25Z

@bryan-cox: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bryan-cox · 2026-04-20T13:37:05Z

/retest

cwbotbot · 2026-04-20T15:20:03Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-04-20T13:33:00Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-04-20T13:33:01Z
View Job
View Job History

bryan-cox · 2026-04-20T15:20:58Z

/retest

hypershift-jira-solve-ci · 2026-04-20T15:32:55Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2046220469312425984 | Cost: $3.0805913999999994 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-merge-bot · 2026-04-20T16:31:52Z

/retest-required

Remaining retests: 0 against base HEAD 3097f65 and 2 for PR HEAD 6ca11ce in total

hypershift-jira-solve-ci · 2026-04-20T18:17:59Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2046266470790336512 | Cost: $1.4363416500000001 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-merge-bot · 2026-04-20T18:31:39Z

/retest-required

Remaining retests: 0 against base HEAD f61ce8f and 1 for PR HEAD 6ca11ce in total

hypershift-jira-solve-ci · 2026-04-20T20:37:58Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2046295946198781952 | Cost: $2.621218949999999 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-merge-bot · 2026-04-20T22:29:36Z

/retest-required

Remaining retests: 0 against base HEAD 9a6b857 and 0 for PR HEAD 6ca11ce in total

openshift-merge-bot · 2026-04-20T23:30:08Z

/hold

Revision 6ca11ce was retested 3 times: holding

bryan-cox · 2026-04-21T16:29:46Z

/retest

openshift-ci · 2026-04-21T20:21:56Z

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-kubevirt-aws-ovn-reduced	`6ca11ce`	link	true	`/test e2e-kubevirt-aws-ovn-reduced`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hypershift-jira-solve-ci · 2026-04-22T07:46:21Z

The analysis is complete. The background task that just finished was a supplementary search for ODF/Ceph diagnostics that I already covered through other artifact paths.

The report above is the final output — the root cause is an ODF/Ceph infrastructure flake on the management cluster where the Ceph RBD CSI provisioner failed to provision boot disk PVCs for the KubeVirt VMs, completely unrelated to the PR's Azure webhook FailurePolicy change. A /retest should resolve it.

openshift-ci Bot added the do-not-merge/needs-area label Apr 20, 2026

openshift-ci Bot requested review from devguyio and muraee April 20, 2026 13:18

openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Apr 20, 2026

openshift-ci Bot added area/platform/azure PR/issue for Azure (AzurePlatform) platform approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Apr 20, 2026

bryan-cox changed the title ~~fix(azure): change workload identity webhook FailurePolicy from Fail to Ignore~~ OCPBUGS-83790: change Azure workload identity webhook FailurePolicy from Fail to Ignore Apr 20, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 20, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 20, 2026

openshift-ci Bot assigned csrwng Apr 20, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 20, 2026

bryan-cox mentioned this pull request Apr 20, 2026

fix: set CNV_SUBSCRIPTION_SOURCE for kubevirt AWS e2e jobs openshift/release#78056

Closed

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 20, 2026

Conversation

bryan-cox commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-merge-bot Bot commented Apr 20, 2026

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci Bot commented Apr 20, 2026

Uh oh!

openshift-ci-robot commented Apr 20, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

csrwng commented Apr 20, 2026

Uh oh!

bryan-cox commented Apr 20, 2026

Uh oh!

openshift-ci-robot commented Apr 20, 2026

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bryan-cox commented Apr 20, 2026

Uh oh!

openshift-ci-robot commented Apr 20, 2026

Uh oh!

openshift-merge-bot Bot commented Apr 20, 2026

Uh oh!

bryan-cox commented Apr 20, 2026

Uh oh!

openshift-ci-robot commented Apr 20, 2026

Uh oh!

bryan-cox commented Apr 20, 2026

Uh oh!

cwbotbot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

bryan-cox commented Apr 20, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Apr 20, 2026

AI Test Failure Analysis

Uh oh!

openshift-merge-bot Bot commented Apr 20, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Apr 20, 2026

AI Test Failure Analysis

Uh oh!

openshift-merge-bot Bot commented Apr 20, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Apr 20, 2026

AI Test Failure Analysis

Uh oh!

openshift-merge-bot Bot commented Apr 20, 2026

Uh oh!

openshift-merge-bot Bot commented Apr 20, 2026

Uh oh!

bryan-cox commented Apr 21, 2026

Uh oh!

openshift-ci Bot commented Apr 21, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

bryan-cox commented Apr 20, 2026 •

edited

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading

cwbotbot commented Apr 20, 2026 •

edited

Loading