Skip to content

OCPBUGS-83790: change Azure workload identity webhook FailurePolicy from Fail to Ignore#8288

Open
bryan-cox wants to merge 1 commit intoopenshift:mainfrom
bryan-cox:azure-webhook-failure-policy
Open

OCPBUGS-83790: change Azure workload identity webhook FailurePolicy from Fail to Ignore#8288
bryan-cox wants to merge 1 commit intoopenshift:mainfrom
bryan-cox:azure-webhook-failure-policy

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox commented Apr 20, 2026

What this PR does / why we need it:

Changes the Azure workload identity MutatingWebhookConfiguration FailurePolicy from Fail to Ignore, matching the pattern already used by the AWS pod identity webhook.

The e2e-azure-self-managed presubmit has had a ~43% pass rate over the past 24 hours (3/7). Two of the three failures show the same pattern: EnsureNoCrashingPods detecting single restarts of openshift-oauth-apiserver and router during bootstrap across most hosted clusters. This is Azure-only — no other platform is seeing EnsureNoCrashingPods failures.

The most likely cause is a race condition during bootstrap. The Azure workload identity webhook is deployed as a sidecar in the KAS pod and waits for KAS /version before starting. Meanwhile, the HCCO registers the MutatingWebhookConfiguration in the guest cluster as soon as it can talk to KAS. With FailurePolicy: Fail, any pod creation matching azure.workload.identity/use: "true" during this window would be rejected, potentially causing downstream component restarts.

We were unable to confirm this definitively — the test framework only logs restartCount > 0 without capturing exit codes or termination reasons. However, the code structure supports this hypothesis:

  • The webhook sidecar intentionally delays startup until KAS is serving (deployment.go:41-54)
  • The webhook uses FailurePolicy: Fail while the equivalent AWS webhook uses Ignore
  • The ObjectSelector targets pods with the azure.workload.identity/use label
  • Failures are Azure-specific, consistent with an Azure-only component in the admission path

Pods that miss the mutation during the startup window will be re-created by their controllers once the webhook is ready. Note that there is a brief window where pods could run without Azure workload identity tokens injected — operators managing these pods should naturally retry on authentication failures.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-83790

Special notes for your reviewer:

The AWS pod identity webhook already uses FailurePolicy: Ignore here: resources.go:2534

A follow-up PR to convert the webhook to a native sidecar init container (using the existing NativeSidecarContainersEnabled pattern from token-minter-container.go) would provide a more architecturally correct solution by guaranteeing the webhook is ready before KAS starts serving.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

…to Ignore

During hosted cluster bootstrap on Azure, there is a race condition between
the MutatingWebhookConfiguration being registered and the webhook sidecar
being ready to serve. The webhook sidecar waits for KAS to be available
before starting, but the HCCO registers the webhook configuration as soon
as it can talk to guest KAS. With FailurePolicy: Fail, any pod creation
matching the azure.workload.identity/use label during this window is
rejected, causing components like oauth-apiserver and router to restart.

This matches the pattern used by the AWS pod identity webhook, which
already uses FailurePolicy: Ignore. Pods that miss the mutation will be
re-created by their controllers once the webhook is ready.

Signed-off-by: Bryan Cox <brcox@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

📝 Walkthrough

Walkthrough

The changes update the Azure workload identity mutating webhook's failure policy from Fail to Ignore in both the implementation and its corresponding test. The FailurePolicy is changed in the reconcileAzureIdentityWebhook function, with the supporting variable renamed from failFailurePolicy to ignoreFailurePolicy. The test assertion is updated to expect admissionregistrationv1.Ignore instead of admissionregistrationv1.Fail. No other webhook field expectations are modified.

🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test violates single responsibility by combining ClusterRole, ClusterRoleBinding, and MutatingWebhookConfiguration verification in one test. Split TestReconcileAzureIdentityWebhook into separate tests for each resource type to follow single responsibility principle.
✅ Passed checks (8 passed)
Check name Status Explanation
Stable And Deterministic Test Names ✅ Passed Test file contains stable, deterministic test names with no dynamic values. PR modification only updates line 74 assertion, leaving all test names unchanged.
Microshift Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. The modified test file contains only standard Go unit tests using testing.T and Gomega matchers, not Ginkgo framework tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This pull request does not add any new Ginkgo e2e tests. The changes are limited to modifying existing unit tests that use Go's standard testing package and a fake Kubernetes client, specifically TestReconcileAzureIdentityWebhook and TestReconcileAzureIdentityWebhookIdempotent. Since the SNO compatibility check applies only to new Ginkgo e2e tests (using patterns like It(), Describe(), Context(), When()), and none are introduced in this PR, the check is not applicable and passes.
Topology-Aware Scheduling Compatibility ✅ Passed The PR modifies only the FailurePolicy field of a MutatingWebhookConfiguration from Fail to Ignore, which is an admission control setting unrelated to pod scheduling, deployment topology, or topology-aware constraints.
Ote Binary Stdout Contract ✅ Passed Changes are to standard Go tests and controller code, not OTE binaries. No process-level code or stdout writes present.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The custom check for IPv6 and disconnected network compatibility is not applicable to this pull request. The PR modifies only unit tests (using standard Go testing.T), not Ginkgo e2e tests. The test file contains TestReconcileAzureIdentityWebhook and TestReconcileAzureIdentityWebhookIdempotent, which are standard Go unit tests that use Gomega for assertions but lack the defining Ginkgo patterns (It(), Describe(), Context(), When()). Since the custom check specifically applies to new Ginkgo e2e tests and this PR only updates existing unit test assertions, the check does not apply.
Title check ✅ Passed The title accurately describes the main change: updating the Azure workload identity webhook's FailurePolicy from Fail to Ignore, which is the core modification across both the implementation and test files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from devguyio and muraee April 20, 2026 13:18
@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Apr 20, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 20, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/platform/azure PR/issue for Azure (AzurePlatform) platform approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Apr 20, 2026
@bryan-cox bryan-cox changed the title fix(azure): change workload identity webhook FailurePolicy from Fail to Ignore OCPBUGS-83790: change Azure workload identity webhook FailurePolicy from Fail to Ignore Apr 20, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 20, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-83790, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Changes the Azure workload identity MutatingWebhookConfiguration FailurePolicy from Fail to Ignore, matching the pattern already used by the AWS pod identity webhook.

The e2e-azure-self-managed presubmit has had a ~43% pass rate over the past 24 hours (3/7). Two of the three failures show the same pattern: EnsureNoCrashingPods detecting single restarts of openshift-oauth-apiserver and router during bootstrap across most hosted clusters. This is Azure-only — no other platform is seeing EnsureNoCrashingPods failures.

The most likely cause is a race condition during bootstrap. The Azure workload identity webhook is deployed as a sidecar in the KAS pod and waits for KAS /version before starting. Meanwhile, the HCCO registers the MutatingWebhookConfiguration in the guest cluster as soon as it can talk to KAS. With FailurePolicy: Fail, any pod creation matching azure.workload.identity/use: "true" during this window would be rejected, potentially causing downstream component restarts.

We were unable to confirm this definitively — the test framework only logs restartCount > 0 without capturing exit codes or termination reasons. However, the code structure supports this hypothesis:

  • The webhook sidecar intentionally delays startup until KAS is serving (deployment.go:41-54)
  • The webhook uses FailurePolicy: Fail while the equivalent AWS webhook uses Ignore
  • The ObjectSelector targets pods with the azure.workload.identity/use label
  • Failures are Azure-specific, consistent with an Azure-only component in the admission path

Pods that miss the mutation during the startup window will be re-created by their controllers once the webhook is ready. Note that there is a brief window where pods could run without Azure workload identity tokens injected — operators managing these pods should naturally retry on authentication failures.

Which issue(s) this PR fixes:

Fixes intermittent e2e-azure-self-managed presubmit failures caused by EnsureNoCrashingPods detecting single restarts of openshift-oauth-apiserver and router during bootstrap.

Special notes for your reviewer:

The AWS pod identity webhook already uses FailurePolicy: Ignore here: resources.go:2534

A follow-up PR to convert the webhook to a native sidecar init container (using the existing NativeSidecarContainersEnabled pattern from token-minter-container.go) would provide a more architecturally correct solution by guaranteeing the webhook is ready before KAS starts serving.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@csrwng
Copy link
Copy Markdown
Contributor

csrwng commented Apr 20, 2026

/lgtm

@bryan-cox
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-83790, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.23" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 35.70%. Comparing base (5c06422) to head (6ca11ce).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8288   +/-   ##
=======================================
  Coverage   35.70%   35.70%           
=======================================
  Files         767      767           
  Lines       93401    93401           
=======================================
  Hits        33353    33353           
  Misses      57346    57346           
  Partials     2702     2702           
Files with missing lines Coverage Δ
...rconfigoperator/controllers/resources/resources.go 50.52% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bryan-cox
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 20, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-83790, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@bryan-cox
Copy link
Copy Markdown
Member Author

/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 20, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented Apr 20, 2026

Test Results

e2e-aws

e2e-aks

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2046220469312425984 | Cost: $3.0805913999999994 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 3097f65 and 2 for PR HEAD 6ca11ce in total

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2046266470790336512 | Cost: $1.4363416500000001 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD f61ce8f and 1 for PR HEAD 6ca11ce in total

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2046295946198781952 | Cost: $2.621218949999999 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 9a6b857 and 0 for PR HEAD 6ca11ce in total

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/hold

Revision 6ca11ce was retested 3 times: holding

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 20, 2026
@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 21, 2026

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-kubevirt-aws-ovn-reduced 6ca11ce link true /test e2e-kubevirt-aws-ovn-reduced

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

The analysis is complete. The background task that just finished was a supplementary search for ODF/Ceph diagnostics that I already covered through other artifact paths.

The report above is the final output — the root cause is an ODF/Ceph infrastructure flake on the management cluster where the Ceph RBD CSI provisioner failed to provision boot disk PVCs for the KubeVirt VMs, completely unrelated to the PR's Azure webhook FailurePolicy change. A /retest should resolve it.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants