OCPBUGS-83790: change Azure workload identity webhook FailurePolicy from Fail to Ignore#8288
OCPBUGS-83790: change Azure workload identity webhook FailurePolicy from Fail to Ignore#8288bryan-cox wants to merge 1 commit intoopenshift:mainfrom
Conversation
…to Ignore During hosted cluster bootstrap on Azure, there is a race condition between the MutatingWebhookConfiguration being registered and the webhook sidecar being ready to serve. The webhook sidecar waits for KAS to be available before starting, but the HCCO registers the webhook configuration as soon as it can talk to guest KAS. With FailurePolicy: Fail, any pod creation matching the azure.workload.identity/use label during this window is rejected, causing components like oauth-apiserver and router to restart. This matches the pattern used by the AWS pod identity webhook, which already uses FailurePolicy: Ignore. Pods that miss the mutation will be re-created by their controllers once the webhook is ready. Signed-off-by: Bryan Cox <brcox@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
📝 WalkthroughWalkthroughThe changes update the Azure workload identity mutating webhook's failure policy from 🚥 Pre-merge checks | ✅ 8 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-83790, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
/jira refresh |
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-83790, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8288 +/- ##
=======================================
Coverage 35.70% 35.70%
=======================================
Files 767 767
Lines 93401 93401
=======================================
Hits 33353 33353
Misses 57346 57346
Partials 2702 2702
🚀 New features to boost your workflow:
|
|
/jira refresh |
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-83790, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling tests matching the |
|
/verified by e2e |
|
@bryan-cox: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
Test Resultse2e-aws
e2e-aks
|
|
/retest |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/hold Revision 6ca11ce was retested 3 times: holding |
|
/retest |
|
@bryan-cox: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
The analysis is complete. The background task that just finished was a supplementary search for ODF/Ceph diagnostics that I already covered through other artifact paths. The report above is the final output — the root cause is an ODF/Ceph infrastructure flake on the management cluster where the Ceph RBD CSI provisioner failed to provision boot disk PVCs for the KubeVirt VMs, completely unrelated to the PR's Azure webhook FailurePolicy change. A |
What this PR does / why we need it:
Changes the Azure workload identity
MutatingWebhookConfigurationFailurePolicyfromFailtoIgnore, matching the pattern already used by the AWS pod identity webhook.The
e2e-azure-self-managedpresubmit has had a ~43% pass rate over the past 24 hours (3/7). Two of the three failures show the same pattern:EnsureNoCrashingPodsdetecting single restarts ofopenshift-oauth-apiserverandrouterduring bootstrap across most hosted clusters. This is Azure-only — no other platform is seeingEnsureNoCrashingPodsfailures.The most likely cause is a race condition during bootstrap. The Azure workload identity webhook is deployed as a sidecar in the KAS pod and waits for KAS
/versionbefore starting. Meanwhile, the HCCO registers theMutatingWebhookConfigurationin the guest cluster as soon as it can talk to KAS. WithFailurePolicy: Fail, any pod creation matchingazure.workload.identity/use: "true"during this window would be rejected, potentially causing downstream component restarts.We were unable to confirm this definitively — the test framework only logs
restartCount > 0without capturing exit codes or termination reasons. However, the code structure supports this hypothesis:deployment.go:41-54)FailurePolicy: Failwhile the equivalent AWS webhook usesIgnoreObjectSelectortargets pods with theazure.workload.identity/uselabelPods that miss the mutation during the startup window will be re-created by their controllers once the webhook is ready. Note that there is a brief window where pods could run without Azure workload identity tokens injected — operators managing these pods should naturally retry on authentication failures.
Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/OCPBUGS-83790
Special notes for your reviewer:
The AWS pod identity webhook already uses
FailurePolicy: Ignorehere:resources.go:2534A follow-up PR to convert the webhook to a native sidecar init container (using the existing
NativeSidecarContainersEnabledpattern fromtoken-minter-container.go) would provide a more architecturally correct solution by guaranteeing the webhook is ready before KAS starts serving.Checklist: