SREP-4577: Fix webhook drain deadlock with preStop hook and failurePolicy: Ignore by dustman9000 · Pull Request #717 · openshift/addon-operator

dustman9000 · 2026-04-18T23:31:36Z

Summary

Adds a 5-second preStop sleep to the webhook deployment to allow endpoint deregistration before shutdown
Changes failurePolicy from Fail to Ignore on the ValidatingWebhookConfiguration to prevent API server hangs during drains

Problem

The addon-operator-webhooks pod gets stuck in Terminating during MCO node drains, blocking the drain for 1+ hour. This causes cascading alerts (ControlPlaneNodeUnschedulableSRE, ClusterMonitoringErrorBudgetBurnSRE) and requires SRE force-deletion to resolve.

Observed twice in 2 hours on the same cluster (rosaprod.gxp9.p1, af-south-1), on two different infra nodes.

Root Cause

Race condition during eviction: the pod stops serving requests but remains registered as a webhook backend. With failurePolicy: Fail, API calls to the dying pod hang rather than failing open. This prevents clean termination and blocks the MCO drain indefinitely.

Fix

Two changes that provide defense-in-depth:

preStop sleep (5s): Gives the endpoints controller time to remove the pod from the Service before it stops serving. Standard Kubernetes pattern for webhook pods.
failurePolicy: Ignore: If the webhook pod is unavailable during drain, the API server skips validation instead of hanging. The second replica continues serving most requests, so the validation gap is minimal (a few seconds during node drain).

Jira: https://redhat.atlassian.net/browse/SREP-4577

Test plan

Deploy to a test cluster with the webhook running on infra nodes
Cordon and drain an infra node hosting a webhook pod
Verify the pod terminates cleanly without blocking the drain
Verify webhook availability during the drain (second replica serves requests)
Verify addon CRD validation still works under normal conditions

Summary by CodeRabbit

Chores
- Webhook admission behavior changed to be tolerant of webhook failures: when the admission webhook is unavailable or errors, API requests are allowed to proceed instead of being blocked.
- Webhook service now delays shutdown briefly (small pre-stop pause) to reduce connection interruptions and improve stability during updates or rolling restarts.

The addon-operator-webhooks pod gets stuck in Terminating during MCO node drains, blocking the drain for 1+ hour and causing cascading alerts (ControlPlaneNodeUnschedulableSRE, ClusterMonitoringErrorBudgetBurnSRE). The root cause is a race condition: when the pod is evicted, it stops serving but remains registered as a webhook backend. With failurePolicy: Fail, API calls hang waiting for the dying pod. Adding a 5s preStop sleep gives the endpoints controller time to remove the pod from the Service before it stops serving, breaking the deadlock. Observed twice in 2 hours on rosaprod.gxp9.p1 (af-south-1). Both times required force-deleting the stuck pod to unblock the drain. Ref: https://redhat.atlassian.net/browse/SREP-4577

openshift-ci-robot · 2026-04-18T23:31:40Z

@dustman9000: This pull request references SREP-4577 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Adds a 5-second preStop sleep to the webhook deployment

Prevents webhook pods from getting stuck in Terminating during MCO node drains

Problem

The addon-operator-webhooks pod gets stuck in Terminating during MCO node drains, blocking the drain for 1+ hour. This causes cascading alerts (ControlPlaneNodeUnschedulableSRE, ClusterMonitoringErrorBudgetBurnSRE) and requires SRE force-deletion to resolve.

Observed twice in 2 hours on the same cluster (rosaprod.gxp9.p1, af-south-1), on two different infra nodes.

Root Cause

Race condition during eviction: the pod stops serving requests but remains registered as a webhook backend. With failurePolicy: Fail on the validating webhook, API calls hang on the dying pod, preventing clean termination. The kubelet cannot complete shutdown because connections stay open.

Fix

A 5-second preStop sleep gives the endpoints controller time to remove the pod from the Service before it stops serving. This is a well-established Kubernetes pattern for webhook and API server pods.

Jira: https://redhat.atlassian.net/browse/SREP-4577

Test plan

Deploy to a test cluster with the webhook running on infra nodes

Cordon and drain an infra node hosting a webhook pod

Verify the pod terminates cleanly without blocking the drain

Verify webhook availability during the drain (second replica serves requests)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-18T23:31:49Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 026a0e05-6f96-4fec-bb6d-97fd827c7ddd

📥 Commits

Reviewing files that changed from the base of the PR and between 0ac2115 and 3d968da.

📒 Files selected for processing (4)

config/templates/csv-template.yaml
deploy-extras/development/webhook/validatingwebhookconfig.yaml
deploy_pko/ValidatingWebhookConfiguration-addons.yaml
hack/webhookdefinition.yaml

✅ Files skipped from review due to trivial changes (1)

deploy_pko/ValidatingWebhookConfiguration-addons.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

config/templates/csv-template.yaml
hack/webhookdefinition.yaml

Walkthrough

Added a preStop lifecycle hook (sleep 5) to the webhook Deployment and changed multiple validating webhook failurePolicy values from Fail to Ignore, altering admission behavior when webhooks are unavailable.

Changes

Cohort / File(s)	Summary
Deployment lifecycle change `deploy/70_webhook-deployment.yaml`	Added a `lifecycle.preStop.exec` hook to the `webhook` container to run `sleep 5` before termination.
ValidatingWebhook failurePolicy updates `bundle/manifests/addon-operator.clusterserviceversion.yaml`, `config/templates/csv-template.yaml`, `deploy-extras/development/webhook/validatingwebhookconfig.yaml`, `deploy_pko/ValidatingWebhookConfiguration-addons.yaml`, `hack/webhookdefinition.yaml`	Changed `failurePolicy` for validating webhook entries from `Fail` → `Ignore` across CSV, template, and multiple webhook manifests; no other webhook fields were modified.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API_Server as API Server
    participant Webhook as Webhook Server
    participant Kubelet
    participant Pod as Webhook Pod

    Client->>API_Server: submit object (create/update)
    API_Server->>Webhook: call validating webhook
    alt Old (failurePolicy: Fail)
        Webhook--x API_Server: timeout/error
        API_Server->>Client: reject request (admission failed)
    else New (failurePolicy: Ignore)
        Webhook--x API_Server: timeout/error
        API_Server->>Client: accept request (proceed despite webhook failure)
    end

    Note over Kubelet,Pod: Pod termination sequence
    Kubelet->>Pod: SIGTERM
    Pod->>Pod: preStop hook (sleep 5)
    Pod-->>Kubelet: container exits after preStop completes

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 10

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: adding a preStop hook and changing failurePolicy to Ignore to fix a webhook drain deadlock issue.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names	✅ Passed	PR modifies only Kubernetes YAML manifest files without introducing any Ginkgo test files, and the repository uses Go's standard testing.T framework, making this check not applicable.
Test Structure And Quality	✅ Passed	Webhook test files use standard Go testing and testify frameworks, not Ginkgo patterns, making Ginkgo check requirements inapplicable.
Microshift Test Compatibility	✅ Passed	This PR does not add any new Ginkgo e2e tests. Only Kubernetes manifest files for webhook deployments and configurations were modified.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR contains only YAML configuration changes to webhooks and manifests. No new Ginkgo e2e tests were added, so SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR does not introduce topology-incompatible scheduling constraints; only adds preStop hook and modifies webhook failurePolicy.
Ote Binary Stdout Contract	✅ Passed	PR modifies only Kubernetes YAML manifests with infrastructure configuration changes; no Go source code or binary entry points modified, so no OTE Binary Stdout Contract risk.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR modifies only Kubernetes manifest configuration files with no new test code added, making the IPv6 and disconnected network test compatibility check not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-18T23:31:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dustman9000
Once this PR has been reviewed and has the lgtm label, please assign eqrx for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

With failurePolicy: Fail, the API server hangs when the webhook pod is being evicted during node drains. The dying pod cannot serve requests but is still registered as a backend, creating a deadlock that blocks MCO drains for 1+ hour. Changing to failurePolicy: Ignore means the API server skips validation during the brief window when the webhook is unavailable, rather than blocking the entire node drain. The second replica continues serving, so most requests still get validated. Combined with the preStop sleep from the previous commit, this provides defense-in-depth: the preStop gives time for endpoint deregistration, and failurePolicy: Ignore prevents hangs if that race is still hit. Ref: https://redhat.atlassian.net/browse/SREP-4577

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/templates/csv-template.yaml`:
- Line 56: The CSV webhook configuration currently sets failurePolicy: Ignore in
config/templates/csv-template.yaml which makes the webhook fail-open; either
explicitly document that this tradeoff is intentional in the manifest (add a
clear comment near failurePolicy explaining why failing open is acceptable and
list compensating controls) or change the webhook failurePolicy to Fail to
enforce validation on webhook unavailability; locate the webhook spec
(failurePolicy field in the CSV/webhook configuration) and either replace Ignore
with Fail or add the explicit justification and link to compensating controls
(replicas, preStop, monitoring/alerting) in the template.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ccee0ca4-d87c-43db-8c1c-ea0afcd92857

📥 Commits

Reviewing files that changed from the base of the PR and between 4588aea and 0ac2115.

📒 Files selected for processing (5)

bundle/manifests/addon-operator.clusterserviceversion.yaml
config/templates/csv-template.yaml
deploy-extras/development/webhook/validatingwebhookconfig.yaml
deploy_pko/ValidatingWebhookConfiguration-addons.yaml
hack/webhookdefinition.yaml

✅ Files skipped from review due to trivial changes (2)

hack/webhookdefinition.yaml
deploy_pko/ValidatingWebhookConfiguration-addons.yaml

codecov-commenter · 2026-04-19T00:00:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.24%. Comparing base (dfaf819) to head (3d968da).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #717   +/-   ##
=======================================
  Coverage   59.24%   59.24%           
=======================================
  Files          62       62           
  Lines        4125     4125           
=======================================
  Hits         2444     2444           
  Misses       1532     1532           
  Partials      149      149

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci · 2026-04-19T00:24:59Z

@dustman9000: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

erdii · 2026-04-20T08:27:28Z

Hey Dustin - thanks for the submission!

Are you sure that this PR actually fixes the node drain issue you've described?
Switching the webhook failure policy to Ignore should not change the drain behaviour because it just unblocks CREATE/UPDATE calls to the Addon API and the pre stop exec sleep makes endpoint deregistration cleaner but does it really help to unblock pod eviction?

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 18, 2026

openshift-ci Bot requested review from ashishmax31 and venkateshsredhat April 18, 2026 23:31

dustman9000 changed the title ~~SREP-4577: Add preStop hook to webhook pod to prevent drain deadlock~~ SREP-4577: Fix webhook drain deadlock with preStop hook and failurePolicy: Ignore Apr 18, 2026

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread config/templates/csv-template.yaml

Add justification comments for failurePolicy: Ignore

3d968da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SREP-4577: Fix webhook drain deadlock with preStop hook and failurePolicy: Ignore#717

SREP-4577: Fix webhook drain deadlock with preStop hook and failurePolicy: Ignore#717
dustman9000 wants to merge 3 commits intoopenshift:mainfrom
dustman9000:drow/fix-webhook-drain-deadlock

dustman9000 commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci-robot commented Apr 18, 2026 •

edited by openshift-ci Bot

Loading

Summary

Problem

Root Cause

Fix

Test plan

Uh oh!

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented Apr 18, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

codecov-commenter commented Apr 19, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented Apr 19, 2026

Uh oh!

erdii commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dustman9000 commented Apr 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Fix

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Apr 18, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Fix

Test plan

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

openshift-ci Bot commented Apr 18, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci Bot commented Apr 19, 2026

Uh oh!

erdii commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dustman9000 commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Apr 18, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading

codecov-commenter commented Apr 19, 2026 •

edited

Loading