Skip to content

SREP-4577: Fix webhook drain deadlock with preStop hook and failurePolicy: Ignore#717

Open
dustman9000 wants to merge 3 commits intoopenshift:mainfrom
dustman9000:drow/fix-webhook-drain-deadlock
Open

SREP-4577: Fix webhook drain deadlock with preStop hook and failurePolicy: Ignore#717
dustman9000 wants to merge 3 commits intoopenshift:mainfrom
dustman9000:drow/fix-webhook-drain-deadlock

Conversation

@dustman9000
Copy link
Copy Markdown
Member

@dustman9000 dustman9000 commented Apr 18, 2026

Summary

  • Adds a 5-second preStop sleep to the webhook deployment to allow endpoint deregistration before shutdown
  • Changes failurePolicy from Fail to Ignore on the ValidatingWebhookConfiguration to prevent API server hangs during drains

Problem

The addon-operator-webhooks pod gets stuck in Terminating during MCO node drains, blocking the drain for 1+ hour. This causes cascading alerts (ControlPlaneNodeUnschedulableSRE, ClusterMonitoringErrorBudgetBurnSRE) and requires SRE force-deletion to resolve.

Observed twice in 2 hours on the same cluster (rosaprod.gxp9.p1, af-south-1), on two different infra nodes.

Root Cause

Race condition during eviction: the pod stops serving requests but remains registered as a webhook backend. With failurePolicy: Fail, API calls to the dying pod hang rather than failing open. This prevents clean termination and blocks the MCO drain indefinitely.

Fix

Two changes that provide defense-in-depth:

  1. preStop sleep (5s): Gives the endpoints controller time to remove the pod from the Service before it stops serving. Standard Kubernetes pattern for webhook pods.

  2. failurePolicy: Ignore: If the webhook pod is unavailable during drain, the API server skips validation instead of hanging. The second replica continues serving most requests, so the validation gap is minimal (a few seconds during node drain).

Jira: https://redhat.atlassian.net/browse/SREP-4577

Test plan

  • Deploy to a test cluster with the webhook running on infra nodes
  • Cordon and drain an infra node hosting a webhook pod
  • Verify the pod terminates cleanly without blocking the drain
  • Verify webhook availability during the drain (second replica serves requests)
  • Verify addon CRD validation still works under normal conditions

Summary by CodeRabbit

  • Chores
    • Webhook admission behavior changed to be tolerant of webhook failures: when the admission webhook is unavailable or errors, API requests are allowed to proceed instead of being blocked.
    • Webhook service now delays shutdown briefly (small pre-stop pause) to reduce connection interruptions and improve stability during updates or rolling restarts.

The addon-operator-webhooks pod gets stuck in Terminating during MCO
node drains, blocking the drain for 1+ hour and causing cascading
alerts (ControlPlaneNodeUnschedulableSRE, ClusterMonitoringErrorBudgetBurnSRE).

The root cause is a race condition: when the pod is evicted, it stops
serving but remains registered as a webhook backend. With failurePolicy:
Fail, API calls hang waiting for the dying pod. Adding a 5s preStop
sleep gives the endpoints controller time to remove the pod from the
Service before it stops serving, breaking the deadlock.

Observed twice in 2 hours on rosaprod.gxp9.p1 (af-south-1). Both times
required force-deleting the stuck pod to unblock the drain.

Ref: https://redhat.atlassian.net/browse/SREP-4577
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 18, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 18, 2026

@dustman9000: This pull request references SREP-4577 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds a 5-second preStop sleep to the webhook deployment
  • Prevents webhook pods from getting stuck in Terminating during MCO node drains

Problem

The addon-operator-webhooks pod gets stuck in Terminating during MCO node drains, blocking the drain for 1+ hour. This causes cascading alerts (ControlPlaneNodeUnschedulableSRE, ClusterMonitoringErrorBudgetBurnSRE) and requires SRE force-deletion to resolve.

Observed twice in 2 hours on the same cluster (rosaprod.gxp9.p1, af-south-1), on two different infra nodes.

Root Cause

Race condition during eviction: the pod stops serving requests but remains registered as a webhook backend. With failurePolicy: Fail on the validating webhook, API calls hang on the dying pod, preventing clean termination. The kubelet cannot complete shutdown because connections stay open.

Fix

A 5-second preStop sleep gives the endpoints controller time to remove the pod from the Service before it stops serving. This is a well-established Kubernetes pattern for webhook and API server pods.

Jira: https://redhat.atlassian.net/browse/SREP-4577

Test plan

  • Deploy to a test cluster with the webhook running on infra nodes
  • Cordon and drain an infra node hosting a webhook pod
  • Verify the pod terminates cleanly without blocking the drain
  • Verify webhook availability during the drain (second replica serves requests)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 026a0e05-6f96-4fec-bb6d-97fd827c7ddd

📥 Commits

Reviewing files that changed from the base of the PR and between 0ac2115 and 3d968da.

📒 Files selected for processing (4)
  • config/templates/csv-template.yaml
  • deploy-extras/development/webhook/validatingwebhookconfig.yaml
  • deploy_pko/ValidatingWebhookConfiguration-addons.yaml
  • hack/webhookdefinition.yaml
✅ Files skipped from review due to trivial changes (1)
  • deploy_pko/ValidatingWebhookConfiguration-addons.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • config/templates/csv-template.yaml
  • hack/webhookdefinition.yaml

Walkthrough

Added a preStop lifecycle hook (sleep 5) to the webhook Deployment and changed multiple validating webhook failurePolicy values from Fail to Ignore, altering admission behavior when webhooks are unavailable.

Changes

Cohort / File(s) Summary
Deployment lifecycle change
deploy/70_webhook-deployment.yaml
Added a lifecycle.preStop.exec hook to the webhook container to run sleep 5 before termination.
ValidatingWebhook failurePolicy updates
bundle/manifests/addon-operator.clusterserviceversion.yaml, config/templates/csv-template.yaml, deploy-extras/development/webhook/validatingwebhookconfig.yaml, deploy_pko/ValidatingWebhookConfiguration-addons.yaml, hack/webhookdefinition.yaml
Changed failurePolicy for validating webhook entries from FailIgnore across CSV, template, and multiple webhook manifests; no other webhook fields were modified.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API_Server as API Server
    participant Webhook as Webhook Server
    participant Kubelet
    participant Pod as Webhook Pod

    Client->>API_Server: submit object (create/update)
    API_Server->>Webhook: call validating webhook
    alt Old (failurePolicy: Fail)
        Webhook--x API_Server: timeout/error
        API_Server->>Client: reject request (admission failed)
    else New (failurePolicy: Ignore)
        Webhook--x API_Server: timeout/error
        API_Server->>Client: accept request (proceed despite webhook failure)
    end

    Note over Kubelet,Pod: Pod termination sequence
    Kubelet->>Pod: SIGTERM
    Pod->>Pod: preStop hook (sleep 5)
    Pod-->>Kubelet: container exits after preStop completes
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 10
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding a preStop hook and changing failurePolicy to Ignore to fix a webhook drain deadlock issue.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed PR modifies only Kubernetes YAML manifest files without introducing any Ginkgo test files, and the repository uses Go's standard testing.T framework, making this check not applicable.
Test Structure And Quality ✅ Passed Webhook test files use standard Go testing and testify frameworks, not Ginkgo patterns, making Ginkgo check requirements inapplicable.
Microshift Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. Only Kubernetes manifest files for webhook deployments and configurations were modified.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR contains only YAML configuration changes to webhooks and manifests. No new Ginkgo e2e tests were added, so SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR does not introduce topology-incompatible scheduling constraints; only adds preStop hook and modifies webhook failurePolicy.
Ote Binary Stdout Contract ✅ Passed PR modifies only Kubernetes YAML manifests with infrastructure configuration changes; no Go source code or binary entry points modified, so no OTE Binary Stdout Contract risk.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies only Kubernetes manifest configuration files with no new test code added, making the IPv6 and disconnected network test compatibility check not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 18, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dustman9000
Once this PR has been reviewed and has the lgtm label, please assign eqrx for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

With failurePolicy: Fail, the API server hangs when the webhook pod is
being evicted during node drains. The dying pod cannot serve requests but
is still registered as a backend, creating a deadlock that blocks MCO
drains for 1+ hour.

Changing to failurePolicy: Ignore means the API server skips validation
during the brief window when the webhook is unavailable, rather than
blocking the entire node drain. The second replica continues serving,
so most requests still get validated.

Combined with the preStop sleep from the previous commit, this provides
defense-in-depth: the preStop gives time for endpoint deregistration,
and failurePolicy: Ignore prevents hangs if that race is still hit.

Ref: https://redhat.atlassian.net/browse/SREP-4577
@dustman9000 dustman9000 changed the title SREP-4577: Add preStop hook to webhook pod to prevent drain deadlock SREP-4577: Fix webhook drain deadlock with preStop hook and failurePolicy: Ignore Apr 18, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/templates/csv-template.yaml`:
- Line 56: The CSV webhook configuration currently sets failurePolicy: Ignore in
config/templates/csv-template.yaml which makes the webhook fail-open; either
explicitly document that this tradeoff is intentional in the manifest (add a
clear comment near failurePolicy explaining why failing open is acceptable and
list compensating controls) or change the webhook failurePolicy to Fail to
enforce validation on webhook unavailability; locate the webhook spec
(failurePolicy field in the CSV/webhook configuration) and either replace Ignore
with Fail or add the explicit justification and link to compensating controls
(replicas, preStop, monitoring/alerting) in the template.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ccee0ca4-d87c-43db-8c1c-ea0afcd92857

📥 Commits

Reviewing files that changed from the base of the PR and between 4588aea and 0ac2115.

📒 Files selected for processing (5)
  • bundle/manifests/addon-operator.clusterserviceversion.yaml
  • config/templates/csv-template.yaml
  • deploy-extras/development/webhook/validatingwebhookconfig.yaml
  • deploy_pko/ValidatingWebhookConfiguration-addons.yaml
  • hack/webhookdefinition.yaml
✅ Files skipped from review due to trivial changes (2)
  • hack/webhookdefinition.yaml
  • deploy_pko/ValidatingWebhookConfiguration-addons.yaml

Comment thread config/templates/csv-template.yaml
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.24%. Comparing base (dfaf819) to head (3d968da).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #717   +/-   ##
=======================================
  Coverage   59.24%   59.24%           
=======================================
  Files          62       62           
  Lines        4125     4125           
=======================================
  Hits         2444     2444           
  Misses       1532     1532           
  Partials      149      149           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 19, 2026

@dustman9000: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@erdii
Copy link
Copy Markdown
Member

erdii commented Apr 20, 2026

Hey Dustin - thanks for the submission!

Are you sure that this PR actually fixes the node drain issue you've described?
Switching the webhook failure policy to Ignore should not change the drain behaviour because it just unblocks CREATE/UPDATE calls to the Addon API and the pre stop exec sleep makes endpoint deregistration cleaner but does it really help to unblock pod eviction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants