Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a flag to disable force detach behavior in kube-controller-manager #120344

Merged
merged 1 commit into from Feb 22, 2024

Conversation

rohitssingh
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR adds a flag to allow the 6-minute force detach window to be disabled.

With this flag enabled, ControllerUnpublishVolume RPCs will not be dispatched if a given pod is stuck unmounting for 6-minutes and the node is deemed to be Unhealthy in that window.

Which issue(s) this PR fixes:

Fixes #120328

Special notes for your reviewer:

Does this PR introduce a user-facing change?

This change adds the following CLI option for `kube-controller-manager`:
* `disable-force-detach` (defaults to `false`): Prevent force detaching volumes based on maximum unmount time and node status. If enabled, the non-graceful node shutdown feature must be used to recover from node failure (see https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/). If enabled and a pod must be forcibly terminated at the risk of corruption, then the appropriate VolumeAttachment object (see here: https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/volume-attachment-v1/) must be deleted.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 1, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @rohitssingh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Sep 1, 2023
@rohitssingh
Copy link
Contributor Author

@k8s-ci-robot k8s-ci-robot added area/code-generation area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 1, 2023
@alexzielenski
Copy link
Contributor

/remove-sig api-machinery
/sig apps
/sig storage

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Sep 5, 2023
@carlory
Copy link
Member

carlory commented Sep 6, 2023

/kind feature
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. kind/feature Categorizes issue or PR as related to a new feature. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 6, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 10e35089f92ad5267d73ba3a9c4ce3f520fa3773

@msau42
Copy link
Member

msau42 commented Feb 22, 2024

/assign @jpbetz

@bswartz
Copy link
Contributor

bswartz commented Feb 22, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2024
@jpbetz
Copy link
Contributor

jpbetz commented Feb 22, 2024

/approve

@rohitssingh
Copy link
Contributor Author

@jpbetz: Thanks for the feedback! I think I've addressed your comments. Please feel free to take another look.
@msau42 & @bswartz: Looks like the lgtm tag has been removed because I uploaded a new patch? Please feel free to take another look.

@jpbetz
Copy link
Contributor

jpbetz commented Feb 22, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a069353479a4fdd71897bdb60a40777182ba61c2

@jpbetz
Copy link
Contributor

jpbetz commented Feb 22, 2024

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, msau42, rohitssingh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 22, 2024
@k8s-ci-robot k8s-ci-robot merged commit 31a482a into kubernetes:master Feb 22, 2024
19 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Feb 22, 2024
@rohitssingh
Copy link
Contributor Author

Thank you for your kindly detailed explanation. I agree with your chage.
It's just out of interest, in your team, the unmount of a pod takes too long time in spite of node broken? Approximately how many minutes does it take to unmount?

Hi @YuikoTakada,

I think we've encountered a number of different scenarios with force detach:

  • In some cases, we have had misbehaving pods on otherwise healthy nodes that were being migrated and just ignored shutdown signals. These pods would never shutdown on their own, and often resulted in Zombie LUNs and data corruption.
  • In other cases, we have had problems with the CSI driver not being able to process NodeUnpublishVolume or NodeUnstageVolume in a timely fashion. In cases where kubelet was restarted in the wrong window, or the load on a node was too high, the node would fall into an unhealthy state, leading to Zombie LUNs and corruption.

I don't think in any case we would have been able to unmount within any given time window. IMHO, We would very much prefer shutdown to hang indefinitely so that we can easily find & fix these problems (as opposed to discovering data corruption days or weeks later).

@rohitssingh rohitssingh deleted the disable_force_detach branch February 23, 2024 01:01
serathius pushed a commit to serathius/kubernetes that referenced this pull request Mar 14, 2024
This is an abbreviated version of kubernetes#120344

It changes the boolean plumbing to use a global package variable to
avoid conflicts.

The unit test is added in an isolated file, named starting with z_ so it
runs after other OSS unit tests which are not resilient to metrics
starting at any value other than 0.

Bug: b/272460654
Change-Id: I5437f3d2dc73f4f1a3782ba3afe5fdc8f93287e4
@sftim
Copy link
Contributor

sftim commented Mar 26, 2024

BTW, we can use Markdown in the changelog, including actual hyperlinks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/code-generation area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: API review completed, 1.30
Archived in project
Development

Successfully merging this pull request may close these issues.

Flag to Disable the 6-minute Force Detach Window