Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: handle non graceful node shutdowns #4260

Merged

Conversation

bart0sh
Copy link
Contributor

@bart0sh bart0sh commented Oct 2, 2023

120421: interaction with unexpected node shutdown KEP

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 2, 2023
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 2, 2023
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 2, 2023

/assign @pohly

@bart0sh bart0sh force-pushed the PR013-DRA-non-graceful-node-shutdowns branch 2 times, most recently from 6d43d94 to 51670a6 Compare October 2, 2023 12:03
@bart0sh bart0sh added this to Triage in SIG Node PR Triage Oct 2, 2023
@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Oct 2, 2023
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, some minor spelling suggestions.

@@ -1162,6 +1163,18 @@ Once all of those steps are complete, kubelet will notice that the claims are
ready and run the pod. Until then it will keep checking periodically, just as
it does for other reasons that prevent a pod from running.

### Handling non graceful node shutdowns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Handling non graceful node shutdowns
### Handling non-graceful node shutdowns

Copy link
Contributor Author

@bart0sh bart0sh Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'non graceful' is used in the KEP, that's why I decided to use it here and in the e2e test code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP issues uses "non-graceful", as does the KEP README in one place - looks like the original authors weren't sure about the right spelling.

"non-graceful" feels more right to me, but I'm not a native speaker and English is creative... Let's leave it in this PR as you have it now ("non graceful").

@@ -1162,6 +1163,18 @@ Once all of those steps are complete, kubelet will notice that the claims are
ready and run the pod. Until then it will keep checking periodically, just as
it does for other reasons that prevent a pod from running.

### Handling non graceful node shutdowns

When a node is shutdown unexpectedly and is tained with an `out-of-service`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a node is shutdown unexpectedly and is tained with an `out-of-service`
When a node is shut down unexpectedly and is tainted with an `out-of-service`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thank you for the review!

### Handling non graceful node shutdowns

When a node is shutdown unexpectedly and is tained with an `out-of-service`
taint with NoExecute effect as explained in the [Non graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
taint with NoExecute effect as explained in the [Non graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown),
taint with NoExecute effect as explained in the [Non-graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

resources used by the pods will be deallocated. However, they will not be
un-prepared as the node is down and Kubelet is not running on it.

Resource drivers should be able to handle this situation correctly and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Deallocate is called without UnprepareNodeResources does it automatically mean a broken node? Or there are other cases like this?

non-blocking: is there any recommendations that can be shared here on the best practices on implementing those? If there is no guarantee, what logic should be implemented in UnprepareNodeResources?

Copy link
Contributor Author

@bart0sh bart0sh Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Deallocate is called without UnprepareNodeResources does it automatically mean a broken node? Or there are other cases like this?

Currently this is the only one case I'm aware of.

is there any recommendations that can be shared here on the best practices on implementing those? If there is no guarantee, what logic should be implemented in UnprepareNodeResources?

It depends on a resource type. For local resources not much can be done if node is powered off, but something can be done if it's just a Kubelet crash. For network-attached resources DRA controller can theoretically detach them from the node. However, all these cases are not generic enough to give recommendations. Plugin authors should know how to handle this type of cases and in most cases it depends on a particular hardware setup, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason I am asking is to understand if we need to guarantee that in normal case all callbacks will be called? And if so - is there any guarantees on timing? Can they be called super fast one after another - how much synchronization is needed there. Can they somehow be called in opposite order? (sorry, I haven't looked at implementation so my questions may be completely out of context).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already guaranteed that PrepareResources is called when container is created and UnprepareResources - when container is terminated. It's a part of the initial DRA implementation. Description of the non graceful node shutdown case is just a way to explain how DRA plugins are expected to behave when UnprepareResources is not called due to unexpected node shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me provide a bit more info on this.
NodePrepareResources and NodeUnprepareResources calls are expected to do as little work as possible as specified in the KEP

This operation SHALL do as little work as possible as it’s called after a pod is scheduled to a node. All potentially failing operations SHALL be done during allocation phase.

In most cases it means that NodePrepareResources only creates CDI file and returns its fqdn to the Kubelet and NodeUnprepareResources removes the file. When the node is rebooted CDI files are removed as they're usually placed at /var/run/cdi/. In this particular case it means that even if NodeUnprepareResources is not called, because of unexpected node shutdown, the file will be removed anyway on the node reboot.

@bart0sh bart0sh force-pushed the PR013-DRA-non-graceful-node-shutdowns branch from 51670a6 to 4240c3b Compare October 2, 2023 19:07

Resource drivers should be able to handle this situation correctly and
should not expect `UnprepareNodeResources` to be always called before
`Deallocate`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sentence saying that UnprepareNodeResources may be omitted, but can be read as "they can be called in opposite order". Please make it more explicit what sentence means

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out to this. Re-frazed, PTAL.

@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 3, 2023

@SergeyKanzhelev any other questions/concerns? if not, can you lgtm/approve?

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2023
@bart0sh bart0sh moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage Oct 4, 2023
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 4, 2023

/assign @mrunalp @dchen1107
for a final approval

@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 11, 2023

@mrunalp @dchen1107 @derekwaynecarr Can you approve this? It's a minor change and it doesn't require any code changes.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bart0sh, mrunalp, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2023
@mrunalp mrunalp moved this from Needs Approver to Done in SIG Node PR Triage Oct 19, 2023
@k8s-ci-robot k8s-ci-robot merged commit 0d9f0c4 into kubernetes:master Oct 19, 2023
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants