Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grace-period=0 --force Forced deletion failed #113717

Open
Seaiii opened this issue Nov 8, 2022 · 18 comments · May be fixed by #113883
Open

grace-period=0 --force Forced deletion failed #113717

Seaiii opened this issue Nov 8, 2022 · 18 comments · May be fixed by #113883
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Seaiii
Copy link

Seaiii commented Nov 8, 2022

What happened?

image

A case where Forced deletion does not stop! (Example: container is in sleep 9999 state) when I first kubectl delete pod --grace-perios=30 and then when kubectl delete pod --grace-perios=0 -force. The second time the force delete will not have any effect on the kubelet. According to the latest documentation, gracePeriod, when set to 0 and force, triggers the kubelet to start cleaning up immediately. But looking at the source code, I found that if gracePeriod=0 and force. kubelt triggers the handler.HandlePodRemoves function and then goes to the kl.deletePod function, which is correct. But in the real processing, the UpdatePod function is used (I looked at several versions of the code before 1.22 and they all use the deletePod function). When entering the update, there is no judgment that gracePeriod=0 and force. All as a normal updatePod to deal with, which also led to, the second gracePeriod = 0 was treated as a normal update, stored in the lastUndeliveredWorkUpdate cache. And the first delete because of the need to interact with the CRI, need to wait for the CRI to return the results, so you need to wait for the first 30s processing success before processing this forced deletion. This is not in line with the documentation logic and code logic. There should be gracePeriod=0 to handle the logic code of forced delete. Here is the source code flow. 1.kubelet.go handler.HandlePodRemoves(u.Pods)

image

2.kubelet.go deletepod

image

3.kubelet.go kl.podWorkers.UpdatePod()

image

4. pod_workers.go

image

The code that follows is the gracePeriod time that is waited for when the kill container is in place. I looked at the code in version 1.19 and found that after the HandlePodRemoves function is triggered, there are two steps. 1. Close the podupdate channel and remove the lastUndeliveredWorkUpdate. 2. Open the goroutine to call CRI to remove the container. The code after version 1.22 does not have this step of the operation, that is, the abandonment of gracePeriod = 0 and force forced deletion, this problem should be marked in the documentation, but I did not see. I changed the source code locally and tested it to achieve force deletion, is this a bug and do I need to pull request?

What did you expect to happen?

After the first grace-period=30, the second grace-period=0 should be deleted immediately.
Now the situation is that apiserver will delete the pod, but kubelet will not delete the container

How can we reproduce it (as minimally and precisely as possible)?

Start a container
1.docker run name sh -c "sleep 99999"
2.kubectl delete pod name --grace-period=30
3.kubectl delete pod name --grace-period=0 --force
The third process will not execute successfully

Anything else we need to know?

no

Kubernetes version

$ kubectl version 1.22.2
# paste output here

Cloud provider

null

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@Seaiii Seaiii added the kind/bug Categorizes issue or PR as related to a bug. label Nov 8, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 8, 2022
@aojea
Copy link
Member

aojea commented Nov 8, 2022

/sig node

I have doubts these commands are preemptive and the second call will stop the process initiated by the first

2.kubectl delete pod name --grace-period=30
3.kubectl delete pod name --grace-period=0 --force

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 8, 2022
@Seaiii
Copy link
Author

Seaiii commented Nov 8, 2022

I read the source code, it won't, it's executed sequentially, when the second request comes in, it will determine if the first execution is complete, if it's still being executed, it will be stored in the local cache and finished. Only after the first execution succeeds, it will go to the local cache to get the second request and execute it again. But by this time, the pod and container have already been finished by kill, and there is no need for them anymore.
According to the code logic of older versions, --grace-period=0 -force would be handled specially and the forced stop would be executed immediately, but not after version 1.22

@aojea
Copy link
Member

aojea commented Nov 8, 2022

yeah, sorry I didn't read it in detail :).

Indeed, in 1.22 there was a big refactor and cleanup on the kubelet and some releases notes were amended regarding the pod lifecycle #107348

This have some consequences as the ones you are highlighting, sig-node is the owner of this area, I know they are addressing some issues related to your question latey

#113145

@Seaiii
Copy link
Author

Seaiii commented Nov 9, 2022

It is true that there are many similar problems, but so far have not seen the official development team to solve, and have not seen the correct PR

@aojea
Copy link
Member

aojea commented Nov 9, 2022

It is true that there are many similar problems, but so far have not seen the official development team to solve, and have not seen the correct PR

yeah, maybe this is no supported, I just was giving you context so you can check this is not a duplicate or a known issue, then you have to wait for SIG node to triage this issue ...

@Seaiii
Copy link
Author

Seaiii commented Nov 9, 2022

It is true that there are many similar problems, but so far have not seen the official development team to solve, and have not seen the correct PR

Okay, thank you very much

@SergeyKanzhelev
Copy link
Member

I changed the source code locally and tested it to achieve force deletion, is this a bug and do I need to pull request?

on what version? 1.22 is almost out of support and earlier versions are out of support

@SergeyKanzhelev
Copy link
Member

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 9, 2022
@Seaiii
Copy link
Author

Seaiii commented Nov 10, 2022

I changed the source code locally and tested it to achieve force deletion, is this a bug and do I need to pull request?

on what version? 1.22 is almost out of support and earlier versions are out of support

I was on version 1.22 was changed. This feature is not supported from 1.22. 1.19 and 1.20 and 1.21 support this feature. I also looked at the code for version 1.23 and it is also not supported.

@SergeyKanzhelev
Copy link
Member

/remove-triage needs-information
/triage accepted
/priority important-longterm

I understand the problem now. This seems to be a change of behavior starting 1.22. The documentation https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced is warning that the force deletion on a node may take some time.

Are there any implications of this behavior beyond the prolonged life of a deleted pod? Would it be possible for you to try to schedule something on that node during the termination? Does it cause pod admission failure?

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it. labels Dec 9, 2022
@Seaiii
Copy link
Author

Seaiii commented Dec 10, 2022

/remove-triage needs-information /triage accepted /priority important-longterm

I understand the problem now. This seems to be a change of behavior starting 1.22. The documentation https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced is warning that the force deletion on a node may take some time.

Are there any implications of this behavior beyond the prolonged life of a deleted pod? Would it be possible for you to try to schedule something on that node during the termination? Does it cause pod admission failure?

Yes, a lot of changes were made to the pod_workers.go file after version 1.22. I also looked at the code comparison from version 1.19 to 1.25. When forcing deletion (grace period 0). apiserver will immediately delete pod information from etcd. But kubelt does not accept requests with a grace period of 0. If the grace period is 0, the default value (usually 30s) is used as the time to interact with CRI. That's why the forced deletion fails.
No other effects, I read the code in detail. Because the delete command has been initiated (CRI), the cached information of the pod has been deleted in the kubelet. pods will be accepted for creation again, but need to wait for the previous pod to be deleted successfully.
But this really doesn't follow the normal logic, because APISERVER and kubelet are no longer synchronized at this time.

@SergeyKanzhelev
Copy link
Member

/assign @Seaiii since you are working on it

@Seaiii
Copy link
Author

Seaiii commented Dec 15, 2022

/assign @Seaiii since you are working on it
Yes, I can fix it. But wait for the PR #113883 to finish.because I don't have permission to launch two PRs at the same time.
please help to advance the progress of PR, thank you very much!

@smarterclayton
Copy link
Contributor

There is some confusion in this bug:

Terminating a pod with grace period 0 does not mean "stop immediately" - it means "force delete the pod, and expect that all other parts of the system will try to clean the pod up as quickly as possible". However, because you remove the pod, you have nowhere to wait to see when the pod is deleted. Based on your description, if you want the pod to be deleted as quickly as possible and to know when the pod is done, you should:

kubectl delete pod name --grace-period=30
kubectl delete pod name --grace-period=1

Asking for a grace period of one means "delete as fast as possible" - the kubelet reserves the right to not send SIGKILL immediately (we sent TERM and then KILL a second later). The Kubelet has never issued SIGKILL to a process without some wait (even during eviction).

When we first implemented grace period deletion, we sent requests asynchronously to docker, so about 7 years ago the logic was correct that if you sent 30, then 1, it would shorten and try to kill right away. However, in the transition to CRI we lost that behavior because we moved to a synchronous event loop (for good reasons).

At the current time, the Kubelet is unable to shorten duration of grace period for a number of reasons. We're currently working on improving that, but it is dependent on:

#107826 - to allow us to cancel wait loops cleanly with a context
#113606 - so that we can cancel existing synchronous logic using the contexts
TODO - the kubelet must track when the first termination request was received and shorten others based on that
TODO - clean up context tracking in the pod worker
TODO - clean up the container runtime to properly handle "immediate" shutdown requests (bypass hooks, probably)

#107826 requires some additional review but is the key blocker. Once we have that, we'll gradually work through the blockers until grace periods behave like expected. But for now, use grace-period=1 rather than 0 - 0 is the same as 1 but you have no idea when the pod is done. No one should force delete pods unless they meet the criteria of "the node is dead and not coming back".

@smarterclayton
Copy link
Contributor

/assign

@Seaiii
Copy link
Author

Seaiii commented Mar 1, 2023

There is some confusion in this bug:

You are correct in this interpretation.
The --grace-period=1 method is much more elegant.
It's just that we can't reduce the grace period time now. As soon as this problem is solved, this issue can be closed

@pacoxu
Copy link
Member

pacoxu commented Apr 6, 2023

As I know this is a potential security issue(not sure if this needs a CVE number) when the user gives a very large graceful period and then forces delete with --grace-period=0 to remove it on the apiserver side. In this case, the pod will be removed in kube cluster admin view, but indeed kubelet will not remove it until the graceful period ends. If I do this multi times, there will be many background-running containers in the cluster but no pod in my namespace.

@Seaiii
Copy link
Author

Seaiii commented Jan 12, 2024

Are there any implications of this behavior beyond the prolonged life of a deleted pod? Would it be possible for you to try to schedule something on that node during the termination? Does it cause pod admission failure?

@SergeyKanzhelev Hi, The current 1.29 version doesn't deal with this bug yet, should we move forward with it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Triaged
6 participants