-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert: Wait for container cleanup before deletion #50350 #53210
Conversation
as soon as this has LGTM, can you go ahead and open the cherrypick to release-1.8 as well to get it running through CI? |
/lgtm |
I'm testing this fix on 2k-node cluster. |
/lgtm |
Hold on, is this dangerous? I thought we added this because otherwise we had problems with volume attach issues. |
/hold |
@smarterclayton I am not sure what volume attach issue you are referring here. Kubelet does wait for the volume deletion done to delete the pod. This change just change the logic to previous release upon deleting a pod. I personally don't think the original issue is the release blocker, but also ok with reverting the entire pr since it is an existing issue in disk eviction manager for a couple of releases. But I think this revert shouldn't cause volume related issues. |
I think in general the measurement we're measuring is somewhat arbitrary - i'd think prefer to ship with this, document 5000 node clusters have an issue on DELETE EVERYTHING IN MY CLUSTER AAAAIIIIEEEE, and make kubelet not thundering herd the masters in a 1.8.x. Is the code in master correct as an administrator would care about the cluster? Is the measurement here we're guiding a revert on a valid measurement for what an administrator cares about? If we shipped 1.8.0 with this, and noted that massive deletion causes thundering herd, and that the nodes shouldn't thundering herd, and fix that in 1.8.x, would that be a better experience for our users? |
Re: #53210 (comment) @smarterclayton I believe you preferred shipping 1.8 without this revert pr, right? That is what I preferred in the first place. I don't think that should cause a real regression in production. In my mind, waiting for volume deletion should have bigger impact at the first place, which was introduced a couple of releases before. But our performance measurement never catch that issue, and there is no single production issue being reported related to that. |
@smarterclayton I'm not sure I agree with this. The delete pods latency shot up from 20ms to greater than 3s (which is orders of magnitude higher and violating the SLOs we promise). And this was seen at a scale of 2k-nodes, not even 5k. The deletion of pods we're doing in the density test is not abnormal imo - we're deleting about 3000 pods within a couple of minutes which is sth I believe could happen if a user brought down a big deployment. Though I don't have too strong opinion here and could be wrong. |
Stepping back a bit on the underlying user scenario, the POINT of the SLO is probably:
Do we agree? If so, from the data that we're seeing are we violating that user scenario? I.e. is the 3s latency impacting turning over the cluster? |
/test pull-kubernetes-e2e-gce-bazel |
@shyamjvs on
Do we think this is a reasonable SLO upon deleting about 3000 pods in a large cluster at the first place? |
Discussing further in the burndown meeting happening right now. |
Just fyi - The scale test is passing with this change. It still hasn't gone back to the latencies like back before that commit, but it's well below our 1s threshold:
|
So what's the plan here? Do we want to get this in for 1.9 and then fix the thundering herd issue before re-reverting or just wait till the issue is fixed? |
i am confused, can we not protect this behind the feature gate flag for local disk scheduling? that seems where like the need to enforce this is cleaned up makes more immediate sense to me. |
If the only impact of deleting the pod prior to the container cleanup is on disk-based scheduling, that makes sense to me as well, lets us get back SLO and test signal immediately in 1.9, have a minimal pick to 1.8.1, and makes resolving the thundering herd a requirement for enabling that feature gate by default (either by smearing deletes or rethinking periodic gc). |
d56c915
to
ac80d4d
Compare
/lgtm cancel //PR changed after LGTM, removing LGTM. @dashpole @dchen1107 @yujuhong |
I added the local storage feature gate, as that seems preferable to removing that portion of code entirely. I don't have an opinion on whether or not this should go in. |
re: #53210 (comment) I did propose to have this feature gate in earlier burndown meeting. The decision is releasing 1.8.0 as is and document the known issue. And we are working on the solution addressing the issue in 1.8.1. |
I didn’t realize there was an existing feature gate this cleanup was related to. If this cleanup only matters in the case where local disk scheduling is being taken into account, it makes sense to me to use the existing feature flag to expedite fixing the thundering herd issue for 1.8.x, and give us more time to figure out how to resolve the issue when the local disk scheduling is enabled |
Spoke to @dashpole offline. The feature gate sounds like a good compromise. On the other hand, I'd rather see the feature gate change be included in the 1.8.0, as opposed to the change of behavior from 1.8.0 to 1.8.1 with no good reason. |
So what's the status of that? In my opinion, solution with feature gate seems acceptable and it would be great if we could merge it and backport to 1.8.1. |
Closing as this PR is no longer needed. The underlying issue is solved by #53233. See #53233 (comment) |
Removing 1.8 milestone since this is no longer needed. |
Removing label |
Issue: #51899
This PR changes the kubelet so that it no longer waits for container deletion to remove the pod.
#50350 had originally added this to prevent deletion of a pod until its containers had been removed from disk. However, because containers are garbage collected only every 30 seconds, this resulted in a large batch of pod deletions all at once. See #51899 (comment) for details.
cc @kubernetes/sig-node-bugs @kubernetes/sig-scalability-bugs @kubernetes/kubernetes-release-managers