Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert: Wait for container cleanup before deletion #50350 #53210

Closed

Conversation

dashpole
Copy link
Contributor

Issue: #51899

This PR changes the kubelet so that it no longer waits for container deletion to remove the pod.
#50350 had originally added this to prevent deletion of a pod until its containers had been removed from disk. However, because containers are garbage collected only every 30 seconds, this resulted in a large batch of pod deletions all at once. See #51899 (comment) for details.

cc @kubernetes/sig-node-bugs @kubernetes/sig-scalability-bugs @kubernetes/kubernetes-release-managers

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 28, 2017
@k8s-github-robot k8s-github-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Sep 28, 2017
@ericchiang ericchiang added release-note-none Denotes a PR that doesn't merit a release note. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 28, 2017
@dchen1107 dchen1107 assigned dchen1107 and yujuhong and unassigned mtaufen and resouer Sep 28, 2017
@liggitt
Copy link
Member

liggitt commented Sep 28, 2017

as soon as this has LGTM, can you go ahead and open the cherrypick to release-1.8 as well to get it running through CI?

@yujuhong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 28, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 28, 2017
@shyamjvs
Copy link
Member

I'm testing this fix on 2k-node cluster.

@liggitt liggitt added this to the v1.8 milestone Sep 28, 2017
@dchen1107
Copy link
Member

/lgtm

@smarterclayton
Copy link
Contributor

Hold on, is this dangerous? I thought we added this because otherwise we had problems with volume attach issues.

@smarterclayton
Copy link
Contributor

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 28, 2017
@dchen1107
Copy link
Member

dchen1107 commented Sep 28, 2017

@smarterclayton I am not sure what volume attach issue you are referring here.

Kubelet does wait for the volume deletion done to delete the pod. This change just change the logic to previous release upon deleting a pod.

I personally don't think the original issue is the release blocker, but also ok with reverting the entire pr since it is an existing issue in disk eviction manager for a couple of releases. But I think this revert shouldn't cause volume related issues.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 28, 2017

I think in general the measurement we're measuring is somewhat arbitrary - i'd think prefer to ship with this, document 5000 node clusters have an issue on DELETE EVERYTHING IN MY CLUSTER AAAAIIIIEEEE, and make kubelet not thundering herd the masters in a 1.8.x.

Is the code in master correct as an administrator would care about the cluster? Is the measurement here we're guiding a revert on a valid measurement for what an administrator cares about?

If we shipped 1.8.0 with this, and noted that massive deletion causes thundering herd, and that the nodes shouldn't thundering herd, and fix that in 1.8.x, would that be a better experience for our users?

@dchen1107
Copy link
Member

Re: #53210 (comment)

@smarterclayton I believe you preferred shipping 1.8 without this revert pr, right? That is what I preferred in the first place. I don't think that should cause a real regression in production. In my mind, waiting for volume deletion should have bigger impact at the first place, which was introduced a couple of releases before. But our performance measurement never catch that issue, and there is no single production issue being reported related to that.

@shyamjvs
Copy link
Member

shyamjvs commented Sep 28, 2017

I think in general the measurement we're measuring is somewhat arbitrary - i'd prefer to ship with this, document 5000 node cluster, and make kubelet not thundering herd the masters on MASSIVE deletion.

@smarterclayton I'm not sure I agree with this. The delete pods latency shot up from 20ms to greater than 3s (which is orders of magnitude higher and violating the SLOs we promise). And this was seen at a scale of 2k-nodes, not even 5k. The deletion of pods we're doing in the density test is not abnormal imo - we're deleting about 3000 pods within a couple of minutes which is sth I believe could happen if a user brought down a big deployment.

Though I don't have too strong opinion here and could be wrong.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 28, 2017

Stepping back a bit on the underlying user scenario, the POINT of the SLO is probably:

  1. I can turn over all the pods in my clusters safely at a very large cluster size quickly

Do we agree? If so, from the data that we're seeing are we violating that user scenario?

I.e. is the 3s latency impacting turning over the cluster?

@spiffxp
Copy link
Member

spiffxp commented Sep 28, 2017

/test pull-kubernetes-e2e-gce-bazel

@dchen1107
Copy link
Member

@shyamjvs on

The delete pods latency shot up from 20ms to greater than 3s (which is orders of magnitude higher and violating the SLOs we promise).

Do we think this is a reasonable SLO upon deleting about 3000 pods in a large cluster at the first place?

@shyamjvs
Copy link
Member

Discussing further in the burndown meeting happening right now.

@shyamjvs
Copy link
Member

Just fyi - The scale test is passing with this change. It still hasn't gone back to the latencies like back before that commit, but it's well below our 1s threshold:

Sep 28 20:23:57.699: INFO: Top latency metric: {Resource:pods Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:4.486ms Perc90:7.763ms Perc99:344.744ms Perc100:0s} Count:124000}

@shyamjvs
Copy link
Member

So what's the plan here? Do we want to get this in for 1.9 and then fix the thundering herd issue before re-reverting or just wait till the issue is fixed?
I'd suggest the former in order to have green signal wrt scalability for 1.9 in order to be able to catch new regressions (note that these tests are release-blocking for 1.9 too).

@derekwaynecarr
Copy link
Member

i am confused, can we not protect this behind the feature gate flag for local disk scheduling? that seems where like the need to enforce this is cleaned up makes more immediate sense to me.

@liggitt
Copy link
Member

liggitt commented Sep 28, 2017

can we not protect this behind the feature gate flag for local disk scheduling? that seems where like the need to enforce this is cleaned up makes more immediate sense to me.

If the only impact of deleting the pod prior to the container cleanup is on disk-based scheduling, that makes sense to me as well, lets us get back SLO and test signal immediately in 1.9, have a minimal pick to 1.8.1, and makes resolving the thundering herd a requirement for enabling that feature gate by default (either by smearing deletes or rethinking periodic gc).

@k8s-github-robot
Copy link

/lgtm cancel //PR changed after LGTM, removing LGTM. @dashpole @dchen1107 @yujuhong

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 28, 2017
@dashpole
Copy link
Contributor Author

I added the local storage feature gate, as that seems preferable to removing that portion of code entirely. I don't have an opinion on whether or not this should go in.

@dchen1107
Copy link
Member

re: #53210 (comment)

I did propose to have this feature gate in earlier burndown meeting. The decision is releasing 1.8.0 as is and document the known issue. And we are working on the solution addressing the issue in 1.8.1.

@liggitt
Copy link
Member

liggitt commented Sep 28, 2017

I didn’t realize there was an existing feature gate this cleanup was related to. If this cleanup only matters in the case where local disk scheduling is being taken into account, it makes sense to me to use the existing feature flag to expedite fixing the thundering herd issue for 1.8.x, and give us more time to figure out how to resolve the issue when the local disk scheduling is enabled

@yujuhong
Copy link
Contributor

Spoke to @dashpole offline. The feature gate sounds like a good compromise.
The only downside is we'd have to live with (overly) aggressive pod eviction for longer to fix this properly. That seems tolerable since the bug existed in 1.7 as well.

On the other hand, I'd rather see the feature gate change be included in the 1.8.0, as opposed to the change of behavior from 1.8.0 to 1.8.1 with no good reason.

@wojtek-t
Copy link
Member

wojtek-t commented Oct 2, 2017

So what's the status of that? In my opinion, solution with feature gate seems acceptable and it would be great if we could merge it and backport to 1.8.1.

@dashpole
Copy link
Contributor Author

dashpole commented Oct 3, 2017

Closing as this PR is no longer needed. The underlying issue is solved by #53233. See #53233 (comment)

@dashpole dashpole closed this Oct 3, 2017
@dashpole dashpole deleted the revert_container_cleanup branch October 3, 2017 16:34
@jpbetz
Copy link
Contributor

jpbetz commented Oct 25, 2017

Removing 1.8 milestone since this is no longer needed.

@jpbetz jpbetz removed this from the v1.8 milestone Oct 25, 2017
@k8s-cherrypick-bot
Copy link

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet