Revert: Wait for container cleanup before deletion #50350 #53210

dashpole · 2017-09-28T16:50:40Z

This PR changes the kubelet so that it no longer waits for container deletion to remove the pod.
#50350 had originally added this to prevent deletion of a pod until its containers had been removed from disk. However, because containers are garbage collected only every 30 seconds, this resulted in a large batch of pod deletions all at once. See #51899 (comment) for details.

cc @kubernetes/sig-node-bugs @kubernetes/sig-scalability-bugs @kubernetes/kubernetes-release-managers

liggitt · 2017-09-28T16:58:48Z

as soon as this has LGTM, can you go ahead and open the cherrypick to release-1.8 as well to get it running through CI?

yujuhong · 2017-09-28T17:03:34Z

/lgtm

shyamjvs · 2017-09-28T17:04:16Z

I'm testing this fix on 2k-node cluster.

dchen1107 · 2017-09-28T17:08:42Z

/lgtm

smarterclayton · 2017-09-28T17:09:02Z

Hold on, is this dangerous? I thought we added this because otherwise we had problems with volume attach issues.

smarterclayton · 2017-09-28T17:09:05Z

/hold

dchen1107 · 2017-09-28T17:18:13Z

@smarterclayton I am not sure what volume attach issue you are referring here.

Kubelet does wait for the volume deletion done to delete the pod. This change just change the logic to previous release upon deleting a pod.

I personally don't think the original issue is the release blocker, but also ok with reverting the entire pr since it is an existing issue in disk eviction manager for a couple of releases. But I think this revert shouldn't cause volume related issues.

smarterclayton · 2017-09-28T17:23:34Z

I think in general the measurement we're measuring is somewhat arbitrary - i'd think prefer to ship with this, document 5000 node clusters have an issue on DELETE EVERYTHING IN MY CLUSTER AAAAIIIIEEEE, and make kubelet not thundering herd the masters in a 1.8.x.

Is the code in master correct as an administrator would care about the cluster? Is the measurement here we're guiding a revert on a valid measurement for what an administrator cares about?

If we shipped 1.8.0 with this, and noted that massive deletion causes thundering herd, and that the nodes shouldn't thundering herd, and fix that in 1.8.x, would that be a better experience for our users?

dchen1107 · 2017-09-28T17:31:36Z

Re: #53210 (comment)

@smarterclayton I believe you preferred shipping 1.8 without this revert pr, right? That is what I preferred in the first place. I don't think that should cause a real regression in production. In my mind, waiting for volume deletion should have bigger impact at the first place, which was introduced a couple of releases before. But our performance measurement never catch that issue, and there is no single production issue being reported related to that.

shyamjvs · 2017-09-28T17:32:42Z

I think in general the measurement we're measuring is somewhat arbitrary - i'd prefer to ship with this, document 5000 node cluster, and make kubelet not thundering herd the masters on MASSIVE deletion.

@smarterclayton I'm not sure I agree with this. The delete pods latency shot up from 20ms to greater than 3s (which is orders of magnitude higher and violating the SLOs we promise). And this was seen at a scale of 2k-nodes, not even 5k. The deletion of pods we're doing in the density test is not abnormal imo - we're deleting about 3000 pods within a couple of minutes which is sth I believe could happen if a user brought down a big deployment.

Though I don't have too strong opinion here and could be wrong.

smarterclayton · 2017-09-28T17:35:48Z

Stepping back a bit on the underlying user scenario, the POINT of the SLO is probably:

I can turn over all the pods in my clusters safely at a very large cluster size quickly

Do we agree? If so, from the data that we're seeing are we violating that user scenario?

I.e. is the 3s latency impacting turning over the cluster?

spiffxp · 2017-09-28T17:45:03Z

/test pull-kubernetes-e2e-gce-bazel

dchen1107 · 2017-09-28T17:45:43Z

@shyamjvs on

The delete pods latency shot up from 20ms to greater than 3s (which is orders of magnitude higher and violating the SLOs we promise).

Do we think this is a reasonable SLO upon deleting about 3000 pods in a large cluster at the first place?

shyamjvs · 2017-09-28T18:20:49Z

Discussing further in the burndown meeting happening right now.

shyamjvs · 2017-09-28T19:43:12Z

Just fyi - The scale test is passing with this change. It still hasn't gone back to the latencies like back before that commit, but it's well below our 1s threshold:

Sep 28 20:23:57.699: INFO: Top latency metric: {Resource:pods Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:4.486ms Perc90:7.763ms Perc99:344.744ms Perc100:0s} Count:124000}

shyamjvs · 2017-09-28T19:50:35Z

So what's the plan here? Do we want to get this in for 1.9 and then fix the thundering herd issue before re-reverting or just wait till the issue is fixed?
I'd suggest the former in order to have green signal wrt scalability for 1.9 in order to be able to catch new regressions (note that these tests are release-blocking for 1.9 too).

derekwaynecarr · 2017-09-28T20:31:08Z

i am confused, can we not protect this behind the feature gate flag for local disk scheduling? that seems where like the need to enforce this is cleaned up makes more immediate sense to me.

liggitt · 2017-09-28T20:35:04Z

can we not protect this behind the feature gate flag for local disk scheduling? that seems where like the need to enforce this is cleaned up makes more immediate sense to me.

If the only impact of deleting the pod prior to the container cleanup is on disk-based scheduling, that makes sense to me as well, lets us get back SLO and test signal immediately in 1.9, have a minimal pick to 1.8.1, and makes resolving the thundering herd a requirement for enabling that feature gate by default (either by smearing deletes or rethinking periodic gc).

k8s-github-robot · 2017-09-28T21:27:41Z

/lgtm cancel //PR changed after LGTM, removing LGTM. @dashpole @dchen1107 @yujuhong

dashpole · 2017-09-28T21:31:19Z

I added the local storage feature gate, as that seems preferable to removing that portion of code entirely. I don't have an opinion on whether or not this should go in.

dchen1107 · 2017-09-28T22:33:51Z

re: #53210 (comment)

I did propose to have this feature gate in earlier burndown meeting. The decision is releasing 1.8.0 as is and document the known issue. And we are working on the solution addressing the issue in 1.8.1.

liggitt · 2017-09-28T22:42:37Z

I didn’t realize there was an existing feature gate this cleanup was related to. If this cleanup only matters in the case where local disk scheduling is being taken into account, it makes sense to me to use the existing feature flag to expedite fixing the thundering herd issue for 1.8.x, and give us more time to figure out how to resolve the issue when the local disk scheduling is enabled

yujuhong · 2017-09-29T00:05:27Z

Spoke to @dashpole offline. The feature gate sounds like a good compromise.
The only downside is we'd have to live with (overly) aggressive pod eviction for longer to fix this properly. That seems tolerable since the bug existed in 1.7 as well.

On the other hand, I'd rather see the feature gate change be included in the 1.8.0, as opposed to the change of behavior from 1.8.0 to 1.8.1 with no good reason.

wojtek-t · 2017-10-02T09:00:57Z

So what's the status of that? In my opinion, solution with feature gate seems acceptable and it would be great if we could merge it and backport to 1.8.1.

dashpole · 2017-10-03T16:34:44Z

Closing as this PR is no longer needed. The underlying issue is solved by #53233. See #53233 (comment)

jpbetz · 2017-10-25T22:33:30Z

Removing 1.8 milestone since this is no longer needed.

k8s-cherrypick-bot · 2017-10-25T22:36:06Z

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

dashpole mentioned this pull request Sep 28, 2017

delete pods API call latencies shot up on large cluster tests #51899

Closed

k8s-github-robot assigned mtaufen and resouer Sep 28, 2017

k8s-github-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Sep 28, 2017

jdumars added queue/critical-fix release-blocker labels Sep 28, 2017

dchen1107 assigned dchen1107 and yujuhong and unassigned mtaufen and resouer Sep 28, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 28, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 28, 2017

liggitt added the cherrypick-candidate label Sep 28, 2017

liggitt added this to the v1.8 milestone Sep 28, 2017

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 28, 2017

jdumars removed queue/critical-fix release-blocker labels Sep 28, 2017

add feature gate to critical portion of kubernetes#50350

ac80d4d

dashpole force-pushed the revert_container_cleanup branch from d56c915 to ac80d4d Compare September 28, 2017 21:26

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 28, 2017

dashpole closed this Oct 3, 2017

dashpole deleted the revert_container_cleanup branch October 3, 2017 16:34

jpbetz removed this from the v1.8 milestone Oct 25, 2017

k8s-cherrypick-bot removed the cherrypick-candidate label Oct 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert: Wait for container cleanup before deletion #50350 #53210

Revert: Wait for container cleanup before deletion #50350 #53210

dashpole commented Sep 28, 2017

liggitt commented Sep 28, 2017 •

edited

yujuhong commented Sep 28, 2017

shyamjvs commented Sep 28, 2017

dchen1107 commented Sep 28, 2017

smarterclayton commented Sep 28, 2017

smarterclayton commented Sep 28, 2017

dchen1107 commented Sep 28, 2017 •

edited

smarterclayton commented Sep 28, 2017 •

edited

dchen1107 commented Sep 28, 2017

shyamjvs commented Sep 28, 2017 •

edited

smarterclayton commented Sep 28, 2017 •

edited

spiffxp commented Sep 28, 2017

dchen1107 commented Sep 28, 2017

shyamjvs commented Sep 28, 2017

shyamjvs commented Sep 28, 2017

shyamjvs commented Sep 28, 2017

derekwaynecarr commented Sep 28, 2017

liggitt commented Sep 28, 2017 •

edited

k8s-github-robot commented Sep 28, 2017

dashpole commented Sep 28, 2017

dchen1107 commented Sep 28, 2017

liggitt commented Sep 28, 2017

yujuhong commented Sep 29, 2017

wojtek-t commented Oct 2, 2017

dashpole commented Oct 3, 2017

jpbetz commented Oct 25, 2017

k8s-cherrypick-bot commented Oct 25, 2017

Revert: Wait for container cleanup before deletion #50350 #53210

Revert: Wait for container cleanup before deletion #50350 #53210

Conversation

dashpole commented Sep 28, 2017

liggitt commented Sep 28, 2017 • edited

yujuhong commented Sep 28, 2017

shyamjvs commented Sep 28, 2017

dchen1107 commented Sep 28, 2017

smarterclayton commented Sep 28, 2017

smarterclayton commented Sep 28, 2017

dchen1107 commented Sep 28, 2017 • edited

smarterclayton commented Sep 28, 2017 • edited

dchen1107 commented Sep 28, 2017

shyamjvs commented Sep 28, 2017 • edited

smarterclayton commented Sep 28, 2017 • edited

spiffxp commented Sep 28, 2017

dchen1107 commented Sep 28, 2017

shyamjvs commented Sep 28, 2017

shyamjvs commented Sep 28, 2017

shyamjvs commented Sep 28, 2017

derekwaynecarr commented Sep 28, 2017

liggitt commented Sep 28, 2017 • edited

k8s-github-robot commented Sep 28, 2017

dashpole commented Sep 28, 2017

dchen1107 commented Sep 28, 2017

liggitt commented Sep 28, 2017

yujuhong commented Sep 29, 2017

wojtek-t commented Oct 2, 2017

dashpole commented Oct 3, 2017

jpbetz commented Oct 25, 2017

k8s-cherrypick-bot commented Oct 25, 2017

liggitt commented Sep 28, 2017 •

edited

dchen1107 commented Sep 28, 2017 •

edited

smarterclayton commented Sep 28, 2017 •

edited

shyamjvs commented Sep 28, 2017 •

edited

smarterclayton commented Sep 28, 2017 •

edited

liggitt commented Sep 28, 2017 •

edited