Enable graceful deletion of pods (7/7) #12963

smarterclayton · 2015-08-20T02:19:06Z

This commit actually enables graceful deletion of pods. Should not be merged before
all other prereqs are merged and e2e are stable.

Only the last commit is part of this pull request

Extracted from #9165

k8s-bot · 2015-08-20T03:21:14Z

GCE e2e build/test failed for commit 8fbc0d92d618e5f107007af08d63b95d940b409d.

k8s-bot · 2015-08-20T16:16:23Z

GCE e2e build/test failed for commit 5baa6c042bc92b296367b86273ee4e9985727f83.

k8s-bot · 2015-08-20T16:23:34Z

GCE e2e build/test passed for commit e3440882bd7de90b1a109f4f48901d172d1c7265.

k8s-bot · 2015-08-21T03:43:54Z

GCE e2e build/test passed for commit cfa410fbf22f81e8ad02b56912c6e047c52c50c5.

k8s-bot · 2015-08-21T14:10:13Z

GCE e2e build/test failed for commit 59bd360f3f1d2b2e0fbca1c2cce2ea0607100ad5.

k8s-bot · 2015-08-21T14:40:39Z

GCE e2e build/test failed for commit 43df21d2b122bdbcc56fb1d7a9019f27f326b62e.

smarterclayton · 2015-08-21T17:11:16Z

@quinton-hoole in https://storage.cloud.google.com/kubernetes-jenkins/pr-logs/43df21d2b122bdbcc56fb1d7a9019f27f326b62e/kubernetes-pull-build-test-e2e-gce/5901/build-log.txt I'm seeing something that looks impossible - pods on the nodes created by a test, but that particular test hasn't run yet (pre_stop). The failure is caused because the nodes running out of cpu capacity (I think) which basically means the test is going to fail. Are tests running in parallel? Because that would explain a lot of the services flakes, if one of the tests is running before it starts.

smarterclayton · 2015-08-21T17:11:27Z

@kubernetes/goog-testing on my previous comment

ghost · 2015-08-21T18:20:18Z

Yup, they run in parallel. We have a blacklist available to exclude non-parallel-friendly tests:

https://github.com/kubernetes/kubernetes/blob/master/hack/jenkins/e2e.sh#L103

Q

ghost · 2015-08-21T18:22:51Z

@smarterclayton Of course if it's purely an out-of-capacity problem, we can increase the number of nodes in the test cluster. @ixdy should be able to help with that if necessary.

smarterclayton · 2015-08-21T18:44:05Z

I think in this case it's capacity caused by the services up and down and multiports - they are creating 3 - 6 pods each, so while other tests may be relevant I suspect those two are the culprits. They should probably only run when the cluster is reasonably empty.

ghost · 2015-08-21T19:03:20Z

146 e2e tests ran in parallel, and all of them create a few pods. What makes you think that the 3-6 pods created by each of those few services tests are the problem? @ixdy, how do you currently capacity plan the per-PR e2e test cluster? Do you have enough nodes to run all of the current tests in parallel, or do you need to increase the cluster size?

ghost · 2015-08-21T19:03:31Z

@smarterclayton ^^

ghost · 2015-08-21T19:05:53Z

@smarterclayton #13032 might be related.

smarterclayton · 2015-08-21T19:09:14Z

The "pending" for longer than a minute on a fast cluster is either gcr being slow (which I very much doubt) or overcapacity. Maybe we need to wire the e2e tests to immediately output and warn whenever a pod can't be scheduled and find a way to intersperse that with the logs.

smarterclayton · 2015-08-21T19:10:15Z

There could also be flakes with CPU capacity on the nodes - I've seen even the standard 2 node cluster fail to schedule the infrastructure. Do we have a general discussion forum to talk e2e that is better than this issue?

ghost · 2015-08-22T00:52:27Z

@smarterclayton Aah yes! Thanks for reminding me. I've just created:

https://groups.google.com/forum/#!forum/kubernetes-sig-testing

... which is probably the right place to have those discussions.

k8s-bot · 2015-08-22T01:07:08Z

GCE e2e build/test failed for commit 449c61974be20ddd7b7462d2a4489627f0b406f3.

ixdy · 2015-08-24T21:19:32Z

"how do you currently capacity plan the per-PR e2e test cluster?"

There isn't really any planning involved right now. It's set at 2 minions because (I think) at the time we were quota-limited, and it never got bumped back up to the default (which is now 3).

When I first tried testing parallel Ginkgo runs, there wasn't much improvement in wall clock times by increasing the number of minions. I'm not sure whether this is still the case or not, though I think there are some tests which intentionally try to run something on every minion.

ixdy · 2015-08-24T21:28:01Z

The PR builder also uses n1-standard-1 master and minions; wonder if that's starting to cause issues.

k8s-bot · 2015-08-30T22:48:30Z

GCE e2e build/test passed for commit e5600f7.

smarterclayton · 2015-08-30T23:50:51Z

ok to test

k8s-bot · 2015-08-31T00:11:28Z

GCE e2e build/test passed for commit e5600f7.

smarterclayton · 2015-08-31T00:29:26Z

ok to test

k8s-bot · 2015-08-31T00:52:51Z

GCE e2e build/test passed for commit e5600f7.

smarterclayton · 2015-08-31T01:19:18Z

ok to test

k8s-bot · 2015-08-31T01:42:31Z

GCE e2e build/test failed for commit e5600f7.

smarterclayton · 2015-08-31T03:13:55Z

ok to test

k8s-bot · 2015-08-31T03:37:25Z

GCE e2e build/test failed for commit e5600f7.

smarterclayton · 2015-08-31T04:49:35Z

These are all flakes - I can't reproduce them in a local environment. They appear to be related to not being able to start on a node in a given period of time - either delayed by pulling or scheduling.

smarterclayton · 2015-08-31T15:18:45Z

ok to test

k8s-bot · 2015-08-31T15:42:12Z

GCE e2e build/test passed for commit e5600f7.

smarterclayton · 2015-08-31T17:04:37Z

I'm convinced every e2e failure listed in the issues above is a flake
here. With graceful enabled, deletion takes a bit longer, but the failures
I see seem to be failures to schedule, lag on the nodes, or failures in
unrelated cases. I don't see pods lying around. In my own e2e runs, I
still see some failures, but in general they are all transient. It's
possible that with deletion enabled the extra calls from the node to the
master in deletion push some of the changes over a rate limit (when they
are heavily loaded).

I believe that this is ready to try merging again, and looking for
statistically significant flakiness increases. I have no e2e tests that
fail locally consistently.

On Mon, Aug 31, 2015 at 12:42 PM, Kubernetes Bot notifications@github.com
wrote:

GCE e2e build/test passed for commit e5600f7
e5600f7
.

Build Log
https://storage.cloud.google.com/kubernetes-jenkins/pr-logs/e5600f7a8444cb14a656cbf7ba25c9f552bcbe0e/kubernetes-pull-build-test-e2e-gce/6787/build-log.txt

Test Artifacts
https://console.developers.google.com/storage/browser/kubernetes-jenkins/pr-logs/e5600f7a8444cb14a656cbf7ba25c9f552bcbe0e/kubernetes-pull-build-test-e2e-gce/6787/_artifacts/

Internal Jenkins Results
http://goto.google.com/prkubekins/job/kubernetes-pull-build-test-e2e-gce//6787

—
Reply to this email directly or view it on GitHub
#12963 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

derekwaynecarr · 2015-08-31T20:17:49Z

LGTM - e2e has been flaky, agree I dont see an issue.

ghost · 2015-08-31T20:53:35Z

Ack, LGTM. Work is ongoing in #13032 to fix the flaky services tests.

smarterclayton · 2015-09-01T01:51:42Z

ok to test

k8s-bot · 2015-09-01T01:53:31Z

GCE e2e build/test failed for commit e5600f7.

smarterclayton · 2015-09-01T02:20:03Z

ok to test

k8s-bot · 2015-09-01T02:41:22Z

GCE e2e build/test failed for commit e5600f7.

smarterclayton · 2015-09-01T02:44:28Z

ok to test

k8s-bot · 2015-09-01T03:13:17Z

GCE e2e build/test passed for commit e5600f7.

smarterclayton · 2015-09-01T13:59:45Z

@k8s-oncall when this is merged (as in the past) we're looking for an increase in failures due to increased latency. We don't have any concrete examples at this point though, so this is ready to merge.

derekwaynecarr · 2015-09-01T19:07:12Z

All things look green. Let's do this.

Enable graceful deletion of pods (7/7)

googlebot added the cla: yes label Aug 20, 2015

smarterclayton changed the title ~~Enable graceful deletion of pods~~ WIP - Enable graceful deletion of pods Aug 20, 2015

smarterclayton changed the title ~~WIP - Enable graceful deletion of pods~~ WIP - Enable graceful deletion of pods (7/7) Aug 20, 2015

smarterclayton force-pushed the enable_graceful branch 2 times, most recently from 5baa6c0 to e344088 Compare August 20, 2015 15:32

smarterclayton force-pushed the enable_graceful branch from e344088 to cfa410f Compare August 21, 2015 02:47

k8s-github-robot assigned derekwaynecarr Aug 21, 2015

smarterclayton force-pushed the enable_graceful branch from cfa410f to 59bd360 Compare August 21, 2015 13:48

smarterclayton force-pushed the enable_graceful branch from 59bd360 to 43df21d Compare August 21, 2015 14:18

smarterclayton force-pushed the enable_graceful branch from 43df21d to 449c619 Compare August 22, 2015 00:37

smarterclayton mentioned this pull request Aug 31, 2015

Enable graceful deletion using reconciliation loops in the Kubelet without TTL #9165

Merged

derekwaynecarr added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2015

smarterclayton added the ok-to-merge label Sep 1, 2015

derekwaynecarr added a commit that referenced this pull request Sep 1, 2015

Merge pull request #12963 from smarterclayton/enable_graceful

a7e47ca

Enable graceful deletion of pods (7/7)

derekwaynecarr merged commit a7e47ca into kubernetes:master Sep 1, 2015

wojtek-t mentioned this pull request Sep 2, 2015

e2e-scalability test suite is broken #13499

Closed

ghost added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Sep 11, 2015

Enable graceful deletion of pods (7/7) #12963

Enable graceful deletion of pods (7/7) #12963

Conversation

smarterclayton commented Aug 20, 2015

k8s-bot commented Aug 20, 2015

k8s-bot commented Aug 20, 2015

k8s-bot commented Aug 20, 2015

k8s-bot commented Aug 21, 2015

k8s-bot commented Aug 21, 2015

k8s-bot commented Aug 21, 2015

smarterclayton commented Aug 21, 2015

smarterclayton commented Aug 21, 2015

ghost commented Aug 21, 2015

ghost commented Aug 21, 2015

smarterclayton commented Aug 21, 2015

ghost commented Aug 21, 2015

ghost commented Aug 21, 2015

ghost commented Aug 21, 2015

smarterclayton commented Aug 21, 2015

smarterclayton commented Aug 21, 2015

ghost commented Aug 22, 2015

k8s-bot commented Aug 22, 2015

ixdy commented Aug 24, 2015

ixdy commented Aug 24, 2015

k8s-bot commented Aug 30, 2015

smarterclayton commented Aug 30, 2015

k8s-bot commented Aug 31, 2015

smarterclayton commented Aug 31, 2015

k8s-bot commented Aug 31, 2015

smarterclayton commented Aug 31, 2015 via email

k8s-bot commented Aug 31, 2015

smarterclayton commented Aug 31, 2015

k8s-bot commented Aug 31, 2015

smarterclayton commented Aug 31, 2015

smarterclayton commented Aug 31, 2015

k8s-bot commented Aug 31, 2015

smarterclayton commented Aug 31, 2015

derekwaynecarr commented Aug 31, 2015

ghost commented Aug 31, 2015

smarterclayton commented Sep 1, 2015

k8s-bot commented Sep 1, 2015

smarterclayton commented Sep 1, 2015

k8s-bot commented Sep 1, 2015

smarterclayton commented Sep 1, 2015

k8s-bot commented Sep 1, 2015

smarterclayton commented Sep 1, 2015

derekwaynecarr commented Sep 1, 2015