Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable graceful deletion of pods (7/7) #12963

Merged
merged 2 commits into from Sep 1, 2015

Conversation

smarterclayton
Copy link
Contributor

This commit actually enables graceful deletion of pods. Should not be merged before
all other prereqs are merged and e2e are stable.

Only the last commit is part of this pull request

Extracted from #9165

@smarterclayton smarterclayton changed the title Enable graceful deletion of pods WIP - Enable graceful deletion of pods Aug 20, 2015
@smarterclayton smarterclayton changed the title WIP - Enable graceful deletion of pods WIP - Enable graceful deletion of pods (7/7) Aug 20, 2015
@k8s-bot
Copy link

k8s-bot commented Aug 20, 2015

GCE e2e build/test failed for commit 8fbc0d92d618e5f107007af08d63b95d940b409d.

@smarterclayton smarterclayton force-pushed the enable_graceful branch 2 times, most recently from 5baa6c0 to e344088 Compare August 20, 2015 15:32
@k8s-bot
Copy link

k8s-bot commented Aug 20, 2015

GCE e2e build/test failed for commit 5baa6c042bc92b296367b86273ee4e9985727f83.

@k8s-bot
Copy link

k8s-bot commented Aug 20, 2015

GCE e2e build/test passed for commit e3440882bd7de90b1a109f4f48901d172d1c7265.

@k8s-bot
Copy link

k8s-bot commented Aug 21, 2015

GCE e2e build/test passed for commit cfa410fbf22f81e8ad02b56912c6e047c52c50c5.

@k8s-bot
Copy link

k8s-bot commented Aug 21, 2015

GCE e2e build/test failed for commit 59bd360f3f1d2b2e0fbca1c2cce2ea0607100ad5.

@k8s-bot
Copy link

k8s-bot commented Aug 21, 2015

GCE e2e build/test failed for commit 43df21d2b122bdbcc56fb1d7a9019f27f326b62e.

@smarterclayton
Copy link
Contributor Author

@quinton-hoole in https://storage.cloud.google.com/kubernetes-jenkins/pr-logs/43df21d2b122bdbcc56fb1d7a9019f27f326b62e/kubernetes-pull-build-test-e2e-gce/5901/build-log.txt I'm seeing something that looks impossible - pods on the nodes created by a test, but that particular test hasn't run yet (pre_stop). The failure is caused because the nodes running out of cpu capacity (I think) which basically means the test is going to fail. Are tests running in parallel? Because that would explain a lot of the services flakes, if one of the tests is running before it starts.

@smarterclayton
Copy link
Contributor Author

@kubernetes/goog-testing on my previous comment

@ghost
Copy link

ghost commented Aug 21, 2015

Yup, they run in parallel. We have a blacklist available to exclude non-parallel-friendly tests:

https://github.com/kubernetes/kubernetes/blob/master/hack/jenkins/e2e.sh#L103

Q

@ghost
Copy link

ghost commented Aug 21, 2015

@smarterclayton Of course if it's purely an out-of-capacity problem, we can increase the number of nodes in the test cluster. @ixdy should be able to help with that if necessary.

@smarterclayton
Copy link
Contributor Author

I think in this case it's capacity caused by the services up and down and multiports - they are creating 3 - 6 pods each, so while other tests may be relevant I suspect those two are the culprits. They should probably only run when the cluster is reasonably empty.

@ghost
Copy link

ghost commented Aug 21, 2015

146 e2e tests ran in parallel, and all of them create a few pods. What makes you think that the 3-6 pods created by each of those few services tests are the problem? @ixdy, how do you currently capacity plan the per-PR e2e test cluster? Do you have enough nodes to run all of the current tests in parallel, or do you need to increase the cluster size?

@ghost
Copy link

ghost commented Aug 21, 2015

@smarterclayton ^^

@ghost
Copy link

ghost commented Aug 21, 2015

@smarterclayton #13032 might be related.

@smarterclayton
Copy link
Contributor Author

The "pending" for longer than a minute on a fast cluster is either gcr being slow (which I very much doubt) or overcapacity. Maybe we need to wire the e2e tests to immediately output and warn whenever a pod can't be scheduled and find a way to intersperse that with the logs.

@smarterclayton
Copy link
Contributor Author

There could also be flakes with CPU capacity on the nodes - I've seen even the standard 2 node cluster fail to schedule the infrastructure. Do we have a general discussion forum to talk e2e that is better than this issue?

@ghost
Copy link

ghost commented Aug 22, 2015

@smarterclayton Aah yes! Thanks for reminding me. I've just created:

https://groups.google.com/forum/#!forum/kubernetes-sig-testing

... which is probably the right place to have those discussions.

@k8s-bot
Copy link

k8s-bot commented Aug 22, 2015

GCE e2e build/test failed for commit 449c61974be20ddd7b7462d2a4489627f0b406f3.

@ixdy
Copy link
Member

ixdy commented Aug 24, 2015

"how do you currently capacity plan the per-PR e2e test cluster?"

There isn't really any planning involved right now. It's set at 2 minions because (I think) at the time we were quota-limited, and it never got bumped back up to the default (which is now 3).

When I first tried testing parallel Ginkgo runs, there wasn't much improvement in wall clock times by increasing the number of minions. I'm not sure whether this is still the case or not, though I think there are some tests which intentionally try to run something on every minion.

@ixdy
Copy link
Member

ixdy commented Aug 24, 2015

The PR builder also uses n1-standard-1 master and minions; wonder if that's starting to cause issues.

@k8s-bot
Copy link

k8s-bot commented Aug 30, 2015

GCE e2e build/test passed for commit e5600f7.

@smarterclayton
Copy link
Contributor Author

ok to test

@k8s-bot
Copy link

k8s-bot commented Aug 31, 2015

GCE e2e build/test passed for commit e5600f7.

@smarterclayton
Copy link
Contributor Author

ok to test

@k8s-bot
Copy link

k8s-bot commented Aug 31, 2015

GCE e2e build/test passed for commit e5600f7.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Aug 31, 2015 via email

@k8s-bot
Copy link

k8s-bot commented Aug 31, 2015

GCE e2e build/test failed for commit e5600f7.

@smarterclayton
Copy link
Contributor Author

ok to test

@k8s-bot
Copy link

k8s-bot commented Aug 31, 2015

GCE e2e build/test failed for commit e5600f7.

@smarterclayton
Copy link
Contributor Author

These are all flakes - I can't reproduce them in a local environment. They appear to be related to not being able to start on a node in a given period of time - either delayed by pulling or scheduling.

@smarterclayton
Copy link
Contributor Author

ok to test

@k8s-bot
Copy link

k8s-bot commented Aug 31, 2015

GCE e2e build/test passed for commit e5600f7.

@smarterclayton
Copy link
Contributor Author

I'm convinced every e2e failure listed in the issues above is a flake
here. With graceful enabled, deletion takes a bit longer, but the failures
I see seem to be failures to schedule, lag on the nodes, or failures in
unrelated cases. I don't see pods lying around. In my own e2e runs, I
still see some failures, but in general they are all transient. It's
possible that with deletion enabled the extra calls from the node to the
master in deletion push some of the changes over a rate limit (when they
are heavily loaded).

I believe that this is ready to try merging again, and looking for
statistically significant flakiness increases. I have no e2e tests that
fail locally consistently.

On Mon, Aug 31, 2015 at 12:42 PM, Kubernetes Bot notifications@github.com
wrote:

GCE e2e build/test passed for commit e5600f7
e5600f7
.


Reply to this email directly or view it on GitHub
#12963 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

@derekwaynecarr
Copy link
Member

LGTM - e2e has been flaky, agree I dont see an issue.

@derekwaynecarr derekwaynecarr added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2015
@ghost
Copy link

ghost commented Aug 31, 2015

Ack, LGTM. Work is ongoing in #13032 to fix the flaky services tests.

@smarterclayton
Copy link
Contributor Author

ok to test

@k8s-bot
Copy link

k8s-bot commented Sep 1, 2015

GCE e2e build/test failed for commit e5600f7.

@smarterclayton
Copy link
Contributor Author

ok to test

@k8s-bot
Copy link

k8s-bot commented Sep 1, 2015

GCE e2e build/test failed for commit e5600f7.

@smarterclayton
Copy link
Contributor Author

ok to test

@k8s-bot
Copy link

k8s-bot commented Sep 1, 2015

GCE e2e build/test passed for commit e5600f7.

@smarterclayton
Copy link
Contributor Author

@k8s-oncall when this is merged (as in the past) we're looking for an increase in failures due to increased latency. We don't have any concrete examples at this point though, so this is ready to merge.

@derekwaynecarr
Copy link
Member

All things look green. Let's do this.

derekwaynecarr added a commit that referenced this pull request Sep 1, 2015
Enable graceful deletion of pods (7/7)
@derekwaynecarr derekwaynecarr merged commit a7e47ca into kubernetes:master Sep 1, 2015
@ghost ghost added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Sep 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants