New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable graceful deletion of pods (7/7) #12963
Enable graceful deletion of pods (7/7) #12963
Conversation
GCE e2e build/test failed for commit 8fbc0d92d618e5f107007af08d63b95d940b409d. |
5baa6c0
to
e344088
Compare
GCE e2e build/test failed for commit 5baa6c042bc92b296367b86273ee4e9985727f83. |
GCE e2e build/test passed for commit e3440882bd7de90b1a109f4f48901d172d1c7265. |
e344088
to
cfa410f
Compare
GCE e2e build/test passed for commit cfa410fbf22f81e8ad02b56912c6e047c52c50c5. |
cfa410f
to
59bd360
Compare
GCE e2e build/test failed for commit 59bd360f3f1d2b2e0fbca1c2cce2ea0607100ad5. |
59bd360
to
43df21d
Compare
GCE e2e build/test failed for commit 43df21d2b122bdbcc56fb1d7a9019f27f326b62e. |
@quinton-hoole in https://storage.cloud.google.com/kubernetes-jenkins/pr-logs/43df21d2b122bdbcc56fb1d7a9019f27f326b62e/kubernetes-pull-build-test-e2e-gce/5901/build-log.txt I'm seeing something that looks impossible - pods on the nodes created by a test, but that particular test hasn't run yet (pre_stop). The failure is caused because the nodes running out of cpu capacity (I think) which basically means the test is going to fail. Are tests running in parallel? Because that would explain a lot of the services flakes, if one of the tests is running before it starts. |
@kubernetes/goog-testing on my previous comment |
Yup, they run in parallel. We have a blacklist available to exclude non-parallel-friendly tests: https://github.com/kubernetes/kubernetes/blob/master/hack/jenkins/e2e.sh#L103 Q |
@smarterclayton Of course if it's purely an out-of-capacity problem, we can increase the number of nodes in the test cluster. @ixdy should be able to help with that if necessary. |
I think in this case it's capacity caused by the services up and down and multiports - they are creating 3 - 6 pods each, so while other tests may be relevant I suspect those two are the culprits. They should probably only run when the cluster is reasonably empty. |
146 e2e tests ran in parallel, and all of them create a few pods. What makes you think that the 3-6 pods created by each of those few services tests are the problem? @ixdy, how do you currently capacity plan the per-PR e2e test cluster? Do you have enough nodes to run all of the current tests in parallel, or do you need to increase the cluster size? |
@smarterclayton #13032 might be related. |
The "pending" for longer than a minute on a fast cluster is either gcr being slow (which I very much doubt) or overcapacity. Maybe we need to wire the e2e tests to immediately output and warn whenever a pod can't be scheduled and find a way to intersperse that with the logs. |
There could also be flakes with CPU capacity on the nodes - I've seen even the standard 2 node cluster fail to schedule the infrastructure. Do we have a general discussion forum to talk e2e that is better than this issue? |
43df21d
to
449c619
Compare
@smarterclayton Aah yes! Thanks for reminding me. I've just created: https://groups.google.com/forum/#!forum/kubernetes-sig-testing ... which is probably the right place to have those discussions. |
GCE e2e build/test failed for commit 449c61974be20ddd7b7462d2a4489627f0b406f3. |
"how do you currently capacity plan the per-PR e2e test cluster?" There isn't really any planning involved right now. It's set at 2 minions because (I think) at the time we were quota-limited, and it never got bumped back up to the default (which is now 3). When I first tried testing parallel Ginkgo runs, there wasn't much improvement in wall clock times by increasing the number of minions. I'm not sure whether this is still the case or not, though I think there are some tests which intentionally try to run something on every minion. |
The PR builder also uses n1-standard-1 master and minions; wonder if that's starting to cause issues. |
GCE e2e build/test passed for commit e5600f7. |
ok to test |
GCE e2e build/test passed for commit e5600f7. |
ok to test |
GCE e2e build/test passed for commit e5600f7. |
ok to test
|
GCE e2e build/test failed for commit e5600f7. |
ok to test |
GCE e2e build/test failed for commit e5600f7. |
These are all flakes - I can't reproduce them in a local environment. They appear to be related to not being able to start on a node in a given period of time - either delayed by pulling or scheduling. |
ok to test |
GCE e2e build/test passed for commit e5600f7. |
I'm convinced every e2e failure listed in the issues above is a flake I believe that this is ready to try merging again, and looking for On Mon, Aug 31, 2015 at 12:42 PM, Kubernetes Bot notifications@github.com
Clayton Coleman | Lead Engineer, OpenShift |
LGTM - e2e has been flaky, agree I dont see an issue. |
Ack, LGTM. Work is ongoing in #13032 to fix the flaky services tests. |
ok to test |
GCE e2e build/test failed for commit e5600f7. |
ok to test |
GCE e2e build/test failed for commit e5600f7. |
ok to test |
GCE e2e build/test passed for commit e5600f7. |
@k8s-oncall when this is merged (as in the past) we're looking for an increase in failures due to increased latency. We don't have any concrete examples at this point though, so this is ready to merge. |
All things look green. Let's do this. |
Enable graceful deletion of pods (7/7)
This commit actually enables graceful deletion of pods. Should not be merged before
all other prereqs are merged and e2e are stable.
Only the last commit is part of this pull request
Extracted from #9165