Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing Test: timeout in deployment on gci-gke-serial #69597

Closed
jberkus opened this Issue Oct 9, 2018 · 13 comments

Comments

Projects
None yet
8 participants
@jberkus
Copy link

jberkus commented Oct 9, 2018

Testgrid: https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gke-serial

Sample failure log: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-serial/7185?log#log

This test job has always been flaky, but is now failing more often than it's passing. Diagnosing it is difficult, though, because even the "successful" runs have multiple failures in the log, and the reported test failures seem not to be independant failures in the logs, but rather related to multiple failures to deploy pods and storage in the test setup.

Does anyone have more insight into what's going on here? This feels more like a google cloud or test config failure than anything else.

/kind failing-test
/kind flake
/sig test-infrastructure
/sig gcp
/sig apps

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Oct 9, 2018

/sig testing

@krzyzacy

This comment has been minimized.

Copy link
Member

krzyzacy commented Oct 9, 2018

also looks like https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gce-serial has similar issues, so this is actually not a gke specific problem.

@krzyzacy

This comment has been minimized.

Copy link
Member

krzyzacy commented Oct 9, 2018

one of the kubelet log has

curl: (7) Failed to connect to 127.0.0.1 port 10255: Connection refused
Kubelet is unhealthy!
Failed to kill unit kubelet.service: Unit kubelet.service is not loaded.

not sure if that's helpful at all...

/assign @saad-ali
(redirect to get some help from dev-oncall to see if can find any leads here...)

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 12, 2018

/assign @artemvmin
/cc @msau42

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Oct 12, 2018

@AishSundar: GitHub didn't allow me to assign the following users: artemvmin.

Note that only kubernetes members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @artemvmin
/cc @msau42

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Oct 12, 2018

The suites are failing on both gce and gke, so I don't think it's a gke specific issue.

Looking at one occurrence: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/4196

The first test that started failing was:

I1010 21:40:43.864] �[91m�[1m• Failure [306.621 seconds]�[0m
I1010 21:40:43.864] [sig-storage] In-tree Volumes
I1010 21:40:43.864] �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/utils/framework.go:22�[0m
I1010 21:40:43.864]   [Driver: emptydir]
I1010 21:40:43.864]   �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/in_tree_volumes.go:74�[0m
I1010 21:40:43.865]     [Testpattern: Inline-volume (default fs)] subPath
I1010 21:40:43.865]     �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:99�[0m
I1010 21:40:43.865]       �[91m�[1mshould unmount if pod is gracefully deleted while kubelet is down [Disruptive][Slow] [It]�[0m
I1010 21:40:43.865]       �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:313�[0m
I1010 21:40:43.865] 
I1010 21:40:43.865]       �[91mwhile waiting for pod to be running
I1010 21:40:43.865]       Expected error:
I1010 21:40:43.866]           <*errors.errorString | 0xc00009b860>: {
I1010 21:40:43.866]               s: "timed out waiting for the condition",
I1010 21:40:43.866]           }
I1010 21:40:43.866]           timed out waiting for the condition
I1010 21:40:43.866]       not to have occurred�[0m
I1010 21:40:43.866] 
I1010 21:40:43.866]       /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:717

Because it couldn't schedule the pod. Looking at the scheduler logs, I only see these two lines referring to that pod and nothing else:

I1010 21:35:37.378227       1 factory.go:1236] About to try and schedule pod e2e-tests-volumes-cmvdp/pod-subpath-test-emptydir-9tlr
I1010 21:35:37.378259       1 scheduler.go:410] Attempting to schedule pod: e2e-tests-volumes-cmvdp/pod-subpath-test-emptydir-9tlr

This is odd because there is usually at least one more log message if scheduling failed or succeeded. This implies that the pod is stuck somewhere in the scheduler and there could potentially be a deadlock. There are no other scheduling logs about other pods after this.

An odd pattern that I noticed was that for all the serial runs where a ton of tests fail, the test case that always runs before the subsequent failing tests is the equivalence cache tests. And there's always a line in the scheduler.log like:

I1010 21:35:18.629752       1 factory.go:1236] About to try and schedule pod e2e-tests-equivalence-cache-zg5mb/with-pod-affinity-5b75a235-ccd4-11e8-8c79-0a580a3c0bf3-58rpx
I1010 21:35:18.629805       1 scheduler.go:406] Skip schedule deleting pod: e2e-tests-equivalence-cache-zg5mb/with-pod-affinity-5b75a235-ccd4-11e8-8c79-0a580a3c0bf3-58rpx

I wonder if there is some race condition regarding the pod deletion case.

I have seen this pattern in the most recent runs where a bunch of tests all fail:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/4177
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/4189
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/4195
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/4196

The first failing test case is different across all these runs, but the commonality between them all is that the equivalence cache tests run right before it.

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Oct 12, 2018

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 12, 2018

@k82cn @bsalamat can you plz help take a look at this and help triage within scheduling?

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Oct 12, 2018

fyi, this is pretty easy to repro on a dev cluster on head. I just run the EquivalenceCache tests, and then create any Pod afterwards. The Pod fails to schedule about 50% of the time.

@bsalamat

This comment has been minimized.

Copy link
Member

bsalamat commented Oct 12, 2018

I am investigating it.

@bsalamat

This comment has been minimized.

Copy link
Member

bsalamat commented Oct 13, 2018

I identified the issue and sent a PR to address it. We need to cherrypick the fix to 1.12 as well.
Thanks to @msau42 for the initial investigation and finding reproduction steps!

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 13, 2018

Thanks a lot @msau42 and @bsalamat for looking into this and giving a fox PR.

/cc @feiskyer @tpepper @jberkus

@feiskyer #69758 needs to be CP'ed into 1.12 as well.

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Oct 14, 2018

@feiskyer #69758 needs to be CP'ed into 1.12 as well.

It already got approved. Please create a cherry pick PR for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.