Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE tests timeout #69891

Closed
mortent opened this issue Oct 16, 2018 · 10 comments
Closed

GKE tests timeout #69891

mortent opened this issue Oct 16, 2018 · 10 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@mortent
Copy link
Member

mortent commented Oct 16, 2018

Tests are timing out on GKE

Dashboards: https://k8s-testgrid.appspot.com/sig-release-master-blocking#gke-cos-master-serial
https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gke-gci-new-gci-master-upgrade-master
https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gke-gci-new-gci-master-upgrade-cluster

Sample failure: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-serial/7208

This might be related to #69597, but it seems to fail in a different way and also more tests seems to be affected.

/sig test-infrastructure
/sig gcp
/kind failing-test
/priority important-soon

/cc @msau42
/cc @justinsb
/cc @bsalamat

@k8s-ci-robot k8s-ci-robot added sig/gcp kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 16, 2018
@msau42
Copy link
Member

msau42 commented Oct 16, 2018

Nevermind ignore this

@msau42
Copy link
Member

msau42 commented Oct 16, 2018

I think i found the issue.

This test case is stopping kubelet:

I1016 07:04:10.561] �[0m[sig-storage] In-tree Volumes�[0m �[90m[Driver: gcepd]�[0m �[0m[Testpattern: Pre-provisioned PV (default fs)] subPath�[0m 
I1016 07:04:10.561]   �[1mshould unmount if pod is gracefully deleted while kubelet is down [Disruptive][Slow]�[0m
I1016 07:04:10.561]   �[37m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:313�[0m

I1016 07:04:42.313] Oct 16 07:04:42.313: INFO: ssh prow@130.211.223.164:22: command:   sudo systemctl stop kubelet
I1016 07:04:42.313] Oct 16 07:04:42.313: INFO: ssh prow@130.211.223.164:22: stdout:    ""
I1016 07:04:42.313] Oct 16 07:04:42.313: INFO: ssh prow@130.211.223.164:22: stderr:    ""
I1016 07:04:42.313] Oct 16 07:04:42.313: INFO: ssh prow@130.211.223.164:22: exit code: 0
I1016 07:04:42.314] Oct 16 07:04:42.313: INFO: Waiting up to 1m0s for node gke-e2e-7208-4a777-default-pool-8048d34c-tg31 condition Ready to be false
I1016 07:04:42.316] Oct 16 07:04:42.315: INFO: Condition Ready of node gke-e2e-7208-4a777-default-pool-8048d34c-tg31 is true instead of false. Reason: KubeletReady, message: kubelet is posting ready status. AppArmor enabled

But because of #69786, the Node lifecycle controller never updated the node ready status, so the test failed. However, the test doesn't seem to have gone through its recovery procedure to restart kubelet.

So all the subsequent tests that have pods scheduled to this node will fail.

@msau42
Copy link
Member

msau42 commented Oct 16, 2018

/assign @jingxu97

@AishSundar
Copy link
Contributor

@jingxu97 any update on this issue?

@AishSundar
Copy link
Contributor

@wangzhen127 @msau42 now that #69786 is fixed, will this help this timeout as well?

@msau42
Copy link
Member

msau42 commented Oct 17, 2018

Yes, #69786 will not trigger the condition that is causing this test to fail and not properly cleanup.

#69944 is fixing the test cleanup

@AishSundar
Copy link
Contributor

#69944 is merged and from the looks of it the gke jobs seem to be passing (atleast in master which had a recent run). Thanks @jingxu97 and @msau42 for the investigation and fix.

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gke-cos-master-serial

@jberkus or @mortent to close this issue once upgrade jobs turn green as well

@jberkus
Copy link

jberkus commented Oct 18, 2018

Now that we're not timing out, we're seeing some other failures. I'll wait for one more consistent run, and then close this and open a new issue for the new failures.

@jberkus
Copy link

jberkus commented Oct 23, 2018

/close

@k8s-ci-robot
Copy link
Contributor

@jberkus: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

6 participants