Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new: broken test run #37981

Closed
k8s-github-robot opened this issue Dec 2, 2016 · 18 comments
Assignees
Labels
area/test-infra priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Milestone

Comments

@k8s-github-robot
Copy link

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/172/

Run so broken it didn't make JUnit output!

@k8s-github-robot k8s-github-robot added kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. area/test-infra labels Dec 2, 2016
@krousey krousey assigned krousey and unassigned ixdy Dec 2, 2016
@krousey
Copy link
Contributor

krousey commented Dec 2, 2016

cc @dchen1107

For some reason, the kubelet stops updating its status. I've verified that the node controller sets the node to "not ready" and the api server sees no further updates from kubelet. Since this is a GKE tests, I've pulled out the relevant master logs to this spreadsheet

The API server log shows that no more node statuses come after the controller updates it. The controller manager shows "Recording status change NodeNotReady event message for node gke-jenkins-e2e-default-pool-e95f9b13-dlae"

@krousey krousey added this to the v1.5 milestone Dec 2, 2016
@krousey krousey added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-blocker and removed kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Dec 2, 2016
@krousey krousey added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 2, 2016
@krousey
Copy link
Contributor

krousey commented Dec 2, 2016

This is blocking us from getting useful signals for GKE CVM upgrades.

@dchen1107
Copy link
Member

@krousey I checked the cluster you mentioned above at: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/172/ The problem is not node-not-ready, instead it is pod is not ready because failed to pulling the image. You are hitting a docker issue which should be fixed by moby/moby#25523 in docker 1.13.X cc/ @thockin

Dec  2 02:23:32.659: INFO: At 2016-12-02 02:18:11 -0800 PST - event for execpod-t3nwq: {kubelet gke-jenkins-e2e-default-pool-e95f9b13-i6tx} Failed: Failed to pull image "gcr.io/google_containers/busybox:1.24": failed to register layer: rename /var/lib/docker/image/aufs/layerdb/tmp/layer-100454157 /var/lib/docker/image/aufs/layerdb/sha256/2c84284818d186d88a16ac7fa731d4b71ba69ecfe11b4ce00413366833cb2403: directory not empty

But why you run into such problem so consistently, I believe the way you shutdown in an unclean way.

@roberthbailey
Copy link
Contributor

@dchen1107 - is there a workaround other than upgrading the version of docker? What is the recommended docker version that we are planning to use for 1.5?

@krousey
Copy link
Contributor

krousey commented Dec 5, 2016

@dchen1107 Where did you find that message? Also, how does failing to pull a docker image cause the node to stop reporting its status?

@dchen1107
Copy link
Member

@roberthbailey Each docker version has different bugs required to workaround. In 1.5 timeframe, we didn't plan to upgrade docker version.

@krousey I saw what you said here. The node first hits docker issue I reported above. I saw that message from build-log. Some tests failed due to that failure. Later I saw the error message you reported. I took a deep look at the nodes of 172 run, and I believe something is broken at the integration with the upgrade process since none of the node are reboot once from kern.log. Unfortunately there is no logs related to the upgrade process.

@dchen1107 dchen1107 added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Dec 5, 2016
@krousey
Copy link
Contributor

krousey commented Dec 5, 2016

@dchen1107 I think the upgrade process for GKE is to replace the node entirely with a new one. That's why you won't see a reboot. @roberthbailey correct me if I'm wrong.

@dchen1107
Copy link
Member

I don't think so. But if that is true, then my above statement is still right since there are only 3 nodes when the cluster is created, no other nodes come up.

@krousey
Copy link
Contributor

krousey commented Dec 5, 2016

And unfortunately, current runs are timing out since test infra changed the upgrade jobs to bootstrap in kubernetes/test-infra#1263. Some had timeouts of 2 hours. kubernetes/test-infra#1291

@krousey
Copy link
Contributor

krousey commented Dec 5, 2016

@dchen1107 I believe the VMs are recreated with the same name. Activity logs in the cluster show deletes followed by creates of the same name.

@roberthbailey
Copy link
Contributor

@dchen1107 I think the upgrade process for GKE is to replace the node entirely with a new one. That's why you won't see a reboot. @roberthbailey correct me if I'm wrong.

That is correct. It's similar to the way that upgrade.sh works.

@krousey
Copy link
Contributor

krousey commented Dec 6, 2016

This test suite passed here https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/74 and timed out the 2 following runs. The name now has ci- prepended to it.

@krousey
Copy link
Contributor

krousey commented Dec 7, 2016

The current run, 77, is hung with the soft lockup issue during the "Deployment should support rollover" test. I rebooted the gke-bootstrap-e2e-default-pool-13881f05-ovd4 node so that the suite can hopefully recover and run some more tests.

@krousey
Copy link
Contributor

krousey commented Dec 7, 2016

It was hung at

I1206 19:55:04.245] [k8s.io] Deployment�
I1206 19:55:04.245]   deployment should support rollover�
I1206 19:55:04.245]   �/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:77
I1206 19:55:04.245] [BeforeEach] [k8s.io] Deployment
I1206 19:55:04.245]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:141
I1206 19:55:04.246] STEP: Creating a kubernetes client
I1206 19:55:04.246] Dec  6 19:55:04.233: INFO: >>> kubeConfig: /workspace/.kube/config
I1206 19:55:04.246] 
I1206 19:55:04.246] STEP: Building a namespace api object
I1206 19:55:04.451] STEP: Waiting for a default service account to be provisioned in namespace
I1206 19:55:04.454] [It] deployment should support rollover
I1206 19:55:04.454]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:77
I1206 19:55:04.590] Dec  6 19:55:04.590: INFO: Pod name rollover-pod: Found 4 pods out of 4
I1206 19:55:04.591] STEP: ensuring each pod is running
I1206 19:55:06.668] STEP: trying to dial each unique pod

As soon as I rebooted the node, it continued.

@krousey
Copy link
Contributor

krousey commented Dec 7, 2016

It appears we don't timeout pod proxy requests.

@krousey
Copy link
Contributor

krousey commented Dec 7, 2016

I wrote a program to analyze build logs (since individual test results aren't recorded on timeouts) to see which test cases were taking longer. It's not that any particular run is taking longer, just some test hangs.

I ran it over https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/75 and got

"[k8s.io] Kubectl client [k8s.io] Guestbook application " took 6h53m59.741s. Lines 7464-7747

The interesting part is

I1205 21:54:06.352] STEP: validating guestbook app
I1205 21:54:06.352] Dec  5 21:54:06.351: INFO: Waiting for all frontend pods to be Running.
I1205 21:54:31.353] Dec  5 21:54:31.353: INFO: Waiting for frontend to serve content.
I1205 21:54:36.536] Dec  5 21:54:36.536: INFO: Failed to get response from guestbook. err: <nil>, response: <br />
I1205 21:54:36.537] <b>Fatal error</b>:  Uncaught exception 'Predis\Connection\ConnectionException' with message 'Connection timed out [tcp://redis-slave:6379]' in /usr/local/lib/php/Predis/Connection/AbstractConnection.php:168
I1205 21:54:36.537] Stack trace:
I1205 21:54:36.537] #0 /usr/local/lib/php/Predis/Connection/StreamConnection.php(97): Predis\Connection\AbstractConnection-&gt;onConnectionError('Connection time...', 110)
I1205 21:54:36.538] #1 /usr/local/lib/php/Predis/Connection/StreamConnection.php(58): Predis\Connection\StreamConnection-&gt;tcpStreamInitializer(Object(Predis\Connection\Parameters))
I1205 21:54:36.538] #2 /usr/local/lib/php/Predis/Connection/AbstractConnection.php(94): Predis\Connection\StreamConnection-&gt;createResource()
I1205 21:54:36.538] #3 /usr/local/lib/php/Predis/Connection/StreamConnection.php(158): Predis\Connection\AbstractConnection-&gt;connect()
I1205 21:54:36.538] #4 /usr/local/lib/php/Predis/Connection/AbstractConnection.php(193): Predis\Connection\StreamConnection-&gt;connect()
I1205 21:54:36.538] #5 /usr/local/lib/php/Predis/Connection/StreamConnection.php(184): Predis\Connection\AbstractConnection-&gt;getResource()
I1205 21:54:36.538] #6 /usr/local/lib/php/Predis/Connection/Stre in <b>/usr/local/lib/php/Predis/Connection/AbstractConnection.php</b> on line <b>168</b><br />
I1205 21:54:36.539] 
W1206 04:31:58.366] 2016/12/06 04:31:57 e2e.go:956: Interrupt testing after 10h0m0s timeout. Will terminate in another 15m

It made one request and got a redis error. It should be retrying this, but it doesn't. It hung for over 6 hours. If you look at how it makes requests, it uses proxying. As with my previous comment, this might be a proxy timeout issue.
See https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl.go#L1575

@krousey
Copy link
Contributor

krousey commented Dec 7, 2016

I ran the same tool over https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/76 and got

"[k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] [Serial] [Slow] ReplicaSet " took 9h17m49.08s. Lines 3734-3870

The interesting part is

I1206 05:49:33.555] [k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] [Serial] [Slow] ReplicaSet 
I1206 05:49:33.555]   Should scale from 5 pods to 3 pods and from 3 to 1
I1206 05:49:33.555]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/horizontal_pod_autoscaling.go:62
I1206 05:49:33.556] [BeforeEach] [k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU)
I1206 05:49:33.556]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:141
I1206 05:49:33.556] STEP: Creating a kubernetes client
I1206 05:49:33.556] Dec  6 05:49:33.554: INFO: >>> kubeConfig: /workspace/.kube/config
I1206 05:49:33.556] 
I1206 05:49:33.559] STEP: Building a namespace api object
I1206 05:49:33.644] STEP: Waiting for a default service account to be provisioned in namespace
I1206 05:49:33.648] [It] Should scale from 5 pods to 3 pods and from 3 to 1
I1206 05:49:33.648]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/horizontal_pod_autoscaling.go:62
I1206 05:49:33.648] STEP: Running consuming RC rs via replicaset with 5 replicas
I1206 05:49:33.678] STEP: creating replicaset rs in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:33.678] STEP: creating replicaset rs in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:33.713] I1206 05:49:33.713479    3051 runners.go:103] Created replica set with name: rs, namespace: e2e-tests-horizontal-pod-autoscaling-lvvbf, replica count: 5
I1206 05:49:43.714] I1206 05:49:43.713850    3051 runners.go:103] rs Pods: 5 out of 5 created, 5 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1206 05:49:43.714] STEP: Running controller
I1206 05:49:43.722] STEP: creating replication controller rs-ctrl in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:43.763] I1206 05:49:43.762667    3051 runners.go:103] Created replication controller with name: rs-ctrl, namespace: e2e-tests-horizontal-pod-autoscaling-lvvbf, replica count: 1
I1206 05:49:53.763] I1206 05:49:53.763029    3051 runners.go:103] rs-ctrl Pods: 1 out of 1 created, 1 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1206 05:50:03.763] Dec  6 05:50:03.763: INFO: RC rs: consume 375 millicores in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting consumption to 375 millicores in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: consume 0 MB in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 375 millicores
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: consume custom metric 0 in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting bump of metric QPS to 0 in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting consumption to 0 MB in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 0 of custom metric QPS
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 0 MB
I1206 05:50:03.769] Dec  6 05:50:03.768: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=375&requestSizeMillicores=20 }
I1206 05:50:03.769] Dec  6 05:50:03.768: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
I1206 05:50:03.769] Dec  6 05:50:03.769: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMetrics=10 }
I1206 05:50:03.838] Dec  6 05:50:03.838: INFO: replicaset: current replicas number 5 waiting to be 3
... snip ...
I1206 06:04:45.146] Dec  6 06:04:45.146: INFO: replicaset: current replicas number 5 waiting to be 3
I1206 06:05:05.147] Dec  6 06:05:05.146: INFO: timeout waiting 15m0s for pods size to be 3
I1206 06:05:05.147] STEP: Removing consuming RC rs
W1206 14:51:20.408] 2016/12/06 14:51:20 e2e.go:956: Interrupt testing after 10h0m0s timeout. Will terminate in another 15m

It seems to hit ResourceConsumer.CleanUp() and hang. I think it's attempting to write to a blocking channel because we never see any cleanup messages from the framework functions that would be called right after the channel writes.

@krousey
Copy link
Contributor

krousey commented Dec 7, 2016

I boiled this down so far to 2 non-blocking test issues for 1.5: #38298 and #38305. Additionally, the last run is completely green. The run before that had an individual test failure due to the soft lockup issue. I did manually intervene in that suite by rebooting the node so the test could finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

6 participants