kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new: broken test run #37981

k8s-github-robot · 2016-12-02T19:59:49Z

Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/172/

Run so broken it didn't make JUnit output!

krousey · 2016-12-02T23:45:45Z

cc @dchen1107

For some reason, the kubelet stops updating its status. I've verified that the node controller sets the node to "not ready" and the api server sees no further updates from kubelet. Since this is a GKE tests, I've pulled out the relevant master logs to this spreadsheet

The API server log shows that no more node statuses come after the controller updates it. The controller manager shows "Recording status change NodeNotReady event message for node gke-jenkins-e2e-default-pool-e95f9b13-dlae"

krousey · 2016-12-02T23:46:30Z

This is blocking us from getting useful signals for GKE CVM upgrades.

dchen1107 · 2016-12-03T01:12:52Z

@krousey I checked the cluster you mentioned above at: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/172/ The problem is not node-not-ready, instead it is pod is not ready because failed to pulling the image. You are hitting a docker issue which should be fixed by moby/moby#25523 in docker 1.13.X cc/ @thockin

Dec  2 02:23:32.659: INFO: At 2016-12-02 02:18:11 -0800 PST - event for execpod-t3nwq: {kubelet gke-jenkins-e2e-default-pool-e95f9b13-i6tx} Failed: Failed to pull image "gcr.io/google_containers/busybox:1.24": failed to register layer: rename /var/lib/docker/image/aufs/layerdb/tmp/layer-100454157 /var/lib/docker/image/aufs/layerdb/sha256/2c84284818d186d88a16ac7fa731d4b71ba69ecfe11b4ce00413366833cb2403: directory not empty

But why you run into such problem so consistently, I believe the way you shutdown in an unclean way.

roberthbailey · 2016-12-04T04:18:15Z

@dchen1107 - is there a workaround other than upgrading the version of docker? What is the recommended docker version that we are planning to use for 1.5?

krousey · 2016-12-05T17:19:44Z

@dchen1107 Where did you find that message? Also, how does failing to pull a docker image cause the node to stop reporting its status?

dchen1107 · 2016-12-05T18:20:54Z

@roberthbailey Each docker version has different bugs required to workaround. In 1.5 timeframe, we didn't plan to upgrade docker version.

@krousey I saw what you said here. The node first hits docker issue I reported above. I saw that message from build-log. Some tests failed due to that failure. Later I saw the error message you reported. I took a deep look at the nodes of 172 run, and I believe something is broken at the integration with the upgrade process since none of the node are reboot once from kern.log. Unfortunately there is no logs related to the upgrade process.

krousey · 2016-12-05T18:24:17Z

@dchen1107 I think the upgrade process for GKE is to replace the node entirely with a new one. That's why you won't see a reboot. @roberthbailey correct me if I'm wrong.

dchen1107 · 2016-12-05T18:32:52Z

I don't think so. But if that is true, then my above statement is still right since there are only 3 nodes when the cluster is created, no other nodes come up.

krousey · 2016-12-05T18:33:25Z

And unfortunately, current runs are timing out since test infra changed the upgrade jobs to bootstrap in kubernetes/test-infra#1263. Some had timeouts of 2 hours. kubernetes/test-infra#1291

krousey · 2016-12-05T19:00:10Z

@dchen1107 I believe the VMs are recreated with the same name. Activity logs in the cluster show deletes followed by creates of the same name.

roberthbailey · 2016-12-05T19:37:06Z

@dchen1107 I think the upgrade process for GKE is to replace the node entirely with a new one. That's why you won't see a reboot. @roberthbailey correct me if I'm wrong.

That is correct. It's similar to the way that upgrade.sh works.

krousey · 2016-12-06T23:30:54Z

This test suite passed here https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/74 and timed out the 2 following runs. The name now has ci- prepended to it.

krousey · 2016-12-07T05:07:55Z

The current run, 77, is hung with the soft lockup issue during the "Deployment should support rollover" test. I rebooted the gke-bootstrap-e2e-default-pool-13881f05-ovd4 node so that the suite can hopefully recover and run some more tests.

krousey · 2016-12-07T05:15:04Z

It was hung at

I1206 19:55:04.245] [k8s.io] Deployment�
I1206 19:55:04.245]   deployment should support rollover�
I1206 19:55:04.245]   �/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:77
I1206 19:55:04.245] [BeforeEach] [k8s.io] Deployment
I1206 19:55:04.245]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:141
I1206 19:55:04.246] STEP: Creating a kubernetes client
I1206 19:55:04.246] Dec  6 19:55:04.233: INFO: >>> kubeConfig: /workspace/.kube/config
I1206 19:55:04.246] 
I1206 19:55:04.246] STEP: Building a namespace api object
I1206 19:55:04.451] STEP: Waiting for a default service account to be provisioned in namespace
I1206 19:55:04.454] [It] deployment should support rollover
I1206 19:55:04.454]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:77
I1206 19:55:04.590] Dec  6 19:55:04.590: INFO: Pod name rollover-pod: Found 4 pods out of 4
I1206 19:55:04.591] STEP: ensuring each pod is running
I1206 19:55:06.668] STEP: trying to dial each unique pod

As soon as I rebooted the node, it continued.

krousey · 2016-12-07T05:16:02Z

It appears we don't timeout pod proxy requests.

krousey · 2016-12-07T05:31:35Z

I wrote a program to analyze build logs (since individual test results aren't recorded on timeouts) to see which test cases were taking longer. It's not that any particular run is taking longer, just some test hangs.

I ran it over https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/75 and got

"[k8s.io] Kubectl client [k8s.io] Guestbook application " took 6h53m59.741s. Lines 7464-7747

The interesting part is

I1205 21:54:06.352] STEP: validating guestbook app
I1205 21:54:06.352] Dec  5 21:54:06.351: INFO: Waiting for all frontend pods to be Running.
I1205 21:54:31.353] Dec  5 21:54:31.353: INFO: Waiting for frontend to serve content.
I1205 21:54:36.536] Dec  5 21:54:36.536: INFO: Failed to get response from guestbook. err: <nil>, response: <br />
I1205 21:54:36.537] <b>Fatal error</b>:  Uncaught exception 'Predis\Connection\ConnectionException' with message 'Connection timed out [tcp://redis-slave:6379]' in /usr/local/lib/php/Predis/Connection/AbstractConnection.php:168
I1205 21:54:36.537] Stack trace:
I1205 21:54:36.537] #0 /usr/local/lib/php/Predis/Connection/StreamConnection.php(97): Predis\Connection\AbstractConnection-&gt;onConnectionError('Connection time...', 110)
I1205 21:54:36.538] #1 /usr/local/lib/php/Predis/Connection/StreamConnection.php(58): Predis\Connection\StreamConnection-&gt;tcpStreamInitializer(Object(Predis\Connection\Parameters))
I1205 21:54:36.538] #2 /usr/local/lib/php/Predis/Connection/AbstractConnection.php(94): Predis\Connection\StreamConnection-&gt;createResource()
I1205 21:54:36.538] #3 /usr/local/lib/php/Predis/Connection/StreamConnection.php(158): Predis\Connection\AbstractConnection-&gt;connect()
I1205 21:54:36.538] #4 /usr/local/lib/php/Predis/Connection/AbstractConnection.php(193): Predis\Connection\StreamConnection-&gt;connect()
I1205 21:54:36.538] #5 /usr/local/lib/php/Predis/Connection/StreamConnection.php(184): Predis\Connection\AbstractConnection-&gt;getResource()
I1205 21:54:36.538] #6 /usr/local/lib/php/Predis/Connection/Stre in <b>/usr/local/lib/php/Predis/Connection/AbstractConnection.php</b> on line <b>168</b><br />
I1205 21:54:36.539] 
W1206 04:31:58.366] 2016/12/06 04:31:57 e2e.go:956: Interrupt testing after 10h0m0s timeout. Will terminate in another 15m

It made one request and got a redis error. It should be retrying this, but it doesn't. It hung for over 6 hours. If you look at how it makes requests, it uses proxying. As with my previous comment, this might be a proxy timeout issue.
See https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl.go#L1575

krousey · 2016-12-07T05:38:19Z

I ran the same tool over https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/76 and got

"[k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] [Serial] [Slow] ReplicaSet " took 9h17m49.08s. Lines 3734-3870

The interesting part is

I1206 05:49:33.555] [k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] [Serial] [Slow] ReplicaSet 
I1206 05:49:33.555]   Should scale from 5 pods to 3 pods and from 3 to 1
I1206 05:49:33.555]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/horizontal_pod_autoscaling.go:62
I1206 05:49:33.556] [BeforeEach] [k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU)
I1206 05:49:33.556]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:141
I1206 05:49:33.556] STEP: Creating a kubernetes client
I1206 05:49:33.556] Dec  6 05:49:33.554: INFO: >>> kubeConfig: /workspace/.kube/config
I1206 05:49:33.556] 
I1206 05:49:33.559] STEP: Building a namespace api object
I1206 05:49:33.644] STEP: Waiting for a default service account to be provisioned in namespace
I1206 05:49:33.648] [It] Should scale from 5 pods to 3 pods and from 3 to 1
I1206 05:49:33.648]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/horizontal_pod_autoscaling.go:62
I1206 05:49:33.648] STEP: Running consuming RC rs via replicaset with 5 replicas
I1206 05:49:33.678] STEP: creating replicaset rs in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:33.678] STEP: creating replicaset rs in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:33.713] I1206 05:49:33.713479    3051 runners.go:103] Created replica set with name: rs, namespace: e2e-tests-horizontal-pod-autoscaling-lvvbf, replica count: 5
I1206 05:49:43.714] I1206 05:49:43.713850    3051 runners.go:103] rs Pods: 5 out of 5 created, 5 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1206 05:49:43.714] STEP: Running controller
I1206 05:49:43.722] STEP: creating replication controller rs-ctrl in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:43.763] I1206 05:49:43.762667    3051 runners.go:103] Created replication controller with name: rs-ctrl, namespace: e2e-tests-horizontal-pod-autoscaling-lvvbf, replica count: 1
I1206 05:49:53.763] I1206 05:49:53.763029    3051 runners.go:103] rs-ctrl Pods: 1 out of 1 created, 1 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1206 05:50:03.763] Dec  6 05:50:03.763: INFO: RC rs: consume 375 millicores in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting consumption to 375 millicores in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: consume 0 MB in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 375 millicores
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: consume custom metric 0 in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting bump of metric QPS to 0 in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting consumption to 0 MB in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 0 of custom metric QPS
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 0 MB
I1206 05:50:03.769] Dec  6 05:50:03.768: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=375&requestSizeMillicores=20 }
I1206 05:50:03.769] Dec  6 05:50:03.768: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
I1206 05:50:03.769] Dec  6 05:50:03.769: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMetrics=10 }
I1206 05:50:03.838] Dec  6 05:50:03.838: INFO: replicaset: current replicas number 5 waiting to be 3
... snip ...
I1206 06:04:45.146] Dec  6 06:04:45.146: INFO: replicaset: current replicas number 5 waiting to be 3
I1206 06:05:05.147] Dec  6 06:05:05.146: INFO: timeout waiting 15m0s for pods size to be 3
I1206 06:05:05.147] STEP: Removing consuming RC rs
W1206 14:51:20.408] 2016/12/06 14:51:20 e2e.go:956: Interrupt testing after 10h0m0s timeout. Will terminate in another 15m

It seems to hit ResourceConsumer.CleanUp() and hang. I think it's attempting to write to a blocking channel because we never see any cleanup messages from the framework functions that would be called right after the channel writes.

krousey · 2016-12-07T18:29:02Z

I boiled this down so far to 2 non-blocking test issues for 1.5: #38298 and #38305. Additionally, the last run is completely green. The run before that had an individual test failure due to the soft lockup issue. I did manually intervene in that suite by rebooting the node so the test could finish.

k8s-github-robot assigned ixdy Dec 2, 2016

k8s-github-robot added kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. area/test-infra labels Dec 2, 2016

krousey assigned krousey and unassigned ixdy Dec 2, 2016

krousey added this to the v1.5 milestone Dec 2, 2016

krousey assigned dchen1107 Dec 2, 2016

krousey added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 2, 2016

dchen1107 added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Dec 5, 2016

This was referenced Dec 7, 2016

HPA test can hang indefinitely #38298

Closed

Don't set timeouts in clients in tests #31346

Merged

Add timeouts back to e2e tests #38305

Closed

krousey added non-release-blocker and removed release-blocker labels Dec 7, 2016

krousey closed this as completed Dec 7, 2016

saad-ali removed the non-release-blocker label Dec 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new: broken test run #37981

kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new: broken test run #37981

k8s-github-robot commented Dec 2, 2016

krousey commented Dec 2, 2016

krousey commented Dec 2, 2016

dchen1107 commented Dec 3, 2016

roberthbailey commented Dec 4, 2016

krousey commented Dec 5, 2016

dchen1107 commented Dec 5, 2016

krousey commented Dec 5, 2016

dchen1107 commented Dec 5, 2016

krousey commented Dec 5, 2016

krousey commented Dec 5, 2016

roberthbailey commented Dec 5, 2016

krousey commented Dec 6, 2016

krousey commented Dec 7, 2016 •

edited

krousey commented Dec 7, 2016

krousey commented Dec 7, 2016

krousey commented Dec 7, 2016 •

edited

krousey commented Dec 7, 2016

krousey commented Dec 7, 2016

kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new: broken test run #37981

kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new: broken test run #37981

Comments

k8s-github-robot commented Dec 2, 2016

krousey commented Dec 2, 2016

krousey commented Dec 2, 2016

dchen1107 commented Dec 3, 2016

roberthbailey commented Dec 4, 2016

krousey commented Dec 5, 2016

dchen1107 commented Dec 5, 2016

krousey commented Dec 5, 2016

dchen1107 commented Dec 5, 2016

krousey commented Dec 5, 2016

krousey commented Dec 5, 2016

roberthbailey commented Dec 5, 2016

krousey commented Dec 6, 2016

krousey commented Dec 7, 2016 • edited

krousey commented Dec 7, 2016

krousey commented Dec 7, 2016

krousey commented Dec 7, 2016 • edited

krousey commented Dec 7, 2016

krousey commented Dec 7, 2016

krousey commented Dec 7, 2016 •

edited

krousey commented Dec 7, 2016 •

edited