HPA test can hang indefinitely #38298

krousey · 2016-12-07T16:27:29Z

I was diagnosing the test timeout in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/76 and saw that

"[k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] [Serial] [Slow] ReplicaSet " took 9h17m49.08s. Lines 3734-3870

The interesting part is

I1206 05:49:33.555] [k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] [Serial] [Slow] ReplicaSet 
I1206 05:49:33.555]   Should scale from 5 pods to 3 pods and from 3 to 1
I1206 05:49:33.555]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/horizontal_pod_autoscaling.go:62
I1206 05:49:33.556] [BeforeEach] [k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU)
I1206 05:49:33.556]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:141
I1206 05:49:33.556] STEP: Creating a kubernetes client
I1206 05:49:33.556] Dec  6 05:49:33.554: INFO: >>> kubeConfig: /workspace/.kube/config
I1206 05:49:33.556] 
I1206 05:49:33.559] STEP: Building a namespace api object
I1206 05:49:33.644] STEP: Waiting for a default service account to be provisioned in namespace
I1206 05:49:33.648] [It] Should scale from 5 pods to 3 pods and from 3 to 1
I1206 05:49:33.648]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/horizontal_pod_autoscaling.go:62
I1206 05:49:33.648] STEP: Running consuming RC rs via replicaset with 5 replicas
I1206 05:49:33.678] STEP: creating replicaset rs in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:33.678] STEP: creating replicaset rs in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:33.713] I1206 05:49:33.713479    3051 runners.go:103] Created replica set with name: rs, namespace: e2e-tests-horizontal-pod-autoscaling-lvvbf, replica count: 5
I1206 05:49:43.714] I1206 05:49:43.713850    3051 runners.go:103] rs Pods: 5 out of 5 created, 5 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1206 05:49:43.714] STEP: Running controller
I1206 05:49:43.722] STEP: creating replication controller rs-ctrl in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:43.763] I1206 05:49:43.762667    3051 runners.go:103] Created replication controller with name: rs-ctrl, namespace: e2e-tests-horizontal-pod-autoscaling-lvvbf, replica count: 1
I1206 05:49:53.763] I1206 05:49:53.763029    3051 runners.go:103] rs-ctrl Pods: 1 out of 1 created, 1 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1206 05:50:03.763] Dec  6 05:50:03.763: INFO: RC rs: consume 375 millicores in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting consumption to 375 millicores in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: consume 0 MB in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 375 millicores
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: consume custom metric 0 in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting bump of metric QPS to 0 in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting consumption to 0 MB in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 0 of custom metric QPS
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 0 MB
I1206 05:50:03.769] Dec  6 05:50:03.768: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=375&requestSizeMillicores=20 }
I1206 05:50:03.769] Dec  6 05:50:03.768: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
I1206 05:50:03.769] Dec  6 05:50:03.769: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMetrics=10 }
I1206 05:50:03.838] Dec  6 05:50:03.838: INFO: replicaset: current replicas number 5 waiting to be 3
... snip ...
I1206 06:04:45.146] Dec  6 06:04:45.146: INFO: replicaset: current replicas number 5 waiting to be 3
I1206 06:05:05.147] Dec  6 06:05:05.146: INFO: timeout waiting 15m0s for pods size to be 3
I1206 06:05:05.147] STEP: Removing consuming RC rs
W1206 14:51:20.408] 2016/12/06 14:51:20 e2e.go:956: Interrupt testing after 10h0m0s timeout. Will terminate in another 15m

It seems to hit ResourceConsumer.CleanUp() and hang. I think it's attempting to write to a blocking channel because we never see any cleanup messages from the framework functions that would be called right after the channel writes.

We no longer see messages about millicores, so it makes me think that the ResourceConsumer.makeConsumeCPURequests() loop terminated. This was probably some sort of GOmega error that propagates as a panic, and then gets swallowed by GinkgoRecover().

The early loop exit might be the cause of a lot of HPA flakes, but the hang on a channel write should be fixed. You can make the channel buffered, but it might be better to just close the channel. I would also like this cherry picked at least to 1.5 for test stability.

Found while investigating #37981

The text was updated successfully, but these errors were encountered:

wojtek-t · 2016-12-07T19:09:59Z

@krousey - great investigation. I had a hypothesis that HPA tests can hang forever in the past but didn't have any strong argument that this is really the case. It seems that you found it.

HPA e2e tests: fixed problem w/blocking channel. Resolves kubernetes#38298.

…-#38377-upstream-release-1.5 Automatic merge from submit-queue cherry-pick of #38377: HPA e2e tests: fixed problem w/blocking channel. Automated cherry-pick of #38377. HPA e2e tests: fixed problem w/blocking channel. Resolves #38298.

krousey added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. non-release-blocker priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 7, 2016

krousey added this to the v1.5 milestone Dec 7, 2016

krousey assigned jszczepkowski Dec 7, 2016

krousey mentioned this issue Dec 7, 2016

kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new: broken test run #37981

Closed

jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue Dec 8, 2016

HPA e2e tests: fixed problem w/blocking channel.

e94d2fd

HPA e2e tests: fixed problem w/blocking channel. Resolves kubernetes#38298.

jszczepkowski mentioned this issue Dec 8, 2016

HPA e2e tests: fixed problem w/blocking channel. #38377

Merged

wojtek-t mentioned this issue Dec 8, 2016

ci-kubernetes-e2e-gci-gce-etcd3: broken test run #38066

Closed

k8s-github-robot closed this as completed in #38377 Dec 8, 2016

jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue Dec 9, 2016

HPA e2e tests: fixed problem w/blocking channel.

e8adc9a

HPA e2e tests: fixed problem w/blocking channel. Resolves kubernetes#38298.

jszczepkowski mentioned this issue Dec 9, 2016

cherry-pick of #38377: HPA e2e tests: fixed problem w/blocking channel. #38454

Merged

saad-ali removed the non-release-blocker label Dec 15, 2016

wojtek-t mentioned this issue Dec 16, 2016

kubernetes-e2e-gce-etcd3: broken test run #35107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPA test can hang indefinitely #38298

HPA test can hang indefinitely #38298

krousey commented Dec 7, 2016

wojtek-t commented Dec 7, 2016

HPA test can hang indefinitely #38298

HPA test can hang indefinitely #38298

Comments

krousey commented Dec 7, 2016

wojtek-t commented Dec 7, 2016