Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA test can hang indefinitely #38298

Closed
krousey opened this issue Dec 7, 2016 · 1 comment · Fixed by #38377
Closed

HPA test can hang indefinitely #38298

krousey opened this issue Dec 7, 2016 · 1 comment · Fixed by #38377
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Milestone

Comments

@krousey
Copy link
Contributor

krousey commented Dec 7, 2016

I was diagnosing the test timeout in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/76 and saw that

"[k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] [Serial] [Slow] ReplicaSet " took 9h17m49.08s. Lines 3734-3870

The interesting part is

I1206 05:49:33.555] [k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] [Serial] [Slow] ReplicaSet 
I1206 05:49:33.555]   Should scale from 5 pods to 3 pods and from 3 to 1
I1206 05:49:33.555]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/horizontal_pod_autoscaling.go:62
I1206 05:49:33.556] [BeforeEach] [k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU)
I1206 05:49:33.556]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:141
I1206 05:49:33.556] STEP: Creating a kubernetes client
I1206 05:49:33.556] Dec  6 05:49:33.554: INFO: >>> kubeConfig: /workspace/.kube/config
I1206 05:49:33.556] 
I1206 05:49:33.559] STEP: Building a namespace api object
I1206 05:49:33.644] STEP: Waiting for a default service account to be provisioned in namespace
I1206 05:49:33.648] [It] Should scale from 5 pods to 3 pods and from 3 to 1
I1206 05:49:33.648]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/horizontal_pod_autoscaling.go:62
I1206 05:49:33.648] STEP: Running consuming RC rs via replicaset with 5 replicas
I1206 05:49:33.678] STEP: creating replicaset rs in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:33.678] STEP: creating replicaset rs in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:33.713] I1206 05:49:33.713479    3051 runners.go:103] Created replica set with name: rs, namespace: e2e-tests-horizontal-pod-autoscaling-lvvbf, replica count: 5
I1206 05:49:43.714] I1206 05:49:43.713850    3051 runners.go:103] rs Pods: 5 out of 5 created, 5 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1206 05:49:43.714] STEP: Running controller
I1206 05:49:43.722] STEP: creating replication controller rs-ctrl in namespace e2e-tests-horizontal-pod-autoscaling-lvvbf
I1206 05:49:43.763] I1206 05:49:43.762667    3051 runners.go:103] Created replication controller with name: rs-ctrl, namespace: e2e-tests-horizontal-pod-autoscaling-lvvbf, replica count: 1
I1206 05:49:53.763] I1206 05:49:53.763029    3051 runners.go:103] rs-ctrl Pods: 1 out of 1 created, 1 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1206 05:50:03.763] Dec  6 05:50:03.763: INFO: RC rs: consume 375 millicores in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting consumption to 375 millicores in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: consume 0 MB in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 375 millicores
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: consume custom metric 0 in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting bump of metric QPS to 0 in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: setting consumption to 0 MB in total
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 0 of custom metric QPS
I1206 05:50:03.764] Dec  6 05:50:03.763: INFO: RC rs: sending request to consume 0 MB
I1206 05:50:03.769] Dec  6 05:50:03.768: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=375&requestSizeMillicores=20 }
I1206 05:50:03.769] Dec  6 05:50:03.768: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
I1206 05:50:03.769] Dec  6 05:50:03.769: INFO: URL: {https  <nil> 130.211.200.26 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-lvvbf/services/rs-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMetrics=10 }
I1206 05:50:03.838] Dec  6 05:50:03.838: INFO: replicaset: current replicas number 5 waiting to be 3
... snip ...
I1206 06:04:45.146] Dec  6 06:04:45.146: INFO: replicaset: current replicas number 5 waiting to be 3
I1206 06:05:05.147] Dec  6 06:05:05.146: INFO: timeout waiting 15m0s for pods size to be 3
I1206 06:05:05.147] STEP: Removing consuming RC rs
W1206 14:51:20.408] 2016/12/06 14:51:20 e2e.go:956: Interrupt testing after 10h0m0s timeout. Will terminate in another 15m

It seems to hit ResourceConsumer.CleanUp() and hang. I think it's attempting to write to a blocking channel because we never see any cleanup messages from the framework functions that would be called right after the channel writes.

We no longer see messages about millicores, so it makes me think that the ResourceConsumer.makeConsumeCPURequests() loop terminated. This was probably some sort of GOmega error that propagates as a panic, and then gets swallowed by GinkgoRecover().

The early loop exit might be the cause of a lot of HPA flakes, but the hang on a channel write should be fixed. You can make the channel buffered, but it might be better to just close the channel. I would also like this cherry picked at least to 1.5 for test stability.

Found while investigating #37981

@krousey krousey added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. non-release-blocker priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 7, 2016
@krousey krousey added this to the v1.5 milestone Dec 7, 2016
@wojtek-t
Copy link
Member

wojtek-t commented Dec 7, 2016

@krousey - great investigation. I had a hypothesis that HPA tests can hang forever in the past but didn't have any strong argument that this is really the case. It seems that you found it.

jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue Dec 8, 2016
HPA e2e tests: fixed problem w/blocking channel. Resolves kubernetes#38298.
jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue Dec 9, 2016
HPA e2e tests: fixed problem w/blocking channel. Resolves kubernetes#38298.
k8s-github-robot pushed a commit that referenced this issue Dec 9, 2016
…-#38377-upstream-release-1.5

Automatic merge from submit-queue

cherry-pick of #38377: HPA e2e tests: fixed problem w/blocking channel.

Automated cherry-pick of #38377. HPA e2e tests: fixed problem w/blocking channel. Resolves #38298.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants