HPA test can hang indefinitely #38298
Labels
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
sig/autoscaling
Categorizes an issue or PR as relevant to SIG Autoscaling.
Milestone
I was diagnosing the test timeout in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-container_vm-1.4-container_vm-1.5-upgrade-cluster-new/76 and saw that
"[k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] [Serial] [Slow] ReplicaSet " took 9h17m49.08s. Lines 3734-3870
The interesting part is
It seems to hit ResourceConsumer.CleanUp() and hang. I think it's attempting to write to a blocking channel because we never see any cleanup messages from the framework functions that would be called right after the channel writes.
We no longer see messages about millicores, so it makes me think that the
ResourceConsumer.makeConsumeCPURequests()
loop terminated. This was probably some sort of GOmega error that propagates as a panic, and then gets swallowed byGinkgoRecover()
.The early loop exit might be the cause of a lot of HPA flakes, but the hang on a channel write should be fixed. You can make the channel buffered, but it might be better to just close the channel. I would also like this cherry picked at least to 1.5 for test stability.
Found while investigating #37981
The text was updated successfully, but these errors were encountered: