New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cluster-autoscaler] doesn't scale down unneeded nodes #4342
Comments
also noticed this a few days ago i was on 1.17.3 upgraded to 1.21.0 and trying 1.22.0 now. Haven't pin pointed if it started occuring around the same time or not, also using AWS cloud provider. |
after leaving it running for 30m from boot, the scaleDownForbidden=false and scaleDownInCoolDown=false finally occured even though everything is configured to 10m it gets into a loop where it finds all the nodes that can be scaled down and marks them for ~ 1-2m then removes the mark and readds them on the next scan thus all scale downs are not occuring. I have modified the following to get around this issue for now, and am seeing scale in.
|
i am getting scale up and down loops with
the deployment is set for 3 replicas only with a node selector and tolerance for the node group being accessed. It also has this anti affinity policy
Not sure if this is fully related anymore at this point. |
We found the root cause. There were some "schedulable" pods which weren't actually scheduable, thus causing cluster autoscaler not to scale down any nodes at all. |
@nmiculinic And what does those logs messages look like for the ones you thought was schedulable during the scaling process? Does it explicitly tell you those pods were actually not schedulable? |
@bbhenry There was definitely a differential between what CA thinks is schedulable, and what actually was schedulable. IIRC it was related to storage, maybe the node was provisioned in a different AWS AZ while the PV was in the wrong AZ (( thus it couldn't be attached))? Or the PV was deleted? Or was something else entirely. CC: @filintod if you remember exactly what was the issue with those unschedulable pods. What is more worrisome this is a global flag, and not per node group, thus we're experiencing frequent times where downscaling is in cooldown globally since we run many disjoined ASGs ( think ~12 per instance type, and we care which pod is scheduled to which instance type ) |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
k8s.gcr.io/autoscaling/cluster-autoscaler:v1.22.0
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
What environment is this in?:
AWS EKS
What did you expect to happen?:
Cluster autoscaler scales down the unneeded nodes, and deletes them.
What happened instead?:
There was 300 EC2 instances alive for more than a day, despite those being unneeded. The Cluster autoscaler hasn't scaled down the nodes. In the logs I see node X is unneeded for YYY time (which is >24h)
How to reproduce it (as minimally and precisely as possible):
Not sure to be honest, it appears a bit non-deterministic.
Anything else we need to know?:
After downgrade I saw the nodes being deleted from AWS, though I'm not sure is it due to version change or just cluster autoscaler restart.
The text was updated successfully, but these errors were encountered: