-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes tainted with ToBeDeletedByClusterAutoscaler remain in the cluster blocking cluster-autoscaler normal operations #5657
Comments
|
I have seen this case happen when CA (cluster-autoscaler) pod tries deleting the node on which it was running. In this case, it adds the I see 2 possibilities here:
I suspect 2 might be the case here. If 2 doesn't turn out to be the case, I think it'd be hard to say what the problem is without looking at the logs. |
I was thinking the same thing but I just double-checked:
That is exactly the effect I need: prevent the node where CA is running to be deleted. Or not?
I don't know exactly what you mean by "not in good state" but I think that it's not the case because it happens almost every night on all our clusters (3) with at least 1-3 nodes.
I'm not able to find any
I can collect and post logs somewhere, if you need. |
UPDATE |
@mimmus |
@comeonyo What exactly "scaleDownInCooldown" means is not really clear to me :( Thanks again |
I see. I thought it was for all the nodes. 👍
Sorry, should have been clearer here.
I see you have set
Of course it would be best to share the logs here so that it can help others when they run into similar issues by redacting any info you think is sensitive. But if you are concerned about posting logs here, feel free to reach out to me on slack.
It means scale down is disabled temporarily. This can happen for multiple reasons e.g., unless you specify P.S.: we recently merged #5632 which fixes #4456 for AWS. You can try it out and see if it fixes the issue for you. |
We'll post logs as soon the issue will happen again, usually every night.
As far as I understand, I can use Cluster Autoscaler with the Kubernetes control plane version for which it was meant Thanks again |
Sorry, you are right. You would have to create a new cluster with the latest control plane version to try it out (I don't think all cloud providers are supporting the latest 1.27 version at the time of writing this comment). We can look into patching it back to 1.26.
Not sure why that is the case. It is recommended to use the same major version for CA as the control plane. |
Our vendor's support thinks that what may be happening is that the Kube-scheduler and autoscaler are in disagreement. |
If your vendor is using a scheduler
CA would be in disagreement with the scheduler because CA makes scale up decisions with the assumption that default scheduler with default scheduler configuration will be used. If your vendor is in-fact using a non-default scheduler or a non-default configuration, your vendor needs to make changes in CA to make it in-line with the cluster's scheduler, re-compile it and use the customized CA image. |
I think vendor is using the default scheduler (image is k8s.gcr.io/kube-scheduler:v1.22.8) and I don't see any particular option. |
It seems solved after upgrading clusters to kubernetes 1.23.12/cluster-autoscaler 1.23.1. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Image tag: cluster-autoscaler:v1.23.0
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
AWS
What did you expect to happen?:
Nodes tainted for deletion should be deleted at some point
What happened instead?:
Nodes tainted for deletion (ToBeDeletedByClusterAutoscaler) remain in the cluster, blocking further scaleUp/scaleDown and causing disruption to normal operation (i.i. new pods remains in Pending state because new nodes cannot be added by CA).
How to reproduce it (as minimally and precisely as possible):
I don't know.
Anything else we need to know?:
Args:
cluster-autoscaler is part of a commercial product, I have also opened a case with vendor but currently navigates in the dark.
I'm aware of:
#4456
#5048
I tried also the simplest workaround (add the
cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'
annotation to cluster-autoscaler) but still getting nodes not deleted.The text was updated successfully, but these errors were encountered: