Nodes tainted with ToBeDeletedByClusterAutoscaler remain in the cluster blocking cluster-autoscaler normal operations #5657

mimmus · 2023-04-05T09:14:51Z

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
Image tag: cluster-autoscaler:v1.23.0

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.8", GitCommit:"7061dbbf75f9f82e8ab21f9be7e8ffcaae8e0d44", GitTreeState:"clean", BuildDate:"2022-03-16T14:10:06Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.8", GitCommit:"7061dbbf75f9f82e8ab21f9be7e8ffcaae8e0d44", GitTreeState:"clean", BuildDate:"2022-03-16T14:04:34Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS

What did you expect to happen?:
Nodes tainted for deletion should be deleted at some point

What happened instead?:
Nodes tainted for deletion (ToBeDeletedByClusterAutoscaler) remain in the cluster, blocking further scaleUp/scaleDown and causing disruption to normal operation (i.i. new pods remains in Pending state because new nodes cannot be added by CA).

How to reproduce it (as minimally and precisely as possible):
I don't know.

Anything else we need to know?:
Args:

          args:
            - '--cloud-provider=clusterapi'
            - '-v5'
            - '--balance-similar-node-groups'
            - '--balancing-ignore-label="topology.ebs.csi.aws.com/zone"'
            - '--scale-down-utilization-threshold=0.65'
            - '--skip-nodes-with-system-pods=false'
            - '--skip-nodes-with-local-storage=false'

cluster-autoscaler is part of a commercial product, I have also opened a case with vendor but currently navigates in the dark.
I'm aware of:
#4456
#5048
I tried also the simplest workaround (add the cluster-autoscaler.kubernetes.io/safe-to-evict: 'false' annotation to cluster-autoscaler) but still getting nodes not deleted.

The text was updated successfully, but these errors were encountered:

vadasambar · 2023-04-13T04:29:41Z

cluster-autoscaler.kubernetes.io/safe-to-evict: 'false' has the opposite effect. It prevents the node from getting removed.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

vadasambar · 2023-04-13T04:38:06Z

I have seen this case happen when CA (cluster-autoscaler) pod tries deleting the node on which it was running. In this case, it adds the ToBeDeletedByClusterAutoscaler and kicks itself out from the node before deleting it.

I see 2 possibilities here:

The node with ToBeDeletedByClusterAutoscaler was not in a good state. New CA pod removes taints from all the nodes at initilization except there is a bug in 1.23 which has been fixed in the later versions where taints from only ready nodes is removed.

autoscaler/cluster-autoscaler/core/static_autoscaler.go

Lines 182 to 193 in 5af3685

    
           if readyNodes, err := a.ReadyNodeLister().List(); err != nil { 
        
           	klog.Errorf("Failed to list ready nodes, not cleaning up taints: %v", err) 
        
           } else { 
        
           	deletetaint.CleanAllToBeDeleted(readyNodes, 
        
           		a.AutoscalingContext.ClientSet, a.Recorder, a.CordonNodeBeforeTerminate) 
        
           	if a.AutoscalingContext.AutoscalingOptions.MaxBulkSoftTaintCount == 0 { 
        
           		// Clean old taints if soft taints handling is disabled 
        
           		deletetaint.CleanAllDeletionCandidates(readyNodes, 
        
           			a.AutoscalingContext.ClientSet, a.Recorder) 
        
           	} 
        
           } 
        
           a.initialized = true

Scale down is in a cool down because of some reason. Check for a log like

autoscaler/cluster-autoscaler/core/static_autoscaler.go

Lines 519 to 524 in 5af3685

    
           klog.V(4).Infof("Scale down status: unneededOnly=%v lastScaleUpTime=%s "+ 
        
           	"lastScaleDownDeleteTime=%v lastScaleDownFailTime=%s scaleDownForbidden=%v "+ 
        
           	"isDeleteInProgress=%v scaleDownInCooldown=%v", 
        
           	calculateUnneededOnly, a.lastScaleUpTime, 
        
           	a.lastScaleDownDeleteTime, a.lastScaleDownFailTime, a.processorCallbacks.disableScaleDownForLoop, 
        
           	scaleDown.nodeDeletionTracker.IsNonEmptyNodeDeleteInProgress(), scaleDownInCooldown)

If you see scaleDownInCooldown=true, scale down won't happen until the cool down is lifted.

I suspect 2 might be the case here.

If 2 doesn't turn out to be the case, I think it'd be hard to say what the problem is without looking at the logs.

mimmus · 2023-04-13T07:59:03Z

I have seen this case happen when CA (cluster-autoscaler) pod tries deleting the node on which it was running.
In this case, it adds the ToBeDeletedByClusterAutoscaler and kicks itself out from the node before deleting it.

I was thinking the same thing but I just double-checked:
cluster-autoscaler has tolerations to run on control-plane nodes (and it's effectively running on a master node), thus is never evicted.

cluster-autoscaler.kubernetes.io/safe-to-evict: 'false' has the opposite effect. It prevents the node from getting removed.

That is exactly the effect I need: prevent the node where CA is running to be deleted. Or not?
(but now is no more applicable, see above note)

The node with ToBeDeletedByClusterAutoscaler was not in a good state.

I don't know exactly what you mean by "not in good state" but I think that it's not the case because it happens almost every night on all our clusters (3) with at least 1-3 nodes.

Scale down is in a cool down because of some reason.

I'm not able to find any scaleDownInCooldown=true in the logs.

I think it'd be hard to say what the problem is without looking at the logs.

I can collect and post logs somewhere, if you need.

mimmus · 2023-04-14T19:40:16Z

UPDATE
If I restart cluster-autoscaler pod, it cleans all taints.

comeonyo · 2023-04-16T14:16:18Z

@mimmus
By default, the cluster autoscaler sets the node scale-in cooldown time to 10 minutes.

mimmus · 2023-04-16T16:39:02Z

@comeonyo What exactly "scaleDownInCooldown" means is not really clear to me :(
Could you elaborate?

Thanks again

vadasambar · 2023-04-17T04:55:42Z

That is exactly the effect I need: prevent the node where CA is running to be deleted. Or not?

I see. I thought it was for all the nodes. 👍

I don't know exactly what you mean by "not in good state"

Sorry, should have been clearer here. not in a good state here means the node is in NotReady state.

I'm not able to find any scaleDownInCooldown=true in the logs.

I see you have set -v5 which means the scaleDownInCooldown log (-v4 log) should get printed irrespective of whether we are seeing the problem or not.

I can collect and post logs somewhere, if you need.

Of course it would be best to share the logs here so that it can help others when they run into similar issues by redacting any info you think is sensitive. But if you are concerned about posting logs here, feel free to reach out to me on slack.

What exactly "scaleDownInCooldown" means is not really clear to me :(

It means scale down is disabled temporarily. This can happen for multiple reasons e.g., unless you specify --scale-down-delay-after-add with a different value, scale down is blocked temporarily for 10 minutes after scale up.

P.S.: we recently merged #5632 which fixes #4456 for AWS. You can try it out and see if it fixes the issue for you.

mimmus · 2023-04-17T13:55:18Z

We'll post logs as soon the issue will happen again, usually every night.

P.S.: we recently merged 5632 which fixes 4456 for AWS. You can try it out and see if it fixes the issue for you.

As far as I understand, I can use Cluster Autoscaler with the Kubernetes control plane version for which it was meant
(I have K8s 1.22.8, CA is 1.23 but it is provided by vendor and thus I suppose it is correct).
In fact, I tried with CA 1.26 and it was a disaster.

Thanks again

vadasambar · 2023-04-17T14:25:41Z

As far as I understand, I can use Cluster Autoscaler with the Kubernetes control plane version for which it was meant

Sorry, you are right. You would have to create a new cluster with the latest control plane version to try it out (I don't think all cloud providers are supporting the latest 1.27 version at the time of writing this comment).

We can look into patching it back to 1.26.

(I have K8s 1.22.8, CA is 1.23 but it is provided by vendor and thus I suppose it is correct).

Not sure why that is the case. It is recommended to use the same major version for CA as the control plane.

mimmus · 2023-04-21T09:34:01Z

Our vendor's support thinks that what may be happening is that the Kube-scheduler and autoscaler are in disagreement.
There are many pods the autoscaler thinks can be scheduled, but the scheduler has them pending.
cluster-autoscaler believes that the upcoming node will be available but this realistically references back to one of my tainted nodes, thus the general "blocked" state. This is the issue outlined in:
https://github.com/kubernetes/autoscaler/issues/4456#issuecomment-1097333210

vadasambar · 2023-04-24T04:24:42Z

Our vendor's support thinks that what may be happening is that the Kube-scheduler and autoscaler are in disagreement.

If your vendor is using a scheduler

with non-default plugins/extenders
with default plugins but non-default scheduler configuration

CA would be in disagreement with the scheduler because CA makes scale up decisions with the assumption that default scheduler with default scheduler configuration will be used.

If your vendor is in-fact using a non-default scheduler or a non-default configuration, your vendor needs to make changes in CA to make it in-line with the cluster's scheduler, re-compile it and use the customized CA image.

mimmus · 2023-04-24T07:10:47Z

I think vendor is using the default scheduler (image is k8s.gcr.io/kube-scheduler:v1.22.8) and I don't see any particular option.
I will try to solve with vendor and I update this issue if I have any news.

mimmus · 2023-05-22T07:51:30Z

It seems solved after upgrading clusters to kubernetes 1.23.12/cluster-autoscaler 1.23.1.

k8s-triage-robot · 2024-01-20T23:15:01Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-19T23:16:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-21T00:10:03Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-21T00:10:08Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mimmus added the kind/bug Categorizes issue or PR as related to a bug. label Apr 5, 2023

vadasambar mentioned this issue Apr 13, 2023

Apr 2023 vadafoss/daily-updates#8

Closed

jbartosik added the area/cluster-autoscaler label May 12, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes tainted with ToBeDeletedByClusterAutoscaler remain in the cluster blocking cluster-autoscaler normal operations #5657

Nodes tainted with ToBeDeletedByClusterAutoscaler remain in the cluster blocking cluster-autoscaler normal operations #5657

mimmus commented Apr 5, 2023 •

edited

Loading

vadasambar commented Apr 13, 2023

vadasambar commented Apr 13, 2023 •

edited

Loading

mimmus commented Apr 13, 2023 •

edited

Loading

mimmus commented Apr 14, 2023

comeonyo commented Apr 16, 2023

mimmus commented Apr 16, 2023

vadasambar commented Apr 17, 2023 •

edited

Loading

mimmus commented Apr 17, 2023

vadasambar commented Apr 17, 2023

mimmus commented Apr 21, 2023 •

edited

Loading

vadasambar commented Apr 24, 2023

mimmus commented Apr 24, 2023

mimmus commented May 22, 2023

k8s-triage-robot commented Jan 20, 2024

k8s-triage-robot commented Feb 19, 2024

k8s-triage-robot commented Mar 21, 2024

k8s-ci-robot commented Mar 21, 2024

Nodes tainted with ToBeDeletedByClusterAutoscaler remain in the cluster blocking cluster-autoscaler normal operations #5657

Nodes tainted with ToBeDeletedByClusterAutoscaler remain in the cluster blocking cluster-autoscaler normal operations #5657

Comments

mimmus commented Apr 5, 2023 • edited Loading

vadasambar commented Apr 13, 2023

vadasambar commented Apr 13, 2023 • edited Loading

mimmus commented Apr 13, 2023 • edited Loading

mimmus commented Apr 14, 2023

comeonyo commented Apr 16, 2023

mimmus commented Apr 16, 2023

vadasambar commented Apr 17, 2023 • edited Loading

mimmus commented Apr 17, 2023

vadasambar commented Apr 17, 2023

mimmus commented Apr 21, 2023 • edited Loading

vadasambar commented Apr 24, 2023

mimmus commented Apr 24, 2023

mimmus commented May 22, 2023

k8s-triage-robot commented Jan 20, 2024

k8s-triage-robot commented Feb 19, 2024

k8s-triage-robot commented Mar 21, 2024

k8s-ci-robot commented Mar 21, 2024

mimmus commented Apr 5, 2023 •

edited

Loading

vadasambar commented Apr 13, 2023 •

edited

Loading

mimmus commented Apr 13, 2023 •

edited

Loading

vadasambar commented Apr 17, 2023 •

edited

Loading

mimmus commented Apr 21, 2023 •

edited

Loading