Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes tainted with ToBeDeletedByClusterAutoscaler remain in the cluster blocking cluster-autoscaler normal operations #5657

Closed
mimmus opened this issue Apr 5, 2023 · 17 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@mimmus
Copy link

mimmus commented Apr 5, 2023

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
Image tag: cluster-autoscaler:v1.23.0

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.8", GitCommit:"7061dbbf75f9f82e8ab21f9be7e8ffcaae8e0d44", GitTreeState:"clean", BuildDate:"2022-03-16T14:10:06Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.8", GitCommit:"7061dbbf75f9f82e8ab21f9be7e8ffcaae8e0d44", GitTreeState:"clean", BuildDate:"2022-03-16T14:04:34Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS

What did you expect to happen?:
Nodes tainted for deletion should be deleted at some point

What happened instead?:
Nodes tainted for deletion (ToBeDeletedByClusterAutoscaler) remain in the cluster, blocking further scaleUp/scaleDown and causing disruption to normal operation (i.i. new pods remains in Pending state because new nodes cannot be added by CA).

How to reproduce it (as minimally and precisely as possible):
I don't know.

Anything else we need to know?:
Args:

          args:
            - '--cloud-provider=clusterapi'
            - '-v5'
            - '--balance-similar-node-groups'
            - '--balancing-ignore-label="topology.ebs.csi.aws.com/zone"'
            - '--scale-down-utilization-threshold=0.65'
            - '--skip-nodes-with-system-pods=false'
            - '--skip-nodes-with-local-storage=false'

cluster-autoscaler is part of a commercial product, I have also opened a case with vendor but currently navigates in the dark.
I'm aware of:
#4456
#5048
I tried also the simplest workaround (add the cluster-autoscaler.kubernetes.io/safe-to-evict: 'false' annotation to cluster-autoscaler) but still getting nodes not deleted.

@mimmus mimmus added the kind/bug Categorizes issue or PR as related to a bug. label Apr 5, 2023
@vadasambar
Copy link
Member

cluster-autoscaler.kubernetes.io/safe-to-evict: 'false' has the opposite effect. It prevents the node from getting removed.
image
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

@vadasambar
Copy link
Member

vadasambar commented Apr 13, 2023

I have seen this case happen when CA (cluster-autoscaler) pod tries deleting the node on which it was running. In this case, it adds the ToBeDeletedByClusterAutoscaler and kicks itself out from the node before deleting it.

I see 2 possibilities here:

  1. The node with ToBeDeletedByClusterAutoscaler was not in a good state. New CA pod removes taints from all the nodes at initilization except there is a bug in 1.23 which has been fixed in the later versions where taints from only ready nodes is removed.
    if readyNodes, err := a.ReadyNodeLister().List(); err != nil {
    klog.Errorf("Failed to list ready nodes, not cleaning up taints: %v", err)
    } else {
    deletetaint.CleanAllToBeDeleted(readyNodes,
    a.AutoscalingContext.ClientSet, a.Recorder, a.CordonNodeBeforeTerminate)
    if a.AutoscalingContext.AutoscalingOptions.MaxBulkSoftTaintCount == 0 {
    // Clean old taints if soft taints handling is disabled
    deletetaint.CleanAllDeletionCandidates(readyNodes,
    a.AutoscalingContext.ClientSet, a.Recorder)
    }
    }
    a.initialized = true
  2. Scale down is in a cool down because of some reason. Check for a log like
    klog.V(4).Infof("Scale down status: unneededOnly=%v lastScaleUpTime=%s "+
    "lastScaleDownDeleteTime=%v lastScaleDownFailTime=%s scaleDownForbidden=%v "+
    "isDeleteInProgress=%v scaleDownInCooldown=%v",
    calculateUnneededOnly, a.lastScaleUpTime,
    a.lastScaleDownDeleteTime, a.lastScaleDownFailTime, a.processorCallbacks.disableScaleDownForLoop,
    scaleDown.nodeDeletionTracker.IsNonEmptyNodeDeleteInProgress(), scaleDownInCooldown)

    If you see scaleDownInCooldown=true, scale down won't happen until the cool down is lifted.

I suspect 2 might be the case here.

If 2 doesn't turn out to be the case, I think it'd be hard to say what the problem is without looking at the logs.

@mimmus
Copy link
Author

mimmus commented Apr 13, 2023

I have seen this case happen when CA (cluster-autoscaler) pod tries deleting the node on which it was running.
In this case, it adds the ToBeDeletedByClusterAutoscaler and kicks itself out from the node before deleting it.

I was thinking the same thing but I just double-checked:
cluster-autoscaler has tolerations to run on control-plane nodes (and it's effectively running on a master node), thus is never evicted.

cluster-autoscaler.kubernetes.io/safe-to-evict: 'false' has the opposite effect. It prevents the node from getting removed.

That is exactly the effect I need: prevent the node where CA is running to be deleted. Or not?
(but now is no more applicable, see above note)

The node with ToBeDeletedByClusterAutoscaler was not in a good state.

I don't know exactly what you mean by "not in good state" but I think that it's not the case because it happens almost every night on all our clusters (3) with at least 1-3 nodes.

Scale down is in a cool down because of some reason.

I'm not able to find any scaleDownInCooldown=true in the logs.

I think it'd be hard to say what the problem is without looking at the logs.

I can collect and post logs somewhere, if you need.

@mimmus
Copy link
Author

mimmus commented Apr 14, 2023

UPDATE
If I restart cluster-autoscaler pod, it cleans all taints.

@comeonyo
Copy link
Contributor

@mimmus
By default, the cluster autoscaler sets the node scale-in cooldown time to 10 minutes.

@mimmus
Copy link
Author

mimmus commented Apr 16, 2023

@comeonyo What exactly "scaleDownInCooldown" means is not really clear to me :(
Could you elaborate?

Thanks again

@vadasambar
Copy link
Member

vadasambar commented Apr 17, 2023

That is exactly the effect I need: prevent the node where CA is running to be deleted. Or not?

I see. I thought it was for all the nodes. 👍

I don't know exactly what you mean by "not in good state"

Sorry, should have been clearer here. not in a good state here means the node is in NotReady state.

I'm not able to find any scaleDownInCooldown=true in the logs.

I see you have set -v5 which means the scaleDownInCooldown log (-v4 log) should get printed irrespective of whether we are seeing the problem or not.

I can collect and post logs somewhere, if you need.

Of course it would be best to share the logs here so that it can help others when they run into similar issues by redacting any info you think is sensitive. But if you are concerned about posting logs here, feel free to reach out to me on slack.

What exactly "scaleDownInCooldown" means is not really clear to me :(

It means scale down is disabled temporarily. This can happen for multiple reasons e.g., unless you specify --scale-down-delay-after-add with a different value, scale down is blocked temporarily for 10 minutes after scale up.

P.S.: we recently merged #5632 which fixes #4456 for AWS. You can try it out and see if it fixes the issue for you.

@mimmus
Copy link
Author

mimmus commented Apr 17, 2023

We'll post logs as soon the issue will happen again, usually every night.

P.S.: we recently merged 5632 which fixes 4456 for AWS. You can try it out and see if it fixes the issue for you.

As far as I understand, I can use Cluster Autoscaler with the Kubernetes control plane version for which it was meant
(I have K8s 1.22.8, CA is 1.23 but it is provided by vendor and thus I suppose it is correct).
In fact, I tried with CA 1.26 and it was a disaster.

Thanks again

@vadasambar
Copy link
Member

As far as I understand, I can use Cluster Autoscaler with the Kubernetes control plane version for which it was meant

Sorry, you are right. You would have to create a new cluster with the latest control plane version to try it out (I don't think all cloud providers are supporting the latest 1.27 version at the time of writing this comment).

We can look into patching it back to 1.26.

(I have K8s 1.22.8, CA is 1.23 but it is provided by vendor and thus I suppose it is correct).

Not sure why that is the case. It is recommended to use the same major version for CA as the control plane.

@mimmus
Copy link
Author

mimmus commented Apr 21, 2023

Our vendor's support thinks that what may be happening is that the Kube-scheduler and autoscaler are in disagreement.
There are many pods the autoscaler thinks can be scheduled, but the scheduler has them pending.
cluster-autoscaler believes that the upcoming node will be available but this realistically references back to one of my tainted nodes, thus the general "blocked" state. This is the issue outlined in:
https://github.com/kubernetes/autoscaler/issues/4456#issuecomment-1097333210

@vadasambar
Copy link
Member

Our vendor's support thinks that what may be happening is that the Kube-scheduler and autoscaler are in disagreement.

If your vendor is using a scheduler

  • with non-default plugins/extenders
  • with default plugins but non-default scheduler configuration

CA would be in disagreement with the scheduler because CA makes scale up decisions with the assumption that default scheduler with default scheduler configuration will be used.

If your vendor is in-fact using a non-default scheduler or a non-default configuration, your vendor needs to make changes in CA to make it in-line with the cluster's scheduler, re-compile it and use the customized CA image.

@mimmus
Copy link
Author

mimmus commented Apr 24, 2023

I think vendor is using the default scheduler (image is k8s.gcr.io/kube-scheduler:v1.22.8) and I don't see any particular option.
I will try to solve with vendor and I update this issue if I have any news.

@mimmus
Copy link
Author

mimmus commented May 22, 2023

It seems solved after upgrading clusters to kubernetes 1.23.12/cluster-autoscaler 1.23.1.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants