Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cluster-autoscaler] doesn't scale down unneeded nodes #4342

Closed
nmiculinic opened this issue Sep 17, 2021 · 12 comments
Closed

[cluster-autoscaler] doesn't scale down unneeded nodes #4342

nmiculinic opened this issue Sep 17, 2021 · 12 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@nmiculinic
Copy link

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

k8s.gcr.io/autoscaling/cluster-autoscaler:v1.22.0

Component version:

What k8s version are you using (kubectl version)?:

kubectl version
$ kubectl version
 ▲ nmiculinic@atlas ~/Desktop/integration-tests kubectl version                                                                                                                                                 52m master 81d ⬢
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"archive", BuildDate:"2021-07-16T17:16:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.7-eks-d88609", GitCommit:"d886092805d5cc3a47ed5cf0c43de38ce442dfcb", GitTreeState:"clean", BuildDate:"2021-07-31T00:29:12Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS EKS

./cluster-autoscaler --cloud-provider=aws --namespace=kube-system --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/staging-3 --balance-similar-node-groups=true --logtostderr=true --max-nodes-total=3000 --scale-down-unneeded-time=10m --stderrthreshold=info --v=4

What did you expect to happen?:

Cluster autoscaler scales down the unneeded nodes, and deletes them.

What happened instead?:

I0917 16:36:09.442769       1 static_autoscaler.go:502] Scale down status: unneededOnly=true lastScaleUpTime=2021-09-17 16:33:56.197319882 +0000 UTC m=+29.082938673 lastScaleDownDeleteTime=2021-09-17 16:33:56.197319982 +0000 UTC m=+29.082938763 lastScaleDownFailTime=2021-09-17 16:33:56.197320072 +0000 UTC m=+29.082938873 scaleDownForbidden=true isDeleteInProgress=false scaleDownInCooldown=true

There was 300 EC2 instances alive for more than a day, despite those being unneeded. The Cluster autoscaler hasn't scaled down the nodes. In the logs I see node X is unneeded for YYY time (which is >24h)

I0917 15:27:01.303473       1 static_autoscaler.go:510] ip-10-0-111-9.ec2.internal is unneeded since 2021-09-17 06:38:26.56670405 +0000 UTC m=+60532.599793618 duration 8h48m34.60339112s

How to reproduce it (as minimally and precisely as possible):

Not sure to be honest, it appears a bit non-deterministic.

Anything else we need to know?:

  • I've tried downgrading to v1.21.0, but I still see worrisome things in the logs:
I0917 17:28:10.085367       1 static_autoscaler.go:491] ip-10-0-80-218.ec2.internal is unneeded since 2021-09-17 16:34:06.197693882 +0000 UTC m=+39.083312683 duration 54m3.736843166s

After downgrade I saw the nodes being deleted from AWS, though I'm not sure is it due to version change or just cluster autoscaler restart.

@nmiculinic nmiculinic added the kind/bug Categorizes issue or PR as related to a bug. label Sep 17, 2021
@AnthonyPoschen
Copy link

also noticed this a few days ago i was on 1.17.3 upgraded to 1.21.0 and trying 1.22.0 now. Haven't pin pointed if it started occuring around the same time or not, also using AWS cloud provider.

@AnthonyPoschen
Copy link

after leaving it running for 30m from boot, the scaleDownForbidden=false and scaleDownInCoolDown=false finally occured even though everything is configured to 10m

it gets into a loop where it finds all the nodes that can be scaled down and marks them for ~ 1-2m then removes the mark and readds them on the next scan thus all scale downs are not occuring. I have modified the following to get around this issue for now, and am seeing scale in.

--scale-down-unneeded-time=1m

@AnthonyPoschen
Copy link

i am getting scale up and down loops with

I0929 03:21:12.884422       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"istio-system", Name:"istio-ingressgateway-75db74996b-hm59x", UID:"31109140-d56a-4719-9244-d3d473b07f48", APIVersion:"v1", ResourceVersion:"217141126", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-Cluster-w4rHeWRU32Ba-nodegroup-istio-system-1-21-NodeGroup-1E0RLQJ7RHCMB 3->10 (max: 20)}]

the deployment is set for 3 replicas only with a node selector and tolerance for the node group being accessed. It also has this anti affinity policy

      affinity:             
        nodeAffinity:                                                      
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:                      
              matchExpressions:          
              - key: kubernetes.io/arch  
                operator: In                                                   
                values:                         
                - amd64              
            weight: 2           
          - preference:                                                                            
              matchExpressions:                     
              - key: kubernetes.io/arch    
                operator: In                  
                values:              
                - ppc64le            
            weight: 2                 
          - preference:                          
              matchExpressions:            
              - key: kubernetes.io/arch
                operator: In                
                values:                 
                - s390x     
            weight: 2                   
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:     
            - matchExpressions:         
              - key: kubernetes.io/arch             
                operator: In               
                values:                      
                - amd64                                
                - ppc64le              
                - s390x                 
              - key: role            
                operator: In                    
                values:          
                - istio-system        
        podAntiAffinity:        
          requiredDuringSchedulingIgnoredDuringExecution: 
          - labelSelector:      
              matchLabels:            
                app: istio-ingressgateway           
            topologyKey: kubernetes.io/hostname  

Not sure if this is fully related anymore at this point.

@nmiculinic
Copy link
Author

We found the root cause.

There were some "schedulable" pods which weren't actually scheduable, thus causing cluster autoscaler not to scale down any nodes at all.

@bbhenry
Copy link

bbhenry commented Dec 13, 2021

@nmiculinic And what does those logs messages look like for the ones you thought was schedulable during the scaling process? Does it explicitly tell you those pods were actually not schedulable?

@nmiculinic
Copy link
Author

@bbhenry There was definitely a differential between what CA thinks is schedulable, and what actually was schedulable.

IIRC it was related to storage, maybe the node was provisioned in a different AWS AZ while the PV was in the wrong AZ (( thus it couldn't be attached))? Or the PV was deleted? Or was something else entirely. CC: @filintod if you remember exactly what was the issue with those unschedulable pods.


What is more worrisome this is a global flag, and not per node group, thus we're experiencing frequent times where downscaling is in cooldown globally since we run many disjoined ASGs ( think ~12 per instance type, and we care which pod is scheduled to which instance type )

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2022
@nmiculinic
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants