Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler resets unneeded since time to 0s #5618

Closed
wreed4 opened this issue Mar 24, 2023 · 19 comments
Closed

Cluster Autoscaler resets unneeded since time to 0s #5618

wreed4 opened this issue Mar 24, 2023 · 19 comments
Assignees
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@wreed4
Copy link

wreed4 commented Mar 24, 2023

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

1.21.3
Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.17", GitCommit:"953be8927218ec8067e1af2641e540238ffd7576", GitTreeState:"clean", BuildDate:"2023-03-01T02:23:41Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.17-eks-48e63af", GitCommit:"47b89ea2caa1f7958bc6539d6865820c86b4bf60", GitTreeState:"clean", BuildDate:"2023-01-24T09:34:06Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

aws, EKS

What did you expect to happen?:

unneeded since would continue to increase until the node either is removed or becomes needed

What happened instead?:

unneeded since dropped to 0 for many nodes at once (but not all) even though they were never determined to be needed per the logs, causing the --scale-down-unneeded-time timer to be reset.

How to reproduce it (as minimally and precisely as possible):

I'm unsure what causes this, but I know we have a fairly high churn rate on our cluster, around 300 nodes, mostly default settings with CA, and I did see a Watch on replicasets close in the loop that this happened in.. if that matters.

Anything else we need to know?:

I'm happy to answer more questions, but I'm unsure what else to put here. The logs are far too verbose to copy in entirety, but I'll say, this is the piece of code I'm at that I think might possibly be lying.

// Update stores nodes along with a time at which they were found to be
// unneeded. Previously existing timestamps are preserved.

@wreed4 wreed4 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 24, 2023
@vadasambar
Copy link
Member

/assign @vadasambar

@vadasambar
Copy link
Member

A snippet of logs around the problematic behavior might be helpful.

Here's my understanding (feel free to correct me)

  1. I couldn't find core/scaledown/unneeded/nodes.go file in 1.21 version of CA. Maybe you are looking at master or a post 1.21 version of the code.
  2. We are talking about the log defined in
    klog.V(2).Infof("%s was unneeded for %s", nodeName, currentTime.Sub(unneededSince).String())

unneededSince timestamp is updated at

// Update the timestamp map.
result := make(map[string]time.Time)
unneededNodesList := make([]*apiv1.Node, 0, len(nodesToRemove))
for _, node := range nodesToRemove {
name := node.Node.Name
unneededNodesList = append(unneededNodesList, node.Node)
if val, found := sd.unneededNodes[name]; !found {
result[name] = timestamp
} else {
result[name] = val
}
}

@wreed4
Copy link
Author

wreed4 commented Mar 28, 2023

Thank you @vadasambar, I have too many tabs open and that one wasn't from 1.21 sorry for the confusion. Thanks for finding the right place.

Here's all the logs having to do with one particular node that this happened on. I'm filtering them because we have hundreds of nodes and there are a LOT of logs, but the filter is just a simple text search of this node name

1 | I0324 06:00:25.374593       1 static_autoscaler.go:497] ip-172-17-76-70.ec2.internal is unneeded since 2023-03-24 06:00:24.071480578 +0000 UTC m=+242752.166823855 duration 0s
2 | I0324 06:00:24.698279       1 cluster.go:187] Fast evaluation: node ip-172-17-76-70.ec2.internal may be removed
3 | I0324 06:00:24.670887       1 cluster.go:148] Fast evaluation: ip-172-17-76-70.ec2.internal for removal
4 | I0324 06:00:24.378929       1 scale_down.go:448] Node ip-172-17-76-70.ec2.internal - memory utilization 0.888549
5 | I0324 06:00:14.017381       1 scale_down.go:829] ip-172-17-76-70.ec2.internal was unneeded for 11m13.651288526s
6 | I0324 06:00:14.015632       1 static_autoscaler.go:497] ip-172-17-76-70.ec2.internal is unneeded since 2023-03-24 05:48:59.326845574 +0000 UTC m=+242067.422188845 duration 11m13.651288526s
7 | I0324 06:00:13.297503       1 scale_down.go:448] Node ip-172-17-76-70.ec2.internal - memory utilization 0.888549
8 | I0324 06:00:13.954747       1 cluster.go:187] Fast evaluation: node ip-172-17-76-70.ec2.internal may be removed
9 | I0324 06:00:13.930523       1 cluster.go:148] Fast evaluation: ip-172-17-76-70.ec2.internal for removal

It looks like something is getting reset in between the 6:00:14 scrape and the 6:00:24 scrape. The only thing that looks like a reset is this line

I0324 06:00:22.671491       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicaSet total 977 items received

During the 6:00:24 there were quite a few other nodes marked as unneeded for 0s, some of which were new, but some of which display the same behavior as this node.

@vadasambar
Copy link
Member

vadasambar commented Mar 29, 2023

It seems like something is removing the node from uneededNodes and then the node is considered in nodesToRemove because of which the time is getting reset to current time on line 600 in

// Update the timestamp map.
result := make(map[string]time.Time)
unneededNodesList := make([]*apiv1.Node, 0, len(nodesToRemove))
for _, node := range nodesToRemove {
name := node.Node.Name
unneededNodesList = append(unneededNodesList, node.Node)
if val, found := sd.unneededNodes[name]; !found {
result[name] = timestamp
} else {
result[name] = val
}
}

@vadasambar
Copy link
Member

I think CA might be removing some nodes from the unneeded node map. Because of this, during the next iteration of CA(cluster-autoscaler), the code goes into !found part of condition on line 600 above and the timestamp is updated to current time because of which unneeded time (calculated as current timestamp - unneeded timestamp) becomes 0.

toRemove := nodesToRemove[0]
utilization := sd.nodeUtilizationMap[toRemove.Node.Name]
podNames := make([]string, 0, len(toRemove.PodsToReschedule))
for _, pod := range toRemove.PodsToReschedule {
podNames = append(podNames, pod.Namespace+"/"+pod.Name)
}
klog.V(0).Infof("Scale-down: removing node %s, utilization: %v, pods to reschedule: %s", toRemove.Node.Name, utilization,
strings.Join(podNames, ","))
sd.context.LogRecorder.Eventf(apiv1.EventTypeNormal, "ScaleDown", "Scale-down: removing node %s, utilization: %v, pods to reschedule: %s",
toRemove.Node.Name, utilization, strings.Join(podNames, ","))
// Nothing super-bad should happen if the node is removed from tracker prematurely.
simulator.RemoveNodeFromTracker(sd.usageTracker, toRemove.Node.Name, sd.unneededNodes)

RemoveNodeFromTracker on line 987 modifies sd.unneededNodes without completely resetting it (note line 975; only one node is scaled down at a time)

To look a little closer at RemoveNodeFromTracker:

// RemoveNodeFromTracker removes node from tracker and also cleans the passed utilization map.
func RemoveNodeFromTracker(tracker *UsageTracker, node string, utilization map[string]time.Time) {
keysToRemove := make([]string, 0)
if mainRecord, found := tracker.Get(node); found {
if mainRecord.usingTooMany {
keysToRemove = getAllKeys(utilization)
} else {
usingloop:
for usedNode := range mainRecord.using {
if usedNodeRecord, found := tracker.Get(usedNode); found {
if usedNodeRecord.usedByTooMany {
keysToRemove = getAllKeys(utilization)
break usingloop
} else {
for anotherNode := range usedNodeRecord.usedBy {
keysToRemove = append(keysToRemove, anotherNode)
}
}
}
}
}
}
tracker.Unregister(node)
delete(utilization, node)
for _, key := range keysToRemove {
delete(utilization, key)
}
}

utilization is sd.unneededNodes passed by reference (for a lack of a better term) and if you see on line 157, it's removing the node being scaled down (let's call it node A) from the sd.unneededNodes (which is expected) but it's also going a step further and deleting other nodes (line 158~160) from sd.unneededNodes.

Here's my understanding of how UsageTracker works:

  1. node A has a usage record. If pods from node A can be moved to node B, node B is added to node A's record.using
  2. node B also has a usage record. If pods from node A can be moved to node B, node A is added to node B's record.usedBy
  3. when node A is scaled down, node A is removed from sd.unneededNodes
  4. usage tracker goes a step further and also removes some more nodes based on the following logic:
    a. node A is using node B
    b. node B is using node D
    c. node B is being used by node A and node C
    d. when node A is deleted, usage tracker sees which nodes node A is using e.g., In our case, it picks node B and gets the usage record for node B. It then gets the record.usedBy for node B and collects all the node names registered in record.usedBy for node B and removes these nodes from sd.unneededNodes

image
Link to picture

My earlier understanding was usage tracker would remove node B from sd.unneededNodes because node A was using node B (i.e., pods from node A could be moved here) but that doesn't seem true based on the code. Instead, node C is being removed from sd.unneededNodes because pods from node C cannot be moved to node B anymore since pods from node A are slated to move there. In other words, re-evaluate where pods from node C would go if it was scaled down.

I believe node C is the node we care about here. It is getting removed from sd.unneededNodes and hence the unneeded for time is being counted started from 0 all over again.

I am not sure why node B is not removed from sd.unneededNodes by the usage tracker. My guess is, this is by design and we are dealing with node B somewhere else or we don't want to remove it from sd.unneededNodes (I am not sure why though)

@vadasambar vadasambar removed their assignment Apr 6, 2023
@vadasambar
Copy link
Member

/label cluster-autoscaler

@k8s-ci-robot
Copy link
Contributor

@vadasambar: The label(s) /label cluster-autoscaler cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label cluster-autoscaler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bober02
Copy link

bober02 commented May 3, 2023

Hi, has there been any progress on this? At large scale deployments this is actually taking us hours to fully scale down, and subsequent cost for unutilised machines

@bober02
Copy link

bober02 commented May 3, 2023

This is also happening in 1.25 version

@kevinkrp93
Copy link

One of our customers was facing the same issue too. We replicated the customer scenario twice and found this reset in one of the runs. Is anyone else seeing this?

Cluster Autoscaler Profile:
image

Customer Scenario:
Customer  issue

Replicated Customer Scenario:
Customer replication

@kevinkrp93
Copy link

/label cluster-autoscaler

@k8s-ci-robot
Copy link
Contributor

@kevinkrp93: The label(s) /label cluster-autoscaler cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label cluster-autoscaler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Bryce-Soghigian
Copy link
Member

/assign Bryce-Soghigian

@Bryce-Soghigian
Copy link
Member

Not sure this is entirely related, but I found the same bug show up in my testing for this fix, but it goes away with this fix
#5890

@Bryce-Soghigian
Copy link
Member

I think we can close this issue as it was addressed in my pr, and i have the cherry picks posted #5962 to fix this for all supported k8s versions

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2024
@towca towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants