CA suddenly tries to scale-up from min nodes instead of current node count on the VMSS node group #4462

mdahamiwal · 2021-11-11T23:59:51Z

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
helm version 1.4.1

helm version 1.4.1

Component version:

What k8s version are you using (kubectl version)?:
v1.21.3

kubectl version Output

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T23:45:37Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
Azure VMSS

What did you expect to happen?:
CA should maintain the scale of the VMSS pool.

What happened instead?:

CA abruptly scaled down the agent pool of the cluster and remained in the same state for around 1/2 hour before correcting the scale. This led to all the running pods go into pending state for that time. There was no significant traffic spike or drop during this time which would cause this.

It also auto-recovered and corrected the state.

CA somehow read the incorrect current usage and pod state incorrectly.

How to reproduce it (as minimally and precisely as possible):
We are not sure how to reproduce this consistently. For the most part the CA runs as expected, but we have seen some rare occurrences of the same issue across multiple clusters without any noticeable traffic spike or drop.

Anything else we need to know?:

cluster autoscaler configuration:

Args:
  --v=3
  --logtostderr=true
  --cloud-provider=azure
  --cloud-config=/etc/kubernetes/azure.json
  --skip-nodes-with-local-storage=false
  --ignore-daemonsets-utilization
  --nodes=10:480:k8s-defpool1-36836756-vmss

Logs emitted:

The text was updated successfully, but these errors were encountered:

marwanad · 2021-11-12T19:39:55Z

@mdahamiwal what CA version are you using? and were there unregistered node removal attempts before that kicked in? This should be fixed in #4372 which is part of CA 1.21.1.

mdahamiwal · 2021-11-13T00:03:32Z

@marwanad Yes, there were 2-3 unregistered node removal attempts running for a long time. Still doesn't explain why removal attempts for other 2 nodes would cause this miscalculation. Let me test the 1.21.1 version.

marwanad · 2021-11-13T01:07:13Z

@mdahamiwal
If the cache kept decrementing for every removal attempt, you'll end up in this state. Keep me posted with the results. If you're able to reproduce this regularly, I'd be curious to take a look at the verbose(maybe v(5)) logs.

mdahamiwal · 2021-11-13T08:03:58Z

@marwanad You are right, with every removal attempt CA decrements the cache and finally hits bottom with min node.. We noticed this happening only once so far in some of our clusters in different regions. We will take in the 1.21.1 version which should fix this. Thank you!

Failed to remove node [...truncated..]k8s-defpool1-36836756-vmss/virtualMachines/60: node group min size reached, skipping unregistered node removal

marwanad · 2021-11-16T20:30:44Z

@mdahamiwal please keep me posted on whether the fix addresses the issue for good. I am planning on adding more logs around the cache and cache refreshes with the capacity deals. My other theory would be stale data returned from the API. I think it's likely the CA bug though.

marwanad · 2021-12-21T11:51:27Z

/area provider/azure

Closing this for now but feel free to reopen if that above hasn't fixed your issue.

/close

k8s-ci-robot · 2021-12-21T11:51:45Z

@marwanad: Closing this issue.

In response to this:

/area provider/azure

Closing this for now but feel free to reopen if that above hasn't fixed your issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mdahamiwal added the kind/bug Categorizes issue or PR as related to a bug. label Nov 11, 2021

jbartosik added the area/cluster-autoscaler label Nov 19, 2021

k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Dec 21, 2021

k8s-ci-robot closed this as completed Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA suddenly tries to scale-up from min nodes instead of current node count on the VMSS node group #4462

CA suddenly tries to scale-up from min nodes instead of current node count on the VMSS node group #4462

mdahamiwal commented Nov 11, 2021 •

edited

Loading

marwanad commented Nov 12, 2021 •

edited

Loading

mdahamiwal commented Nov 13, 2021

marwanad commented Nov 13, 2021

mdahamiwal commented Nov 13, 2021 •

edited

Loading

marwanad commented Nov 16, 2021

marwanad commented Dec 21, 2021

k8s-ci-robot commented Dec 21, 2021

CA suddenly tries to scale-up from min nodes instead of current node count on the VMSS node group #4462

CA suddenly tries to scale-up from min nodes instead of current node count on the VMSS node group #4462

Comments

mdahamiwal commented Nov 11, 2021 • edited Loading

marwanad commented Nov 12, 2021 • edited Loading

mdahamiwal commented Nov 13, 2021

marwanad commented Nov 13, 2021

mdahamiwal commented Nov 13, 2021 • edited Loading

marwanad commented Nov 16, 2021

marwanad commented Dec 21, 2021

k8s-ci-robot commented Dec 21, 2021

mdahamiwal commented Nov 11, 2021 •

edited

Loading

marwanad commented Nov 12, 2021 •

edited

Loading

mdahamiwal commented Nov 13, 2021 •

edited

Loading