Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA suddenly tries to scale-up from min nodes instead of current node count on the VMSS node group #4462

Closed
mdahamiwal opened this issue Nov 11, 2021 · 7 comments
Labels
area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider kind/bug Categorizes issue or PR as related to a bug.

Comments

@mdahamiwal
Copy link

mdahamiwal commented Nov 11, 2021

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
helm version 1.4.1

helm version 1.4.1

Component version:

What k8s version are you using (kubectl version)?:
v1.21.3

kubectl version Output
$ kubectl version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T23:45:37Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
Azure VMSS

What did you expect to happen?:
CA should maintain the scale of the VMSS pool.

What happened instead?:

CA abruptly scaled down the agent pool of the cluster and remained in the same state for around 1/2 hour before correcting the scale. This led to all the running pods go into pending state for that time. There was no significant traffic spike or drop during this time which would cause this.

It also auto-recovered and corrected the state.

CA somehow read the incorrect current usage and pod state incorrectly.

How to reproduce it (as minimally and precisely as possible):
We are not sure how to reproduce this consistently. For the most part the CA runs as expected, but we have seen some rare occurrences of the same issue across multiple clusters without any noticeable traffic spike or drop.

Anything else we need to know?:

cluster autoscaler configuration:

Args:
  --v=3
  --logtostderr=true
  --cloud-provider=azure
  --cloud-config=/etc/kubernetes/azure.json
  --skip-nodes-with-local-storage=false
  --ignore-daemonsets-utilization
  --nodes=10:480:k8s-defpool1-36836756-vmss

Logs emitted:

vmss_autoscaler

@mdahamiwal mdahamiwal added the kind/bug Categorizes issue or PR as related to a bug. label Nov 11, 2021
@marwanad
Copy link
Member

marwanad commented Nov 12, 2021

@mdahamiwal what CA version are you using? and were there unregistered node removal attempts before that kicked in? This should be fixed in #4372 which is part of CA 1.21.1.

@mdahamiwal
Copy link
Author

@marwanad Yes, there were 2-3 unregistered node removal attempts running for a long time. Still doesn't explain why removal attempts for other 2 nodes would cause this miscalculation. Let me test the 1.21.1 version.

@marwanad
Copy link
Member

@mdahamiwal
If the cache kept decrementing for every removal attempt, you'll end up in this state. Keep me posted with the results. If you're able to reproduce this regularly, I'd be curious to take a look at the verbose(maybe v(5)) logs.

@mdahamiwal
Copy link
Author

mdahamiwal commented Nov 13, 2021

@marwanad You are right, with every removal attempt CA decrements the cache and finally hits bottom with min node.. We noticed this happening only once so far in some of our clusters in different regions. We will take in the 1.21.1 version which should fix this. Thank you!

Failed to remove node [...truncated..]k8s-defpool1-36836756-vmss/virtualMachines/60: node group min size reached, skipping unregistered node removal

@marwanad
Copy link
Member

@mdahamiwal please keep me posted on whether the fix addresses the issue for good. I am planning on adding more logs around the cache and cache refreshes with the capacity deals. My other theory would be stale data returned from the API. I think it's likely the CA bug though.

@marwanad
Copy link
Member

/area provider/azure

Closing this for now but feel free to reopen if that above hasn't fixed your issue.

/close

@k8s-ci-robot k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Dec 21, 2021
@k8s-ci-robot
Copy link
Contributor

@marwanad: Closing this issue.

In response to this:

/area provider/azure

Closing this for now but feel free to reopen if that above hasn't fixed your issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants