-
Notifications
You must be signed in to change notification settings - Fork 3.9k
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA suddenly tries to scale-up from min nodes instead of current node count on the VMSS node group #4462
Comments
@mdahamiwal what CA version are you using? and were there unregistered node removal attempts before that kicked in? This should be fixed in #4372 which is part of CA 1.21.1. |
@marwanad Yes, there were 2-3 unregistered node removal attempts running for a long time. Still doesn't explain why removal attempts for other 2 nodes would cause this miscalculation. Let me test the 1.21.1 version. |
@mdahamiwal |
@marwanad You are right, with every removal attempt CA decrements the cache and finally hits bottom with min node.. We noticed this happening only once so far in some of our clusters in different regions. We will take in the 1.21.1 version which should fix this. Thank you! Failed to remove node [...truncated..]k8s-defpool1-36836756-vmss/virtualMachines/60: node group min size reached, skipping unregistered node removal |
@mdahamiwal please keep me posted on whether the fix addresses the issue for good. I am planning on adding more logs around the cache and cache refreshes with the capacity deals. My other theory would be stale data returned from the API. I think it's likely the CA bug though. |
/area provider/azure Closing this for now but feel free to reopen if that above hasn't fixed your issue. /close |
@marwanad: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
helm version 1.4.1
helm version 1.4.1
Component version:
What k8s version are you using (
kubectl version
)?:v1.21.3
kubectl version
OutputWhat environment is this in?:
Azure VMSS
What did you expect to happen?:
CA should maintain the scale of the VMSS pool.
What happened instead?:
CA abruptly scaled down the agent pool of the cluster and remained in the same state for around 1/2 hour before correcting the scale. This led to all the running pods go into pending state for that time. There was no significant traffic spike or drop during this time which would cause this.
It also auto-recovered and corrected the state.
CA somehow read the incorrect current usage and pod state incorrectly.
How to reproduce it (as minimally and precisely as possible):
We are not sure how to reproduce this consistently. For the most part the CA runs as expected, but we have seen some rare occurrences of the same issue across multiple clusters without any noticeable traffic spike or drop.
Anything else we need to know?:
cluster autoscaler configuration:
Logs emitted:
The text was updated successfully, but these errors were encountered: