cluster autoscaler --skip-nodes-with-system-pods=false ? #1253
Comments
Just checking all our current kube-aws provisioned clusters and I don't have the above suggested workaround/hack/fix config directive in any of them (but note that they are all older versions though). They all do have the following though: Perhaps if we were to hack the userdata file from kube-aws provisioned cluster, it would go in this section below?:
Where we'd insert I haven't tried this yet as I need to provision a test cluster first as I cannot disturb the current cluster. Thanks |
modifies the behavior of the autoscaler to terminate worker and controller nodes no longer needed when minSize is set manually. Note that I had to delete the initial autoscaler pod after deploying a new cluster with this modification in place. After that initial autoscaler pod was deleted the second pod functioned without errors and as expected. I will update the issue comments with more details. Fixes kubernetes-retired#1253
One significant caveat/concern that I wanted to expand on and provide more details: I tested this on the most recent version of kube-aws version that we have: kube-aws version v0.9.10-rc.3 after intial cluster deployment inspecting AS pod logs:
'deleted the initial first autoscaler pod and the second replacement AS pod resolved the errors noted above.
then validated saw the correct updated options with the logs from the newly deployed cluster-autoscaler pod:
and once the pod was deleted and redeployed its logs were clean and the updated cloudconfig file worked as expected allowing scaling up/down. |
@cmcconnell1 Thanks for the information! This is so helpful. Have you deployed kiam or kube2iam, or enabled calico, or anything you think suspicious on your cluster? Just an idea, but I suspect CA may have a startup-ordering issue with kiam/kube2iam/calic/etc. |
Yes @mumoshu we have kube2Iam enabled in both nodepools and in the global
Looking at another fresh cluster just deployed to validate--this is immediately after deployment: kk get po | grep 'cluster-autoscaler' Saw three restarts of the initial AS pod. I waited until the sixth (6th) restarted AS pod before killing it and noting its logs look good. Thanks! |
@cmcconnell1 Yes, it's super helpful! I was reading kubernetes/kops#1796 (comment) - I remember that a custom DNS was the requirement in your env. If that's still the case, several questions I'd appreciate if answered would be:
|
@cmcconnell1 Regardless, would setting |
@mumoshu The new cluster's CA did have six (6) restarts, but its logs were clear of the previous noted errors when deployed.
|
@cmcconnell1 Thanks! Seems better than before, right? Now, perhaps you're seeing errors other than |
@mumoshu I didn't see any (concerning) errors in the final (sixth) CA pod's logs. It remained stable and did not redeploy any more pods and remained stable after around 20 minutes (time of the sixth CA pod deployment). I was not able to watch the previous CA pod logs as am working on other projects at the same time. What would be very helpful for me would be some docs and utilities (i.e. kube-aws test/validation harness) for testing new kube-aws/kube clusters that would/could gather all requisite details for deep dive analysis. kube-aws test/validation harness benefits:
On that same note, could you recommend existing recommended test/validation tools such as sonobuoy that we could use to help us validate/test new kube-aws (and kubernetes, etc.) versions? Thanks! |
FWIW, I ran Sonobuoy Scan against my p/r test cluster and the results should be available here for awhile
The only issue my Sonobuoy Scan detected was a very trivial issue I also manually scaled the worker node pools and controller up and back down successfully with our mods noted in the P/R. Hope this helps. |
Yes. I prefer it over our
Depends on the changes made in a PR, but basically:
Your idea of the test hardness sounds awesome! |
@cmcconnell1 Thank you so much for your efforts, anyway! Your PR LGTM now. |
* autoscaler: update cloud-config-controller modifies the behavior of the autoscaler to terminate worker and controller nodes no longer needed when minSize is set manually. Note that I had to delete the initial autoscaler pod after deploying a new cluster with this modification in place. After that initial autoscaler pod was deleted the second pod functioned without errors and as expected. I will update the issue comments with more details. Fixes #1253 * update CA in cloud-config-controller with dnsPolicy: Default
if you using cluster autoscaler helm chart set values as below to get it to work
|
Hello All,
Apologies if I'm missing something, but with a recently new cluster deployed with kube-aws v0.9.10-rc.3, I seem to be missing autoscaler functionality. The problem is that this new cluster's AS doesn't honor the min settings in our node pool configurations as specified in the cluster.yaml and will not terminate after we've manually scaled the nodepool up, and then after we've done a kube-aws update with the new lower/min setting.
Looking at the actual AS git code/repo:
ref: Cluster Autoscaler on AWS
autoscaler/cluster-autoscaler/simulator/cluster.go#L42,L43,L44
On that note, I received a private message / response to my post in the kube-aws IRC channel stating
Regarding kube-aws deployed clusters and the above fixes and suggestions, it's not clear how the above should be done.
Thanks
The text was updated successfully, but these errors were encountered: