Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node-pool doesn't scale down to 0 on GKE #2377

Closed
aaaaahaaaaa opened this issue Sep 24, 2019 · 8 comments
Closed

Node-pool doesn't scale down to 0 on GKE #2377

aaaaahaaaaa opened this issue Sep 24, 2019 · 8 comments

Comments

@aaaaahaaaaa
Copy link

aaaaahaaaaa commented Sep 24, 2019

I can't seem to be able to configure my k8s cluster on GKE in such way that any of my non-default node-pools properly scales down to 0. The kube-system pods seem to be the problem but the documentation mentioning this specific use case doesn't help, and as far as I can tell several people are in the same situation (e.g. kubernetes/kubernetes#69696).

The PDB mentioned here can only be applied to heapster, kube-dns and metric-server. PDBs don't work on pods like fluentd, kube-proxy and prometheus-to-sd. I imagine because they are handled by daemonsets?

Warning NoControllers 95s (x48 over 36m) controllermanager found no controllers for pod "kube-proxy-gke-xxx"

K8s Rev: v1.13.7-gke.8

❯ kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
apiVersion: v1
data:
  status: |+
    Cluster-autoscaler status at 2019-09-24 10:12:30.638916052 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=2 unready=0 notStarted=0 longNotStarted=0 registered=2 longUnregistered=0)
                   LastProbeTime:      2019-09-24 10:12:30.31663754 +0000 UTC m=+148830.819145764
                   LastTransitionTime: 2019-09-22 16:52:28.254187732 +0000 UTC m=+28.756695947
      ScaleUp:     NoActivity (ready=2 registered=2)
                   LastProbeTime:      2019-09-24 10:12:30.31663754 +0000 UTC m=+148830.819145764
                   LastTransitionTime: 2019-09-24 09:21:58.063983054 +0000 UTC m=+145798.566491267
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2019-09-24 10:12:30.31663754 +0000 UTC m=+148830.819145764
                   LastTransitionTime: 2019-09-24 08:59:00.15511532 +0000 UTC m=+144420.657623532

    NodeGroups:
      Name:        https://content.googleapis.com/compute/v1/projects/XXX/zones/europe-west1-b/instanceGroups/gke-XXX-default-pool-ed23a39e-grp
      Health:      Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=1, maxSize=3))
                   LastProbeTime:      2019-09-24 10:12:30.31663754 +0000 UTC m=+148830.819145764
                   LastTransitionTime: 2019-09-22 16:52:28.254187732 +0000 UTC m=+28.756695947
      ScaleUp:     NoActivity (ready=1 cloudProviderTarget=1)
                   LastProbeTime:      2019-09-24 10:12:30.31663754 +0000 UTC m=+148830.819145764
                   LastTransitionTime: 2019-09-22 16:52:28.254187732 +0000 UTC m=+28.756695947
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2019-09-24 10:12:30.31663754 +0000 UTC m=+148830.819145764
                   LastTransitionTime: 2019-09-22 16:52:28.254187732 +0000 UTC m=+28.756695947

      Name:        https://content.googleapis.com/compute/v1/projects/XXX/zones/europe-west1-b/instanceGroups/gke-XXX-processing-b864ae5d-grp
      Health:      Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=0, maxSize=1))
                   LastProbeTime:      2019-09-24 10:12:30.31663754 +0000 UTC m=+148830.819145764
                   LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
      ScaleUp:     NoActivity (ready=1 cloudProviderTarget=1)
                   LastProbeTime:      2019-09-24 10:12:30.31663754 +0000 UTC m=+148830.819145764
                   LastTransitionTime: 2019-09-24 09:21:58.063983054 +0000 UTC m=+145798.566491267
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2019-09-24 10:12:30.31663754 +0000 UTC m=+148830.819145764
                   LastTransitionTime: 2019-09-24 09:21:04.100256228 +0000 UTC m=+145744.602764440

kind: ConfigMap
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/last-updated: 2019-09-24 10:12:30.638916052 +0000
      UTC
  creationTimestamp: "2019-09-20T15:55:43Z"
  name: cluster-autoscaler-status
  namespace: kube-system
  resourceVersion: "1244195"
  selfLink: /api/v1/namespaces/kube-system/configmaps/cluster-autoscaler-status
  uid: 19ef15cd-dbbf-11e9-b868-42010a840216

Node is only running kube-system pods:

kube-system fluentd-gcp-v3.2.0-ldhpk                        2/2  Running  ...
kube-system heapster-v1.6.1-8b4b64566-4krcg                 3/3  Running  ...
kube-system kube-dns-6987857fdb-29h2g                       4/4  Running  ...
kube-system kube-proxy-gke-pricing-processing-b864ae5d-ljgb 1/1  Running  ...
kube-system metrics-server-v0.3.1-57c75779f-c9nhk           2/2  Running  ...
kube-system prometheus-to-sd-8rt98                          1/1  Running  ...
@losipiuk
Copy link
Contributor

The culprit is probably kube-dns please look for cannot be removed in the logs. It should shed some light why node was not eligible for scale down.

@aaaaahaaaaa
Copy link
Author

aaaaahaaaaa commented Sep 24, 2019

@losipiuk The logs from the CA? I believe they aren't accessible on GKE.

@losipiuk
Copy link
Contributor

Oh - sorry - I did not notice the GKE part (though you are on GCE). It is hard to be sure what exactly is the problem without seeing cluster logs. If you set PDBs for non-deamonset system pods you should be fine (given there is place for those pods to be run on other nodes). The daemonsets are not blocking node scale-down.

If you

  • have PDBs set and
  • verified that scaledown should be possible (node utilization is below 50% and there is place for pods from node elsewhere)
  • and it is still not happening
    then contacting GKE support is your best chance IMO.

@aaaaahaaaaa
Copy link
Author

Alright, I guess GKE support it is then. Thanks for the help.

@xhanin
Copy link

xhanin commented Apr 23, 2020

@aaaaahaaaaa Did you get a chance to sort this out? I have the same problem, an autoscaling node pool which doesn't scale to 0 whilst scale down conditions should be met.

@MaciekPytel
Copy link
Contributor

@xhanin There may be any number of reasons for this, but system pods or pods using local storage are the most common reason (other reasons are in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node).

One option to consider is to put a taint on the nodepool that you want to be able to scale to 0. That way system pods will not be able to run on those nodes, so they won't block scale-down. Downside is you'll need to add a toleration to all the pods that you want to run on this nodepool (this can be automated with mutating admission webhook). This is a very useful pattern if you have a nodepool with particularly expensive nodes.
Alternatively you can create PDBs for all non-daemonset system pods. Note: restarting some system pods can cause various types of disruption to your cluster, which is why CA does not restart them by default (ex. restarting metrics-server will break all HPAs in your cluster for a few minutes). It's up to you to decide which disruptions you're ok with.

CA will log a name of the pod that is blocking scale-down (on GKE logs are not directly accessible, but the same information is exposed via https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-autoscaler-visibility).

@xhanin
Copy link

xhanin commented Apr 25, 2020

@MaciekPytel Thank you so much for your help! The page documenting how to get visibility events is very helpful, this is exactly what I was looking for.

And the pattern of mutating webhook is very interesting.

I'll further investigate in that direction, thank you again!

@superarvind
Copy link

It's been observed that, even if taints are defined, system workloads are still trying to run on the custom node pool. Does anyone come across a similar case with GKE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants