Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Re: cluster-autoscaler support #629

Merged
merged 1 commit into from
May 22, 2017

Conversation

mumoshu
Copy link
Contributor

@mumoshu mumoshu commented May 9, 2017

In #151, we've introduced the initial, incomplete, almost theoretical support for cluster-autoscaler. The situation has changed considerably since then - Now we can finally complete the cluster-autoscaler support.

Features

  • Automatically scale out a node group by adding node(s) when one or more pods in the group become unschedulable due to insufficient resource.
  • Automatically scales in a node group by removing node(s) that is safe to be done so. Basically, a node group is safe to be removed when it is not running a critical k8s component

kube-aws scope changes

  • CA is now deployed automatically
    • More concretely, a k8s deployment for cluster-autoscaler is automatically created on a k8s cluster when cluster-autoscaler addon is enabled in cluster.yaml

Configuration changes

A valid cluster.yaml for a CA-enabled cluster would now look like:

# cluster-autoscaler addon is deployed and enabled to watch node groups so that it can scale in/out node group(s) that is registered as an autoscaling target.
addons:
  clusterAutoscaler:
    enabled: true
worker:
  nodePools:
  - name: scaled
    # Make this node pool an autoscaling target
    autoscaling:
      clusterAutoscaler:
        enabled: true
  - name: notScaled
    # This node pool is not an autoscaling target
  • The former experimental.clusterAutoscalerSupport.enabled is dropped for controller nodes in favor of addons.clusterAutoscaler.enabled
  • worker.nodePools[].clusterAutoscaler.minSize and maxSize are dropped in favor of worker.nodePools[].autoscaling.clustserAutoscaler.enabled and the auto discovery feature of cluster-autoscaler
  • worker.nodePools[].clusterAutoscalerSupport is kept as-is, but not necessarily be true because when you've enabled addons.clusterAutoscaler, kube-aws by default gives enough IAM permissions to only controller nodes and CA is scheduled there.
  • This work currently relies on the docker image built from a fork of cluster-autoscaler which supports the automatic node group discovery feature

cloud-config-controller changes

  • Add cluster-autoscaler deployment to be created when the CA addon is enabled

Go changes

  • The former ClusterAutoscalerImage is renamed to ClusterProportionalAutoscalerImage
  • Introduce ClusterAutoscalerImage(clusterAutoscalerImage in cluster.yaml) for the cluster-autoscaler docker image reference
  • ClusterAutoscalerSupport is no-op in controller nodes, used only for worker nodes. Addons.ClusterAutoscaler is used instead to give controller nodes appropriate IAM permissions and deploy CA to them.
  • Most of CA related types and funcs are moved from core/controlplane/config to the model package

IAM changes

  • autoscaling:DescribeTags is allowed in IAM to allow enabling the automatic node group discovery feature of cluster-autoscaler

Gotchas

Note that a node running cluster-autoscaler or kube-resources-autosave can not be scaled in:

09 05:29:21.523426       1 cluster.go:74] Fast evaluation: ip-10-0-0-68.ap-northeast-1.compute.internal for removal
I0509 05:29:21.523467       1 cluster.go:88] Fast evaluation: node ip-10-0-0-68.ap-northeast-1.compute.internal cannot be removed: non-deamons set, non-mirrored, kube-system pod present: cluster-autoscaler-998591511-thcpj
I0509 05:29:21.523479       1 cluster.go:74] Fast evaluation: ip-10-0-0-133.ap-northeast-1.compute.internal for removal
I0509 05:29:21.523488       1 cluster.go:103] Fast evaluation: node ip-10-0-0-133.ap-northeast-1.compute.internal may be removed
I0509 05:29:21.523493       1 cluster.go:74] Fast evaluation: ip-10-0-0-150.ap-northeast-1.compute.internal for removal
I0509 05:29:21.523530       1 cluster.go:88] Fast evaluation: node ip-10-0-0-150.ap-northeast-1.compute.internal cannot be removed: non-deamons set, non-mirrored, kube-system pod present: kube-resources-autosave-2845171460-1cbs5

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 9, 2017
@mumoshu
Copy link
Contributor Author

mumoshu commented May 9, 2017

I'm trying to complete the cluster-autoscaler support according to our roadmap for v0.9.7 https://github.com/kubernetes-incubator/kube-aws/blob/master/ROADMAP.md#v097

@codecov-io
Copy link

codecov-io commented May 9, 2017

Codecov Report

Merging #629 into master will decrease coverage by 0.04%.
The diff coverage is 62.96%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #629      +/-   ##
=========================================
- Coverage   37.14%   37.1%   -0.05%     
=========================================
  Files          51      52       +1     
  Lines        3201    3210       +9     
=========================================
+ Hits         1189    1191       +2     
- Misses       1836    1842       +6     
- Partials      176     177       +1
Impacted Files Coverage Δ
model/controller.go 0% <0%> (ø) ⬆️
model/node_pool_config.go 20.28% <0%> (ø) ⬆️
model/node_labels.go 0% <0%> (ø)
core/controlplane/config/config.go 57.09% <94.44%> (+0.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 910fbd0...166c34b. Read the comment docs.

@mumoshu
Copy link
Contributor Author

mumoshu commented May 9, 2017

This work depends on kubernetes/autoscaler#11

e2e/run Outdated
@@ -156,13 +163,17 @@ customize_cluster_yaml() {
worker:
nodePools:
- name: asg1
clusterAutoscalerSupport:
enabled: true
Copy link
Contributor Author

@mumoshu mumoshu May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterAutoscalerSupport is meant to provide enough permissions to call AWS APIs required to host cluster-autoscaler. Then, probably we should provide a nodeSelector for cluster-autoscaler so that we can ensure CA to be scheduled to nodes with the enough permissions?

Edit: And node labels correspond to clusterAutoscalerSupport.enabled accordingly to the nodeSelector?

Copy link
Contributor Author

@mumoshu mumoshu May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a validation which emits an error when there are no worker node pool or controller node whose clusterAutoscalerSupport is enabled.
Otherwise cluster-autoscaler can be unable to work/be scheduled due to missing IAM permissions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we run CA on controller nodes? they already have elevated permissions, adding more wont make it much worse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redbaron Yes, we can. kube-aws as of today already supports controller.clusterAutoscalerSupport.enabled to provide appropriate iam permissions to controller nodes. It would be a matter of just adding appropriate node labels and node selectors to stick cluster-autoscaler to whatever nodes(worker or controller) clusterAutoscalerSupport is enabled.

@@ -145,6 +145,13 @@
],
"MinSize": "{{.MinCount}}",
"Tags": [
{{if gt .ClusterAutoscaler.MaxSize 0}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be if .ClusterAutoscaler.Enabled

- name: AWS_REGION
value: {{.Region}}
volumeMounts:
- name: ssl-certs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This volumeMount and the corresponding volume will be unnecessary once kubernetes/autoscaler#48 is merged to CA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is merged

@mumoshu
Copy link
Contributor Author

mumoshu commented May 12, 2017

Assuming I have no control over when a docker image for cluster-autoscaler containing improvements on which this PR depends is released, I have opened https://github.com/kube-aws/autoscaler and https://quay.io/repository/kube-aws/cluster-autoscaler for hosting our own.

Update: This is how our docker image is built and released.

TAG=$(git rev-parse HEAD) REGISTRY=quay.io/kube-aws make release

And images can be found at https://quay.io/repository/kube-aws/cluster-autoscaler?tab=tags

@mumoshu
Copy link
Contributor Author

mumoshu commented May 12, 2017

kubernetes/autoscaler#11 is now merged but a docker image containing it isn't released yet.

Update: #629 (comment)

@mumoshu mumoshu added this to the backlog milestone May 12, 2017
@mumoshu mumoshu mentioned this pull request May 12, 2017
2 tasks
@mumoshu mumoshu modified the milestones: wip, backlog, tbd, v0.9.7 May 12, 2017
@mumoshu mumoshu added this to To be implemented in v0.9.7 May 12, 2017
@mumoshu mumoshu force-pushed the re-cluster-autoscaler branch 3 times, most recently from 27e9c6a to f112823 Compare May 18, 2017 08:13
@@ -203,12 +203,13 @@
"Resource": [ "*" ]
},
{{end}}
{{if .Experimental.ClusterAutoscalerSupport.Enabled}}
{{if .Addons.ClusterAutoscaler.Enabled}}
Copy link
Contributor

@redbaron redbaron May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            {
                 "Effect": "Allow",
                   "Action": [
                     "autoscaling:DescribeAutoScalingGroups",
                    "autoscaling:DescribeAutoScalingInstances",
                    "autoscaling:DescribeTags"
                   ],
                  "Resource": "*"
                },
                {
                  "Action": [
                     "autoscaling:SetDesiredCapacity",
                     "autoscaling:TerminateInstanceInAutoScalingGroup"
                  ],
                    "Condition": {
                    "Null": { "autoscaling:ResourceTag/kubernetes.io/cluster/{{.ClusterName}}": "false" }
                  },
                  "Resource": "*",
                  "Effect": "Allow"
                },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the Null condition? Could you point a doc for that for me?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! So it permits an action only when the targeted resource has the tag? Great - I will incorporate it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for controller nodes.
I'm still unsure if we'd want to support the customization to schedule CA to worker nodes #629 (comment)

@@ -361,6 +368,7 @@
"Action": [
"autoscaling:DescribeAutoScalingGroups",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand, what is the benfit of running cluster autoscaler on worker nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redbaron Thanks, good question 👍
The bigger a k8s cluster is, the more CA's resource usage is.
So, I guess one wants a dedicated node pool with a single worker node which can be recreated easier than controller nodes and used to run CA which is possibly scaled-up/down by a vertical-pod-autoscaler.
That's why I tried to support running CA in worker nodes, even though it isn't the default setup.
What do you think? 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an usual use-case, running CA in a controller node is recommended and that's the default setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On larger clusters you'd have large controller nodes too :) Anyway, I see the point and I guess adding this support isn't much of the work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redbaron

On larger clusters you'd have large controller nodes too :)

Yea, that's certainly true!
Do you think we can just recommend users to make controller large enough, instead of complication cluster.yaml like this?

}
sort.Strings(keys)
for _, k := range keys {
v := l[k]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v := l[k]
if len(v) > 0 {
  labels = append(labels, fmt.Sprintf("%s=%s", k, v))
}
else {
  labels = append(labels, fmt.Sprintf("%s", k))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redbaron Thanks!
Is a node label spec like mykey= is invalid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it isn't, but seeing args on a command line with empty = is a little confusing, at least for me. end result is the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the confirmation. Ok, I will make the suggested change 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mumoshu
Copy link
Contributor Author

mumoshu commented May 22, 2017

I've updated the PR description accordingly to the current state of the work.

@mumoshu mumoshu modified the milestones: v0.9.7-rc.1, v0.9.7-rc.<tbd> May 22, 2017
@mumoshu
Copy link
Contributor Author

mumoshu commented May 22, 2017

Added a cluster-autoscaling section in the doc

@mumoshu
Copy link
Contributor Author

mumoshu commented May 22, 2017

Testing E2E after merging with the master branch locally

n kubernetes-retired#151, we've introduced the initial, incomplete, almost theoretical support for cluster-autoscaler. The situation has changed considerably since then - Now we can finally complete the cluster-autoscaler support.

* Automatically scale out a node group by adding node(s) when one or more pods in the group become unschedulable due to insufficient resource.
* Automatically scales in a node group by removing node(s) that is safe to be done so. Basically, a node group is safe to be removed when it is not running a critical k8s component

* CA is now deployed automatically
  * More concretely, a k8s deployment for cluster-autoscaler is automatically created on a k8s cluster when cluster-autoscaler addon is enabled in cluster.yaml

A valid cluster.yaml for a CA-enabled cluster would now look like:
```yaml
addons:
  clusterAutoscaler:
    enabled: true
worker:
  nodePools:
  - name: scaled
    # Make this node pool an autoscaling target
    autoscaling:
      clusterAutoscaler:
        enabled: true
  - name: notScaled
    # This node pool is not an autoscaling target
```

* The former `experimental.clusterAutoscalerSupport.enabled` is dropped for controller nodes in favor of `addons.clusterAutoscaler.enabled`
* `worker.nodePools[].clusterAutoscaler.minSize` and `maxSize` are dropped in favor of `worker.nodePools[].autoscaling.clustserAutoscaler.enabled` and the auto discovery feature of cluster-autoscaler
* `worker.nodePools[].clusterAutoscalerSupport` is kept as-is, but not necessarily be `true` because when you've enabled `addons.clusterAutoscaler`, kube-aws by default gives enough IAM permissions to only controller nodes and CA is scheduled there.
* This work currently relies on the docker image built from a fork of cluster-autoscaler which supports the automatic node group discovery feature

* Add cluster-autoscaler deployment to be created when the CA addon is enabled

* The former `ClusterAutoscalerImage` is renamed to `ClusterProportionalAutoscalerImage`
* Introduce `ClusterAutoscalerImage`(`clusterAutoscalerImage` in cluster.yaml) for the cluster-autoscaler docker image reference
* `ClusterAutoscalerSupport` is no-op in controller nodes, used only for worker nodes. `Addons.ClusterAutoscaler` is used instead to give controller nodes appropriate IAM permissions and deploy CA to them.
* Most of CA related types and funcs are moved from `core/controlplane/config` to the `model` package

* `autoscaling:DescribeTags` is allowed in IAM to allow enabling the automatic node group discovery feature of cluster-autoscaler

Note that a node running cluster-autoscaler or kube-resources-autosave can not be scaled in:

```
09 05:29:21.523426       1 cluster.go:74] Fast evaluation: ip-10-0-0-68.ap-northeast-1.compute.internal for removal
I0509 05:29:21.523467       1 cluster.go:88] Fast evaluation: node ip-10-0-0-68.ap-northeast-1.compute.internal cannot be removed: non-deamons set, non-mirrored, kube-system pod present: cluster-autoscaler-998591511-thcpj
I0509 05:29:21.523479       1 cluster.go:74] Fast evaluation: ip-10-0-0-133.ap-northeast-1.compute.internal for removal
I0509 05:29:21.523488       1 cluster.go:103] Fast evaluation: node ip-10-0-0-133.ap-northeast-1.compute.internal may be removed
I0509 05:29:21.523493       1 cluster.go:74] Fast evaluation: ip-10-0-0-150.ap-northeast-1.compute.internal for removal
I0509 05:29:21.523530       1 cluster.go:88] Fast evaluation: node ip-10-0-0-150.ap-northeast-1.compute.internal cannot be removed: non-deamons set, non-mirrored, kube-system pod present: kube-resources-autosave-2845171460-1cbs5
```
@mumoshu mumoshu changed the title WIP: Re: cluster-autoscaler support Re: cluster-autoscaler support May 22, 2017
@mumoshu
Copy link
Contributor Author

mumoshu commented May 22, 2017

Whoa, all the tests have passed.

@mumoshu mumoshu merged commit 0504707 into kubernetes-retired:master May 22, 2017
@mumoshu mumoshu deleted the re-cluster-autoscaler branch May 22, 2017 07:59
@mumoshu mumoshu mentioned this pull request May 24, 2017
@cknowles
Copy link
Contributor

cluster.yaml defaults could do with a bit of an update here, it's still listing config which is not read any more.

@Vince-Cercury
Copy link

Sorry to comment on a closed ticket (happy to open a new one if required).

What happens if the condition is missing for the IAM role:

"Condition": { "Null": { "autoscaling:ResourceTag/kubernetes.io/cluster/{{.ClusterName}}": "false" } },

I don't have much control over IAM roles and my roles are shared between all my clusters.
I tried using a wildcard / start but it is not accepted by AWS.
What would you recommend for a case where the role is shared between many clusters?
What happens if I omit the condition altogether? The upstream project does not have it.
Is there a more generic tag I should check instead of being specific to a cluster?

@mumoshu
Copy link
Contributor Author

mumoshu commented Oct 11, 2017

@Vincemd If it was missing for Describe* operations, it would be ok. I'm afraid you would end up leaking more autoscaling group names when one of your pods are hijacked by attackers but it won't be a huge problem anyway?

However, if it was missing for TerminateInstancesInAutoscalingGroups, CA can theoretically terminate EC2 instances outside of K8S clusters when e.g. CA had bug(s) result in doing so. #800 is an example of that.

In your case, my suggestion is to tag EC2 instances with something like "PartOfKubernetes": "true" and use it in the condition of the IAM policy. So that CA simply can't terminate EC2 instances not part of K8S cluster(s). I suppose it isn't the best thing to do - the best would be create IAM role per cluster - but better than nothing.

Does my explanation make sense? 😃

@Vince-Cercury
Copy link

@mumoshu thanks for the clear explanation. I got it.

Good news is that my production Kubernetes cluster will have its own IAM role, so I can be specific and use the conditions properly (can only terminate/setDesired for the prod cluster based on specific tag)

It's only my non prod clusters, which are non public facing and sharing same IAM role at the moment. I will work with a tag, as suggested.

Problem solved!

kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this pull request Mar 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
No open projects
v0.9.7
To be implemented
Development

Successfully merging this pull request may close these issues.

None yet

6 participants