Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
176 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
# Cluster Autoscaler - min at 0 | ||
### Design Document for Google Cloud Platform | ||
##### Author: mwielgus | ||
|
||
# Introduction | ||
|
||
One of the common requests for Cluster Autoscaler (for example: [1], [2]) is the ability to scale some node groups to zero. This would definitely be a very useful feature but the implementation is somehow problematic in ScaleUP due to couple reasons: | ||
|
||
* [P1] There is no live example of what a new node would look like if the currently zero-sized node group was expanded. The node shape is defined as: | ||
* [P1A] Node capacity | ||
* [P1B] Node allocatable | ||
* [P1C] Node labels | ||
* [P2] There is no live example of what manifest-run pods would be present on the new node. | ||
* [P3] There is no live example of what DaemonSets would be run on the new node. | ||
|
||
In general the above can be summarized as that the full definition of a new node needs to be somehow known before the node is actually created in order to decide whether the creation of a new node from a particular node group makes sense or not. Scale down has no issues with min@0. | ||
Design | ||
|
||
Problems P1, P1A, P1B, P1C, P2, P3 needs to be solved. The primary focus is to create a solution for GCE/GKE but the proposed option should be generic enough to allow to expand this feature to other cloud providers if found needed and business-justified. | ||
|
||
Each node group in Cluster Autoscaler is on GCP backed by a Managed Instance Group. ManagedInstanceGroup has a node template and in the template there is couple of important information: | ||
|
||
```yaml | ||
creationTimestamp: '2017-02-24T06:16:21.761-08:00' | ||
description: '' | ||
id: '7272458143615811290' | ||
kind: compute#instanceTemplate | ||
name: kubernetes-minion-template | ||
properties: | ||
canIpForward: true | ||
disks: | ||
- autoDelete: true | ||
boot: true | ||
initializeParams: | ||
diskSizeGb: '100' | ||
diskType: pd-standard | ||
sourceImage: https://www.googleapis.com/compute/v1/projects/google-containers/global/images/container-vm-v20170214 | ||
kind: compute#attachedDisk | ||
mode: READ_WRITE | ||
type: PERSISTENT | ||
machineType: n1-standard-2 | ||
metadata: | ||
fingerprint: -1QtDcJUvH0= | ||
items: | ||
- key: cluster-name | ||
value: | | ||
kubernetes | ||
- key: kube-env | ||
value: | ||
ENABLE_NODE_PROBLEM_DETECTOR: 'daemonset' | ||
ENABLE_RESCHEDULER: 'true' | ||
LOGGING_DESTINATION: 'gcp' | ||
ELASTICSEARCH_LOGGING_REPLICAS: '1' | ||
ENABLE_CLUSTER_DNS: 'true' | ||
ENABLE_CLUSTER_REGISTRY: 'false' | ||
CLUSTER_REGISTRY_DISK: 'kubernetes-kube-system-kube-registry' | ||
CLUSTER_REGISTRY_DISK_SIZE: '200GB' | ||
DNS_SERVER_IP: '10.0.0.10' | ||
DNS_DOMAIN: 'cluster.local' | ||
ENABLE_DNS_HORIZONTAL_AUTOSCALER: 'true' | ||
[...] | ||
``` | ||
|
||
Node type and kube-env. These two fields are very useful to estimate the node shape. | ||
|
||
### [P1A] - Node capacity | ||
To estimate the node capacity we need to decode n1-standard-2 or any other string mentioned in the machineType. The format of the strings is well documented | ||
here: https://cloud.google.com/compute/docs/machine-types | ||
|
||
However, except for the earliest prototype, we probably don’t want to manually hardcode all of the values available there. Luckily, the GCP platform provides an api to get the machine description in terms of memory and cpu: | ||
|
||
```json | ||
{ | ||
"kind": "compute#machineTypeList", | ||
"id": "projects/mwielgus-proj/zones/us-central1-b/machineTypes", | ||
"items": [ | ||
{ | ||
"kind": "compute#machineType", | ||
"id": "2000", | ||
"creationTimestamp": "2015-01-16T09:25:43.315-08:00", | ||
"name": "g1-small", | ||
"description": "1 vCPU (shared physical core) and 1.7 GB RAM", | ||
"guestCpus": 1, | ||
"memoryMb": 1740, | ||
"maximumPersistentDisks": 16, | ||
"maximumPersistentDisksSizeGb": "3072", | ||
"zone": "us-central1-b", | ||
"selfLink": "https://www.googleapis.com/compute/v1/projects/mwielgus-proj/zones/us-central1-b/machineTypes/g1-small", | ||
"isSharedCpu": true | ||
}, | ||
{ | ||
"kind": "compute#machineType", | ||
"id": "4016", | ||
"creationTimestamp": "2015-01-16T09:25:43.316-08:00", | ||
"name": "n1-highcpu-16", | ||
"description": "16 vCPUs, 14.4 GB RAM", | ||
"guestCpus": 16, | ||
"memoryMb": 14746, | ||
"maximumPersistentDisks": 128, | ||
"maximumPersistentDisksSizeGb": "65536", | ||
"zone": "us-central1-b", | ||
"selfLink": "https://www.googleapis.com/compute/v1/projects/mwielgus-proj/zones/us-central1-b/machineTypes/n1-highcpu-16", | ||
"isSharedCpu": false | ||
}, | ||
``` | ||
|
||
So all standard type sizes can be obtained by API. On GCP there is a possibility of a custom machine type. This custom machine type has the name formatted as: | ||
|
||
``` | ||
custom-<cpu_count>-<memory_in_mb> | ||
``` | ||
|
||
So it also quite easy to get all of capacity information from it. | ||
|
||
### [P1B] - Node allocatable | ||
|
||
In GKE 1.5.6 allocatable for new nodes is equal to capacity. For simplicity we could assume that the new node will have the 90% (or -0.1cpu/-200mb) of capacity. Being wrong or underestimating here is not fatal, most users will probably be OK with this. Once some nodes are present we will have more precise estimates. The worst thing that can happen is that the scale up may not be triggered if the request is exactly at the node capacity - system pods. | ||
|
||
### [P1C] - Node labels | ||
|
||
The basic set of labels in GKE is relatively simple. | ||
|
||
``` | ||
labels: | ||
beta.kubernetes.io/arch: amd64 | ||
beta.kubernetes.io/instance-type: n1-standard-1 | ||
beta.kubernetes.io/os: linux | ||
cloud.google.com/gke-nodepool: default-pool | ||
failure-domain.beta.kubernetes.io/region: us-central1 | ||
failure-domain.beta.kubernetes.io/zone: us-central1-a | ||
kubernetes.io/hostname: gke-cluster-1-default-pool-408619fb-0zkd | ||
name: gke-cluster-1-default-pool-408619fb-0zkd | ||
``` | ||
|
||
All of them can be “guessed” from the MIG description and template. Some of the labels are conditional like. | ||
|
||
``` | ||
cloud.google.com/gke-local-ssd: "true" | ||
cloud.google.com/gke-preemptible: "true" | ||
``` | ||
|
||
There is also an option to specify custom labels for the node pool. | ||
``` | ||
a: b | ||
c: d | ||
``` | ||
Luckily, all of these labels are mentioned in the NODE_LABELS variable in kube-env metadata of the template. | ||
``` | ||
NODE_LABELS: a=b,c=d,cloud.google.com/gke-nodepool=pool-3,cloud.google.com/gke-preemptible=true | ||
|
||
NODE_LABELS: cloud.google.com/gke-local-ssd=true,cloud.google.com/gke-nodepool=pool-1 | ||
``` | ||
Bottom line is that all of the labels can be easily obtained. | ||
|
||
### [P2] - There is no live example of what manifest-run pods would be present on the new node. | ||
|
||
In GKE (since 1.6) we run 1 types of pods by default on the node - Kube-proxy that requires only cpu. Unfortunately the amount of cpu is not well-defined and is hidden inside the startup script. https://github.com/kubernetes/kubernetes/blob/6bf9f2f0bbf25c550e9dd93bfa0a3cda4feec954/cluster/gce/gci/configure-helper.sh#L797 | ||
|
||
The amount is fixed at 100m and I guess it is unlikely to change so it can be probably hardcoded in CA as well. | ||
|
||
### [P3] - There is no live example of what DaemonSets would be run on the new node. | ||
|
||
All daemon sets can be listed from apiserver and checked against the node with all the labels, capacities, allocatables and manifest-run pods obtained in previous steps. CA codebase already has the set of predicates imported so checking which pods should run on the node will be relatively easy. | ||
|
||
# Solution | ||
|
||
Given the all information above it should be relatively simple to write a module that given the access to GCP Api and Kubernetes API server. We will expand the NodeGroup interface (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/cloud_provider.go#L40) with a method EstimateNodeShape, taking no parameters and returning NodeInfo (containing api.Node and all pods running by default on the node) or error if unable to do so. | ||
|
||
On GCP the method will get the MIG template and perform analysis needed to estimate the node shape. | ||
|
||
The method will be called here: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/utils.go#L132 | ||
|
||
only if the current size of the node group is 0 or all of the nodes are unready/broken. Otherwise CA will try to estimate the shape of the node using live examples to avoid repeating any mis-estimation errors. | ||
|
||
The EstimateNodeShape will also be run on CA startup to ensure that CA is able to build an example for the node pool should the node group min size was set to 0. | ||
|