Cluster autoscaling proposal #15304

mwielgus · 2015-10-08T14:44:41Z

No description provided.

mwielgus · 2015-10-08T14:45:25Z

cc: @bgrant0607 @davidopp @fgrzadkowski @jszczepkowski @piosz

derekwaynecarr · 2015-10-08T14:49:54Z

@smarterclayton @kubernetes/rh-cluster-infra

It looks like this plan is predicated on moving Config out of experimental in time for 1.2?

k8s-bot · 2015-10-08T15:13:18Z

GCE e2e test build/test passed for commit dadd417803372cf318c3871fab3c4cb7dcaebc02.

k8s-bot · 2015-10-08T15:19:52Z

GCE e2e test build/test passed for commit a510d09d3dd90cbe6ccd306cbad77ae4d20e5d0c.

k8s-bot · 2015-10-08T15:24:04Z

GCE e2e build/test failed for commit 801580f9cb613c512eb1d26d3b8c8fb64a37498a.

ncdc · 2015-10-08T20:38:16Z

docs/proposals/cluster-autoscaling.md

+
+At this point we will encourage the community to contribute code that talks to particular cloud provider node controller.
+
+Kubernetes 1.4-1.5 - big KAC


Prepend with ## ?

k8s-bot · 2015-10-08T21:15:51Z

GCE e2e test build/test passed for commit 5cdbffa80f75a27f8fc6f4e202b974395c930924.

k8s-bot · 2015-10-08T21:17:54Z

GCE e2e test build/test passed for commit 46dd07afa89899549033350e7ded6963b530089e.

davidopp · 2015-10-08T21:51:56Z

docs/proposals/cluster-autoscaling.md

+
+In the current Kubernetes version we have a basic support for cluster autoscaling. It is based
+on Google Cloud Autoscaler (https://cloud.google.com/compute/docs/autoscaler/) and works only
+in Google Compute Engine. It looks on cpu-usage (what is the current load) and pod-cpu-requests


s/on/at/
I think this sentence could be clearer -- maybe something like "It looks at total CPU usage of all running pods and total CPU request of all running pods."

davidopp · 2015-10-22T06:22:21Z

docs/proposals/cluster-autoscaling.md

+when Kubernetes is running on Google Compute Engine. It looks at total cpu usage on each node (including system stuff)
+and total cpu request of all running pods. It can also scale on memory but for the simplicity
+this document will focus only on the CPU usage aspect (other metrics work in the exactly same way).
+The user is expected to provide the target level of utilization (which is now common for all metrics)


Explain how the observed utilization is derived from the metrics you said it looks at, namely "total cpu usage on each node (including system stuff) and total cpu request of all running pods" (in particular I don't see how it uses cpu request of running pods)

davidopp · 2015-10-23T07:28:23Z

When we discussed this in person, I believe we said KAC would use request only, not utilization. Please make it clear in the doc that some time before 1.4 you will switch to using request. Maybe 1.2 is the place to do it, since that's where you mention the hack for pending pods, and it's easy to incorporate pending pods if you just do it by adding request of pending pods to request of the other pods.

bgrant0607 · 2015-10-27T17:15:14Z

"Autoscaling" is the wrong way to think about the node provisioning problem. Using the MIG autoscaler is not the long-term direction. Autoscaling based on resource usage or even custom metrics is never going to be sufficient, and nodes aren't fungible once pods are scheduled on them. Additionally, managed-infrastructure-based autoscaling would require different work on every cloud provider, and would require a MIG/autoscaler controller in order to manage heterogeneous clusters, at which point the controller might as well just deal with nodes. We shouldn't invest a lot in it. Furthermore, the existing ClusterAutoscaler API doesn't contain the right information and should be deleted.

We need to think about automatic provisioning and deprovisioning of nodes based on applications' needs. This is also being discussed for storage provisioning. #14537 @thockin

Even if there is 0 actual cpu usage on the nodes, we may need more nodes in order to schedule more pods, since scheduling is based on resource requests and other constraints, not usage. If 1000 pods, or a few large-memory pods, or pods requiring SSD or GPUs, or pods with some other special constraints or requirements are submitted to Kubernetes, we need some component to decide how many nodes of what kind need to be provisioned, based not only on standard resource requirements but also availability/spreading considerations, latency required to create nodes, hardware and software requirements, etc. Additionally, in order to get most of the benefit of provisioning new nodes, we need to rebalance via rescheduling. #12140

Similarly, if we want to be able to deprovision nodes, some component would need to take into consideration all the relevant tradeoffs before gradually draining those nodes so that they can be retired. What if those nodes were running storage servers? Storage can't be drained quickly, and maybe the storage capacity is needed, so the nodes shouldn't be deleted at all.

Once we add preemption and time-deferred scheduling, the problem will be even more complicated.

@alex-mohr suggested starting by provisioning new nodes if pods are pending and we can estimate that creating a new node would allow the pod(s) to schedule. I like that idea.

Deprovisioning is a separate problem.

bgrant0607 · 2015-10-27T17:15:34Z

cc @justinsb

bgrant0607 · 2015-10-27T22:10:36Z

Regarding API details:

Here's the current API:

const (
    // Percentage of node's CPUs that is currently used.
    CpuConsumption NodeResource = "CpuConsumption"

    // Percentage of node's CPUs that is currently requested for pods.
    CpuRequest NodeResource = "CpuRequest"

    // Percentage od node's memory that is currently used.
    MemConsumption NodeResource = "MemConsumption"

    // Percentage of node's CPUs that is currently requested for pods.
    MemRequest NodeResource = "MemRequest"
)

// NodeUtilization describes what percentage of a particular resource is used on a node.
type NodeUtilization struct {
    Resource NodeResource `json:"resource"`

    // The accepted values are from 0 to 1.
    Value float64 `json:"value"`
}

// Configuration of the Cluster Autoscaler
type ClusterAutoscalerSpec struct {
    // Minimum number of nodes that the cluster should have.
    MinNodes int `json:"minNodes"`

    // Maximum number of nodes that the cluster should have.
    MaxNodes int `json:"maxNodes"`

    // Target average utilization of the cluster nodes. New nodes will be added if one of the
    // targets is exceeded. Cluster size will be decreased if the current utilization is too low
    // for all targets.
    TargetUtilization []NodeUtilization `json:"target"`
}

This API precludes heterogeneous clusters. We should consider using aggregate resources, more similar to ResourceQuota.

targetUtilization doesn't take into account other scheduling constraints and considerations, such as hostPort.

I made additional comments on #15063 and on the HPA API (e.g., naming isn't consistent with HorizontalPodAutoscaler, Go and json field name mismatch, floats).

But I'd rather not have an API for this right now, especially since it's experimental, provider-specific, and a singleton for the cluster.

bgrant0607 · 2015-10-27T22:12:50Z

Other considerations: max pods, max PDs per node

bgrant0607 · 2015-10-28T00:21:11Z

docs/proposals/cluster-autoscaling.md

+always a node that can fit a pod with request set to `1-X`.
+
+If a pod has some bigger request then there is no guarantee. It may be a problem for customers who would
+like to have big pods running on nodes or even devote nodes to one type of job (some kind of database for instance).


And:

pods that use hostPort

pods with additional constraints, such as nodeName or nodeSelector

pods that use hostPath and expect the data to remain resident

bgrant0607 · 2015-10-28T00:22:26Z

See also comments on #16199.

bgrant0607 · 2015-10-28T05:44:14Z

To scale down, we also need graceful evacuation/drain: #3885

eparis · 2016-02-01T23:56:42Z

@k8s-bot test this issue: #IGNORE

Tests have been pending for 24 hours

k8s-bot · 2016-02-02T01:04:00Z

GCE e2e test build/test passed for commit e72f862.

k8s-bot · 2016-02-10T16:27:34Z

GCE e2e test build/test passed for commit e72f862.

bgrant0607 · 2016-04-25T16:59:56Z

@mwielgus @fgrzadkowski @piosz Are you planning to update this proposal, or should we just close this PR?

mwielgus · 2016-04-25T18:37:31Z

We will create another one.

googlebot added the cla: yes label Oct 8, 2015

mwielgus force-pushed the kac_proposal branch 2 times, most recently from a510d09 to 801580f Compare October 8, 2015 14:49

mwielgus mentioned this pull request Oct 8, 2015

Experimental Cluster Autoscaler configuration object #15063

Merged

k8s-github-robot assigned brendandburns Oct 8, 2015

k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 8, 2015

ncdc reviewed Oct 8, 2015
View reviewed changes

mwielgus force-pushed the kac_proposal branch 2 times, most recently from 5cdbffa to 46dd07a Compare October 8, 2015 20:47

davidopp reviewed Oct 8, 2015
View reviewed changes

davidopp reviewed Oct 22, 2015
View reviewed changes

bgrant0607 mentioned this pull request Oct 23, 2015

Proposal: NodeSets for API-driven node management #16199

Closed

bgrant0607 reviewed Oct 28, 2015
View reviewed changes

davidopp assigned bgrant0607 and unassigned davidopp Oct 28, 2015

This was referenced Nov 2, 2015

Autoscaling Kubernetes Nodes #16645

Closed

[feature request] adding a script that adds nodes to k8s cluster on the fly #16648

Closed

This was referenced Mar 25, 2016

AWS Feature: Autoscaling of nodes #23480

Closed

AWS Feature: Should be possible to use SpotFleet #23494

Closed

k8s-github-robot added the release-note-label-needed label Mar 31, 2016

mwielgus closed this Apr 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster autoscaling proposal #15304

Cluster autoscaling proposal #15304

mwielgus commented Oct 8, 2015

mwielgus commented Oct 8, 2015

derekwaynecarr commented Oct 8, 2015

k8s-bot commented Oct 8, 2015

k8s-bot commented Oct 8, 2015

k8s-bot commented Oct 8, 2015

ncdc Oct 8, 2015

mwielgus Oct 8, 2015

k8s-bot commented Oct 8, 2015

k8s-bot commented Oct 8, 2015

davidopp Oct 8, 2015

mwielgus Oct 9, 2015

davidopp Oct 22, 2015

davidopp commented Oct 23, 2015

bgrant0607 commented Oct 27, 2015

bgrant0607 commented Oct 27, 2015

bgrant0607 commented Oct 27, 2015

bgrant0607 commented Oct 27, 2015

bgrant0607 Oct 28, 2015

bgrant0607 commented Oct 28, 2015

bgrant0607 commented Oct 28, 2015

eparis commented Feb 1, 2016

k8s-bot commented Feb 2, 2016

k8s-bot commented Feb 10, 2016

bgrant0607 commented Apr 25, 2016

mwielgus commented Apr 25, 2016


		At this point we will encourage the community to contribute code that talks to particular cloud provider node controller.

		Kubernetes 1.4-1.5 - big KAC

Cluster autoscaling proposal #15304

Cluster autoscaling proposal #15304

Conversation

mwielgus commented Oct 8, 2015

mwielgus commented Oct 8, 2015

derekwaynecarr commented Oct 8, 2015

k8s-bot commented Oct 8, 2015

k8s-bot commented Oct 8, 2015

k8s-bot commented Oct 8, 2015

ncdc Oct 8, 2015

Choose a reason for hiding this comment

mwielgus Oct 8, 2015

Choose a reason for hiding this comment

k8s-bot commented Oct 8, 2015

k8s-bot commented Oct 8, 2015

davidopp Oct 8, 2015

Choose a reason for hiding this comment

mwielgus Oct 9, 2015

Choose a reason for hiding this comment

davidopp Oct 22, 2015

Choose a reason for hiding this comment

davidopp commented Oct 23, 2015

bgrant0607 commented Oct 27, 2015

bgrant0607 commented Oct 27, 2015

bgrant0607 commented Oct 27, 2015

bgrant0607 commented Oct 27, 2015

bgrant0607 Oct 28, 2015

Choose a reason for hiding this comment

bgrant0607 commented Oct 28, 2015

bgrant0607 commented Oct 28, 2015

eparis commented Feb 1, 2016

k8s-bot commented Feb 2, 2016

k8s-bot commented Feb 10, 2016

bgrant0607 commented Apr 25, 2016

mwielgus commented Apr 25, 2016