Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster autoscaling proposal #15304

Closed
wants to merge 1 commit into from
Closed

Conversation

mwielgus
Copy link
Contributor

@mwielgus mwielgus commented Oct 8, 2015

No description provided.

@mwielgus
Copy link
Contributor Author

mwielgus commented Oct 8, 2015

@mwielgus mwielgus force-pushed the kac_proposal branch 2 times, most recently from a510d09 to 801580f Compare October 8, 2015 14:49
@derekwaynecarr
Copy link
Member

@smarterclayton @kubernetes/rh-cluster-infra

It looks like this plan is predicated on moving Config out of experimental in time for 1.2?

@k8s-bot
Copy link

k8s-bot commented Oct 8, 2015

GCE e2e test build/test passed for commit dadd417803372cf318c3871fab3c4cb7dcaebc02.

@k8s-bot
Copy link

k8s-bot commented Oct 8, 2015

GCE e2e test build/test passed for commit a510d09d3dd90cbe6ccd306cbad77ae4d20e5d0c.

@k8s-bot
Copy link

k8s-bot commented Oct 8, 2015

GCE e2e build/test failed for commit 801580f9cb613c512eb1d26d3b8c8fb64a37498a.

@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 8, 2015

At this point we will encourage the community to contribute code that talks to particular cloud provider node controller.

Kubernetes 1.4-1.5 - big KAC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prepend with ## ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mwielgus mwielgus force-pushed the kac_proposal branch 2 times, most recently from 5cdbffa to 46dd07a Compare October 8, 2015 20:47
@k8s-bot
Copy link

k8s-bot commented Oct 8, 2015

GCE e2e test build/test passed for commit 5cdbffa80f75a27f8fc6f4e202b974395c930924.

@k8s-bot
Copy link

k8s-bot commented Oct 8, 2015

GCE e2e test build/test passed for commit 46dd07afa89899549033350e7ded6963b530089e.


In the current Kubernetes version we have a basic support for cluster autoscaling. It is based
on Google Cloud Autoscaler (https://cloud.google.com/compute/docs/autoscaler/) and works only
in Google Compute Engine. It looks on cpu-usage (what is the current load) and pod-cpu-requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/on/at/
I think this sentence could be clearer -- maybe something like "It looks at total CPU usage of all running pods and total CPU request of all running pods."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

when Kubernetes is running on Google Compute Engine. It looks at total cpu usage on each node (including system stuff)
and total cpu request of all running pods. It can also scale on memory but for the simplicity
this document will focus only on the CPU usage aspect (other metrics work in the exactly same way).
The user is expected to provide the target level of utilization (which is now common for all metrics)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain how the observed utilization is derived from the metrics you said it looks at, namely "total cpu usage on each node (including system stuff) and total cpu request of all running pods" (in particular I don't see how it uses cpu request of running pods)

@davidopp
Copy link
Member

When we discussed this in person, I believe we said KAC would use request only, not utilization. Please make it clear in the doc that some time before 1.4 you will switch to using request. Maybe 1.2 is the place to do it, since that's where you mention the hack for pending pods, and it's easy to incorporate pending pods if you just do it by adding request of pending pods to request of the other pods.

@bgrant0607
Copy link
Member

"Autoscaling" is the wrong way to think about the node provisioning problem. Using the MIG autoscaler is not the long-term direction. Autoscaling based on resource usage or even custom metrics is never going to be sufficient, and nodes aren't fungible once pods are scheduled on them. Additionally, managed-infrastructure-based autoscaling would require different work on every cloud provider, and would require a MIG/autoscaler controller in order to manage heterogeneous clusters, at which point the controller might as well just deal with nodes. We shouldn't invest a lot in it. Furthermore, the existing ClusterAutoscaler API doesn't contain the right information and should be deleted.

We need to think about automatic provisioning and deprovisioning of nodes based on applications' needs. This is also being discussed for storage provisioning. #14537 @thockin

Even if there is 0 actual cpu usage on the nodes, we may need more nodes in order to schedule more pods, since scheduling is based on resource requests and other constraints, not usage. If 1000 pods, or a few large-memory pods, or pods requiring SSD or GPUs, or pods with some other special constraints or requirements are submitted to Kubernetes, we need some component to decide how many nodes of what kind need to be provisioned, based not only on standard resource requirements but also availability/spreading considerations, latency required to create nodes, hardware and software requirements, etc. Additionally, in order to get most of the benefit of provisioning new nodes, we need to rebalance via rescheduling. #12140

Similarly, if we want to be able to deprovision nodes, some component would need to take into consideration all the relevant tradeoffs before gradually draining those nodes so that they can be retired. What if those nodes were running storage servers? Storage can't be drained quickly, and maybe the storage capacity is needed, so the nodes shouldn't be deleted at all.

Once we add preemption and time-deferred scheduling, the problem will be even more complicated.

@alex-mohr suggested starting by provisioning new nodes if pods are pending and we can estimate that creating a new node would allow the pod(s) to schedule. I like that idea.

Deprovisioning is a separate problem.

@bgrant0607
Copy link
Member

cc @justinsb

@bgrant0607
Copy link
Member

Regarding API details:

Here's the current API:

const (
    // Percentage of node's CPUs that is currently used.
    CpuConsumption NodeResource = "CpuConsumption"

    // Percentage of node's CPUs that is currently requested for pods.
    CpuRequest NodeResource = "CpuRequest"

    // Percentage od node's memory that is currently used.
    MemConsumption NodeResource = "MemConsumption"

    // Percentage of node's CPUs that is currently requested for pods.
    MemRequest NodeResource = "MemRequest"
)

// NodeUtilization describes what percentage of a particular resource is used on a node.
type NodeUtilization struct {
    Resource NodeResource `json:"resource"`

    // The accepted values are from 0 to 1.
    Value float64 `json:"value"`
}

// Configuration of the Cluster Autoscaler
type ClusterAutoscalerSpec struct {
    // Minimum number of nodes that the cluster should have.
    MinNodes int `json:"minNodes"`

    // Maximum number of nodes that the cluster should have.
    MaxNodes int `json:"maxNodes"`

    // Target average utilization of the cluster nodes. New nodes will be added if one of the
    // targets is exceeded. Cluster size will be decreased if the current utilization is too low
    // for all targets.
    TargetUtilization []NodeUtilization `json:"target"`
}

This API precludes heterogeneous clusters. We should consider using aggregate resources, more similar to ResourceQuota.

targetUtilization doesn't take into account other scheduling constraints and considerations, such as hostPort.

I made additional comments on #15063 and on the HPA API (e.g., naming isn't consistent with HorizontalPodAutoscaler, Go and json field name mismatch, floats).

But I'd rather not have an API for this right now, especially since it's experimental, provider-specific, and a singleton for the cluster.

@bgrant0607
Copy link
Member

Other considerations: max pods, max PDs per node

always a node that can fit a pod with request set to `1-X`.

If a pod has some bigger request then there is no guarantee. It may be a problem for customers who would
like to have big pods running on nodes or even devote nodes to one type of job (some kind of database for instance).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And:

  • pods that use hostPort
  • pods with additional constraints, such as nodeName or nodeSelector
  • pods that use hostPath and expect the data to remain resident

@bgrant0607
Copy link
Member

See also comments on #16199.

@davidopp davidopp assigned bgrant0607 and unassigned davidopp Oct 28, 2015
@bgrant0607
Copy link
Member

To scale down, we also need graceful evacuation/drain: #3885

@eparis
Copy link
Contributor

eparis commented Feb 1, 2016

@k8s-bot test this issue: #IGNORE

Tests have been pending for 24 hours

@k8s-bot
Copy link

k8s-bot commented Feb 2, 2016

GCE e2e test build/test passed for commit e72f862.

@k8s-bot
Copy link

k8s-bot commented Feb 10, 2016

GCE e2e test build/test passed for commit e72f862.

@bgrant0607
Copy link
Member

@mwielgus @fgrzadkowski @piosz Are you planning to update this proposal, or should we just close this PR?

@mwielgus
Copy link
Contributor Author

We will create another one.

@mwielgus mwielgus closed this Apr 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet