New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster autoscaling proposal #15304
Cluster autoscaling proposal #15304
Conversation
a510d09
to
801580f
Compare
@smarterclayton @kubernetes/rh-cluster-infra It looks like this plan is predicated on moving |
GCE e2e test build/test passed for commit dadd417803372cf318c3871fab3c4cb7dcaebc02. |
GCE e2e test build/test passed for commit a510d09d3dd90cbe6ccd306cbad77ae4d20e5d0c. |
GCE e2e build/test failed for commit 801580f9cb613c512eb1d26d3b8c8fb64a37498a. |
|
||
At this point we will encourage the community to contribute code that talks to particular cloud provider node controller. | ||
|
||
Kubernetes 1.4-1.5 - big KAC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prepend with ##
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
5cdbffa
to
46dd07a
Compare
GCE e2e test build/test passed for commit 5cdbffa80f75a27f8fc6f4e202b974395c930924. |
GCE e2e test build/test passed for commit 46dd07afa89899549033350e7ded6963b530089e. |
|
||
In the current Kubernetes version we have a basic support for cluster autoscaling. It is based | ||
on Google Cloud Autoscaler (https://cloud.google.com/compute/docs/autoscaler/) and works only | ||
in Google Compute Engine. It looks on cpu-usage (what is the current load) and pod-cpu-requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/on/at/
I think this sentence could be clearer -- maybe something like "It looks at total CPU usage of all running pods and total CPU request of all running pods."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
when Kubernetes is running on Google Compute Engine. It looks at total cpu usage on each node (including system stuff) | ||
and total cpu request of all running pods. It can also scale on memory but for the simplicity | ||
this document will focus only on the CPU usage aspect (other metrics work in the exactly same way). | ||
The user is expected to provide the target level of utilization (which is now common for all metrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain how the observed utilization is derived from the metrics you said it looks at, namely "total cpu usage on each node (including system stuff) and total cpu request of all running pods" (in particular I don't see how it uses cpu request of running pods)
When we discussed this in person, I believe we said KAC would use request only, not utilization. Please make it clear in the doc that some time before 1.4 you will switch to using request. Maybe 1.2 is the place to do it, since that's where you mention the hack for pending pods, and it's easy to incorporate pending pods if you just do it by adding request of pending pods to request of the other pods. |
"Autoscaling" is the wrong way to think about the node provisioning problem. Using the MIG autoscaler is not the long-term direction. Autoscaling based on resource usage or even custom metrics is never going to be sufficient, and nodes aren't fungible once pods are scheduled on them. Additionally, managed-infrastructure-based autoscaling would require different work on every cloud provider, and would require a MIG/autoscaler controller in order to manage heterogeneous clusters, at which point the controller might as well just deal with nodes. We shouldn't invest a lot in it. Furthermore, the existing ClusterAutoscaler API doesn't contain the right information and should be deleted. We need to think about automatic provisioning and deprovisioning of nodes based on applications' needs. This is also being discussed for storage provisioning. #14537 @thockin Even if there is 0 actual cpu usage on the nodes, we may need more nodes in order to schedule more pods, since scheduling is based on resource requests and other constraints, not usage. If 1000 pods, or a few large-memory pods, or pods requiring SSD or GPUs, or pods with some other special constraints or requirements are submitted to Kubernetes, we need some component to decide how many nodes of what kind need to be provisioned, based not only on standard resource requirements but also availability/spreading considerations, latency required to create nodes, hardware and software requirements, etc. Additionally, in order to get most of the benefit of provisioning new nodes, we need to rebalance via rescheduling. #12140 Similarly, if we want to be able to deprovision nodes, some component would need to take into consideration all the relevant tradeoffs before gradually draining those nodes so that they can be retired. What if those nodes were running storage servers? Storage can't be drained quickly, and maybe the storage capacity is needed, so the nodes shouldn't be deleted at all. Once we add preemption and time-deferred scheduling, the problem will be even more complicated. @alex-mohr suggested starting by provisioning new nodes if pods are pending and we can estimate that creating a new node would allow the pod(s) to schedule. I like that idea. Deprovisioning is a separate problem. |
cc @justinsb |
Regarding API details: Here's the current API:
This API precludes heterogeneous clusters. We should consider using aggregate resources, more similar to ResourceQuota. targetUtilization doesn't take into account other scheduling constraints and considerations, such as hostPort. I made additional comments on #15063 and on the HPA API (e.g., naming isn't consistent with HorizontalPodAutoscaler, Go and json field name mismatch, floats). But I'd rather not have an API for this right now, especially since it's experimental, provider-specific, and a singleton for the cluster. |
Other considerations: max pods, max PDs per node |
always a node that can fit a pod with request set to `1-X`. | ||
|
||
If a pod has some bigger request then there is no guarantee. It may be a problem for customers who would | ||
like to have big pods running on nodes or even devote nodes to one type of job (some kind of database for instance). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And:
- pods that use hostPort
- pods with additional constraints, such as nodeName or nodeSelector
- pods that use hostPath and expect the data to remain resident
See also comments on #16199. |
To scale down, we also need graceful evacuation/drain: #3885 |
@k8s-bot test this issue: #IGNORE Tests have been pending for 24 hours |
GCE e2e test build/test passed for commit e72f862. |
GCE e2e test build/test passed for commit e72f862. |
@mwielgus @fgrzadkowski @piosz Are you planning to update this proposal, or should we just close this PR? |
We will create another one. |
No description provided.