Add Capacity scheduling for ML/DL workloads based on scheduler framework #9

denkensk · 2020-06-23T03:03:57Z

What would you like to be added:

Add Capacity scheduling for ML/DL workloads based on scheduler framework

Why is this needed:

There is increasing demand to use Kubernetes to manage batch workloads (ML/DL). In those cases, one challenge is to improve cluster utilization while ensuring that each user has a reasonable amount of resources. The problem can be partially addressed by the Kubernetes ResourceQuota. The native Kubernetes ResourceQuota API can be used to specify the maximum overall resource allocation per namespace. The quota enforcement is done through an admission check. A quota resource consumer (e.g., a Pod) cannot be created if the aggregated resource allocation exceeds the quota limit. The Kubernetes quota design has the following limitations:

The quota resource usage is aggregated based on the resource configurations (e.g., Pod cpu/mem requests specified in the Pod spec). Although this mechanism can guarantee that the actual resource consumption will never exceed the ResourceQuota limit, it might lead to low resource utilization had the actual resource consumption been much smaller than the limit. i.e., lead to internal resource fragmentation.
If we use ResourceQuota to strictly divide the resources of the cluster among all users to prevent clusters from running out of resources, it might also lead to low resource utilization because some users may have quota resources not being used.

Due to above limitations, the batch workloads (ML/DL) can't run in a Kubernetes cluster as efficiently as they do in other container orchestration platforms such as Yarn. In order to overcome above limitations, we introduce an "Queue" concept used in Yarn capacity scheduler in Kubernetes. Basically, the "Queue" has the notions of "max" and "min", where the "min” is the minimum resources that are needed to ensure the basic functionality/performance of the consumers and the "max" specifies the upper bound of the resource consumption of the consumers. By introducing "min" and "max", Pod scheduling allows the following optimizations:

The slack between "min" and "max" can help to tolerate runtime failures. For example, if a Pod that consumes the entire "min" fails to run, new Pods can still be created if the "max" has not been reached. When using Kubernetes resource quota, once the Pod that consumes the entire quota is created, no other Pods can be created even if the Pod fails to run (e.g., the Pod stucks at the image pulling phase).
Improve overall resource utilization by allowing one queue user to "borrow" unused reserved "min" resources from other queues. A queue user's unused "min" resource can be used by other users, under the condition that there is a mechanism to guarantee the "victim" user can consume his "min" resource whenever he needs. Typically, this is done by implementing preemption.

The success of “Queue” based capacity scheduling relies on a proper Pod preemption algorithm implementation. This KEP proposes the minimal scheduler extension to support the “Queue” based scheduling based on the scheduler framework.

KEP：https://docs.google.com/document/d/1ViujTXLP1XX3WKYUTk6u5LTdJ1sX-tVIw9_t9_mLpIc/edit?usp=sharing

/assign
/cc @Huang-Wei

wsxiaozhang · 2020-06-24T03:09:42Z

Capacity scheduling is definitely important to run ML/DL and batching jobs in multiple users/tenants cluster. The overall cluster utilization is always required to be improved especially when compete for GPU.

Huang-Wei · 2020-06-25T06:57:52Z

Thanks @denkensk for the continuous work on exploring the scheduler framework to resolve the real pain points! I've heard a lot of users having this requirement, such as kubernetes/kubernetes#72278.

And incubating and iterating a plugin like this proposal (along with the previous co-scheduling one) is exactly the intention and the right way to empower users to extend Kubernetes scheduler, in an extensible way. I do hope we can build a better scheduler plugin ecosystem. Let's keep the ball rolling!

prow: add configuration

rbac: update missing verbs

…-sched-registry gh actions: fix registry for the scheduler plugin

k8s-ci-robot assigned denkensk Jun 23, 2020

denkensk mentioned this issue Jun 23, 2020

KEP for Capacity scheduling for ML/DL workloads based on scheduler framework #10

Merged

k8s-ci-robot closed this as completed in #10 Aug 28, 2020

Tal-or referenced this issue in openshift-kni/scheduler-plugins Dec 8, 2021

Merge pull request #9 from yanirq/master

8727417

prow: add configuration

cmisale added a commit to openshift-psap/scheduler-plugins that referenced this issue May 18, 2022

Merge pull request kubernetes-sigs#9 from ArangoGutierrez/devel/rbac

583b9b7

rbac: update missing verbs

swatisehgal pushed a commit to swatisehgal/scheduler-plugins that referenced this issue May 19, 2022

Merge pull request kubernetes-sigs#9 from k8stopologyawareschedwg/fix…

0f581ab

…-sched-registry gh actions: fix registry for the scheduler plugin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Capacity scheduling for ML/DL workloads based on scheduler framework #9

Add Capacity scheduling for ML/DL workloads based on scheduler framework #9

denkensk commented Jun 23, 2020 •

edited

wsxiaozhang commented Jun 24, 2020

Huang-Wei commented Jun 25, 2020

Add Capacity scheduling for ML/DL workloads based on scheduler framework #9

Add Capacity scheduling for ML/DL workloads based on scheduler framework #9

Comments

denkensk commented Jun 23, 2020 • edited

What would you like to be added:

Why is this needed:

wsxiaozhang commented Jun 24, 2020

Huang-Wei commented Jun 25, 2020

denkensk commented Jun 23, 2020 •

edited