Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Capacity scheduling for ML/DL workloads based on scheduler framework #9

Closed
denkensk opened this issue Jun 23, 2020 · 2 comments · Fixed by #10
Closed

Add Capacity scheduling for ML/DL workloads based on scheduler framework #9

denkensk opened this issue Jun 23, 2020 · 2 comments · Fixed by #10
Assignees

Comments

@denkensk
Copy link
Member

denkensk commented Jun 23, 2020

What would you like to be added:

Add Capacity scheduling for ML/DL workloads based on scheduler framework

Why is this needed:

There is increasing demand to use Kubernetes to manage batch workloads (ML/DL). In those cases, one challenge is to improve cluster utilization while ensuring that each user has a reasonable amount of resources. The problem can be partially addressed by the Kubernetes ResourceQuota. The native Kubernetes ResourceQuota API can be used to specify the maximum overall resource allocation per namespace. The quota enforcement is done through an admission check. A quota resource consumer (e.g., a Pod) cannot be created if the aggregated resource allocation exceeds the quota limit. The Kubernetes quota design has the following limitations:

  1. The quota resource usage is aggregated based on the resource configurations (e.g., Pod cpu/mem requests specified in the Pod spec). Although this mechanism can guarantee that the actual resource consumption will never exceed the ResourceQuota limit, it might lead to low resource utilization had the actual resource consumption been much smaller than the limit. i.e., lead to internal resource fragmentation.

  2. If we use ResourceQuota to strictly divide the resources of the cluster among all users to prevent clusters from running out of resources, it might also lead to low resource utilization because some users may have quota resources not being used.

Due to above limitations, the batch workloads (ML/DL) can't run in a Kubernetes cluster as efficiently as they do in other container orchestration platforms such as Yarn. In order to overcome above limitations, we introduce an "Queue" concept used in Yarn capacity scheduler in Kubernetes. Basically, the "Queue" has the notions of "max" and "min", where the "min” is the minimum resources that are needed to ensure the basic functionality/performance of the consumers and the "max" specifies the upper bound of the resource consumption of the consumers. By introducing "min" and "max", Pod scheduling allows the following optimizations:

  1. The slack between "min" and "max" can help to tolerate runtime failures. For example, if a Pod that consumes the entire "min" fails to run, new Pods can still be created if the "max" has not been reached. When using Kubernetes resource quota, once the Pod that consumes the entire quota is created, no other Pods can be created even if the Pod fails to run (e.g., the Pod stucks at the image pulling phase).

  2. Improve overall resource utilization by allowing one queue user to "borrow" unused reserved "min" resources from other queues. A queue user's unused "min" resource can be used by other users, under the condition that there is a mechanism to guarantee the "victim" user can consume his "min" resource whenever he needs. Typically, this is done by implementing preemption.

The success of “Queue” based capacity scheduling relies on a proper Pod preemption algorithm implementation. This KEP proposes the minimal scheduler extension to support the “Queue” based scheduling based on the scheduler framework.

KEP:https://docs.google.com/document/d/1ViujTXLP1XX3WKYUTk6u5LTdJ1sX-tVIw9_t9_mLpIc/edit?usp=sharing

/assign
/cc @Huang-Wei

@wsxiaozhang
Copy link

Capacity scheduling is definitely important to run ML/DL and batching jobs in multiple users/tenants cluster. The overall cluster utilization is always required to be improved especially when compete for GPU.

@Huang-Wei
Copy link
Contributor

Thanks @denkensk for the continuous work on exploring the scheduler framework to resolve the real pain points! I've heard a lot of users having this requirement, such as kubernetes/kubernetes#72278.

And incubating and iterating a plugin like this proposal (along with the previous co-scheduling one) is exactly the intention and the right way to empower users to extend Kubernetes scheduler, in an extensible way. I do hope we can build a better scheduler plugin ecosystem. Let's keep the ball rolling!

Tal-or referenced this issue in openshift-kni/scheduler-plugins Dec 8, 2021
prow: add configuration
cmisale added a commit to openshift-psap/scheduler-plugins that referenced this issue May 18, 2022
swatisehgal pushed a commit to swatisehgal/scheduler-plugins that referenced this issue May 19, 2022
…-sched-registry

gh actions: fix registry for the scheduler plugin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants