-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create proposal on multiple schedulers #17197
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> | ||
|
||
<!-- BEGIN STRIP_FOR_RELEASE --> | ||
|
||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
|
||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> | ||
|
||
If you are using a released version of Kubernetes, you should | ||
refer to the docs that go with that version. | ||
|
||
<strong> | ||
The latest release of this document can be found | ||
[here](http://releases.k8s.io/release-1.1/docs/proposals/multiple-schedulers.md). | ||
|
||
Documentation for other releases can be found at | ||
[releases.k8s.io](http://releases.k8s.io). | ||
</strong> | ||
-- | ||
|
||
<!-- END STRIP_FOR_RELEASE --> | ||
|
||
<!-- END MUNGE: UNVERSIONED_WARNING --> | ||
|
||
# Multi-Scheduler in Kubernetes | ||
|
||
**Status**: Design & Implementation in progress. | ||
|
||
> Contact @HaiyangDING for questions & suggestions. | ||
|
||
## Motivation | ||
|
||
In current Kubernetes design, there is only one default scheduler in a Kubernetes cluster. | ||
However it is common that multiple types of workload, such as traditional batch, DAG batch, streaming and user-facing production services, | ||
are running in the same cluster and they need to be scheduled in different ways. For example, in | ||
[Omega](http://research.google.com/pubs/pub41684.html) batch workload and service workload are scheduled by two types of schedulers: | ||
the batch workload is scheduled by a scheduler which looks at the current usage of the cluster to improve the resource usage rate | ||
and the service workload is scheduled by another one which considers the reserved resources in the | ||
cluster and many other constraints since their performance must meet some higher SLOs. | ||
[Mesos](http://mesos.apache.org/) has done a great work to support multiple schedulers by building a | ||
two-level scheduling structure. This proposal describes how Kubernetes is going to support multi-scheduler | ||
so that users could be able to run their user-provided scheduler(s) to enable some customized scheduling | ||
behavior as they need. As previously discussed in [#11793](https://github.com/kubernetes/kubernetes/issues/11793), | ||
[#9920](https://github.com/kubernetes/kubernetes/issues/9920) and [#11470](https://github.com/kubernetes/kubernetes/issues/11470), | ||
the design of the multiple scheduler should be generic and includes adding a scheduler name annotation to separate the pods. | ||
It is worth mentioning that the proposal does not address the question of how the scheduler name annotation gets | ||
set although it is reasonable to anticipate that it would be set by a component like admission controller/initializer, | ||
as the doc currently does. | ||
|
||
Before going to the details of this proposal, below lists a number of the methods to extend the scheduler: | ||
|
||
- Write your own scheduler and run it along with Kubernetes native scheduler. This is going to be detailed in this proposal | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we improve the second sentence by saying: The details are going to be explained in this proposal. |
||
- Use the callout approach such as the one implemented in [#13580](https://github.com/kubernetes/kubernetes/issues/13580) | ||
- Recompile the scheduler with a new policy | ||
- Restart the scheduler with a new [scheduler policy config file](../../examples/scheduler-policy-config.json) | ||
- Or maybe in future dynamically link a new policy into the running scheduler | ||
|
||
## Challenges in multiple schedulers | ||
|
||
- Separating the pods | ||
|
||
Each pod should be scheduled by only one scheduler. As for implementation, a pod should | ||
have an additional field to tell by which scheduler it wants to be scheduled. Besides, | ||
each scheduler, including the default one, should have a unique logic of how to add unscheduled | ||
pods to its to-be-scheduled pod queue. Details will be explained in later sections. | ||
|
||
- Dealing with conflicts | ||
|
||
Different schedulers are essentially separated processes. When all schedulers try to schedule | ||
their pods onto the nodes, there might be conflicts. | ||
|
||
One example of the conflicts is resource racing: Suppose there be a `pod1` scheduled by | ||
`my-scheduler` requiring 1 CPU's *request*, and a `pod2` scheduled by `kube-scheduler` (k8s native | ||
scheduler, acting as default scheduler) requiring 2 CPU's *request*, while `node-a` only has 2.5 | ||
free CPU's, if both schedulers all try to put their pods on `node-a`, then one of them would eventually | ||
fail when Kubelet on `node-a` performs the create action due to insufficient CPU resources. | ||
|
||
This conflict is complex to deal with in api-server and etcd. Our current solution is to let Kubelet | ||
to do the conflict check and if the conflict happens, effected pods would be put back to scheduler | ||
and waiting to be scheduled again. Implementation details are in later sections. | ||
|
||
## Where to start: initial design | ||
|
||
We definitely want the multi-scheduler design to be a generic mechanism. The following lists the changes | ||
we want to make in the first step. | ||
|
||
- Add an annotation in pod template: `scheduler.alpha.kubernetes.io/name: scheduler-name`, this is used to | ||
separate pods between schedulers. `scheduler-name` should match one of the schedulers' `scheduler-name` | ||
- Add a `scheduler-name` to each scheduler. It is done by hardcode or as command-line argument. The | ||
Kubernetes native scheduler (now `kube-scheduler` process) would have the name as `kube-scheduler` | ||
- The `scheduler-name` plays an important part in separating the pods between different schedulers. | ||
Pods are statically dispatched to different schedulers based on `scheduler.alpha.kubernetes.io/name: scheduler-name` | ||
annotation and there should not be any conflicts between different schedulers handling their pods, i.e. one pod must | ||
NOT be claimed by more than one scheduler. To be specific, a scheduler can add a pod to its queue if and only if: | ||
1. The pod has no nodeName, **AND** | ||
2. The `scheduler-name` specified in the pod's annotation `scheduler.alpha.kubernetes.io/name: scheduler-name` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "match" sounds a little ambiguous. |
||
matches the `scheduler-name` of the scheduler. | ||
|
||
The only one exception is the default scheduler. Any pod that has no `scheduler.alpha.kubernetes.io/name: scheduler-name` | ||
annotation is assumed to be handled by the "default scheduler". In the first version of the multi-scheduler feature, | ||
the default scheduler would be the Kubernetes built-in scheduler with `scheduler-name` as `kube-scheduler`. | ||
The Kubernetes build-in scheduler will claim any pod which has no `scheduler.alpha.kubernetes.io/name: scheduler-name` | ||
annotation or which has `scheduler.alpha.kubernetes.io/name: kube-scheduler`. In the future, it may be possible to | ||
change which scheduler is the default for a given cluster. | ||
|
||
- Dealing with conflicts. All schedulers must use predicate functions that are at least as strict as | ||
the ones that Kubelet applies when deciding whether to accept a pod, otherwise Kubelet and scheduler | ||
may get into an infinite loop where Kubelet keeps rejecting a pod and scheduler keeps re-scheduling | ||
it back the same node. To make it easier for people who write new schedulers to obey this rule, we will | ||
create a library containing the predicates Kubelet uses. (See issue [#12744](https://github.com/kubernetes/kubernetes/issues/12744).) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should also create an API for querying nodes where the pod fits, with an optional limit to the number of (prioritized) nodes returned, as a complement to the approach proposed in #11470. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be an API exported by every scheduler in the system? BTW I created a separate issue for "guidelines for writing schedulers" (#17208); I think this comment probably belongs there? (I've added it to the list there.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also think this comment should be places in the "guidelines for writing schedulers". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I can add another section like "related scheduler design issues" by the end of the proposal to include the many links appeared in the comments so far. Some of them are really helpful for the readers to understand what would be going on in future, but they are hard to be included as part of this multi-scheduler proposal. Possible ones are: #13580 for scheduler extension It that OK? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/16485/16845/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes that sounds like a good idea. I wouldn't bother to try to summarize them in this doc, because it will just be redundant, but pointing to the other issues is a good idea. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
|
||
In summary, in the initial version of this multi-scheduler design, we will achieve the following: | ||
|
||
- If a pod has the annotation `scheduler.alpha.kubernetes.io/name: kube-scheduler` or the user does not explicitly | ||
sets this annotation in the template, it will be picked up by default scheduler | ||
- If the annotation is set and refers to a valid `scheduler-name`, it will be picked up by the scheduler of | ||
specified `scheduler-name` | ||
- If the annotation is set but refers to an invalid `scheduler-name`, the pod will not be picked by any scheduler. | ||
The pod will keep PENDING. | ||
|
||
### An example | ||
|
||
```yaml | ||
kind: Pod | ||
apiVersion: v1 | ||
metadata: | ||
name: pod-abc | ||
labels: | ||
foo: bar | ||
annotations: | ||
scheduler.alpha.kubernetes.io/name: my-scheduler | ||
``` | ||
|
||
This pod will be scheduled by "my-scheduler" and ignored by "kube-scheduler". If there is no running scheduler | ||
of name "my-scheduler", the pod will never be scheduled. | ||
|
||
## Next steps | ||
|
||
1. Use admission controller to add and verify the annotation, and do some modification if necessary. For example, the | ||
admission controller might add the scheduler annotation based on the namespace of the pod, and/or identify if | ||
there are conflicting rules, and/or set a default value for the scheduler annotation, and/or reject pods on | ||
which the client has set a scheduler annotation that does not correspond to a running scheduler. | ||
2. Dynamic launching scheduler(s) and registering to admission controller (as an external call). This also | ||
requires some work on authorization and authentication to control what schedulers can write the /binding | ||
subresource of which pods. | ||
|
||
## Other issues/discussions related to scheduler design | ||
|
||
- [#13580](https://github.com/kubernetes/kubernetes/pull/13580): scheduler extension | ||
- [#17097](https://github.com/kubernetes/kubernetes/issues/17097): policy config file in pod template | ||
- [#16845](https://github.com/kubernetes/kubernetes/issues/16845): scheduling groups of pods | ||
- [#17208](https://github.com/kubernetes/kubernetes/issues/17208): guide to writing a new scheduler | ||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/multiple-schedulers.md?pixel)]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the last line "as the doc currently does." seems better?