New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple schedulers #11793

Closed
bgrant0607 opened this Issue Jul 24, 2015 · 29 comments

Comments

Projects
None yet
9 participants
@bgrant0607
Member

bgrant0607 commented Jul 24, 2015

Forked from #9920, #11470, and other issues.

Kubernetes should multiple schedulers, including user-provided schedulers. The main thing we need is a way to not apply the default scheduling behavior, perhaps by namespace, using some sort of "admission control" plugin. I want that to be a fairly generic mechanism, since I expect to need it at least for horizontal auto-scaling and vertical auto-sizing, as well, if not other things in the future. This is related to the "initializers" topic (#3585).

A custom scheduler would then need to be configured for which pods to schedule. This probably requires adding information of some form to the pods. We should take that into account when thinking about the general initializer mechanism.

The custom scheduler should then be able to watch all pods and nodes to keep its state up to date, but I imagine we could add more information, events, or something to make this more convenient, as requested in #1517.

cc @davidopp

@HaiyangDING

This comment has been minimized.

Show comment
Hide comment
@HaiyangDING

HaiyangDING Jul 24, 2015

Contributor

Hi, I am glad to see that multiple-scheduler feature is put on the table.

A while ago I have started designing a plan for multiple schedulers, and now it is being verified and tested. I think I will show it to the community as soon as the verification is done (within a couple of days).

Anyway, I would really like to be part of implementing the multiple-scheduler feature.

/cc myself

Cheers

Contributor

HaiyangDING commented Jul 24, 2015

Hi, I am glad to see that multiple-scheduler feature is put on the table.

A while ago I have started designing a plan for multiple schedulers, and now it is being verified and tested. I think I will show it to the community as soon as the verification is done (within a couple of days).

Anyway, I would really like to be part of implementing the multiple-scheduler feature.

/cc myself

Cheers

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Jul 27, 2015

Member

@HaiyangDING looking forward to seeing what you did.

@bgrant0607 I think it's hard to know exactly how everyone will use namespaces, but I think we want something in addition to namespace to determine which scheduler is responsible (for example, a field on each Pod). A scheduler would be configured to pay attention to a particular namespace+scheduler_name combo. There's also a question of what is the trust model for schedulers, e.g. do we assume all the schedulers are trusted or is there some mechanism by which the API server determines whether to accept a binding for a particular Pod from a particular scheduler.

Member

davidopp commented Jul 27, 2015

@HaiyangDING looking forward to seeing what you did.

@bgrant0607 I think it's hard to know exactly how everyone will use namespaces, but I think we want something in addition to namespace to determine which scheduler is responsible (for example, a field on each Pod). A scheduler would be configured to pay attention to a particular namespace+scheduler_name combo. There's also a question of what is the trust model for schedulers, e.g. do we assume all the schedulers are trusted or is there some mechanism by which the API server determines whether to accept a binding for a particular Pod from a particular scheduler.

@HaiyangDING

This comment has been minimized.

Show comment
Hide comment
@HaiyangDING

HaiyangDING Jul 28, 2015

Contributor

@bgrant0607 @davidopp

Hi, I create a PR to implement multi-scheduler. This PR only implements a fundamental functionality to enable multiple algorithm providers and allow pod to specify its preferred algorithm provider. It works but is far from perfection. However I think it is a good start.

Please leave your comments and we could discuss with the community to see which further feature should be added.

Regards,

Contributor

HaiyangDING commented Jul 28, 2015

@bgrant0607 @davidopp

Hi, I create a PR to implement multi-scheduler. This PR only implements a fundamental functionality to enable multiple algorithm providers and allow pod to specify its preferred algorithm provider. It works but is far from perfection. However I think it is a good start.

Please leave your comments and we could discuss with the community to see which further feature should be added.

Regards,

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Jul 30, 2015

Member

More about the motivation for this issue:

We should strive to make the built-in scheduler make higher-quality decisions for a wider variety of workloads in a wider variety of cluster configurations. However, there will always be special cases that are unaddressed by the standard scheduler. Today, one can disable the standard scheduler and run a different one, but there is no way to run multiple schedulers simultaneously, with each scheduling a particular segment of the workload population.

What is needed is a way for each scheduler to know which pods it should schedule.

Potentially schedulers could look at various pod attributes in order to make this decision, but in that case they should all agree which scheduler is actually responsible. For this approach, I'm thinking more along the lines of attributes like quality of service, expected duration, etc. than labels or annotations. For instance, one might want to schedule short-duration, best-effort pods using a fast, approximate worst-fit algorithm, and use a more precise, exhaustive best-fit algorithm for long-lived, guaranteed pods.

An extreme example of such an approach would be to explicitly specify the scheduler or scheduling policy desired.

Member

bgrant0607 commented Jul 30, 2015

More about the motivation for this issue:

We should strive to make the built-in scheduler make higher-quality decisions for a wider variety of workloads in a wider variety of cluster configurations. However, there will always be special cases that are unaddressed by the standard scheduler. Today, one can disable the standard scheduler and run a different one, but there is no way to run multiple schedulers simultaneously, with each scheduling a particular segment of the workload population.

What is needed is a way for each scheduler to know which pods it should schedule.

Potentially schedulers could look at various pod attributes in order to make this decision, but in that case they should all agree which scheduler is actually responsible. For this approach, I'm thinking more along the lines of attributes like quality of service, expected duration, etc. than labels or annotations. For instance, one might want to schedule short-duration, best-effort pods using a fast, approximate worst-fit algorithm, and use a more precise, exhaustive best-fit algorithm for long-lived, guaranteed pods.

An extreme example of such an approach would be to explicitly specify the scheduler or scheduling policy desired.

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Jul 30, 2015

Member

To be more clear: This issue is gated on developing a proposal to create a general-purpose way to defer and/or delegate asynchronous initialization behavior, such as scheduling or auto-scaling.

Member

bgrant0607 commented Jul 30, 2015

To be more clear: This issue is gated on developing a proposal to create a general-purpose way to defer and/or delegate asynchronous initialization behavior, such as scheduling or auto-scaling.

@HaiyangDING

This comment has been minimized.

Show comment
Hide comment
@HaiyangDING

HaiyangDING Jul 31, 2015

Contributor

@bgrant0607 @davidopp

Hi, I think I'll have to re-consider the multi-scheduler design for now. Hope to have some proposal sooner. Before that, I will try to "make the built-in scheduler make higher-quality decisions for a wider variety of workloads in a wider variety of cluster configurations".

Regards,

Contributor

HaiyangDING commented Jul 31, 2015

@bgrant0607 @davidopp

Hi, I think I'll have to re-consider the multi-scheduler design for now. Hope to have some proposal sooner. Before that, I will try to "make the built-in scheduler make higher-quality decisions for a wider variety of workloads in a wider variety of cluster configurations".

Regards,

@HaiyangDING HaiyangDING referenced this issue Jul 31, 2015

Closed

NeverMind #12063

@davidopp davidopp assigned davidopp and unassigned davidopp Aug 3, 2015

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Aug 3, 2015

Member

One thing I think is missing is the notion of allowing greedy schedulers to use the unused capacity.

Member

timothysc commented Aug 3, 2015

One thing I think is missing is the notion of allowing greedy schedulers to use the unused capacity.

@AnanyaKumar

This comment has been minimized.

Show comment
Hide comment
@AnanyaKumar

AnanyaKumar Aug 12, 2015

Contributor

Maybe something along the lines of what @bgrant0607 suggested might work? Have a "scheduler allocator" and a bunch of "schedulers". Each scheduler has a function canSchedulePod(pod) that returns true if the scheduler is responsible for the pod. The scheduler allocator uses the function canSchedulePod(pod) to figure out which scheduler to send the pod to. If canSchedulePod(pod) returns true for multiple schedulers, then the pod will be sent to one of these schedulers (the specific scheduler is implementation defined).

@davidopp I think we should trust the schedulers, since they aren't API objects that can be arbitrarily added. Only cluster administrators should be able to add/edit/remove schedulers.

@bgrant0607 Why is the issue gated on a general purpose way to defer initialization behavior? What do you mean by deferring asynchronous initialization behavior?

Please feel free to ignore this, since I know you guys have way more experience with scheduling than I do!

Contributor

AnanyaKumar commented Aug 12, 2015

Maybe something along the lines of what @bgrant0607 suggested might work? Have a "scheduler allocator" and a bunch of "schedulers". Each scheduler has a function canSchedulePod(pod) that returns true if the scheduler is responsible for the pod. The scheduler allocator uses the function canSchedulePod(pod) to figure out which scheduler to send the pod to. If canSchedulePod(pod) returns true for multiple schedulers, then the pod will be sent to one of these schedulers (the specific scheduler is implementation defined).

@davidopp I think we should trust the schedulers, since they aren't API objects that can be arbitrarily added. Only cluster administrators should be able to add/edit/remove schedulers.

@bgrant0607 Why is the issue gated on a general purpose way to defer initialization behavior? What do you mean by deferring asynchronous initialization behavior?

Please feel free to ignore this, since I know you guys have way more experience with scheduling than I do!

@HaiyangDING

This comment has been minimized.

Show comment
Hide comment
@HaiyangDING

HaiyangDING Aug 17, 2015

Contributor

Some thoughts on multiple schedulers

I have a few thoughts on multiple schedulers here and it would be really great if you can provider some feedback, so that we could have a proposal or something alike earlier. Two things in this post:

  1. How to distinguish Pods
  2. How does each scheduler behave

Distinguish Pods

The purpose of separating the Pods is to make sure:

  1. any unscheduled Pod should be scheduled by one of the multiple schedulers (no one is left over)
  2. one Pod should never be scheduled by two or even more schedulers

Since QoS proposal #11713 has been accepted, it is natural to distinguish Pods according to their QoS classes: Guaranteed, Burstable and Best-Effort. According to the QoS proposal, the QoS class of a Pod can be inferred from its resource specification:

  • 0 < request = limit: Guaranteed
  • 0 < request < limit: Burstable
  • 0 = request: Best-Effort

Once the QoS class of the Pods is determined (QoSClass), different classes of Pods could be scheduled by different schedulers. This comes along the lines that @bgrant0607 has proposed in the above comments.

Currently, Kubernetes scheduler "fetch" the Pods with PodHost=="" and queue these Pods to schedule them one by one. In multiple schedulers scenario, each scheduler "fetch" the Pods using both PodHost and QoSClass. For example, scheduler for Guaranteed Pods is supposed to "fetch" the Pods with PodHost=="" and QoSClass==Guaranteed. Then, the scheduler chooses a node to host the Pod and post the binding relationship to api-server.

Behaviors of schedulers

In our scope, scheduler for Guaranteed should make high quality decisions using various cluster configurations while the scheduler for Best-Effort Pods should make fast decisions.

Scheduler for Guaranteed/Burtable Pods

Personally, I think high quality decision making should apply to both Guaranteed and Burtable Pods. High quality decision requires various cluster configurations. In addition to current Kubernetes scheduling policies, the following features could be added:

  • Number of Guaranteed/Burstable/Best-Effort Pods on the Node: We may want to spread the Pods of different QoS class among the cluster.
    • Spread Guaranteed/Burstable Pods could help those Pods reach better performance and counter Nodes failure.
    • Spread Best-Effort could reduce the number of them to be killed when work load rises on certain Node.
  • Ratio between different classes of Pods: We may want to maintain a reasonable ratio between Guaranteed/Burstable and Best-Effort Pods on each Node.
  • Number of Best-Effort to be removed: When the entire cluster is running at high workload, it is likely that deploying a Guaranteed Pod would result in killing some Best-Effort Pods on the destination Node(although the killing is performed by Kubelet later). Scheduler should try to minimum the number of Best-Effort Pods to be killed if possible.

These ideas are inspired by the paper of Borg system. Many thanks to Google!

There are also other issues related to enriching the scheduler policies:

Scheduler for Best-Effort Pods

To improve the speed of scheduler for Best-Effort Pods, the following could be considered:

  • Instead of use all nodes in the cluster, only choose part of them as potential destination. Say only randomly include 1/3 of total Nodes in scheduler's NodeLister
  • Use as less predicate and priority functions as possible
  • Worst-fit or first fit algorithm
    • Worst-fit algorithm lets Best-Effort not race with Guaranteed/Burstable requirements
    • First-fit algorithm helps to make even quicker decisions (no priority functions)

Conflicts

Scheduler processes running in parallel will eventually face the problem of decision conflicts. To the best of my knowledge, there is no perfect solution. When a scheduler is making the decision, the cluster state is possibly changed by another scheduler. This makes the schedule decision no longer the best for the Pod.

To this end, I think the idea Omega (again, thanks to Google's paper) has given is a sound one: Even if the schedule decision is not the best-fit, we continue to carry it out despite of the cluster (Node) status change. However, if Kubelet finds it impossible to carry out later (there is a good chance that this does not happen), the Pod will be put back to the queue.

At last

First, please have my gratitude if you have read this far...

This post certainly does not cover everything to implement the multiple schedulers. @bgrant0607 has mentioned in other issues that significant changes to scheduler are being planned. My colleagues and I would really like to take part in improving/re-building the Kubernetes scheduler. This post is just some of our thoughts on the multiple schedulers, what do you think of these?

@bgrant0607 @davidopp

PS: please help me ping anyone as you wish

Contributor

HaiyangDING commented Aug 17, 2015

Some thoughts on multiple schedulers

I have a few thoughts on multiple schedulers here and it would be really great if you can provider some feedback, so that we could have a proposal or something alike earlier. Two things in this post:

  1. How to distinguish Pods
  2. How does each scheduler behave

Distinguish Pods

The purpose of separating the Pods is to make sure:

  1. any unscheduled Pod should be scheduled by one of the multiple schedulers (no one is left over)
  2. one Pod should never be scheduled by two or even more schedulers

Since QoS proposal #11713 has been accepted, it is natural to distinguish Pods according to their QoS classes: Guaranteed, Burstable and Best-Effort. According to the QoS proposal, the QoS class of a Pod can be inferred from its resource specification:

  • 0 < request = limit: Guaranteed
  • 0 < request < limit: Burstable
  • 0 = request: Best-Effort

Once the QoS class of the Pods is determined (QoSClass), different classes of Pods could be scheduled by different schedulers. This comes along the lines that @bgrant0607 has proposed in the above comments.

Currently, Kubernetes scheduler "fetch" the Pods with PodHost=="" and queue these Pods to schedule them one by one. In multiple schedulers scenario, each scheduler "fetch" the Pods using both PodHost and QoSClass. For example, scheduler for Guaranteed Pods is supposed to "fetch" the Pods with PodHost=="" and QoSClass==Guaranteed. Then, the scheduler chooses a node to host the Pod and post the binding relationship to api-server.

Behaviors of schedulers

In our scope, scheduler for Guaranteed should make high quality decisions using various cluster configurations while the scheduler for Best-Effort Pods should make fast decisions.

Scheduler for Guaranteed/Burtable Pods

Personally, I think high quality decision making should apply to both Guaranteed and Burtable Pods. High quality decision requires various cluster configurations. In addition to current Kubernetes scheduling policies, the following features could be added:

  • Number of Guaranteed/Burstable/Best-Effort Pods on the Node: We may want to spread the Pods of different QoS class among the cluster.
    • Spread Guaranteed/Burstable Pods could help those Pods reach better performance and counter Nodes failure.
    • Spread Best-Effort could reduce the number of them to be killed when work load rises on certain Node.
  • Ratio between different classes of Pods: We may want to maintain a reasonable ratio between Guaranteed/Burstable and Best-Effort Pods on each Node.
  • Number of Best-Effort to be removed: When the entire cluster is running at high workload, it is likely that deploying a Guaranteed Pod would result in killing some Best-Effort Pods on the destination Node(although the killing is performed by Kubelet later). Scheduler should try to minimum the number of Best-Effort Pods to be killed if possible.

These ideas are inspired by the paper of Borg system. Many thanks to Google!

There are also other issues related to enriching the scheduler policies:

Scheduler for Best-Effort Pods

To improve the speed of scheduler for Best-Effort Pods, the following could be considered:

  • Instead of use all nodes in the cluster, only choose part of them as potential destination. Say only randomly include 1/3 of total Nodes in scheduler's NodeLister
  • Use as less predicate and priority functions as possible
  • Worst-fit or first fit algorithm
    • Worst-fit algorithm lets Best-Effort not race with Guaranteed/Burstable requirements
    • First-fit algorithm helps to make even quicker decisions (no priority functions)

Conflicts

Scheduler processes running in parallel will eventually face the problem of decision conflicts. To the best of my knowledge, there is no perfect solution. When a scheduler is making the decision, the cluster state is possibly changed by another scheduler. This makes the schedule decision no longer the best for the Pod.

To this end, I think the idea Omega (again, thanks to Google's paper) has given is a sound one: Even if the schedule decision is not the best-fit, we continue to carry it out despite of the cluster (Node) status change. However, if Kubelet finds it impossible to carry out later (there is a good chance that this does not happen), the Pod will be put back to the queue.

At last

First, please have my gratitude if you have read this far...

This post certainly does not cover everything to implement the multiple schedulers. @bgrant0607 has mentioned in other issues that significant changes to scheduler are being planned. My colleagues and I would really like to take part in improving/re-building the Kubernetes scheduler. This post is just some of our thoughts on the multiple schedulers, what do you think of these?

@bgrant0607 @davidopp

PS: please help me ping anyone as you wish

@hurf

This comment has been minimized.

Show comment
Hide comment
@hurf

hurf Aug 18, 2015

Contributor

As in my mind, there're two kinds of requirment for schedulers. One is multi-scheduler we're talking about here, it allows multi schedule processes running in parallel. Another is what @HaiyangDING did in #11921, the pod is till scheduled serially, but we can assign algorithm to the scheduler for each job.

Though the second requirment can be satisfied by using multi-scheduler, but the cost is too high. For one kind of algorithm which decides how the pod will be deployed(I will call it 'policy' later) , one scheduler process is started. What about 20 or more polices? It's a real requirment from customers. And in this case, multiple schedulers processes is unnecessary to work in parallel.

So I think we should have two plug-in points for the scheduler. One for schedulers, we can customize schedulers and running them in seperated processes. One for policies, we can customize polices and register it to one single scheduler, and user's can assign the policy which scheduler should use for some particular job. And there's no conflict in the implementaion of these two plug-in points, we can have both under a unified structure.

@bgrant0607 @davidopp

Contributor

hurf commented Aug 18, 2015

As in my mind, there're two kinds of requirment for schedulers. One is multi-scheduler we're talking about here, it allows multi schedule processes running in parallel. Another is what @HaiyangDING did in #11921, the pod is till scheduled serially, but we can assign algorithm to the scheduler for each job.

Though the second requirment can be satisfied by using multi-scheduler, but the cost is too high. For one kind of algorithm which decides how the pod will be deployed(I will call it 'policy' later) , one scheduler process is started. What about 20 or more polices? It's a real requirment from customers. And in this case, multiple schedulers processes is unnecessary to work in parallel.

So I think we should have two plug-in points for the scheduler. One for schedulers, we can customize schedulers and running them in seperated processes. One for policies, we can customize polices and register it to one single scheduler, and user's can assign the policy which scheduler should use for some particular job. And there's no conflict in the implementaion of these two plug-in points, we can have both under a unified structure.

@bgrant0607 @davidopp

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Aug 20, 2015

Member

This issue is gated on developing a proposal to create a general-purpose way to defer and/or delegate asynchronous initialization behavior, such as scheduling or auto-scaling.

@bgrant0607 I'm not sure I agree with this. We could start by just adding a scheduler_name field to pods. Then later we could build an admission controller that sets this fields based on the kinds of characteristics of the Pod you mentioned (and address the sequencing/dependency issues related to having multiple admission controllers that you alluded to).

@HaiyangDING Thanks for your comments. I think we should defer discussing what scheduling policies different schedulers would use, and first settle on the mechanism we use to support multiple schedulers. So I will only comment on one aspect of your proposal:

  • You suggested to configure each scheduler with rules about which Pods it is responsible for. I think it is better to have the policy in one place (e.g. an admission controller), and have the schedulers be "dumb" in the sense of just looking for Pods explicitly labeled with their name. This makes it easier to configure and understand (look at one config vs. many). It also makes it possible to manually or automatically enforce rules like "exactly one scheduler is responsible for scheduling any type of Pod." The correctness of the schedulers should not depend on avoiding overlapping schedulers, but it is still better to be able to avoid this situation (and it is definitely good to be able to avoid accidentally configuring things such that no scheduler is responsible for some kinds of Pods).

@hurf I think we already support the policy plug-in mechanism you're describing. You can add priority functions to the scheduler today that look at various characteristics of the Pod. So for example you can have a different spreading policy for best-effort Pods vs. burstable pods vs. guaranteed Pods, by looking at the request and limit. If we add an explicit scheduler name field to Pods, this mechanism would continue to work, just the scheduler name would mean more like "scheduler policy" than scheduler name, if there is only one scheduler and multiple policies in that scheduler.

Member

davidopp commented Aug 20, 2015

This issue is gated on developing a proposal to create a general-purpose way to defer and/or delegate asynchronous initialization behavior, such as scheduling or auto-scaling.

@bgrant0607 I'm not sure I agree with this. We could start by just adding a scheduler_name field to pods. Then later we could build an admission controller that sets this fields based on the kinds of characteristics of the Pod you mentioned (and address the sequencing/dependency issues related to having multiple admission controllers that you alluded to).

@HaiyangDING Thanks for your comments. I think we should defer discussing what scheduling policies different schedulers would use, and first settle on the mechanism we use to support multiple schedulers. So I will only comment on one aspect of your proposal:

  • You suggested to configure each scheduler with rules about which Pods it is responsible for. I think it is better to have the policy in one place (e.g. an admission controller), and have the schedulers be "dumb" in the sense of just looking for Pods explicitly labeled with their name. This makes it easier to configure and understand (look at one config vs. many). It also makes it possible to manually or automatically enforce rules like "exactly one scheduler is responsible for scheduling any type of Pod." The correctness of the schedulers should not depend on avoiding overlapping schedulers, but it is still better to be able to avoid this situation (and it is definitely good to be able to avoid accidentally configuring things such that no scheduler is responsible for some kinds of Pods).

@hurf I think we already support the policy plug-in mechanism you're describing. You can add priority functions to the scheduler today that look at various characteristics of the Pod. So for example you can have a different spreading policy for best-effort Pods vs. burstable pods vs. guaranteed Pods, by looking at the request and limit. If we add an explicit scheduler name field to Pods, this mechanism would continue to work, just the scheduler name would mean more like "scheduler policy" than scheduler name, if there is only one scheduler and multiple policies in that scheduler.

@HaiyangDING

This comment has been minimized.

Show comment
Hide comment
@HaiyangDING

HaiyangDING Aug 21, 2015

Contributor

@davidopp

Hi, thank you for the comments. I agree with you that we should add a scheduler_name to pods which is simple and clear. Regarding the policies for different schedulers, let us discuss it later and maybe I will have some new ideas by then.

Anyway, I am looking forward to working out a plan on multiple schedulers. Regarding the "multiple policies" mentioned by @hurf , I will summarize something later on.

Contributor

HaiyangDING commented Aug 21, 2015

@davidopp

Hi, thank you for the comments. I agree with you that we should add a scheduler_name to pods which is simple and clear. Regarding the policies for different schedulers, let us discuss it later and maybe I will have some new ideas by then.

Anyway, I am looking forward to working out a plan on multiple schedulers. Regarding the "multiple policies" mentioned by @hurf , I will summarize something later on.

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Aug 26, 2015

Member

I posted some thoughts on initializers recently: #3585 (comment)

Member

bgrant0607 commented Aug 26, 2015

I posted some thoughts on initializers recently: #3585 (comment)

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Aug 26, 2015

Member

I agree with the approach of using an admission-control plugin to select a scheduler. I don't want to bake the policy into the system, and I want to make it possible for users to specify a scheduler in the case that they are running their own.

Resource QoS is per-resource, per-container, so choosing a scheduler based on QoS isn't as straightforward as one might like, though it is true that criteria for best-effort resources and the other resource classes are pretty different, since ideally best-effort resources would be based on metrics derived from observed usage, while guaranteed and burstable resources need to be based primarily on reservations.

However, selection of a scheduling strategy is not the same as selection of a scheduling component. The former is driven by properties of the workload (QoS, duration, etc.), whereas the latter would be driven by the need to extend the system without recompiling it.

There is also a proposal for a more limited extension mechanism for plugging fit predicates and prioritization criteria: #11470. But, yes, the high-performance plugin mechanism is already supported, as @davidopp described.

As far as how to specify which scheduler, we're going to have the same issue with at least vertical auto-scaling, so we need a solution that addresses both, hence the initializers proposal. I think the general solution is not much harder, at least implemented just for pods (and pod templates), which is the primary use case for initializers.

Member

bgrant0607 commented Aug 26, 2015

I agree with the approach of using an admission-control plugin to select a scheduler. I don't want to bake the policy into the system, and I want to make it possible for users to specify a scheduler in the case that they are running their own.

Resource QoS is per-resource, per-container, so choosing a scheduler based on QoS isn't as straightforward as one might like, though it is true that criteria for best-effort resources and the other resource classes are pretty different, since ideally best-effort resources would be based on metrics derived from observed usage, while guaranteed and burstable resources need to be based primarily on reservations.

However, selection of a scheduling strategy is not the same as selection of a scheduling component. The former is driven by properties of the workload (QoS, duration, etc.), whereas the latter would be driven by the need to extend the system without recompiling it.

There is also a proposal for a more limited extension mechanism for plugging fit predicates and prioritization criteria: #11470. But, yes, the high-performance plugin mechanism is already supported, as @davidopp described.

As far as how to specify which scheduler, we're going to have the same issue with at least vertical auto-scaling, so we need a solution that addresses both, hence the initializers proposal. I think the general solution is not much harder, at least implemented just for pods (and pod templates), which is the primary use case for initializers.

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Aug 26, 2015

Member

As far as how to specify which scheduler, we're going to have the same issue with at least vertical auto-scaling, so we need a solution that addresses both, hence the initializers proposal. I think the general solution is not much harder, at least implemented just for pods (and pod templates), which is the primary use case for initializers.

If by "auto-scaling" you mean "setting initial resource request" and by "same issue" you mean "a way to set fields of the Pod.Spec during the admission process" then I agree. Otherwise can you explain what you mean?

BTW is there an issue on initializers? I know we've talked about but I couldn't find an issue. My mental model has been "using an admission control plugin to set field(s) of a Pod.Spec, with some thought having gone into how to chain such plugins together." Is there more to it?

Member

davidopp commented Aug 26, 2015

As far as how to specify which scheduler, we're going to have the same issue with at least vertical auto-scaling, so we need a solution that addresses both, hence the initializers proposal. I think the general solution is not much harder, at least implemented just for pods (and pod templates), which is the primary use case for initializers.

If by "auto-scaling" you mean "setting initial resource request" and by "same issue" you mean "a way to set fields of the Pod.Spec during the admission process" then I agree. Otherwise can you explain what you mean?

BTW is there an issue on initializers? I know we've talked about but I couldn't find an issue. My mental model has been "using an admission control plugin to set field(s) of a Pod.Spec, with some thought having gone into how to chain such plugins together." Is there more to it?

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Aug 26, 2015

Member

IMHO I would much prefer a well thought out design-doc on the topic, and folks could hash over the enumerated design options.

Right now there appears to be a separation from submission + admission-control and scheduling. But typically on other systems each elastic applications / schedulers performed their own resource requests or scheduling are up to the applications. Their weight is based on the application itself when it registers. Meaning when application (X) registers an administrator has already defined it's role, which indirectly implies their QoS and all submissions will "stay in their lane".

Member

timothysc commented Aug 26, 2015

IMHO I would much prefer a well thought out design-doc on the topic, and folks could hash over the enumerated design options.

Right now there appears to be a separation from submission + admission-control and scheduling. But typically on other systems each elastic applications / schedulers performed their own resource requests or scheduling are up to the applications. Their weight is based on the application itself when it registers. Meaning when application (X) registers an administrator has already defined it's role, which indirectly implies their QoS and all submissions will "stay in their lane".

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Aug 26, 2015

Member

@timothysc IIUC you're contrasting the Mesos model with the Borg model?

I think we want to support both models. Google has found a lot of success with the Borg model--in Borg, everything from user-facing front-ends, to data processing frameworks like MapReduce and MillWheel, to storage systems like MegaStore and Spanner, to infrastructure services are scheduled by the same scheduler. Borg has a quota mechanism applied at admission time, a priority scheme applied at scheduling time, and a node-level QoS scheme applied at runtime, that together implement policy-driven sharing of the cluster and nodes. To date, the way we've been thinking about Kubernetes has been very similar to Borg.

But I think you can get almost all of the way to the Mesos model in Kubernetes by using multiple schedulers and (maybe) multiple controllers (e.g. one scheduler per framework, and maybe one controller per framework). Or, do a monolithic controller+scheduler like we're doing for the Daemon Controller, and have one of those per framework. The only thing that's left is how to keep Pods from different frameworks "in their lane." I think we can adapt the mechanisms we already have like quota and LimitRange to set the kind of "weight" you're talking about, although it would need to be making decisions based on which framework your Pods belong to, rather than (or in addition to) your username and namespace.

The one piece we're missing that is needed for both models is a preemption scheme so that we can share the cluster according to the desired policy even when resource demand exceeds supply.

Anyway, I agree a design doc would be useful.

Member

davidopp commented Aug 26, 2015

@timothysc IIUC you're contrasting the Mesos model with the Borg model?

I think we want to support both models. Google has found a lot of success with the Borg model--in Borg, everything from user-facing front-ends, to data processing frameworks like MapReduce and MillWheel, to storage systems like MegaStore and Spanner, to infrastructure services are scheduled by the same scheduler. Borg has a quota mechanism applied at admission time, a priority scheme applied at scheduling time, and a node-level QoS scheme applied at runtime, that together implement policy-driven sharing of the cluster and nodes. To date, the way we've been thinking about Kubernetes has been very similar to Borg.

But I think you can get almost all of the way to the Mesos model in Kubernetes by using multiple schedulers and (maybe) multiple controllers (e.g. one scheduler per framework, and maybe one controller per framework). Or, do a monolithic controller+scheduler like we're doing for the Daemon Controller, and have one of those per framework. The only thing that's left is how to keep Pods from different frameworks "in their lane." I think we can adapt the mechanisms we already have like quota and LimitRange to set the kind of "weight" you're talking about, although it would need to be making decisions based on which framework your Pods belong to, rather than (or in addition to) your username and namespace.

The one piece we're missing that is needed for both models is a preemption scheme so that we can share the cluster according to the desired policy even when resource demand exceeds supply.

Anyway, I agree a design doc would be useful.

@HaiyangDING

This comment has been minimized.

Show comment
Hide comment
@HaiyangDING

HaiyangDING Aug 27, 2015

Contributor

+1 for one scheduler (maybe controller) per framework.

also, +1 for a design doc.

Contributor

HaiyangDING commented Aug 27, 2015

+1 for one scheduler (maybe controller) per framework.

also, +1 for a design doc.

@bgrant0607 bgrant0607 added this to the v1.2-candidate milestone Sep 9, 2015

@HaiyangDING

This comment has been minimized.

Show comment
Hide comment
@HaiyangDING

HaiyangDING Sep 10, 2015

Contributor

How is multiple-scheduler work going? I see that it is labeled with v1.2-candidate. Our team is working on designing multiple schedulers, and hope to present a proposal soon.

Contributor

HaiyangDING commented Sep 10, 2015

How is multiple-scheduler work going? I see that it is labeled with v1.2-candidate. Our team is working on designing multiple schedulers, and hope to present a proposal soon.

@cameronbrunner

This comment has been minimized.

Show comment
Hide comment
@cameronbrunner

cameronbrunner Sep 14, 2015

Contributor

A bit late but a +1 on the scheduler_name attribute as part of the solution.

I would like to run my custom scheduler as a pod on kubernetes using the stock scheduler to place my custom scheduler pods. My custom scheduler would then handle the rest of the pods in the system. Having a scheduler_name field combined with a way of setting a default value (perhaps implemented as an admission controller) would elegantly solve my use case.

Contributor

cameronbrunner commented Sep 14, 2015

A bit late but a +1 on the scheduler_name attribute as part of the solution.

I would like to run my custom scheduler as a pod on kubernetes using the stock scheduler to place my custom scheduler pods. My custom scheduler would then handle the rest of the pods in the system. Having a scheduler_name field combined with a way of setting a default value (perhaps implemented as an admission controller) would elegantly solve my use case.

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Oct 15, 2015

Member

@HaiyangDING @cameronbrunner We're considering this for the 1.2 timeframe.

Member

bgrant0607 commented Oct 15, 2015

@HaiyangDING @cameronbrunner We're considering this for the 1.2 timeframe.

@cameronbrunner

This comment has been minimized.

Show comment
Hide comment
@cameronbrunner

cameronbrunner Oct 15, 2015

Contributor

@bgrant0607 Thanks for the update. I implemented a very simple version of where I can limit a scheduler process to a single namespace to prove out this feature for my use case and was quite happy with it. I would much rather use a general purpose solution though.

Contributor

cameronbrunner commented Oct 15, 2015

@bgrant0607 Thanks for the update. I implemented a very simple version of where I can limit a scheduler process to a single namespace to prove out this feature for my use case and was quite happy with it. I would much rather use a general purpose solution though.

@brendandburns

This comment has been minimized.

Show comment
Hide comment
@brendandburns

brendandburns Oct 22, 2015

Contributor

@cameronbrunner I'd like to add an annotation experimental.kubernetes.io/scheduler that you can put on both a namespace and a pod.

If it's on a namespace, and not set to default all pods in that namespace will be ignored by the default scheduler. If it is on a pod and not set to default that pod will be ignored by the default scheduler.

I'm planning on putting together two PRs to do this, unless you already have an implementation.

Contributor

brendandburns commented Oct 22, 2015

@cameronbrunner I'd like to add an annotation experimental.kubernetes.io/scheduler that you can put on both a namespace and a pod.

If it's on a namespace, and not set to default all pods in that namespace will be ignored by the default scheduler. If it is on a pod and not set to default that pod will be ignored by the default scheduler.

I'm planning on putting together two PRs to do this, unless you already have an implementation.

@cameronbrunner

This comment has been minimized.

Show comment
Hide comment
@cameronbrunner

cameronbrunner Oct 22, 2015

Contributor

@brendandburns I don't have an implementation that does exactly what you are describing as mine simply updated the unassigned pods listwatch in the scheduler factory to only look at one namespace. It sounds like your implementation will still require a watch on all namespaces and then filter out ones that match specific labels.

It sounds like your implementation will fully meet my needs though!

Contributor

cameronbrunner commented Oct 22, 2015

@brendandburns I don't have an implementation that does exactly what you are describing as mine simply updated the unassigned pods listwatch in the scheduler factory to only look at one namespace. It sounds like your implementation will still require a watch on all namespaces and then filter out ones that match specific labels.

It sounds like your implementation will fully meet my needs though!

@HaiyangDING

This comment has been minimized.

Show comment
Hide comment
@HaiyangDING

HaiyangDING Oct 23, 2015

Contributor

@bgrant0607 @brendandburns

We are working on an implementation on multiple-scheduler. One of our plans is to separating pods based on namespace, which I think is same as @brendandburns has mentioned earlier.

However, I do not fully understand "on a pod". Does this mean the pods are separated according to its types (labels and/or other properties)?

CC @hurf

Contributor

HaiyangDING commented Oct 23, 2015

@bgrant0607 @brendandburns

We are working on an implementation on multiple-scheduler. One of our plans is to separating pods based on namespace, which I think is same as @brendandburns has mentioned earlier.

However, I do not fully understand "on a pod". Does this mean the pods are separated according to its types (labels and/or other properties)?

CC @hurf

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Oct 23, 2015

Member

We're planning to support this properly in 1.2. However, I'm not opposed to starting with such an annotation provided:

  • We should not cherrypick it into 1.1.
  • We're free to remove support for the annotation at any time.
  • The scheduler only looks at pod annotations, not namespaces.
  • If you want to implement the namespace annotation, it would be consumed by an optional admission-control plugin, which would then automatically apply annotations to pods within the namespace.
Member

bgrant0607 commented Oct 23, 2015

We're planning to support this properly in 1.2. However, I'm not opposed to starting with such an annotation provided:

  • We should not cherrypick it into 1.1.
  • We're free to remove support for the annotation at any time.
  • The scheduler only looks at pod annotations, not namespaces.
  • If you want to implement the namespace annotation, it would be consumed by an optional admission-control plugin, which would then automatically apply annotations to pods within the namespace.
@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Oct 23, 2015

Member

Also, annotation conventions are documented here:
https://github.com/kubernetes/kubernetes/blob/master/docs/devel/api-conventions.md#label-selector-and-annotation-conventions

The annotation would be the form scheduler.alpha.kubernetes.io/foo-bar or scheduling.... Specifying a scheduler "name" would be the simplest approach and a reasonable starting point, though we'll likely want to augment that with a more intent-oriented scheduling policy eventually.

Member

bgrant0607 commented Oct 23, 2015

Also, annotation conventions are documented here:
https://github.com/kubernetes/kubernetes/blob/master/docs/devel/api-conventions.md#label-selector-and-annotation-conventions

The annotation would be the form scheduler.alpha.kubernetes.io/foo-bar or scheduling.... Specifying a scheduler "name" would be the simplest approach and a reasonable starting point, though we'll likely want to augment that with a more intent-oriented scheduling policy eventually.

@HaiyangDING

This comment has been minimized.

Show comment
Hide comment
@HaiyangDING

HaiyangDING Oct 23, 2015

Contributor

@bgrant0607

That is of great help, I will look into that.

Thanks.

Contributor

HaiyangDING commented Oct 23, 2015

@bgrant0607

That is of great help, I will look into that.

Thanks.

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Feb 4, 2016

Member

I'm going to close this issue, as the core functionality has been implemented. MetadaPolicy design is still under discussion (#18262).

Member

davidopp commented Feb 4, 2016

I'm going to close this issue, as the core functionality has been implemented. MetadaPolicy design is still under discussion (#18262).

@davidopp davidopp closed this Feb 4, 2016

@spzala spzala referenced this issue Jul 26, 2017

Open

Create a guide to writing a new scheduler #4517

1 of 2 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment