New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP/RFC] Rescheduling in Kubernetes design proposal #22217

Merged
merged 1 commit into from Jul 10, 2016

Conversation

Projects
None yet
@davidopp
Member

davidopp commented Feb 29, 2016

Proposal by @bgrant0607 and @davidopp (and inspired by years of discussion and experience from folks who worked on Borg and Omega).

This doc is a proposal for a set of inter-related concepts related to "rescheduling" -- that is, "moving" an already-running pod to a new node in order to improve where it is running. (Specific concepts discussed are priority, preemption, disruption budget, quota, /evict subresource, and rescheduler.)

Feedback on the proposal is very welcome. For now, please stick to comments about the design, not spelling, punctuation, grammar, broken links, etc., so we can keep the doc uncluttered enough to make it easy for folks to comment on the more important things.

ref/ #22054 #18724 #19080 #12611 #20699 #17393 #12140 #22212

@HaiyangDING @mqliang @derekwaynecarr @kubernetes/sig-scheduling @kubernetes/huawei @timothysc @mml @dchen1107

@k8s-bot

This comment has been minimized.

Show comment
Hide comment
@k8s-bot

k8s-bot commented Feb 29, 2016

GCE e2e build/test passed for commit a543cfe.

Kubernetes will terminate a pod that is managed by a controller, and the controller will
create a replacement pod that is then scheduled by the pod's scheduler. The terminated
pod and replacement pod are completely separate pods, and no pod migration is
implied. However, describing the process as "moving" the pod is approximately accurate

This comment has been minimized.

@mml

mml Mar 1, 2016

Contributor

Given that we have a pretty tight container abstraction and live migration has come a long way, should we consider if live container (pod) migration isn't a better way to go? At the least, it probably deservers an "alternatives considered" entry.

@mml

mml Mar 1, 2016

Contributor

Given that we have a pretty tight container abstraction and live migration has come a long way, should we consider if live container (pod) migration isn't a better way to go? At the least, it probably deservers an "alternatives considered" entry.

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

We (and the user) would still need to deal with the node failure case, where live migration is impossible. So it becomes a matter of how many terminations the client sees, not whether they see them. Unless the reduction in number of terminations is dramatic and crucial (which I don't think it is), I think that the consistency of termination/failure semantics wins here (i.e. pods always terminate and replacements are created, rather than pods sometimes moving, and sometimes dieing and being replaced).

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

We (and the user) would still need to deal with the node failure case, where live migration is impossible. So it becomes a matter of how many terminations the client sees, not whether they see them. Unless the reduction in number of terminations is dramatic and crucial (which I don't think it is), I think that the consistency of termination/failure semantics wins here (i.e. pods always terminate and replacements are created, rather than pods sometimes moving, and sometimes dieing and being replaced).

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Mar 1, 2016

Member

I would prefer we avoid live container (pod) migration, and I would imagine the rescheduler will continue to respect graceful termination.

@derekwaynecarr

derekwaynecarr Mar 1, 2016

Member

I would prefer we avoid live container (pod) migration, and I would imagine the rescheduler will continue to respect graceful termination.

This comment has been minimized.

@mqliang

mqliang Mar 2, 2016

Member

+1 for avoiding live migration. Currently, Pods in k8s are stateless, we save data in PersistentVolume, so live migration is less meaningful.

@mqliang

mqliang Mar 2, 2016

Member

+1 for avoiding live migration. Currently, Pods in k8s are stateless, we save data in PersistentVolume, so live migration is less meaningful.

This comment has been minimized.

@HaiyangDING

HaiyangDING Mar 2, 2016

Contributor

+1 for avoiding live migration.

@HaiyangDING

HaiyangDING Mar 2, 2016

Contributor

+1 for avoiding live migration.

This comment has been minimized.

@bgrant0607

bgrant0607 Mar 7, 2016

Member

This proposal doesn't preclude migration, and this functionality would be required in order to implement migration.

However, migration would be a lot more work than what is described here.

Migration is covered by #3949.

@bgrant0607

bgrant0607 Mar 7, 2016

Member

This proposal doesn't preclude migration, and this functionality would be required in order to implement migration.

However, migration would be a lot more work than what is described here.

Migration is covered by #3949.

This comment has been minimized.

@davidopp

davidopp Mar 7, 2016

Member

Right. More generally, this proposal is an attempt to define an initial version and short-term roadmap, not the entire design space or long-term ideas. Once live migration of containers is available (the containerd docs seem to imply it will be soon? https://github.com/docker/containerd ) I would assume Kubernetes will take advantage of it.

@davidopp

davidopp Mar 7, 2016

Member

Right. More generally, this proposal is an attempt to define an initial version and short-term roadmap, not the entire design space or long-term ideas. Once live migration of containers is available (the containerd docs seem to imply it will be soon? https://github.com/docker/containerd ) I would assume Kubernetes will take advantage of it.

This comment has been minimized.

@bgrant0607

bgrant0607 Apr 12, 2016

Member

@hurf Persisted state can still be "migrated" via persistent volume claim or flocker. Presumably the stateful applications have to be able to deal with restart after failure, so migrating the in-memory state shouldn't be strictly necessary.

@bgrant0607

bgrant0607 Apr 12, 2016

Member

@hurf Persisted state can still be "migrated" via persistent volume claim or flocker. Presumably the stateful applications have to be able to deal with restart after failure, so migrating the in-memory state shouldn't be strictly necessary.

This comment has been minimized.

@hurf

hurf Apr 12, 2016

Contributor

Yes, that's what we do now. A problem is dealing with failure sometime may lower performance indicator of a service(doesn't mean not going to deal with the failure, but try to reduce failure). Especially in rescheduler case, failure may not caused by service itself but by an eviction(unless we give it a disruption budget of none). If we can have in-memory migration, the pod can get rescheduled without breaking its ongoing task. It frees more pods and looses the disruption budget restriction. Indeed it's not a necessary feature but an optimized option.

@hurf

hurf Apr 12, 2016

Contributor

Yes, that's what we do now. A problem is dealing with failure sometime may lower performance indicator of a service(doesn't mean not going to deal with the failure, but try to reduce failure). Especially in rescheduler case, failure may not caused by service itself but by an eviction(unless we give it a disruption budget of none). If we can have in-memory migration, the pod can get rescheduled without breaking its ongoing task. It frees more pods and looses the disruption budget restriction. Indeed it's not a necessary feature but an optimized option.

This comment has been minimized.

@hurf

hurf Apr 12, 2016

Contributor

I'm not asking for live migration. It's a good thing but not urgent.

@hurf

hurf Apr 12, 2016

Contributor

I'm not asking for live migration. It's a good thing but not urgent.

Although we could put the responsibility for checking and updating disruption budgets
solely on the client, it is safer and more convenient if we implement that functionality
in the API server. Thus we will introduce a new `/evict` subresource on pod. It is similar to

This comment has been minimized.

@mml

mml Mar 1, 2016

Contributor

"Evict" isn't nouny, which made me think maybe it should be called "eviction". At that point I thought, "why not just DELETE the /binding subresource?" The semantics of such a request is clear and it avoids creating another subresource.

@mml

mml Mar 1, 2016

Contributor

"Evict" isn't nouny, which made me think maybe it should be called "eviction". At that point I thought, "why not just DELETE the /binding subresource?" The semantics of such a request is clear and it avoids creating another subresource.

This comment has been minimized.

@jiangyaoguo

jiangyaoguo Apr 15, 2016

Member

@davidopp I'm confused what /evict subresource will do in resheduler mechanism. IIUC, the re-schedule initiator will delete pod and controller is responsible for creating the replacement pod. Will the /evict subresource delete the pod ? If so, POST or PUT /pod/xxx/evict will lead to deletion of pod xxx and all subresources of pod xxx including /evict itself(though there's no realistic subresource). It's a little semantically strange.

@jiangyaoguo

jiangyaoguo Apr 15, 2016

Member

@davidopp I'm confused what /evict subresource will do in resheduler mechanism. IIUC, the re-schedule initiator will delete pod and controller is responsible for creating the replacement pod. Will the /evict subresource delete the pod ? If so, POST or PUT /pod/xxx/evict will lead to deletion of pod xxx and all subresources of pod xxx including /evict itself(though there's no realistic subresource). It's a little semantically strange.

This comment has been minimized.

@davidopp

davidopp Apr 15, 2016

Member

Yes, DELETE of /evict subresource should delete the pod (if the API server allows it, i.e. if DisruptionBudget is satisfied). Other operations (POST, PUT, etc.) on /evict subresource are not supported. Let's continue the discussion in #24321 (I filed that issue just now because the comments on this design doc are getting very cluttered :) )

@davidopp

davidopp Apr 15, 2016

Member

Yes, DELETE of /evict subresource should delete the pod (if the API server allows it, i.e. if DisruptionBudget is satisfied). Other operations (POST, PUT, etc.) on /evict subresource are not supported. Let's continue the discussion in #24321 (I filed that issue just now because the comments on this design doc are getting very cluttered :) )

@SrinivasChilveri

This comment has been minimized.

Show comment
Hide comment
@SrinivasChilveri
Member

SrinivasChilveri commented Mar 1, 2016

@mqliang

This comment has been minimized.

Show comment
Hide comment
@mqliang

mqliang Mar 1, 2016

Member

@bgrant0607 @davidopp This proposal is very huge. May I volunteer myself to implement part of this? I am really interested in:

  1. Priority and Preemption. I have a proposal about Preemption in #22054, but without disruption budgets. And I have a proposal about "the order in which a scheduler examines pods in its scheduling loop" in #20203
  2. One feature of Rescheduler: moving a pod onto an under-utilized node. I'd like implement this using the idea of "Pod Stealing", I described it in #22054
Member

mqliang commented Mar 1, 2016

@bgrant0607 @davidopp This proposal is very huge. May I volunteer myself to implement part of this? I am really interested in:

  1. Priority and Preemption. I have a proposal about Preemption in #22054, but without disruption budgets. And I have a proposal about "the order in which a scheduler examines pods in its scheduling loop" in #20203
  2. One feature of Rescheduler: moving a pod onto an under-utilized node. I'd like implement this using the idea of "Pod Stealing", I described it in #22054
@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Mar 1, 2016

Member

Let's use the scheduling SIG mailing list and meetings to discuss who is interested in implementing the various pieces. But let's make sure we converge on the design for a piece before building it. :)

Member

davidopp commented Mar 1, 2016

Let's use the scheduling SIG mailing list and meetings to discuss who is interested in implementing the various pieces. But let's make sure we converge on the design for a piece before building it. :)

TBD: In addition to `PodSpec`, where do we store pointer to disruption budget
(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption
budget (e.g. when instantiating a Service), or require the user to create it manually
before they create a controller? Which objects should return the disruption budget object

This comment has been minimized.

@mqliang

mqliang Mar 1, 2016

Member

I think we could use a admission controller to auto-generate a disruption budget. User could enable such an admission controller so that when they create a Service, a disruption budget will be auto-generated, or they can disable it and create disruption budget manually.

@mqliang

mqliang Mar 1, 2016

Member

I think we could use a admission controller to auto-generate a disruption budget. User could enable such an admission controller so that when they create a Service, a disruption budget will be auto-generated, or they can disable it and create disruption budget manually.

This comment has been minimized.

@hurf

hurf Mar 3, 2016

Contributor

We can do just like LimitRange, users can create one, buf if they don't specifiy it, apply a default one.
Question: what's the behavior if this admission controller is disabled. All or none of the pods can be disrupted?

@hurf

hurf Mar 3, 2016

Contributor

We can do just like LimitRange, users can create one, buf if they don't specifiy it, apply a default one.
Question: what's the behavior if this admission controller is disabled. All or none of the pods can be disrupted?

evicted pod will reschedule. The evicted pod may go pending, consuming one unit of the
corresponding shard strength disruption budget by one indefinitely. By using the `/evict`
subresource, the rescheduler ensures that an evicted pod has sufficient budget for the
evicted pod to go and stay pending. We expect future versions of the rescheduler may be

This comment has been minimized.

@davidopp

davidopp Mar 1, 2016

Member

Thinking about it more, I'm not convinced we can really get away with having the rescheduler not know at least the required predicate functions, even for the first version of the rescheduler.

Take for example the "move pods onto under-utilized" nodes use case. It's important to know if an evicted pod will actually not reschedule onto any of those nodes--for example those nodes might be under-utilized because they're being drained and are marked unschedulable, or maybe they have some kind of taint due to being in a dedicated node group or otherwise restricted to a limited set of pods. In such cases the eviction is pointless, as the evicted pod will not move onto any of those nodes. On the other hand, maybe the nodes were just added by cluster auto-scale-up, in which case they will be feasible for the pod you're considering moving. It seems important to be able to distinguish these two cases.

This is a slightly different argument than saying rescheduler needs the predicate functions so that it can know whether an evicted pod will go pending. I think the argument that we don't really care whether the pod goes pending is reasonable, certainly for a first version of the rescheduler. The argument here is that rescheduler needs the predicate functions so it can know if the node(s) that seem better for the pod in question are actually feasible for that pod. If they're not, the pod might not end up on a node that is actually better.

@davidopp

davidopp Mar 1, 2016

Member

Thinking about it more, I'm not convinced we can really get away with having the rescheduler not know at least the required predicate functions, even for the first version of the rescheduler.

Take for example the "move pods onto under-utilized" nodes use case. It's important to know if an evicted pod will actually not reschedule onto any of those nodes--for example those nodes might be under-utilized because they're being drained and are marked unschedulable, or maybe they have some kind of taint due to being in a dedicated node group or otherwise restricted to a limited set of pods. In such cases the eviction is pointless, as the evicted pod will not move onto any of those nodes. On the other hand, maybe the nodes were just added by cluster auto-scale-up, in which case they will be feasible for the pod you're considering moving. It seems important to be able to distinguish these two cases.

This is a slightly different argument than saying rescheduler needs the predicate functions so that it can know whether an evicted pod will go pending. I think the argument that we don't really care whether the pod goes pending is reasonable, certainly for a first version of the rescheduler. The argument here is that rescheduler needs the predicate functions so it can know if the node(s) that seem better for the pod in question are actually feasible for that pod. If they're not, the pod might not end up on a node that is actually better.

This comment has been minimized.

@HaiyangDING

HaiyangDING Mar 2, 2016

Contributor

+1. Predicate functions are useful for the rescheduler, even in the first version of implementation.

@HaiyangDING

HaiyangDING Mar 2, 2016

Contributor

+1. Predicate functions are useful for the rescheduler, even in the first version of implementation.

This comment has been minimized.

@bgrant0607

bgrant0607 Mar 8, 2016

Member

What scenarios would you like the MVP to handle?

I don't object to including the fit predicates, since we do have a spec for feasibility, not just an implementation. We should provide an API, though it would admittedly be of significantly lower performance, even once we support protobuf.

I personally think the MVP could be simpler -- for instance don't attempt to fill tainted nodes. Most clusters are likely still homogeneous, especially since we don't yet have a simple mechanism for exporting attributes. Certainly in an auto-scaled cluster, the additional nodes are going to be similar to existing ones. Another approach could be to evict pods from "similar" nodes.

@bgrant0607

bgrant0607 Mar 8, 2016

Member

What scenarios would you like the MVP to handle?

I don't object to including the fit predicates, since we do have a spec for feasibility, not just an implementation. We should provide an API, though it would admittedly be of significantly lower performance, even once we support protobuf.

I personally think the MVP could be simpler -- for instance don't attempt to fill tainted nodes. Most clusters are likely still homogeneous, especially since we don't yet have a simple mechanism for exporting attributes. Certainly in an auto-scaled cluster, the additional nodes are going to be similar to existing ones. Another approach could be to evict pods from "similar" nodes.

This comment has been minimized.

@HaiyangDING

HaiyangDING Mar 9, 2016

Contributor

Sorry, what does MVP mean here?

@HaiyangDING

HaiyangDING Mar 9, 2016

Contributor

Sorry, what does MVP mean here?

This comment has been minimized.

@davidopp

davidopp Mar 9, 2016

Member

MVP: https://en.wikipedia.org/wiki/Minimum_viable_product

@bgrant0607 : The plan for the MVP is to move pods onto nodes that were added by cluster auto-scaler (hopefully implemented generically to move pods onto under-utilized nodes) and to move pods to improve affinity. I don't think we need an API; the point of #20204 is to make it possible for a component to link in the required scheduling predicates.

@davidopp

davidopp Mar 9, 2016

Member

MVP: https://en.wikipedia.org/wiki/Minimum_viable_product

@bgrant0607 : The plan for the MVP is to move pods onto nodes that were added by cluster auto-scaler (hopefully implemented generically to move pods onto under-utilized nodes) and to move pods to improve affinity. I don't think we need an API; the point of #20204 is to make it possible for a component to link in the required scheduling predicates.

* moving a running pod off of a node from which it is receiving poor service
* anomalous crashlooping or other mysterious incompatiblity between the pod and the node
* repeated out-of-resource killing (see #18724)
* repeated attempts by the scheduler to schedule the pod onto some node, but it is

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

Is this really possible? If so, we should presumably fix it in the scheduler, not a rescheduler? If the scheduler schedules a pod to a node, and then discovers that it had out of date information about the node, and the pod can not in fact be scheduled there, it should automatically reschedule, possibly?

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

Is this really possible? If so, we should presumably fix it in the scheduler, not a rescheduler? If the scheduler schedules a pod to a node, and then discovers that it had out of date information about the node, and the pod can not in fact be scheduled there, it should automatically reschedule, possibly?

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Mar 1, 2016

Member

if a pod is scheduled to a kubelet and rejected in admission, isn't the pod immediately moved to a terminal state? i agree with @quinton-hoole here that this seems like something the scheduler should get right, but from what I can tell thus far in this PR is that the rescheduler does more selective killing of pods on a node, but does not actually take over primary scheduling responsibility so maybe its a no-op here and a solution is needed in both places?

@derekwaynecarr

derekwaynecarr Mar 1, 2016

Member

if a pod is scheduled to a kubelet and rejected in admission, isn't the pod immediately moved to a terminal state? i agree with @quinton-hoole here that this seems like something the scheduler should get right, but from what I can tell thus far in this PR is that the rescheduler does more selective killing of pods on a node, but does not actually take over primary scheduling responsibility so maybe its a no-op here and a solution is needed in both places?

This comment has been minimized.

@davidopp

davidopp Mar 9, 2016

Member

In the model where scheduler has full information about kubelet resources, what you guys are saying is correct (and this is indeed how things are today). However, in the future it's possible that Kubelet will have more information than the scheduler, especially if the resource topology within a node becomes very complicated and it's not scalable for the scheduler to know all of the details. I would like to avoid moving to that world as long as possible, though.

@davidopp

davidopp Mar 9, 2016

Member

In the model where scheduler has full information about kubelet resources, what you guys are saying is correct (and this is indeed how things are today). However, in the future it's possible that Kubelet will have more information than the scheduler, especially if the resource topology within a node becomes very complicated and it's not scalable for the scheduler to know all of the details. I would like to avoid moving to that world as long as possible, though.

This comment has been minimized.

@HaiyangDING

HaiyangDING Mar 9, 2016

Contributor

+1 @davidopp 's comments.

@HaiyangDING

HaiyangDING Mar 9, 2016

Contributor

+1 @davidopp 's comments.

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 14, 2016

Member

@davidopp @HaiyangDING In my mind there are two distinct cases where a pod fails to schedule:

  1. The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler.
  2. The scheduler picks a non-optimal node, or no suitable node exists (without moving someother pods around). This seems like the purvue of the rescheduler.

Am I missing something?

@quinton-hoole

quinton-hoole Mar 14, 2016

Member

@davidopp @HaiyangDING In my mind there are two distinct cases where a pod fails to schedule:

  1. The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler.
  2. The scheduler picks a non-optimal node, or no suitable node exists (without moving someother pods around). This seems like the purvue of the rescheduler.

Am I missing something?

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 14, 2016

Member

@davidopp @HaiyangDING In my mind there are two distinct cases where a pod fails to schedule:

  1. The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler. It seems like the scheduler should fix it's own mistakes here, e.g by trying again.
  2. The node picked by the scheduler is non-optimal (or becomes non-optimal over time), or no suitable node exists (without moving some other pods around). This seems like the purview of the rescheduler.

Am I missing something?

@quinton-hoole

quinton-hoole Mar 14, 2016

Member

@davidopp @HaiyangDING In my mind there are two distinct cases where a pod fails to schedule:

  1. The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler. It seems like the scheduler should fix it's own mistakes here, e.g by trying again.
  2. The node picked by the scheduler is non-optimal (or becomes non-optimal over time), or no suitable node exists (without moving some other pods around). This seems like the purview of the rescheduler.

Am I missing something?

This comment has been minimized.

@HaiyangDING

HaiyangDING Mar 15, 2016

Contributor

I will try to answer with my knowledge, correct me if I am wrong.

The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler. It seems like the scheduler should fix it's own mistakes here, e.g by trying again.

Currently, if the pod is denied by the node proposed by the scheduler, the pod is simply marked 'failed', there is no trying again. I think in future:

  1. It is the responsibility for the scheduler to try to scheduler the pod denied by kubelet again, or several times (<=N).
  2. It is the responsibility for the rescheduler to handle the pod that has been denied by kubelet several times (>= N). However the mechanism needs further consideration. For instance, if the pod is denied by the same kubelet several times, we can add the avoid annotation on the node; but if the pod is denied by different, maybe we could just wait and see.

The node picked by the scheduler is non-optimal (or becomes non-optimal over time), or no suitable node exists (without moving some other pods around). This seems like the purview of the rescheduler.

Yes, it is. FWIW:

  • The node picked by the scheduler is non-optimal (or becomes non-optimal over time), this is in the scope of the first version of rescheduler, but we need to figure out some policy to decide how non-optimal is bad enough (as well as the possibility to improve) to trigger the rescheduling behavior. Actually, by this we are asking the rescheduler to know (at least some of) the priority functions.
  • suitable node exists (without moving some other pods around) , this is within the scope of rescheduler but is related to the preemption & priority, so they are not going to be implemented in the first step.
  • Another use case of the rescheduler in the first step is to move some pod to under-utilized node, and both 'some pod' and 'under-utilized' need to be defined.

And finally, yeah, I don't think you miss anything :)

@HaiyangDING

HaiyangDING Mar 15, 2016

Contributor

I will try to answer with my knowledge, correct me if I am wrong.

The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler. It seems like the scheduler should fix it's own mistakes here, e.g by trying again.

Currently, if the pod is denied by the node proposed by the scheduler, the pod is simply marked 'failed', there is no trying again. I think in future:

  1. It is the responsibility for the scheduler to try to scheduler the pod denied by kubelet again, or several times (<=N).
  2. It is the responsibility for the rescheduler to handle the pod that has been denied by kubelet several times (>= N). However the mechanism needs further consideration. For instance, if the pod is denied by the same kubelet several times, we can add the avoid annotation on the node; but if the pod is denied by different, maybe we could just wait and see.

The node picked by the scheduler is non-optimal (or becomes non-optimal over time), or no suitable node exists (without moving some other pods around). This seems like the purview of the rescheduler.

Yes, it is. FWIW:

  • The node picked by the scheduler is non-optimal (or becomes non-optimal over time), this is in the scope of the first version of rescheduler, but we need to figure out some policy to decide how non-optimal is bad enough (as well as the possibility to improve) to trigger the rescheduling behavior. Actually, by this we are asking the rescheduler to know (at least some of) the priority functions.
  • suitable node exists (without moving some other pods around) , this is within the scope of rescheduler but is related to the preemption & priority, so they are not going to be implemented in the first step.
  • Another use case of the rescheduler in the first step is to move some pod to under-utilized node, and both 'some pod' and 'under-utilized' need to be defined.

And finally, yeah, I don't think you miss anything :)

This comment has been minimized.

@davidopp

davidopp Mar 21, 2016

Member

Yeah, I'd say this falls into the category of things that could go into either the rescheduler or every scheduler. As mentioned somewhere in the doc, we don't actually need a rescheduler component at all--we could just implement all of the rescheduler in every scheduler, creating a virtual/distribute rescheduler. But it's easier for people to write new schedulers (and the system is easier to understand, global policies are easier to configure, etc. etc.) if we have a single rescheduler component rather than putting the responsibility on every scheduler. With that in mind, the reasoning is basically what @HaiyangDING said -- while you could make schedulers responsible for noticing and stopping "rescheduling loops," you can instead make that the responsibility of the rescheduler, which would notice it happening (for any scheduler) and would add an indication to the node that equivalent pods should avoid that node for some period of time. But it is certainly the case that you could put this logic in the scheduler. (And we are not suggesting to address this at all in the first version of the rescheduler, especially since we don't have any scenarios today AFAIK that should cause rescheduling loops, other than stale information, which I don't consider a rescheduling loop because it will quickly stop.)

@davidopp

davidopp Mar 21, 2016

Member

Yeah, I'd say this falls into the category of things that could go into either the rescheduler or every scheduler. As mentioned somewhere in the doc, we don't actually need a rescheduler component at all--we could just implement all of the rescheduler in every scheduler, creating a virtual/distribute rescheduler. But it's easier for people to write new schedulers (and the system is easier to understand, global policies are easier to configure, etc. etc.) if we have a single rescheduler component rather than putting the responsibility on every scheduler. With that in mind, the reasoning is basically what @HaiyangDING said -- while you could make schedulers responsible for noticing and stopping "rescheduling loops," you can instead make that the responsibility of the rescheduler, which would notice it happening (for any scheduler) and would add an indication to the node that equivalent pods should avoid that node for some period of time. But it is certainly the case that you could put this logic in the scheduler. (And we are not suggesting to address this at all in the first version of the rescheduler, especially since we don't have any scenarios today AFAIK that should cause rescheduling loops, other than stale information, which I don't consider a rescheduling loop because it will quickly stop.)

rejected by Kubelet admission control due to incomplete scheduler knowledge
* poor performance due to interference from other containers on the node (CPU hogs,
cache thrashers, etc.) (note that in this case there is a choice of moving the victim
or the aggressor)

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

This verges on a gramatical nit, but it goes a bit deeper than that. I'd avoid using connotative terms here, inferring that one pod is 'good' and the other is 'bad', e.g. using as much CPU as is available on the node is a perfectly reasonable strategy for e.g. a CPU-bound batch job. The only thing that makes it not always work out well for everyone is our oversubscription policy and inadequate resource requirement semantics.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

This verges on a gramatical nit, but it goes a bit deeper than that. I'd avoid using connotative terms here, inferring that one pod is 'good' and the other is 'bad', e.g. using as much CPU as is available on the node is a perfectly reasonable strategy for e.g. a CPU-bound batch job. The only thing that makes it not always work out well for everyone is our oversubscription policy and inadequate resource requirement semantics.

This comment has been minimized.

@resouer

resouer Mar 4, 2016

Member

Also, I propose to use latency as a metric to determine "good" or "bad"

@resouer

resouer Mar 4, 2016

Member

Also, I propose to use latency as a metric to determine "good" or "bad"

When a scheduler is scheduling a new pod P and cannot find any node that meets all of P's
scheduling predicates, it is allowed to evict ("preempt") one or more pods that are at
the same or lower priority than P (subject to disruption budgets, see next section) from

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

It might be worth clarifying why it's a good idea to preempt pods of the same priority. Superficially it sounds like replacing a pod with another of equal priority does not improve cluster-wide scheduled pod percentages, and is just busy work. I'm sure there's more to it than that though.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

It might be worth clarifying why it's a good idea to preempt pods of the same priority. Superficially it sounds like replacing a pod with another of equal priority does not improve cluster-wide scheduled pod percentages, and is just busy work. I'm sure there's more to it than that though.

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

I guess the assumption is that the evicted pod might be schedulable on some other node. In the worst case we break even (modulo rescheduling costs), while in the best case we have an additional pod scheduled.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

I guess the assumption is that the evicted pod might be schedulable on some other node. In the worst case we break even (modulo rescheduling costs), while in the best case we have an additional pod scheduled.

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

One can imagine a pretty bad situation where the above worst-case behavoir cascades, evicting a large number of equal-priority pods in succession, resulting in high rescheduling costs, and no net benefit. Would it be worth trying to short-circuit this sort of thing (by e.g. only pre-empting an equal-priority pod if it is immediately schedulable onto another node without evicting anything, or only evicting lower-priority pods?

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

One can imagine a pretty bad situation where the above worst-case behavoir cascades, evicting a large number of equal-priority pods in succession, resulting in high rescheduling costs, and no net benefit. Would it be worth trying to short-circuit this sort of thing (by e.g. only pre-empting an equal-priority pod if it is immediately schedulable onto another node without evicting anything, or only evicting lower-priority pods?

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

Note that this is not a purely theoretical concern. I have considerable experience with borg taking out an entire serving cluster by precipitating such a "pre-emption storm", ultimately rendering the cluster useless in an attempt to schedule one or a small number of higher priority "pods". In those particular cases it would have been way, way better for the rescheduler to have done absolutely nothing (in which case the serving capacity of the cluster in question would have been reduced by a small number of percent due to the unscheduled pods, rather than to zero by the resulting cascading pre-emption storm).

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

Note that this is not a purely theoretical concern. I have considerable experience with borg taking out an entire serving cluster by precipitating such a "pre-emption storm", ultimately rendering the cluster useless in an attempt to schedule one or a small number of higher priority "pods". In those particular cases it would have been way, way better for the rescheduler to have done absolutely nothing (in which case the serving capacity of the cluster in question would have been reduced by a small number of percent due to the unscheduled pods, rather than to zero by the resulting cascading pre-emption storm).

This comment has been minimized.

@mqliang

mqliang Mar 2, 2016

Member

@quinton-hoole

It might be worth clarifying why it's a good idea to preempt pods of the same priority. Superficially it sounds like replacing a pod with another of equal priority does not improve cluster-wide scheduled pod percentages, and is just busy work. I'm sure there's more to it than that though.

I think the reason we allow preempt pods of the same priority is we have "disruption budget", as long as "disruption budget" allows, it won't increase unavailability of one Service in a further step, but we can get the benefit of allowing Pods or another Service be scheduled.

@mqliang

mqliang Mar 2, 2016

Member

@quinton-hoole

It might be worth clarifying why it's a good idea to preempt pods of the same priority. Superficially it sounds like replacing a pod with another of equal priority does not improve cluster-wide scheduled pod percentages, and is just busy work. I'm sure there's more to it than that though.

I think the reason we allow preempt pods of the same priority is we have "disruption budget", as long as "disruption budget" allows, it won't increase unavailability of one Service in a further step, but we can get the benefit of allowing Pods or another Service be scheduled.

This comment has been minimized.

@HaiyangDING

HaiyangDING Mar 2, 2016

Contributor

I think the reason we allow preempt pods of the same priority is we have "disruption budget", as long as "disruption budget" allows, it won't increase unavailability of one Service in a further step, but we can get the benefit of allowing Pods or another Service be scheduled.

Good one. I have the same concern of why we could allow equal priority preemption. So pod in the set whose disruption budget is low could be able to preempt that with plenty disruption budget?
Is there other reasons?

@HaiyangDING

HaiyangDING Mar 2, 2016

Contributor

I think the reason we allow preempt pods of the same priority is we have "disruption budget", as long as "disruption budget" allows, it won't increase unavailability of one Service in a further step, but we can get the benefit of allowing Pods or another Service be scheduled.

Good one. I have the same concern of why we could allow equal priority preemption. So pod in the set whose disruption budget is low could be able to preempt that with plenty disruption budget?
Is there other reasons?

This comment has been minimized.

@davidopp

davidopp Mar 9, 2016

Member

I guess the assumption is that the evicted pod might be schedulable on some other node. In the worst case we break even (modulo rescheduling costs), while in the best case we have an additional pod scheduled.

Correct.

One can imagine a pretty bad situation where the above worst-case behavoir cascades, evicting a large number of equal-priority pods in succession, resulting in high rescheduling costs, and no net benefit.

As @mqliang says, preemptions are rate-limited by disruption budgets, so you can't get preemption cascades/preemption storms.

Would it be worth trying to short-circuit this sort of thing (by e.g. only pre-empting an equal-priority pod if it is immediately schedulable onto another node without evicting anything, or only evicting lower-priority pods?

What you're describing (only evict if we're pretty sure the victim can reschedule somewhere) is something we might eventually want to do for the other rescheduling use cases, and I agree it could be useful for the same-priority-preemption use case as well. I don't think it's necessary for a first version of the system, though.

I have the same concern of why we could allow equal priority preemption.

The reason to allow equal-priority preemption is to "repack" pods to make room for a pending pod. (For example, defragment resources, or move a pod off of a machine with some label that is required by the pending pod but not required by the already-running pod.) Doing this at scheduling time on-demand rather than speculatively in the rescheduler minimizes the amount of unnecessary evictions.

@davidopp

davidopp Mar 9, 2016

Member

I guess the assumption is that the evicted pod might be schedulable on some other node. In the worst case we break even (modulo rescheduling costs), while in the best case we have an additional pod scheduled.

Correct.

One can imagine a pretty bad situation where the above worst-case behavoir cascades, evicting a large number of equal-priority pods in succession, resulting in high rescheduling costs, and no net benefit.

As @mqliang says, preemptions are rate-limited by disruption budgets, so you can't get preemption cascades/preemption storms.

Would it be worth trying to short-circuit this sort of thing (by e.g. only pre-empting an equal-priority pod if it is immediately schedulable onto another node without evicting anything, or only evicting lower-priority pods?

What you're describing (only evict if we're pretty sure the victim can reschedule somewhere) is something we might eventually want to do for the other rescheduling use cases, and I agree it could be useful for the same-priority-preemption use case as well. I don't think it's necessary for a first version of the system, though.

I have the same concern of why we could allow equal priority preemption.

The reason to allow equal-priority preemption is to "repack" pods to make room for a pending pod. (For example, defragment resources, or move a pod off of a machine with some label that is required by the pending pod but not required by the already-running pod.) Doing this at scheduling time on-demand rather than speculatively in the rescheduler minimizes the amount of unnecessary evictions.

a node in order to make room for P, i.e. in order to make the scheduling predicates
satisfied for P on that node. (Note that when we add cluster-level resources (#19080),
it might be necessary to preempt from multiple nodes, but that scenario is outside the
scope of this document.) The preempted pod(s) may or may not be able to reschedule. The

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

This is probably not the right place for this comment/question, but here goes. Have we considered treating scheduling and rescheduling as a global constraint-satisfaction problem (rather than a local one as per the current scheduler implementation and rescheduler design)? What I have in mind is an algorithm that takes as input all of the nodes, all of the pods and their constraints, and current pod placements as input. It's job is to find a new global mapping of pods to nodes that:

  1. results in 'more' scheduled pods (by some definition of 'more'). Ideally the highest possible.
  2. results in 'more' placement constraints (affinities, anti-affinities, ... etc) being satisfied. Ideally the highest possible.
  3. favors alternative global solutions which are "closer" to the current pods placement (i.e the "diff" is smaller), in preference to solutions that are "further" from it (to reduce rescheduling cost). A conversion function between rescheduling cost and the cost of pending/unscheduled pods makes it possible to trade off 1 against 3 etc.

The classic problems with local constraint satisfaction (as we currently design and implement it) as opposed to global constraint satisfaction, is that the former can commonly get stuck in local minima of the function being optimised. In our case, this is made worse by the relatively high cost of rescheduling (terminating and replacing pods creates a lot of churn in the system, so we'd better be sure that each of these gets us strictly closer to a near-optimal solution). This is theoretically and practically impossible with a local constraint solver, as evidenced by e.g. the "pre-emption storm" example above, and further similar discussion below.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

This is probably not the right place for this comment/question, but here goes. Have we considered treating scheduling and rescheduling as a global constraint-satisfaction problem (rather than a local one as per the current scheduler implementation and rescheduler design)? What I have in mind is an algorithm that takes as input all of the nodes, all of the pods and their constraints, and current pod placements as input. It's job is to find a new global mapping of pods to nodes that:

  1. results in 'more' scheduled pods (by some definition of 'more'). Ideally the highest possible.
  2. results in 'more' placement constraints (affinities, anti-affinities, ... etc) being satisfied. Ideally the highest possible.
  3. favors alternative global solutions which are "closer" to the current pods placement (i.e the "diff" is smaller), in preference to solutions that are "further" from it (to reduce rescheduling cost). A conversion function between rescheduling cost and the cost of pending/unscheduled pods makes it possible to trade off 1 against 3 etc.

The classic problems with local constraint satisfaction (as we currently design and implement it) as opposed to global constraint satisfaction, is that the former can commonly get stuck in local minima of the function being optimised. In our case, this is made worse by the relatively high cost of rescheduling (terminating and replacing pods creates a lot of churn in the system, so we'd better be sure that each of these gets us strictly closer to a near-optimal solution). This is theoretically and practically impossible with a local constraint solver, as evidenced by e.g. the "pre-emption storm" example above, and further similar discussion below.

This comment has been minimized.

@mqliang

mqliang Mar 2, 2016

Member

Have we considered treating scheduling and rescheduling as a global constraint-satisfaction problem

Getting a global optimal solution is challenging IMO. Since we need a good objective functions(take a lot of factors into consideration), and it takes time to calculate. And unfortunately there are schedulers work at the same time, so it's racy. One idea occurs to me is that we can have a distributed lock, if rescheduler get this lock, all schedulers paused, once rescheduler have finished it's work, it free the lock, and schedulers could then start to work again. I have discussed this idea with @HaiyangDING .

@mqliang

mqliang Mar 2, 2016

Member

Have we considered treating scheduling and rescheduling as a global constraint-satisfaction problem

Getting a global optimal solution is challenging IMO. Since we need a good objective functions(take a lot of factors into consideration), and it takes time to calculate. And unfortunately there are schedulers work at the same time, so it's racy. One idea occurs to me is that we can have a distributed lock, if rescheduler get this lock, all schedulers paused, once rescheduler have finished it's work, it free the lock, and schedulers could then start to work again. I have discussed this idea with @HaiyangDING .

This comment has been minimized.

@HaiyangDING

HaiyangDING Mar 2, 2016

Contributor

@mqliang. I have taken a second thought on this. I think locking the schedulers at any time may not be a good idea, because pods that users submitted will get pending for quite a while as long as the rescheduler does not finish its job, which, unfortunately, in many use cases, is not acceptable: new job can not be started due to pod pending, application SLO may be violated due to failure of scaling up pods (newly created pods are scheduled by the scheduler(s)). The rescheduler's goal is to OPTIMIZE the cluster/certain-pod-placement, without harming user experience (or as less as possible). Note that the rescheduler is trying to improve the performance of the cluster/certain-pod: help the cluster/certain-pod to move from a "worth-improving" situation to a better one (locally), not from complete failure to "global optimization"; while locking the schedulers actually does "freeze" the whole cluster for quite some time, which is not worthy at all.

Regarding the global constraint-satisfaction problem brought up by @quinton-hoole , I don't think we could be able to achieve a truly "global optimized state", for:

  1. it is too hard, if not impossible, to define the cost function. We need many a-prior knowledge of the system and the input. I believe there are many related research on the field;
  2. in multi-scheduler scenario, it is even harder unless we introduce some cooperative constraints between the schedulers, which is also not easy;
  3. I am very fond of the 3rd point brought by @quinton-hoole , that we can use some heuristics that depends heavily on the current pods placement. We can use such heuristics to see if there is anything we could do to improve the cluster/certain-pod, and the improvement can be measured by some cost function with a considerable punishment on moving the pods. +1 to

(terminating and replacing pods creates a lot of churn in the system, so we'd better be sure that each of these gets us strictly closer to a near-optimal solution)

All in all, the above is just some of my thoughts and I believe there is no easy solution if we want the "global optimization". We should not consider this in the first step, or even for the next few steps. The rescheduler for the community version should stay simple, easy-to-understand, so let us leave the complexity to the researchers or dedicated commercial distributions.

FWIW, there is a project called Firmament, in which the modeling can be a very good reference for those who are interested in the field. I am not sure if the authors are interested in this discussion. @ms705 @ICGog

@HaiyangDING

HaiyangDING Mar 2, 2016

Contributor

@mqliang. I have taken a second thought on this. I think locking the schedulers at any time may not be a good idea, because pods that users submitted will get pending for quite a while as long as the rescheduler does not finish its job, which, unfortunately, in many use cases, is not acceptable: new job can not be started due to pod pending, application SLO may be violated due to failure of scaling up pods (newly created pods are scheduled by the scheduler(s)). The rescheduler's goal is to OPTIMIZE the cluster/certain-pod-placement, without harming user experience (or as less as possible). Note that the rescheduler is trying to improve the performance of the cluster/certain-pod: help the cluster/certain-pod to move from a "worth-improving" situation to a better one (locally), not from complete failure to "global optimization"; while locking the schedulers actually does "freeze" the whole cluster for quite some time, which is not worthy at all.

Regarding the global constraint-satisfaction problem brought up by @quinton-hoole , I don't think we could be able to achieve a truly "global optimized state", for:

  1. it is too hard, if not impossible, to define the cost function. We need many a-prior knowledge of the system and the input. I believe there are many related research on the field;
  2. in multi-scheduler scenario, it is even harder unless we introduce some cooperative constraints between the schedulers, which is also not easy;
  3. I am very fond of the 3rd point brought by @quinton-hoole , that we can use some heuristics that depends heavily on the current pods placement. We can use such heuristics to see if there is anything we could do to improve the cluster/certain-pod, and the improvement can be measured by some cost function with a considerable punishment on moving the pods. +1 to

(terminating and replacing pods creates a lot of churn in the system, so we'd better be sure that each of these gets us strictly closer to a near-optimal solution)

All in all, the above is just some of my thoughts and I believe there is no easy solution if we want the "global optimization". We should not consider this in the first step, or even for the next few steps. The rescheduler for the community version should stay simple, easy-to-understand, so let us leave the complexity to the researchers or dedicated commercial distributions.

FWIW, there is a project called Firmament, in which the modeling can be a very good reference for those who are interested in the field. I am not sure if the authors are interested in this discussion. @ms705 @ICGog

This comment has been minimized.

@mqliang

mqliang Mar 3, 2016

Member

I agree. But it still racy as long as schedulers work at the same time. My idea is just pause schedulers for a short time. We use pessimistic conflic control with deadline: reschedling algorithm will have a deadline, rescheduler must ensure that rescheduling will finished before the deadline. If it just couldn't, rescheduling algorithm will return a suboptimal(but global) pod placement, if fortunately, it can finished before deadline, we get a optimal solution. Such a implementation could ensure all schedulers will not paused for a long time. Even if we could not get the best cluster layout after one rescheduling, we could get a better and better one.

Pseudo code is like:

// take as input all of the nodes, all of the pods and their constraints, and current pod placements as input. 
// return a new global mapping of pods to nodes, if return before deadline, we get a optimal solution, otherwise,
// we get a suboptimal solution 
func Rescheduling(pods []api.Pod, nodes []api.Node, deadline time.Duraion)(result []Binding) {
    var result []Binding
    select {
        case <-time.After(deadline):
            // calculate can not finish before the deadline, return with a suboptimal global pod placement
            return
        default: 
            // calculate the optimal global pod placement
            return  
    }
}

So, if we design like this, we can solve the race problem, and we could harm user experience as less as possible.

May be premature, but we could discuss it.

@mqliang

mqliang Mar 3, 2016

Member

I agree. But it still racy as long as schedulers work at the same time. My idea is just pause schedulers for a short time. We use pessimistic conflic control with deadline: reschedling algorithm will have a deadline, rescheduler must ensure that rescheduling will finished before the deadline. If it just couldn't, rescheduling algorithm will return a suboptimal(but global) pod placement, if fortunately, it can finished before deadline, we get a optimal solution. Such a implementation could ensure all schedulers will not paused for a long time. Even if we could not get the best cluster layout after one rescheduling, we could get a better and better one.

Pseudo code is like:

// take as input all of the nodes, all of the pods and their constraints, and current pod placements as input. 
// return a new global mapping of pods to nodes, if return before deadline, we get a optimal solution, otherwise,
// we get a suboptimal solution 
func Rescheduling(pods []api.Pod, nodes []api.Node, deadline time.Duraion)(result []Binding) {
    var result []Binding
    select {
        case <-time.After(deadline):
            // calculate can not finish before the deadline, return with a suboptimal global pod placement
            return
        default: 
            // calculate the optimal global pod placement
            return  
    }
}

So, if we design like this, we can solve the race problem, and we could harm user experience as less as possible.

May be premature, but we could discuss it.

This comment has been minimized.

@hurf

hurf Mar 3, 2016

Contributor

@mqliang Can we consider rescheduler as a controller, scheduler as a executor? Rescheduler make decisions, and ask kublet to evict a pod, scheduler to schedule a pod(with "exclude some certain nodes" constraint). Them we don't need to lock any component.

@hurf

hurf Mar 3, 2016

Contributor

@mqliang Can we consider rescheduler as a controller, scheduler as a executor? Rescheduler make decisions, and ask kublet to evict a pod, scheduler to schedule a pod(with "exclude some certain nodes" constraint). Them we don't need to lock any component.

This comment has been minimized.

@davidopp

davidopp Mar 9, 2016

Member

Later in the doc we say

We expect certain aspects of the design to be "permanent" (e.g. the notion and use of priorities, preemption, disruption budgets, and the /evict subresource) while others may change over time (e.g. the partitioning of functionality between schedulers, controllers, rescheduler, horizontal pod autoscaler, and cluster autoscaler"

You can definitely move the rescheduler logic into the schedulers, but that makes it more complicated for people to build schedulers. As for whether rescheduler is a "controller" or a unique type of component -- I think of controllers as "owning" pods. The rescheduler doesn't own any pods -- it only evicts pods. So I would not consider it to be a controller. (And even if some day we had rescheduler forcibly schedule evicted pods onto nodes, it still would not own the pods, and wouldn't be a controller.) So I think it is a unique type of component.

@davidopp

davidopp Mar 9, 2016

Member

Later in the doc we say

We expect certain aspects of the design to be "permanent" (e.g. the notion and use of priorities, preemption, disruption budgets, and the /evict subresource) while others may change over time (e.g. the partitioning of functionality between schedulers, controllers, rescheduler, horizontal pod autoscaler, and cluster autoscaler"

You can definitely move the rescheduler logic into the schedulers, but that makes it more complicated for people to build schedulers. As for whether rescheduler is a "controller" or a unique type of component -- I think of controllers as "owning" pods. The rescheduler doesn't own any pods -- it only evicts pods. So I would not consider it to be a controller. (And even if some day we had rescheduler forcibly schedule evicted pods onto nodes, it still would not own the pods, and wouldn't be a controller.) So I think it is a unique type of component.

This comment has been minimized.

@mqliang

mqliang Mar 9, 2016

Member

Agree.

@mqliang

mqliang Mar 9, 2016

Member

Agree.

This comment has been minimized.

@mqliang

mqliang Mar 9, 2016

Member

Does it mean the first version of rescheduler just "evict" pods, but we may eventually want rescheduler "rebind" pods?

@mqliang

mqliang Mar 9, 2016

Member

Does it mean the first version of rescheduler just "evict" pods, but we may eventually want rescheduler "rebind" pods?

This comment has been minimized.

@HaiyangDING

HaiyangDING Mar 9, 2016

Contributor

To evict pod is definitely what we want to implement in the first version of the rescheduler. However, I am not fully convinced that rescheduler should schedule pods, maybe we can discuss this based on concrete use cases or tests.

@HaiyangDING

HaiyangDING Mar 9, 2016

Contributor

To evict pod is definitely what we want to implement in the first version of the rescheduler. However, I am not fully convinced that rescheduler should schedule pods, maybe we can discuss this based on concrete use cases or tests.

This comment has been minimized.

@davidopp

davidopp Mar 9, 2016

Member

Right -- only eviction in the first version. I don't think we would ever want rescheduler to directly schedule pods. I think we might want /evict subresource to have "prefer" in addition to "avoid," but we would still have the regular scheduler(s) do the scheduling for the evicted pods. Of course, since rescheduler is an independent component, people are free to experiment with their own implementations of the rescheduler. If someone could demonstrate a benefit to having rescheduler directly schedule pods, using a real workload, I'm sure I would change my mind. :)

@davidopp

davidopp Mar 9, 2016

Member

Right -- only eviction in the first version. I don't think we would ever want rescheduler to directly schedule pods. I think we might want /evict subresource to have "prefer" in addition to "avoid," but we would still have the regular scheduler(s) do the scheduling for the evicted pods. Of course, since rescheduler is an independent component, people are free to experiment with their own implementations of the rescheduler. If someone could demonstrate a benefit to having rescheduler directly schedule pods, using a real workload, I'm sure I would change my mind. :)

same priorities (names and ordering). This could be done by making them constants in the
API, or using ConfigMap to configure the schedulers with the information. The advantage of
the former (at least making the names, if not the ordering, constants in the API) is that
it allows the API server to do validation (e.g. to catch mis-spelling).

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

I think that this sort of validation will be crucial for good user experience.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

I think that this sort of validation will be crucial for good user experience.

#### Relationship of priority to quota
Of course, if the decision of what priority to give a pod is solely up to the user, then
users have no incentive to ever request any priority less than the maximum. Thus

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

.. unless the charges are different. What about if the charge against quota was proportional to priority. High priority pods cost more than low priority pods, so people spend their priority wisely, hopefully.

@quinton-hoole

quinton-hoole Mar 1, 2016

Member

.. unless the charges are different. What about if the charge against quota was proportional to priority. High priority pods cost more than low priority pods, so people spend their priority wisely, hopefully.

This comment has been minimized.

@davidopp

davidopp Mar 9, 2016

Member

That's an interesting idea.

@davidopp

davidopp Mar 9, 2016

Member

That's an interesting idea.

it terminates on its own, is deleted by the user, or experiences some unplanned event
(e.g. the node where it is running dies). Thus in a cluster with long-running pods, the
assignment of pods to nodes degrades over time, no matter how good an initial scheduling
decision the scheduler makes. This observation motivates "controlled rescheduling," a

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Mar 1, 2016

Member

Thoughts on controllerRef for a pod to "move" pods that are backed by a controller versus one-off pods that were created (say by controllers outside of core kubernetes)?

@derekwaynecarr

derekwaynecarr Mar 1, 2016

Member

Thoughts on controllerRef for a pod to "move" pods that are backed by a controller versus one-off pods that were created (say by controllers outside of core kubernetes)?

This comment has been minimized.

@bgrant0607

bgrant0607 Mar 7, 2016

Member

I can see the usability argument, but am concerned about gaming in multi-tenant clusters.

@bgrant0607

bgrant0607 Mar 7, 2016

Member

I can see the usability argument, but am concerned about gaming in multi-tenant clusters.

This comment has been minimized.

@davidopp

davidopp Mar 7, 2016

Member

@derekwaynecarr I didn't understand your suggestion.

@davidopp

davidopp Mar 7, 2016

Member

@derekwaynecarr I didn't understand your suggestion.

This comment has been minimized.

@quinton-hoole

quinton-hoole Mar 14, 2016

Member

@derekwaynecarr Neither do I. Perhaps you're thinking of storing a reference to one or more controllers that might manage a given pod? What do you plan to do with the reference?

@quinton-hoole

quinton-hoole Mar 14, 2016

Member

@derekwaynecarr Neither do I. Perhaps you're thinking of storing a reference to one or more controllers that might manage a given pod? What do you plan to do with the reference?

is to move a pod specifically for the benefit of another pod)
* moving a running pod off of a node from which it is receiving poor service
* anomalous crashlooping or other mysterious incompatiblity between the pod and the node
* repeated out-of-resource killing (see #18724)

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Mar 1, 2016

Member

out of resource killing will kill a pod already based on F2F discussion. is this implying repeated out of resource killing on a specific node or for a specific pod derived from a specific controller?

@derekwaynecarr

derekwaynecarr Mar 1, 2016

Member

out of resource killing will kill a pod already based on F2F discussion. is this implying repeated out of resource killing on a specific node or for a specific pod derived from a specific controller?

This comment has been minimized.

@davidopp

davidopp Mar 9, 2016

Member

Yes, exactly what you said. In this case rescheduler is not responsible for the killing, just for ensuring that after some criteria is met (e.g. pod has been repeatedly OOR killed on the same node some number of times) it reschedules elsewhere.

@davidopp

davidopp Mar 9, 2016

Member

Yes, exactly what you said. In this case rescheduler is not responsible for the killing, just for ensuring that after some criteria is met (e.g. pod has been repeatedly OOR killed on the same node some number of times) it reschedules elsewhere.

We propose to add a required `Priority` field to `PodSpec`. Its value type is string, and
the cluster administrator defines a total ordering on these strings (for example
`Critical`, `Normal`, `Preemptible`). We choose string instead of integer so that it is

This comment has been minimized.

@mml

mml Mar 1, 2016

Contributor

Thought about this again on my ride in this morning. We know that exposing a numerical ordering is messy, but with this UI, users will have to think in terms of a named priority (hopefully they remember the semantics/ordering!) and the QoS behavior they get based on the quantitative rules that map to conceptual words documented here.

It's true that we don't want to expose numbers to users, but what we'd like to expose (IMO) is fully-packaged Service Levels with names. They aren't ordered themselves because they're multi-axis. "Uncheckpointable Batch", for example, would map to specific priority and QoS settings.

If we plan on going this direction, we should consider not forcing users to name the priorities and sticking with numbers, because ultimately the numbers should be hidden behind the Service Level facade and naming the objects in the total ordering is actually going to be hard and error-prone.

@mml

mml Mar 1, 2016

Contributor

Thought about this again on my ride in this morning. We know that exposing a numerical ordering is messy, but with this UI, users will have to think in terms of a named priority (hopefully they remember the semantics/ordering!) and the QoS behavior they get based on the quantitative rules that map to conceptual words documented here.

It's true that we don't want to expose numbers to users, but what we'd like to expose (IMO) is fully-packaged Service Levels with names. They aren't ordered themselves because they're multi-axis. "Uncheckpointable Batch", for example, would map to specific priority and QoS settings.

If we plan on going this direction, we should consider not forcing users to name the priorities and sticking with numbers, because ultimately the numbers should be hidden behind the Service Level facade and naming the objects in the total ordering is actually going to be hard and error-prone.

This comment has been minimized.

@hurf

hurf Mar 3, 2016

Contributor

Do we want to allow users(I mean system admins) to extend Priority levels? I think string helps in this problem.

@hurf

hurf Mar 3, 2016

Contributor

Do we want to allow users(I mean system admins) to extend Priority levels? I think string helps in this problem.

This comment has been minimized.

@bgrant0607

bgrant0607 Apr 15, 2016

Member

Numerical priorities also have the BASIC-line-number problem: one has to leave space for insertion of additional numbers later.

Another possibility to consider is to just give up on a total order. Think about it as an authorization problem.

@bgrant0607

bgrant0607 Apr 15, 2016

Member

Numerical priorities also have the BASIC-line-number problem: one has to leave space for insertion of additional numbers later.

Another possibility to consider is to just give up on a total order. Think about it as an authorization problem.

This comment has been minimized.

@bgrant0607

bgrant0607 Jun 25, 2016

Member

Maybe we should split "priority" into a separate proposal.

Pod disruption budget will create a safety mechanism so that rescheduling and other automated processes don't take down whole services. We will need a mechanism to control who/what is allowed to spend that disruption budget, and we will need a mechanism to control who can ask for what kind of disruption budget, but I believe those features to be separable.

@bgrant0607

bgrant0607 Jun 25, 2016

Member

Maybe we should split "priority" into a separate proposal.

Pod disruption budget will create a safety mechanism so that rescheduling and other automated processes don't take down whole services. We will need a mechanism to control who/what is allowed to spend that disruption budget, and we will need a mechanism to control who can ask for what kind of disruption budget, but I believe those features to be separable.

This comment has been minimized.

@therc

therc Jul 8, 2016

Contributor

Whether strings or numbers, there should be a sane default so that

  1. Administrators aren't forced to make unnecessary decisions
  2. It's easier to follow tutorials or reuse other people's configurations

Regarding the type to use, I understand the reluctance to go for magic numbers, but, to give a concrete example, the distinction between "batch" and "best-effort" was always a bit of a mnemonic struggle to me, as "left" vs. "right" are to some people ("batch" vs. "free" was easier; 25 vs 0 even more so). Perhaps it's also because I'm not a native speaker. And, back to the first point, it's shifting an UI decision to cluster administrators, a group that is not generally well-versed in user experience. The pessimist in me imagines clusters from hell with priorities like "Critical", "VeryCritical", "VeryVeryCritical" and "ReallyCritical".

@therc

therc Jul 8, 2016

Contributor

Whether strings or numbers, there should be a sane default so that

  1. Administrators aren't forced to make unnecessary decisions
  2. It's easier to follow tutorials or reuse other people's configurations

Regarding the type to use, I understand the reluctance to go for magic numbers, but, to give a concrete example, the distinction between "batch" and "best-effort" was always a bit of a mnemonic struggle to me, as "left" vs. "right" are to some people ("batch" vs. "free" was easier; 25 vs 0 even more so). Perhaps it's also because I'm not a native speaker. And, back to the first point, it's shifting an UI decision to cluster administrators, a group that is not generally well-versed in user experience. The pessimist in me imagines clusters from hell with priorities like "Critical", "VeryCritical", "VeryVeryCritical" and "ReallyCritical".

The first version of the rescheduler will only implement two objectives: moving a pod
onto an under-utilized node, and moving a pod onto a node that meets more of the pod's
affinity/anti-affinity preferences than wherever it is currently running. (We assume that
nodes that are intentionally under-utilized, e.g. because they are being drained, are

This comment has been minimized.

@HaiyangDING

HaiyangDING Mar 2, 2016

Contributor

I am a little confused here, a node being drained or marked unschedulable should not be considered as candidate host when rescheduling, and IMO this is so called "not cause the rescheduler to fight a system". So, should here be We assume that nodes that are **NOT** intentionally under-utilized? Or am I misunderstanding something?

@HaiyangDING

HaiyangDING Mar 2, 2016

Contributor

I am a little confused here, a node being drained or marked unschedulable should not be considered as candidate host when rescheduling, and IMO this is so called "not cause the rescheduler to fight a system". So, should here be We assume that nodes that are **NOT** intentionally under-utilized? Or am I misunderstanding something?

This comment has been minimized.

@davidopp

davidopp Mar 9, 2016

Member

What I was trying to say is that there will be some nodes that are intentionally under-utilized, because they are being drained (for example, in preparation for maintenance). We will assume that these nodes will be marked unschedulable. Thus when looking for under-utilized nodes to move pods onto, we would ignore these nodes.

@davidopp

davidopp Mar 9, 2016

Member

What I was trying to say is that there will be some nodes that are intentionally under-utilized, because they are being drained (for example, in preparation for maintenance). We will assume that these nodes will be marked unschedulable. Thus when looking for under-utilized nodes to move pods onto, we would ignore these nodes.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Apr 13, 2016

Member

If you add an /evict subresource, then we can easily authorize that separately from DELETE /pods/foo. If you use a DeleteOptions, we don't yet have a way to handle that in authorization.

Member

erictune commented Apr 13, 2016

If you add an /evict subresource, then we can easily authorize that separately from DELETE /pods/foo. If you use a DeleteOptions, we don't yet have a way to handle that in authorization.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Apr 13, 2016

Member

I haven't read all of this, but from our conversation, I got the impression the eviction is handled synchronously in the apiserver. That made me wonder:
Scheduling is asynchronous, and happens out of the apiserver process. Why isn't eviction also asynchronous, and out of the apiserver process?

Like scheduling, eviction computations:

  • might require a lot of computation, which we don't want in the request path of the apiserver.
  • might require a lot of memory for caching and holding state of all pods and nodes, which we don't want to hold in memory of apiserver.
  • might benefit from handling multiple requests in a batch.
  • benefits from letting other people try their own implementations
  • might need to parallelize someday
  • does not benefit from being "transactional", because some inputs cannot be serialized: if a node reboots, causing a pod to fail, you can't say "sorry, you can't fail, because that would exceed your disruption budget."

By the way, assigning a default disruptionBudget to newly created pods in apiserver seems fine. I'm just asking about eviction.

Member

erictune commented Apr 13, 2016

I haven't read all of this, but from our conversation, I got the impression the eviction is handled synchronously in the apiserver. That made me wonder:
Scheduling is asynchronous, and happens out of the apiserver process. Why isn't eviction also asynchronous, and out of the apiserver process?

Like scheduling, eviction computations:

  • might require a lot of computation, which we don't want in the request path of the apiserver.
  • might require a lot of memory for caching and holding state of all pods and nodes, which we don't want to hold in memory of apiserver.
  • might benefit from handling multiple requests in a batch.
  • benefits from letting other people try their own implementations
  • might need to parallelize someday
  • does not benefit from being "transactional", because some inputs cannot be serialized: if a node reboots, causing a pod to fail, you can't say "sorry, you can't fail, because that would exceed your disruption budget."

By the way, assigning a default disruptionBudget to newly created pods in apiserver seems fine. I'm just asking about eviction.

of overcommitment, by allowing prioritization of which pods should be allowed to run pods
when demand for cluster resources exceeds supply.
### Disruption budget

This comment has been minimized.

@bgrant0607

bgrant0607 Jun 25, 2016

Member

Let's move this to a design doc. It's underway already.

@bgrant0607

bgrant0607 Jun 25, 2016

Member

Let's move this to a design doc. It's underway already.

This comment has been minimized.

@bgrant0607

bgrant0607 Jun 25, 2016

Member

Nevermind. If we move priority out, isn't the rest resolved and underway, and this could be merged?

@bgrant0607

bgrant0607 Jun 25, 2016

Member

Nevermind. If we move priority out, isn't the rest resolved and underway, and this could be merged?

This comment has been minimized.

@timothysc
@timothysc

This comment has been minimized.

@davidopp

davidopp Jun 28, 2016

Member

PodDisruptionBudget API is already merged, see
https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/policy/v1alpha1/types.go#L56

The controller for it is awaiting my review, #25921.

Last step is to implement /evict subresource (no PR for that yet).

@davidopp

davidopp Jun 28, 2016

Member

PodDisruptionBudget API is already merged, see
https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/policy/v1alpha1/types.go#L56

The controller for it is awaiting my review, #25921.

Last step is to implement /evict subresource (no PR for that yet).

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Jul 10, 2016

Member

Thanks to everyone for their feedback. Obviously this is a complex topic with a large design space. I have incorporated some of the suggestions, and am going to submit the doc. Note that it is intentionally in the proposals/ directory -- it's somewhere between a proposal and a design doc. I expect a fuller design doc for some issues, such as priorities and preemption, before any implementation happens. Other features, such as disruption budget and evict subresource, are almost finished. I don't think it's worth splitting different aspects of this doc into separate docs, since the ideas are closely inter-related, and I think it's helpful for someone to see a global view of how they fit together. Instead I think we should view this as an overview doc, with the expectation that there will be more detailed design docs for some features as necessary.

Member

davidopp commented Jul 10, 2016

Thanks to everyone for their feedback. Obviously this is a complex topic with a large design space. I have incorporated some of the suggestions, and am going to submit the doc. Note that it is intentionally in the proposals/ directory -- it's somewhere between a proposal and a design doc. I expect a fuller design doc for some issues, such as priorities and preemption, before any implementation happens. Other features, such as disruption budget and evict subresource, are almost finished. I don't think it's worth splitting different aspects of this doc into separate docs, since the ideas are closely inter-related, and I think it's helpful for someone to see a global view of how they fit together. Instead I think we should view this as an overview doc, with the expectation that there will be more detailed design docs for some features as necessary.

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Jul 10, 2016

Member

(Note: I haven't pushed the commit with the fixed yet. Having some trouble on my machine. Will do it soon.)

Member

davidopp commented Jul 10, 2016

(Note: I haven't pushed the commit with the fixed yet. Having some trouble on my machine. Will do it soon.)

@k8s-merge-robot k8s-merge-robot added size/XL and removed size/L labels Jul 10, 2016

@k8s-bot

This comment has been minimized.

Show comment
Hide comment
@k8s-bot

k8s-bot commented Jul 10, 2016

GCE e2e build/test passed for commit a852773.

@k8s-bot

This comment has been minimized.

Show comment
Hide comment
@k8s-bot

k8s-bot commented Jul 10, 2016

GCE e2e build/test passed for commit b77e392.

@k8s-bot

This comment has been minimized.

Show comment
Hide comment
@k8s-bot

k8s-bot commented Jul 10, 2016

GCE e2e build/test passed for commit 19fbd90.

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Jul 10, 2016

Contributor

Automatic merge from submit-queue

Contributor

k8s-merge-robot commented Jul 10, 2016

Automatic merge from submit-queue

@k8s-merge-robot k8s-merge-robot merged commit 710374b into kubernetes:master Jul 10, 2016

7 checks passed

Jenkins GCE Node e2e Build finished. 124 tests run, 12 skipped, 0 failed.
Details
Jenkins GCE e2e Build finished. 331 tests run, 148 skipped, 0 failed.
Details
Jenkins GKE smoke e2e Build finished. 331 tests run, 329 skipped, 0 failed.
Details
Jenkins unit/integration Build finished. 3362 tests run, 14 skipped, 0 failed.
Details
Jenkins verification Build finished.
Details
Submit Queue Queued to run github e2e tests a second time.
Details
cla/google All necessary CLAs are signed

@davidopp davidopp referenced this pull request Aug 30, 2016

Open

PodDisruptionBudget and /eviction subresource #85

7 of 20 tasks complete
@krmayankk

This comment has been minimized.

Show comment
Hide comment
@krmayankk

krmayankk Aug 17, 2017

Contributor

@davidopp is this available in 1.7 ?

Contributor

krmayankk commented Aug 17, 2017

@davidopp is this available in 1.7 ?

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Sep 5, 2017

Member

@krmayankk It is being built out of core and is working towards incubation.

@aveshagarwal has more details.

Member

timothysc commented Sep 5, 2017

@krmayankk It is being built out of core and is working towards incubation.

@aveshagarwal has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment