New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preemption priority / scheme #22212

Open
dchen1107 opened this Issue Feb 29, 2016 · 50 comments

Comments

@dchen1107
Member

dchen1107 commented Feb 29, 2016

We expect Kubelet to make more and more decision on which pods should run on a given node. For example,

  • Reject pods on port conflict
  • Reject incoming / evict existing pods when node constraints are violated.
  • Reject incoming / evict existing pods in response to resource starvation, such as OOD, OOM, etc. (#147, #18724)
  • ...

To really do that, Kubelet needs to know preemption priority / scheme, so that Kubelet knows which one(s) should be evicted, so that the most important pods run when resource demand exceeds supply. For example, daemonset pods should have higher priority than rest of pods, and kube-system pods might have a relatively higher priority too.

cc/ @bgrant0607 @davidopp

@dchen1107

This comment has been minimized.

Show comment
Hide comment
@dchen1107

dchen1107 Feb 29, 2016

Member

cc/ @kubernetes/goog-node

Member

dchen1107 commented Feb 29, 2016

cc/ @kubernetes/goog-node

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Feb 29, 2016

Member

I think there are three separate issues

  1. During scheduling, how does scheduler decide preemptions
  2. During runtime, how does Kubelet decide which pod(s) to kill if it needs to kill pod(s), e.g. due to OOM
  3. How does Kubelet admission work

My assumption was that (1) would use priority, (2) would use QoS, and (3) doesn't really need to take priority into account (unless we want Kubelet to be responsible for deciding preemptions -- and I don't think we do). (3) does need to understand the concept of preemptors waiting for victims to exit, but I don't think it necessarily needs to reason about priority directly.

ref/ #21767

Member

davidopp commented Feb 29, 2016

I think there are three separate issues

  1. During scheduling, how does scheduler decide preemptions
  2. During runtime, how does Kubelet decide which pod(s) to kill if it needs to kill pod(s), e.g. due to OOM
  3. How does Kubelet admission work

My assumption was that (1) would use priority, (2) would use QoS, and (3) doesn't really need to take priority into account (unless we want Kubelet to be responsible for deciding preemptions -- and I don't think we do). (3) does need to understand the concept of preemptors waiting for victims to exit, but I don't think it necessarily needs to reason about priority directly.

ref/ #21767

@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh

vishh Feb 29, 2016

Member

(3) does need to understand the concept of preemptors waiting for victims to exit, but I don't think it necessarily needs to reason about priority directly.

If kubelet takes into account availability in the future for the purposes of admission, does this mean that a Guaranteed Pod scheduled onto a node directly (or DaemonSet) will remain pending until sufficient resources are available, even if the resources are held by BestEffort pods?
Why do we not want the kubelet to understand priority and evict pods or wait for resources based on priority?

Member

vishh commented Feb 29, 2016

(3) does need to understand the concept of preemptors waiting for victims to exit, but I don't think it necessarily needs to reason about priority directly.

If kubelet takes into account availability in the future for the purposes of admission, does this mean that a Guaranteed Pod scheduled onto a node directly (or DaemonSet) will remain pending until sufficient resources are available, even if the resources are held by BestEffort pods?
Why do we not want the kubelet to understand priority and evict pods or wait for resources based on priority?

@dchen1107

This comment has been minimized.

Show comment
Hide comment
@dchen1107

dchen1107 Feb 29, 2016

Member

@davidopp re: #22212 (comment)

Shouldn't kubelet, (re)scheduler and other control components should follow the same policy under all three issues you listed? For example, assuming the scheduler decide to preempt A to schedule B to a node if the resource is not abundant. Then a given node which have both A and B being allocated, when the node has sys OOM, kubelet should preempt A first to prevent the resource starvation issue. Right?

Of course, the rule / policy being applied to different control component are not completely overlap. For example, scheduler shouldn't check port conflict. But even in this case, shouldn't same preemption priority / scheme should be applied, instead of liking today what we have, simply reject the new coming one.

Member

dchen1107 commented Feb 29, 2016

@davidopp re: #22212 (comment)

Shouldn't kubelet, (re)scheduler and other control components should follow the same policy under all three issues you listed? For example, assuming the scheduler decide to preempt A to schedule B to a node if the resource is not abundant. Then a given node which have both A and B being allocated, when the node has sys OOM, kubelet should preempt A first to prevent the resource starvation issue. Right?

Of course, the rule / policy being applied to different control component are not completely overlap. For example, scheduler shouldn't check port conflict. But even in this case, shouldn't same preemption priority / scheme should be applied, instead of liking today what we have, simply reject the new coming one.

@davidopp davidopp self-assigned this Mar 1, 2016

@dchen1107

This comment has been minimized.

Show comment
Hide comment
@dchen1107
Member

dchen1107 commented Mar 18, 2016

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Mar 19, 2016

Member

Today QoS is determined by the relationship between request and limit. Trying to tie together QoS and scheduling priority would be confusing, because we'd end up having to forbid some combinations of and . (For example, pods with "low" scheduling priorities wouldn't be able to set limit==request, and pods with "high" scheduling priority wouldn't be able to set request==0.) Or we'd have to say the relationship between request and limit defines your scheduling priority--but that's even worse (and it ties us to only having three scheduling priorities, which is far too few).

Now, if we come up with a different way to specify QoS (e.g. it is specified directly rather than inferred from the relationship between request and limit), then we could potentially tie together QoS and scheduling priority. But even if that were a good idea, there are only three QoS levels, yet we want cluster admins to be able to configure many scheduling priority levels; certainly we need more than three. So we'd have to make rules about how scheduling priority and QoS map to one another, like "increasing scheduling priority should never decrease QoS" and cluster admins would have to define a mapping between the two (assuming we want cluster admins to be able to define the allowed scheduling priorities and the total order on them).

Member

davidopp commented Mar 19, 2016

Today QoS is determined by the relationship between request and limit. Trying to tie together QoS and scheduling priority would be confusing, because we'd end up having to forbid some combinations of and . (For example, pods with "low" scheduling priorities wouldn't be able to set limit==request, and pods with "high" scheduling priority wouldn't be able to set request==0.) Or we'd have to say the relationship between request and limit defines your scheduling priority--but that's even worse (and it ties us to only having three scheduling priorities, which is far too few).

Now, if we come up with a different way to specify QoS (e.g. it is specified directly rather than inferred from the relationship between request and limit), then we could potentially tie together QoS and scheduling priority. But even if that were a good idea, there are only three QoS levels, yet we want cluster admins to be able to configure many scheduling priority levels; certainly we need more than three. So we'd have to make rules about how scheduling priority and QoS map to one another, like "increasing scheduling priority should never decrease QoS" and cluster admins would have to define a mapping between the two (assuming we want cluster admins to be able to define the allowed scheduling priorities and the total order on them).

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Mar 24, 2016

Member

But even if that were a good idea, there are only three QoS levels, yet we want cluster admins to be able to configure many scheduling priority levels; certainly we need more than three. So we'd have to make rules about how scheduling priority and QoS map to one another, like "increasing scheduling priority should never decrease QoS" and cluster admins would have to define a mapping between the two

Isn't the this just the ye-olde vectored cross product w/filling and using the determinant as the weighted sum. /cc @erikerlandson to sanity check me on this.

Member

timothysc commented Mar 24, 2016

But even if that were a good idea, there are only three QoS levels, yet we want cluster admins to be able to configure many scheduling priority levels; certainly we need more than three. So we'd have to make rules about how scheduling priority and QoS map to one another, like "increasing scheduling priority should never decrease QoS" and cluster admins would have to define a mapping between the two

Isn't the this just the ye-olde vectored cross product w/filling and using the determinant as the weighted sum. /cc @erikerlandson to sanity check me on this.

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Mar 28, 2016

Member

So I'm going to retract my previous statement and try switch focus on defining the user stories 1st. Right now we (red hat) lives in a multi-tenant environment where we have yet to flush out the standard use cases on priorities, but there will likely be user+admin assigned priorities. Before diving into semantics, I personally believe the priority + preemption mechanics should probably be their own design document.

Member

timothysc commented Mar 28, 2016

So I'm going to retract my previous statement and try switch focus on defining the user stories 1st. Right now we (red hat) lives in a multi-tenant environment where we have yet to flush out the standard use cases on priorities, but there will likely be user+admin assigned priorities. Before diving into semantics, I personally believe the priority + preemption mechanics should probably be their own design document.

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Mar 28, 2016

Member

#22217
is the doc you're looking for :)

I intentionally put priority + preemption in the same doc as rescheduling because they are closely related. It makes for a somewhat lengthy doc but I think the alternative is more confusing.

Member

davidopp commented Mar 28, 2016

#22217
is the doc you're looking for :)

I intentionally put priority + preemption in the same doc as rescheduling because they are closely related. It makes for a somewhat lengthy doc but I think the alternative is more confusing.

@a-robinson

This comment has been minimized.

Show comment
Hide comment
@a-robinson

a-robinson Apr 14, 2016

Member

I'm glad to see this happening. Priority will also be useful for ensuring system pods are scheduled before user pods in most circumstances.

@Q-Lee

Member

a-robinson commented Apr 14, 2016

I'm glad to see this happening. Priority will also be useful for ensuring system pods are scheduled before user pods in most circumstances.

@Q-Lee

@davidopp davidopp removed this from the next-candidate milestone Apr 14, 2016

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Apr 14, 2016

Member

Just to clarify, scheduler preemption is definitely not going to be in 1.3. (We had never intended intended for it to be in 1.3.)

Member

davidopp commented Apr 14, 2016

Just to clarify, scheduler preemption is definitely not going to be in 1.3. (We had never intended intended for it to be in 1.3.)

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Apr 15, 2016

Member

Just to clarify, scheduler preemption is definitely not going to be in 1.3

2 comments then:

  1. We should probably change label from P1->P3.
  2. Wouldn't rescheduling also need to be de-prioritized from 1.3 ?
Member

timothysc commented Apr 15, 2016

Just to clarify, scheduler preemption is definitely not going to be in 1.3

2 comments then:

  1. We should probably change label from P1->P3.
  2. Wouldn't rescheduling also need to be de-prioritized from 1.3 ?
@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Apr 15, 2016

Member

Good point on priority, but moved to P2 rather than P3.

Rescheduler uses DisruptionBudget (#12611), not priority. That said, I don't think it's likely we'll have rescheduler for 1.3 either. I do want to make sure we have the building blocks (among them DisruptionBudget) but it doesn't seem we'll actually have rescheduler.

Member

davidopp commented Apr 15, 2016

Good point on priority, but moved to P2 rather than P3.

Rescheduler uses DisruptionBudget (#12611), not priority. That said, I don't think it's likely we'll have rescheduler for 1.3 either. I do want to make sure we have the building blocks (among them DisruptionBudget) but it doesn't seem we'll actually have rescheduler.

k8s-merge-robot added a commit that referenced this issue Jul 10, 2016

Merge pull request #22217 from davidopp/rescheduling
Automatic merge from submit-queue

[WIP/RFC] Rescheduling in Kubernetes design proposal

Proposal by @bgrant0607 and @davidopp (and inspired by years of discussion and experience from folks who worked on Borg and Omega).

This doc is a proposal for a set of inter-related concepts related to "rescheduling" -- that is, "moving" an already-running pod to a new node in order to improve where it is running. (Specific concepts discussed are priority, preemption, disruption budget, quota, `/evict` subresource, and rescheduler.)

Feedback on the proposal is very welcome. For now, please stick to comments about the design, not spelling, punctuation, grammar, broken links, etc., so we can keep the doc uncluttered enough to make it easy for folks to comment on the more important things. 

ref/ #22054 #18724 #19080 #12611 #20699 #17393 #12140 #22212

@HaiyangDING @mqliang @derekwaynecarr @kubernetes/sig-scheduling @kubernetes/huawei @timothysc @mml @dchen1107
@dims

This comment has been minimized.

Show comment
Hide comment
@dims

dims Sep 18, 2017

Member

@davidopp @dchen1107 @kubernetes/sig-node-feature-requests Can we please move this out of 1.8? Doesn't look like much was done and it's too late now.

Member

dims commented Sep 18, 2017

@davidopp @dchen1107 @kubernetes/sig-node-feature-requests Can we please move this out of 1.8? Doesn't look like much was done and it's too late now.

@dchen1107

This comment has been minimized.

Show comment
Hide comment
@dchen1107

dchen1107 Sep 18, 2017

Member

Ok, here is a quick status updates

  1. The design proposal was agreed and merged during 1.8 timeframe
  2. The new API for the proposal was merged during 1.8 timeframe
  3. There are pending PR and discussion on the implementation.

I am retargeting this to v1.9. @bsalamat please correct me if I missed anything above.

Member

dchen1107 commented Sep 18, 2017

Ok, here is a quick status updates

  1. The design proposal was agreed and merged during 1.8 timeframe
  2. The new API for the proposal was merged during 1.8 timeframe
  3. There are pending PR and discussion on the implementation.

I am retargeting this to v1.9. @bsalamat please correct me if I missed anything above.

@dchen1107 dchen1107 modified the milestones: v1.8, v1.9 Sep 18, 2017

@bsalamat

This comment has been minimized.

Show comment
Hide comment
@bsalamat

bsalamat Sep 18, 2017

Contributor

@dchen1107 Priority and Preemption are already implemented in 1.8. We plan to improve preemption algorithm in 1.9, but those improvements are beyond the scope of this issue in my opinion.

Contributor

bsalamat commented Sep 18, 2017

@dchen1107 Priority and Preemption are already implemented in 1.8. We plan to improve preemption algorithm in 1.9, but those improvements are beyond the scope of this issue in my opinion.

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Sep 19, 2017

Member

Is there a separate issue for the Kubelet changes to use priority for eviction? I assume that will be in 1.9?

I think this issue covers both uses of priority, though it probably would have been better to have two separate issues.

Member

davidopp commented Sep 19, 2017

Is there a separate issue for the Kubelet changes to use priority for eviction? I assume that will be in 1.9?

I think this issue covers both uses of priority, though it probably would have been better to have two separate issues.

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Oct 11, 2017

Contributor

[MILESTONENOTIFIER] Milestone Removed

@bsalamat @dchen1107

Important: This issue was missing labels required for the v1.9 milestone for more than 3 days:

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.

Help
Contributor

k8s-merge-robot commented Oct 11, 2017

[MILESTONENOTIFIER] Milestone Removed

@bsalamat @dchen1107

Important: This issue was missing labels required for the v1.9 milestone for more than 3 days:

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.

Help

@k8s-merge-robot k8s-merge-robot removed this from the v1.9 milestone Oct 11, 2017

k8s-merge-robot added a commit that referenced this issue Oct 13, 2017

Merge pull request #53542 from dashpole/priority_eviction
Automatic merge from submit-queue (batch tested with PRs 51840, 53542, 53857, 53831, 53702). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Kubelet Evictions take Priority into account

Issue: #22212
This implements the eviction strategy documented here: kubernetes/community#1162, and discussed here: kubernetes/community#846.
When priority is not enabled, all pods are treated as equal priority.

This PR makes the following changes:

1. Changes the eviction ordering strategy to (usage < requests, priority, usage - requests)
2. Changes unit testing to account for this change in eviction strategy (including tests where priority is disabled).
3. Adds a node e2e test which tests the eviction ordering of pods with different priorities.

/assign @dchen1107 @vishh 
cc @bsalamat @derekwaynecarr 

```release-note
Kubelet evictions take pod priority into account
```
@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp
Member

davidopp commented Jan 5, 2018

ref/ #47604

@fejta-bot

This comment has been minimized.

Show comment
Hide comment
@fejta-bot

fejta-bot Apr 14, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot commented Apr 14, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@bsalamat

This comment has been minimized.

Show comment
Hide comment
@bsalamat

bsalamat Apr 16, 2018

Contributor

/remove-lifecycle stale

Contributor

bsalamat commented Apr 16, 2018

/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Show comment
Hide comment
@fejta-bot

fejta-bot Jul 15, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot commented Jul 15, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@ravisantoshgudimetla

This comment has been minimized.

Show comment
Hide comment
@ravisantoshgudimetla

ravisantoshgudimetla Jul 15, 2018

Contributor

/remove-lifecycle stale

xref: #65990

Contributor

ravisantoshgudimetla commented Jul 15, 2018

/remove-lifecycle stale

xref: #65990

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment