change sort function of scheduling queue to avoid starvation when a lot of unscheduleable pods are in the queue #72619

everpeace · 2019-01-07T02:11:54Z

What type of PR is this?
/kind bug

What this PR does / why we need it:

This changes the sort function of scheduler (priority queue) to prevent starvation (some pods don't get chance to schedule very long time).

#71488, which resolved #71486, alleviates starvation by taking condition.LastTransitionTme into account in the sort function of scheduler's priority queue. However, condition.LastTransitionTime is updated only when condition.Status changed, which happens after scheduler binding a pod to a node. This means that once a pod is marked unschedulable, which creates PodScheduled condition with Status=False on the pod, the LastTransitionTime field never updated until the pod is successfully scheduled. So this still blocks newer pods when a lot of pods determined as unschedulable exists in the queue.

This PR changes:

updates condition.LastProbeTime of everytime when a pod is determined
unschedulable.
changed sort function so to use condition.LastProbeTime to avoid starvation
described above

Special notes for your reviewer:

I'm concerned with the decrease of scheduler throughput and the increase of k8s API server load because of the change. I think performance test would be needed to understand how much impact does the PR have on them.

Does this PR introduce a user-facing change?:

Fix scheduling starvation of pods in cluster with large number of unschedulable pods.

/sig scheduling

…chedulable pods are in the queue When starvation heppens: - a lot of unschedulable pods exists in the head of queue - because condition.LastTransitionTime is updated only when condition.Status changed - (this means that once a pod is marked unschedulable, the field never updated until the pod successfuly scheduled.) What was changed: - condition.LastProbeTime is updated everytime when pod is determined unschedulable. - changed sort function so to use LastProbeTime to avoid starvation described above Consideration: - This changes increases k8s API server load because it updates Pod.status whenever scheduler decides it as unschedulable. Signed-off-by: Shingo Omura <everpeace@gmail.com>

k8s-ci-robot · 2019-01-07T02:12:02Z

Hi @everpeace. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Huang-Wei · 2019-01-07T02:17:24Z

/ok-to-test

krmayankk · 2019-01-07T05:37:05Z

pkg/scheduler/scheduler.go

-		Message: err.Error(),
+		Type:          v1.PodScheduled,
+		Status:        v1.ConditionFalse,
+		LastProbeTime: metav1.Now(),


Since the active queue takes into account lastProbeTime, and if a pod keeps getting unschedulable , which means its lastProbeTime will always be latest, it will be be farthest behind in the queue and hence least likely to be considered for scheduling ? Is that the general idea ?

After being marked as unschedulable, how often is it tried again ?

If an unschedulable pod eventually becomes schedulable, the LastTransitionTime will still update for this condition , hence this will still be least in prioirty to get schedulable compared to other pods. I am wondering if that will cause starvation ?

Is that the general idea ?

Thank you for your clear explanation 🙇 Yes, it is. That's the idea.

After being marked as unschedulable, how often is it tried again ?

I think it depends on cluster status. MoveAllToActiveQueue() moves unschedulable pods from unschedulable queue to active queue with backoff per pod. The method are called generally when scheduler detected pods/nodes status changed.

If an unschedulable pod eventually becomes schedulable, the LastTransitionTime will still update for this condition

LastTransitionTime should be updated only when the condition status is changed by definition. This means, once the pod is marked as unschedulable, which creates PodScheduled condition with Status=False on the pod status, LastTransitionTime shoudn't be updated until condition status will become True. It is because I added code updating LastProbeTime when schedule() failed.

I thinkkubelet is responsible for updating with PodScheduled.Status = True of PodScheduled condition in under the current implementation.

What I was trying to say is that this change optimizes for the cases where there are lot of unschedulable pods and favors other pods scheduling in that case. Does it make the recovery of a pod which has been unschedulable for a while and just became schedulable slower compare to previously @everpeace @bsalamat because its been keep pushing to end because of constant updating of lastprobetime ?

@krmayankk I don't think so. If an unschedulable pod has higher priority, it will still get to the head of the queue even after this change. When it has the same priority as other pods, it is fair to put it behind other pods with the same priority after the scheduler has tried it and determined that it is unschedulable.

@bsalamat i was talking about the case when there is no priority involved. All pods are same priroity or default priority. In that case its trying to avoid starvation for regular pods when lot of unschedulable pods are present. How does it affect the recovery of the unschedulable pods which finally become shcedulable. Does this behavior change when compared to without this change ?
Note: Just trying to understand, the answer may be no change. It depends on how the active queue is implemented

With this change (and somewhat similarly after #71488), a pod that is determined unschedulable goes behind other similar priority pods in the scheduling queue. Once pods become schedulable they are processed by their order in the scheduling queue. So, depending on their location in the queue, they may get scheduled before or after same priority pods.
In short, we don't expect further delays in scheduling unschedulable pods after this change.

bsalamat

Thanks, @everpeace! The change looks good to me. Your concern with respect to increase in the API server load is valid, but even before this change it was likely that the condition.message of the pod would change after a scheduling retry, causing a request to be sent to the API server. The message contains detailed information about how many nodes were considered and what predicates failed for nodes. So, in larger clusters with higher churn the message would change somewhat frequently, causing a new request to be sent to the API server. After this PR all the scheduling attempts would send a request to the API server which may make things worse than before, but I don't expect a major issue.

/lgtm

bsalamat · 2019-01-08T01:36:15Z

/approve

k8s-ci-robot · 2019-01-08T01:36:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, everpeace

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [bsalamat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

everpeace · 2019-01-08T01:36:53Z

@bsalamat Thanks for your review!! It's my first time to contribute to kube-scheduler actually :-)

So, in larger clusters with higher churn the message would change somewhat frequently, causing a new request to be sent to the API server.

oh, yeah, That's right 👍 Thanks, too!

bsalamat · 2019-01-09T22:03:12Z

We need to cherrypick this into previous releases. I will send cherrypick PRs.

everpeace · 2019-01-10T01:16:39Z

Oh, I couldn't take care of cherry-picking to previous versions 🙇 Thank you!!

…19-upstream-release-1.12 Automated cherry-pick of #72619 into upstream release 1.12

bsalamat · 2019-01-10T04:56:35Z

@everpeace No problem. I have already sent those PRs.

…19-upstream-release-1.13 Automated cherry-pick of #72619 into upstream release 1.13

…19-upstream-release-1.11 Automated cherry-pick of #72619 into upstream release 1.11

bsalamat · 2019-01-29T18:44:48Z

@everpeace FYI, I just filed #73485 to address the concern that you have had with respect to the increase of the API server load.

everpeace · 2019-01-30T05:31:39Z

Oh, ok. I got it. I noticed another engineer just started working on the issue. Thanks for your notification!

k8s-ci-robot requested review from bsalamat and ravisantoshgudimetla January 7, 2019 02:12

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 7, 2019

krmayankk reviewed Jan 7, 2019

View reviewed changes

bsalamat approved these changes Jan 8, 2019

View reviewed changes

k8s-ci-robot assigned bsalamat Jan 8, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 8, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2019

bsalamat mentioned this pull request Jan 8, 2019

Move unschedulable pods to the active queue if they are not retried for more than 1 minute #72558

Merged

k8s-ci-robot merged commit 5a70801 into kubernetes:master Jan 8, 2019

k8s-ci-robot added a commit that referenced this pull request Jan 10, 2019

Merge pull request #72753 from bsalamat/automated-cherry-pick-of-#726…

8f7b6ba

…19-upstream-release-1.12 Automated cherry-pick of #72619 into upstream release 1.12

k8s-ci-robot added a commit that referenced this pull request Jan 15, 2019

Merge pull request #72754 from bsalamat/automated-cherry-pick-of-#726…

2f7c607

…19-upstream-release-1.13 Automated cherry-pick of #72619 into upstream release 1.13

k8s-ci-robot added a commit that referenced this pull request Jan 16, 2019

Merge pull request #72750 from bsalamat/automated-cherry-pick-of-#726…

523cffb

…19-upstream-release-1.11 Automated cherry-pick of #72619 into upstream release 1.11

bsalamat mentioned this pull request Jan 29, 2019

Don't update the Pod object after each scheduling attempt by adding a timestamp to the scheduling queue #73485

Closed

tuminoid mentioned this pull request Feb 7, 2019

Pods sharing a PVC on a single node cluster fail to schedule #73216

Closed

everpeace mentioned this pull request Sep 10, 2019

REQUEST: New membership for everpeace kubernetes/org#1173

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change sort function of scheduling queue to avoid starvation when a lot of unscheduleable pods are in the queue #72619

change sort function of scheduling queue to avoid starvation when a lot of unscheduleable pods are in the queue #72619

everpeace commented Jan 7, 2019 •

edited

k8s-ci-robot commented Jan 7, 2019

Huang-Wei commented Jan 7, 2019

krmayankk Jan 7, 2019

everpeace Jan 7, 2019 •

edited

krmayankk Jan 8, 2019

bsalamat Jan 8, 2019

krmayankk Jan 8, 2019

bsalamat Jan 8, 2019

bsalamat left a comment

bsalamat commented Jan 8, 2019

k8s-ci-robot commented Jan 8, 2019

everpeace commented Jan 8, 2019 •

edited

bsalamat commented Jan 9, 2019

everpeace commented Jan 10, 2019

bsalamat commented Jan 10, 2019

bsalamat commented Jan 29, 2019

everpeace commented Jan 30, 2019

change sort function of scheduling queue to avoid starvation when a lot of unscheduleable pods are in the queue #72619

change sort function of scheduling queue to avoid starvation when a lot of unscheduleable pods are in the queue #72619

Conversation

everpeace commented Jan 7, 2019 • edited

k8s-ci-robot commented Jan 7, 2019

Huang-Wei commented Jan 7, 2019

krmayankk Jan 7, 2019

Choose a reason for hiding this comment

everpeace Jan 7, 2019 • edited

Choose a reason for hiding this comment

krmayankk Jan 8, 2019

Choose a reason for hiding this comment

bsalamat Jan 8, 2019

Choose a reason for hiding this comment

krmayankk Jan 8, 2019

Choose a reason for hiding this comment

bsalamat Jan 8, 2019

Choose a reason for hiding this comment

bsalamat left a comment

Choose a reason for hiding this comment

bsalamat commented Jan 8, 2019

k8s-ci-robot commented Jan 8, 2019

everpeace commented Jan 8, 2019 • edited

bsalamat commented Jan 9, 2019

everpeace commented Jan 10, 2019

bsalamat commented Jan 10, 2019

bsalamat commented Jan 29, 2019

everpeace commented Jan 30, 2019

everpeace commented Jan 7, 2019 •

edited

everpeace Jan 7, 2019 •

edited

everpeace commented Jan 8, 2019 •

edited