New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change sort function of scheduling queue to avoid starvation when a lot of unscheduleable pods are in the queue #72619

Merged
merged 1 commit into from Jan 8, 2019

Conversation

@everpeace
Copy link
Contributor

everpeace commented Jan 7, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:

This changes the sort function of scheduler (priority queue) to prevent starvation (some pods don't get chance to schedule very long time).

#71488, which resolved #71486, alleviates starvation by taking condition.LastTransitionTme into account in the sort function of scheduler's priority queue. However, condition.LastTransitionTime is updated only when condition.Status changed, which happens after scheduler binding a pod to a node. This means that once a pod is marked unschedulable, which creates PodScheduled condition with Status=False on the pod, the LastTransitionTime field never updated until the pod is successfully scheduled. So this still blocks newer pods when a lot of pods determined as unschedulable exists in the queue.

This PR changes:

  • updates condition.LastProbeTime of everytime when a pod is determined
    unschedulable.
  • changed sort function so to use condition.LastProbeTime to avoid starvation
    described above

Special notes for your reviewer:

I'm concerned with the decrease of scheduler throughput and the increase of k8s API server load because of the change. I think performance test would be needed to understand how much impact does the PR have on them.

Does this PR introduce a user-facing change?:

Fix scheduling starvation of pods in cluster with large number of unschedulable pods.

/sig scheduling

change sort function of scheduling queue to avoid starvation when uns…
…chedulable pods are in the queue

When starvation heppens:
- a lot of unschedulable pods exists in the head of queue
- because condition.LastTransitionTime is updated only when condition.Status changed
- (this means that once a pod is marked unschedulable, the field never updated until the pod successfuly scheduled.)

What was changed:
- condition.LastProbeTime is updated everytime when pod is determined
unschedulable.
- changed sort function so to use LastProbeTime to avoid starvation
described above

Consideration:
- This changes increases k8s API server load because it updates Pod.status whenever scheduler decides it as
unschedulable.

Signed-off-by: Shingo Omura <everpeace@gmail.com>
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 7, 2019

Hi @everpeace. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Jan 7, 2019

/ok-to-test

Message: err.Error(),
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
LastProbeTime: metav1.Now(),

This comment has been minimized.

@krmayankk

krmayankk Jan 7, 2019

Contributor

Since the active queue takes into account lastProbeTime, and if a pod keeps getting unschedulable , which means its lastProbeTime will always be latest, it will be be farthest behind in the queue and hence least likely to be considered for scheduling ? Is that the general idea ?

After being marked as unschedulable, how often is it tried again ?

If an unschedulable pod eventually becomes schedulable, the LastTransitionTime will still update for this condition , hence this will still be least in prioirty to get schedulable compared to other pods. I am wondering if that will cause starvation ?

This comment has been minimized.

@everpeace

everpeace Jan 7, 2019

Author Contributor

Is that the general idea ?

Thank you for your clear explanation 🙇 Yes, it is. That's the idea.

After being marked as unschedulable, how often is it tried again ?

I think it depends on cluster status. MoveAllToActiveQueue() moves unschedulable pods from unschedulable queue to active queue with backoff per pod. The method are called generally when scheduler detected pods/nodes status changed.

If an unschedulable pod eventually becomes schedulable, the LastTransitionTime will still update for this condition

LastTransitionTime should be updated only when the condition status is changed by definition. This means, once the pod is marked as unschedulable, which creates PodScheduled condition with Status=False on the pod status, LastTransitionTime shoudn't be updated until condition status will become True. It is because I added code updating LastProbeTime when schedule() failed.

I thinkkubelet is responsible for updating with PodScheduled.Status = True of PodScheduled condition in under the current implementation.

This comment has been minimized.

@krmayankk

krmayankk Jan 8, 2019

Contributor

What I was trying to say is that this change optimizes for the cases where there are lot of unschedulable pods and favors other pods scheduling in that case. Does it make the recovery of a pod which has been unschedulable for a while and just became schedulable slower compare to previously @everpeace @bsalamat because its been keep pushing to end because of constant updating of lastprobetime ?

This comment has been minimized.

@bsalamat

bsalamat Jan 8, 2019

Contributor

@krmayankk I don't think so. If an unschedulable pod has higher priority, it will still get to the head of the queue even after this change. When it has the same priority as other pods, it is fair to put it behind other pods with the same priority after the scheduler has tried it and determined that it is unschedulable.

This comment has been minimized.

@krmayankk

krmayankk Jan 8, 2019

Contributor

@bsalamat i was talking about the case when there is no priority involved. All pods are same priroity or default priority. In that case its trying to avoid starvation for regular pods when lot of unschedulable pods are present. How does it affect the recovery of the unschedulable pods which finally become shcedulable. Does this behavior change when compared to without this change ?
Note: Just trying to understand, the answer may be no change. It depends on how the active queue is implemented

This comment has been minimized.

@bsalamat

bsalamat Jan 8, 2019

Contributor

With this change (and somewhat similarly after #71488), a pod that is determined unschedulable goes behind other similar priority pods in the scheduling queue. Once pods become schedulable they are processed by their order in the scheduling queue. So, depending on their location in the queue, they may get scheduled before or after same priority pods.
In short, we don't expect further delays in scheduling unschedulable pods after this change.

@bsalamat
Copy link
Contributor

bsalamat left a comment

Thanks, @everpeace! The change looks good to me. Your concern with respect to increase in the API server load is valid, but even before this change it was likely that the condition.message of the pod would change after a scheduling retry, causing a request to be sent to the API server. The message contains detailed information about how many nodes were considered and what predicates failed for nodes. So, in larger clusters with higher churn the message would change somewhat frequently, causing a new request to be sent to the API server. After this PR all the scheduling attempts would send a request to the API server which may make things worse than before, but I don't expect a major issue.

/lgtm

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jan 8, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 8, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, everpeace

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@everpeace

This comment has been minimized.

Copy link
Contributor Author

everpeace commented Jan 8, 2019

@bsalamat Thanks for your review!! It's my first time to contribute to kube-scheduler actually :-)

So, in larger clusters with higher churn the message would change somewhat frequently, causing a new request to be sent to the API server.

oh, yeah, That's right 👍 Thanks, too!

@k8s-ci-robot k8s-ci-robot merged commit 5a70801 into kubernetes:master Jan 8, 2019

19 checks passed

cla/linuxfoundation everpeace authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jan 9, 2019

We need to cherrypick this into previous releases. I will send cherrypick PRs.

@everpeace

This comment has been minimized.

Copy link
Contributor Author

everpeace commented Jan 10, 2019

Oh, I couldn't take care of cherry-picking to previous versions 🙇 Thank you!!

k8s-ci-robot added a commit that referenced this pull request Jan 10, 2019

Merge pull request #72753 from bsalamat/automated-cherry-pick-of-#726…
…19-upstream-release-1.12

Automated cherry-pick of #72619 into upstream release 1.12
@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jan 10, 2019

@everpeace No problem. I have already sent those PRs.

k8s-ci-robot added a commit that referenced this pull request Jan 15, 2019

Merge pull request #72754 from bsalamat/automated-cherry-pick-of-#726…
…19-upstream-release-1.13

Automated cherry-pick of #72619 into upstream release 1.13

k8s-ci-robot added a commit that referenced this pull request Jan 16, 2019

Merge pull request #72750 from bsalamat/automated-cherry-pick-of-#726…
…19-upstream-release-1.11

Automated cherry-pick of #72619 into upstream release 1.11
@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jan 29, 2019

@everpeace FYI, I just filed #73485 to address the concern that you have had with respect to the increase of the API server load.

@everpeace

This comment has been minimized.

Copy link
Contributor Author

everpeace commented Jan 30, 2019

Oh, ok. I got it. I noticed another engineer just started working on the issue. Thanks for your notification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment