Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change sort function of scheduling queue to avoid starvation when a lot of unscheduleable pods are in the queue #72619

Conversation

everpeace
Copy link
Contributor

@everpeace everpeace commented Jan 7, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:

This changes the sort function of scheduler (priority queue) to prevent starvation (some pods don't get chance to schedule very long time).

#71488, which resolved #71486, alleviates starvation by taking condition.LastTransitionTme into account in the sort function of scheduler's priority queue. However, condition.LastTransitionTime is updated only when condition.Status changed, which happens after scheduler binding a pod to a node. This means that once a pod is marked unschedulable, which creates PodScheduled condition with Status=False on the pod, the LastTransitionTime field never updated until the pod is successfully scheduled. So this still blocks newer pods when a lot of pods determined as unschedulable exists in the queue.

This PR changes:

  • updates condition.LastProbeTime of everytime when a pod is determined
    unschedulable.
  • changed sort function so to use condition.LastProbeTime to avoid starvation
    described above

Special notes for your reviewer:

I'm concerned with the decrease of scheduler throughput and the increase of k8s API server load because of the change. I think performance test would be needed to understand how much impact does the PR have on them.

Does this PR introduce a user-facing change?:

Fix scheduling starvation of pods in cluster with large number of unschedulable pods.

/sig scheduling

…chedulable pods are in the queue

When starvation heppens:
- a lot of unschedulable pods exists in the head of queue
- because condition.LastTransitionTime is updated only when condition.Status changed
- (this means that once a pod is marked unschedulable, the field never updated until the pod successfuly scheduled.)

What was changed:
- condition.LastProbeTime is updated everytime when pod is determined
unschedulable.
- changed sort function so to use LastProbeTime to avoid starvation
described above

Consideration:
- This changes increases k8s API server load because it updates Pod.status whenever scheduler decides it as
unschedulable.

Signed-off-by: Shingo Omura <everpeace@gmail.com>
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 7, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @everpeace. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Huang-Wei
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 7, 2019
Message: err.Error(),
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
LastProbeTime: metav1.Now(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the active queue takes into account lastProbeTime, and if a pod keeps getting unschedulable , which means its lastProbeTime will always be latest, it will be be farthest behind in the queue and hence least likely to be considered for scheduling ? Is that the general idea ?

After being marked as unschedulable, how often is it tried again ?

If an unschedulable pod eventually becomes schedulable, the LastTransitionTime will still update for this condition , hence this will still be least in prioirty to get schedulable compared to other pods. I am wondering if that will cause starvation ?

Copy link
Contributor Author

@everpeace everpeace Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the general idea ?

Thank you for your clear explanation 🙇 Yes, it is. That's the idea.

After being marked as unschedulable, how often is it tried again ?

I think it depends on cluster status. MoveAllToActiveQueue() moves unschedulable pods from unschedulable queue to active queue with backoff per pod. The method are called generally when scheduler detected pods/nodes status changed.

If an unschedulable pod eventually becomes schedulable, the LastTransitionTime will still update for this condition

LastTransitionTime should be updated only when the condition status is changed by definition. This means, once the pod is marked as unschedulable, which creates PodScheduled condition with Status=False on the pod status, LastTransitionTime shoudn't be updated until condition status will become True. It is because I added code updating LastProbeTime when schedule() failed.

I thinkkubelet is responsible for updating with PodScheduled.Status = True of PodScheduled condition in under the current implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was trying to say is that this change optimizes for the cases where there are lot of unschedulable pods and favors other pods scheduling in that case. Does it make the recovery of a pod which has been unschedulable for a while and just became schedulable slower compare to previously @everpeace @bsalamat because its been keep pushing to end because of constant updating of lastprobetime ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krmayankk I don't think so. If an unschedulable pod has higher priority, it will still get to the head of the queue even after this change. When it has the same priority as other pods, it is fair to put it behind other pods with the same priority after the scheduler has tried it and determined that it is unschedulable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat i was talking about the case when there is no priority involved. All pods are same priroity or default priority. In that case its trying to avoid starvation for regular pods when lot of unschedulable pods are present. How does it affect the recovery of the unschedulable pods which finally become shcedulable. Does this behavior change when compared to without this change ?
Note: Just trying to understand, the answer may be no change. It depends on how the active queue is implemented

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change (and somewhat similarly after #71488), a pod that is determined unschedulable goes behind other similar priority pods in the scheduling queue. Once pods become schedulable they are processed by their order in the scheduling queue. So, depending on their location in the queue, they may get scheduled before or after same priority pods.
In short, we don't expect further delays in scheduling unschedulable pods after this change.

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @everpeace! The change looks good to me. Your concern with respect to increase in the API server load is valid, but even before this change it was likely that the condition.message of the pod would change after a scheduling retry, causing a request to be sent to the API server. The message contains detailed information about how many nodes were considered and what predicates failed for nodes. So, in larger clusters with higher churn the message would change somewhat frequently, causing a new request to be sent to the API server. After this PR all the scheduling attempts would send a request to the API server which may make things worse than before, but I don't expect a major issue.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 8, 2019
@bsalamat
Copy link
Member

bsalamat commented Jan 8, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, everpeace

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2019
@everpeace
Copy link
Contributor Author

everpeace commented Jan 8, 2019

@bsalamat Thanks for your review!! It's my first time to contribute to kube-scheduler actually :-)

So, in larger clusters with higher churn the message would change somewhat frequently, causing a new request to be sent to the API server.

oh, yeah, That's right 👍 Thanks, too!

@bsalamat
Copy link
Member

bsalamat commented Jan 9, 2019

We need to cherrypick this into previous releases. I will send cherrypick PRs.

@everpeace
Copy link
Contributor Author

Oh, I couldn't take care of cherry-picking to previous versions 🙇 Thank you!!

k8s-ci-robot added a commit that referenced this pull request Jan 10, 2019
…19-upstream-release-1.12

Automated cherry-pick of #72619 into upstream release 1.12
@bsalamat
Copy link
Member

@everpeace No problem. I have already sent those PRs.

k8s-ci-robot added a commit that referenced this pull request Jan 15, 2019
…19-upstream-release-1.13

Automated cherry-pick of #72619 into upstream release 1.13
k8s-ci-robot added a commit that referenced this pull request Jan 16, 2019
…19-upstream-release-1.11

Automated cherry-pick of #72619 into upstream release 1.11
@bsalamat
Copy link
Member

@everpeace FYI, I just filed #73485 to address the concern that you have had with respect to the increase of the API server load.

@everpeace
Copy link
Contributor Author

Oh, ok. I got it. I noticed another engineer just started working on the issue. Thanks for your notification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unschedulable pods may block head of the scheduling queue
5 participants