-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change sort function of scheduling queue to avoid starvation when a lot of unscheduleable pods are in the queue #72619
Conversation
…chedulable pods are in the queue When starvation heppens: - a lot of unschedulable pods exists in the head of queue - because condition.LastTransitionTime is updated only when condition.Status changed - (this means that once a pod is marked unschedulable, the field never updated until the pod successfuly scheduled.) What was changed: - condition.LastProbeTime is updated everytime when pod is determined unschedulable. - changed sort function so to use LastProbeTime to avoid starvation described above Consideration: - This changes increases k8s API server load because it updates Pod.status whenever scheduler decides it as unschedulable. Signed-off-by: Shingo Omura <everpeace@gmail.com>
Hi @everpeace. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
Message: err.Error(), | ||
Type: v1.PodScheduled, | ||
Status: v1.ConditionFalse, | ||
LastProbeTime: metav1.Now(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the active queue takes into account lastProbeTime, and if a pod keeps getting unschedulable , which means its lastProbeTime will always be latest, it will be be farthest behind in the queue and hence least likely to be considered for scheduling ? Is that the general idea ?
After being marked as unschedulable, how often is it tried again ?
If an unschedulable pod eventually becomes schedulable, the LastTransitionTime will still update for this condition , hence this will still be least in prioirty to get schedulable compared to other pods. I am wondering if that will cause starvation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the general idea ?
Thank you for your clear explanation 🙇 Yes, it is. That's the idea.
After being marked as unschedulable, how often is it tried again ?
I think it depends on cluster status. MoveAllToActiveQueue()
moves unschedulable pods from unschedulable queue to active queue with backoff per pod. The method are called generally when scheduler detected pods/nodes status changed.
If an unschedulable pod eventually becomes schedulable, the LastTransitionTime will still update for this condition
LastTransitionTime
should be updated only when the condition status is changed by definition. This means, once the pod is marked as unschedulable, which creates PodScheduled
condition with Status=False
on the pod status, LastTransitionTime
shoudn't be updated until condition status will become True
. It is because I added code updating LastProbeTime
when schedule()
failed.
I thinkkubelet
is responsible for updating with PodScheduled.Status = True
of PodScheduled
condition in under the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was trying to say is that this change optimizes for the cases where there are lot of unschedulable pods and favors other pods scheduling in that case. Does it make the recovery of a pod which has been unschedulable for a while and just became schedulable slower compare to previously @everpeace @bsalamat because its been keep pushing to end because of constant updating of lastprobetime ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krmayankk I don't think so. If an unschedulable pod has higher priority, it will still get to the head of the queue even after this change. When it has the same priority as other pods, it is fair to put it behind other pods with the same priority after the scheduler has tried it and determined that it is unschedulable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat i was talking about the case when there is no priority involved. All pods are same priroity or default priority. In that case its trying to avoid starvation for regular pods when lot of unschedulable pods are present. How does it affect the recovery of the unschedulable pods which finally become shcedulable. Does this behavior change when compared to without this change ?
Note: Just trying to understand, the answer may be no change. It depends on how the active queue is implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change (and somewhat similarly after #71488), a pod that is determined unschedulable goes behind other similar priority pods in the scheduling queue. Once pods become schedulable they are processed by their order in the scheduling queue. So, depending on their location in the queue, they may get scheduled before or after same priority pods.
In short, we don't expect further delays in scheduling unschedulable pods after this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @everpeace! The change looks good to me. Your concern with respect to increase in the API server load is valid, but even before this change it was likely that the condition.message of the pod would change after a scheduling retry, causing a request to be sent to the API server. The message contains detailed information about how many nodes were considered and what predicates failed for nodes. So, in larger clusters with higher churn the message would change somewhat frequently, causing a new request to be sent to the API server. After this PR all the scheduling attempts would send a request to the API server which may make things worse than before, but I don't expect a major issue.
/lgtm
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, everpeace The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@bsalamat Thanks for your review!! It's my first time to contribute to kube-scheduler actually :-)
oh, yeah, That's right 👍 Thanks, too! |
We need to cherrypick this into previous releases. I will send cherrypick PRs. |
Oh, I couldn't take care of cherry-picking to previous versions 🙇 Thank you!! |
…19-upstream-release-1.12 Automated cherry-pick of #72619 into upstream release 1.12
@everpeace No problem. I have already sent those PRs. |
…19-upstream-release-1.13 Automated cherry-pick of #72619 into upstream release 1.13
…19-upstream-release-1.11 Automated cherry-pick of #72619 into upstream release 1.11
@everpeace FYI, I just filed #73485 to address the concern that you have had with respect to the increase of the API server load. |
Oh, ok. I got it. I noticed another engineer just started working on the issue. Thanks for your notification! |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This changes the sort function of scheduler (priority queue) to prevent starvation (some pods don't get chance to schedule very long time).
#71488, which resolved #71486, alleviates starvation by taking
condition.LastTransitionTme
into account in the sort function of scheduler's priority queue. However,condition.LastTransitionTime
is updated only whencondition.Status
changed, which happens after scheduler binding a pod to a node. This means that once a pod is marked unschedulable, which createsPodScheduled
condition withStatus=False
on the pod, theLastTransitionTime
field never updated until the pod is successfully scheduled. So this still blocks newer pods when a lot of pods determined asunschedulable
exists in the queue.This PR changes:
condition.LastProbeTime
of everytime when a pod is determinedunschedulable.
condition.LastProbeTime
to avoid starvationdescribed above
Special notes for your reviewer:
I'm concerned with the decrease of scheduler throughput and the increase of k8s API server load because of the change. I think performance test would be needed to understand how much impact does the PR have on them.
Does this PR introduce a user-facing change?:
/sig scheduling