Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Retry scheduling pods after errors more consistently in scheduler #50106
Conversation
k8s-ci-robot
added
the
cncf-cla: yes
label
Aug 3, 2017
|
Hi @julia-stripe. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
k8s-ci-robot
added
the
needs-ok-to-test
label
Aug 3, 2017
k8s-merge-robot
assigned
lavalamp and
davidopp
Aug 3, 2017
k8s-merge-robot
added
size/S
release-note-label-needed
labels
Aug 3, 2017
julia-stripe
referenced this pull request
Aug 3, 2017
Closed
Fix pods stuck in Pending state forever #49661
wojtek-t
assigned
wojtek-t
and unassigned
lavalamp
Aug 3, 2017
| @@ -222,7 +227,7 @@ func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error { | ||
| if err != nil { | ||
| glog.V(1).Infof("Failed to bind pod: %v/%v", assumed.Namespace, assumed.Name) | ||
| if err := sched.config.SchedulerCache.ForgetPod(assumed); err != nil { | ||
| - return fmt.Errorf("scheduler cache ForgetPod failed: %v", err) | ||
| + glog.Errorf("scheduler cache ForgetPod failed: %v", err) |
k82cn
Aug 3, 2017
Member
I think we need to FinisBinding before ForgetPod; if ForgetPod failed, the assumed pod maybe still in cache.
wojtek-t
Aug 4, 2017
Member
Agree. There were two problems here. One that we were returning here, the second is what @k82cn pointed at.
| - // This should be fixed properly though. | ||
| + | ||
| + // This is most probably result of a BUG in retrying logic. | ||
| + // We report an error here so that pod scheduling can be retried. |
wojtek-t
Aug 4, 2017
Member
This relies on the fact that Error will check if pod was bounded in the meantime and if so will not add it back to the unscheduled pods set (otherwise it would lead to an infinite loop).
This is true now, but can you please add a comment about it.
| @@ -222,7 +227,7 @@ func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error { | ||
| if err != nil { | ||
| glog.V(1).Infof("Failed to bind pod: %v/%v", assumed.Namespace, assumed.Name) | ||
| if err := sched.config.SchedulerCache.ForgetPod(assumed); err != nil { | ||
| - return fmt.Errorf("scheduler cache ForgetPod failed: %v", err) | ||
| + glog.Errorf("scheduler cache ForgetPod failed: %v", err) |
wojtek-t
Aug 4, 2017
Member
Agree. There were two problems here. One that we were returning here, the second is what @k82cn pointed at.
k8s-merge-robot
added
size/M
and removed
size/S
labels
Aug 4, 2017
|
Thanks for the review! Added a comment & moved FinishBinding back up after PTAL @wojtek-t |
|
/ok-to-test |
k8s-ci-robot
removed
the
needs-ok-to-test
label
Aug 4, 2017
wojtek-t
added
release-note
and removed
release-note-label-needed
labels
Aug 7, 2017
|
/lgtm |
k8s-ci-robot
added
the
lgtm
label
Aug 7, 2017
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: julia-stripe, wojtek-t Associated issue: 49314 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
k8s-merge-robot
added
the
approved
label
Aug 7, 2017
wojtek-t
added this to the v1.7 milestone
Aug 7, 2017
wojtek-t
added
the
cherrypick-candidate
label
Aug 7, 2017
|
/test all [submit-queue is verifying that this PR is safe to merge] |
|
Automatic merge from submit-queue |
k8s-merge-robot
merged commit bc7ccfe
into
kubernetes:master
Aug 7, 2017
9 of 10 checks passed
wojtek-t
referenced this pull request
Aug 7, 2017
Merged
Automated cherry pick of #50028 #50106 upstream release 1.7 #50240
|
Cherrypick in #50240 |
wojtek-t
added
the
cherrypick-approved
label
Aug 7, 2017
added a commit
that referenced
this pull request
Aug 7, 2017
k8s-cherrypick-bot
commented
Aug 7, 2017
|
Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
julia-stripe commentedAug 3, 2017
•
Edited 1 time
-
wojtek-t
Aug 3, 2017
What this PR does / why we need it:
This fixes 2 places in the scheduler where pods can get stuck in Pending forever. In both these places, errors happen and
sched.config.Erroris not called afterwards. This is a problem becausesched.config.Erroris responsible for requeuing pods to retry scheduling when there are issues (see here), so if we don't callsched.config.Errorthen the pod will never get scheduled (unless the scheduler is restarted).One of these (where it returns when
ForgetPodfails instead of continuing and reporting an error) is a regression from this refactor, and with the old behavior the error was reported correctly. As far as I can tell changing the error handling in that refactor wasn't intentional.When AssumePod fails there's never been an error reported but I think adding this will help the scheduler recover when something goes wrong instead of letting pods possibly never get scheduled.
This will help prevent issues like #49314 in the future.
Release note: