New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry scheduling pods after errors more consistently in scheduler #50106

Merged
merged 1 commit into from Aug 7, 2017

Conversation

@julia-stripe
Contributor

julia-stripe commented Aug 3, 2017

What this PR does / why we need it:

This fixes 2 places in the scheduler where pods can get stuck in Pending forever. In both these places, errors happen and sched.config.Error is not called afterwards. This is a problem because sched.config.Error is responsible for requeuing pods to retry scheduling when there are issues (see here), so if we don't call sched.config.Error then the pod will never get scheduled (unless the scheduler is restarted).

One of these (where it returns when ForgetPod fails instead of continuing and reporting an error) is a regression from this refactor, and with the old behavior the error was reported correctly. As far as I can tell changing the error handling in that refactor wasn't intentional.

When AssumePod fails there's never been an error reported but I think adding this will help the scheduler recover when something goes wrong instead of letting pods possibly never get scheduled.

This will help prevent issues like #49314 in the future.

Release note:

Fix incorrect retry logic in scheduler
@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Aug 3, 2017

Contributor

Hi @julia-stripe. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Contributor

k8s-ci-robot commented Aug 3, 2017

Hi @julia-stripe. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@@ -222,7 +227,7 @@ func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error {
if err != nil {
glog.V(1).Infof("Failed to bind pod: %v/%v", assumed.Namespace, assumed.Name)
if err := sched.config.SchedulerCache.ForgetPod(assumed); err != nil {
return fmt.Errorf("scheduler cache ForgetPod failed: %v", err)
glog.Errorf("scheduler cache ForgetPod failed: %v", err)

This comment has been minimized.

@k82cn

k82cn Aug 3, 2017

Member

I think we need to FinisBinding before ForgetPod; if ForgetPod failed, the assumed pod maybe still in cache.

@k82cn

k82cn Aug 3, 2017

Member

I think we need to FinisBinding before ForgetPod; if ForgetPod failed, the assumed pod maybe still in cache.

This comment has been minimized.

@wojtek-t

wojtek-t Aug 4, 2017

Member

Agree. There were two problems here. One that we were returning here, the second is what @k82cn pointed at.

@wojtek-t

wojtek-t Aug 4, 2017

Member

Agree. There were two problems here. One that we were returning here, the second is what @k82cn pointed at.

@k82cn k82cn referenced this pull request Aug 3, 2017

Closed

Can't delete pod #50077

// This should be fixed properly though.
// This is most probably result of a BUG in retrying logic.
// We report an error here so that pod scheduling can be retried.

This comment has been minimized.

@wojtek-t

wojtek-t Aug 4, 2017

Member

This relies on the fact that Error will check if pod was bounded in the meantime and if so will not add it back to the unscheduled pods set (otherwise it would lead to an infinite loop).

This is true now, but can you please add a comment about it.

@wojtek-t

wojtek-t Aug 4, 2017

Member

This relies on the fact that Error will check if pod was bounded in the meantime and if so will not add it back to the unscheduled pods set (otherwise it would lead to an infinite loop).

This is true now, but can you please add a comment about it.

@@ -222,7 +227,7 @@ func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error {
if err != nil {
glog.V(1).Infof("Failed to bind pod: %v/%v", assumed.Namespace, assumed.Name)
if err := sched.config.SchedulerCache.ForgetPod(assumed); err != nil {
return fmt.Errorf("scheduler cache ForgetPod failed: %v", err)
glog.Errorf("scheduler cache ForgetPod failed: %v", err)

This comment has been minimized.

@wojtek-t

wojtek-t Aug 4, 2017

Member

Agree. There were two problems here. One that we were returning here, the second is what @k82cn pointed at.

@wojtek-t

wojtek-t Aug 4, 2017

Member

Agree. There were two problems here. One that we were returning here, the second is what @k82cn pointed at.

@k8s-merge-robot k8s-merge-robot added size/M and removed size/S labels Aug 4, 2017

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Aug 4, 2017

Contributor

Thanks for the review! Added a comment & moved FinishBinding back up after bind. (as a side note, these are my first contributions to Kubernetes and I'm extremely impressed by how quickly and carefully people review code)

PTAL @wojtek-t

Contributor

julia-stripe commented Aug 4, 2017

Thanks for the review! Added a comment & moved FinishBinding back up after bind. (as a side note, these are my first contributions to Kubernetes and I'm extremely impressed by how quickly and carefully people review code)

PTAL @wojtek-t

@cblecker

This comment has been minimized.

Show comment
Hide comment
@cblecker

cblecker Aug 4, 2017

Member

/ok-to-test

Member

cblecker commented Aug 4, 2017

/ok-to-test

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Aug 7, 2017

Member

/lgtm
/approve no-issue

Member

wojtek-t commented Aug 7, 2017

/lgtm
/approve no-issue

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 7, 2017

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Aug 7, 2017

Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: julia-stripe, wojtek-t

Associated issue: 49314

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

Contributor

k8s-merge-robot commented Aug 7, 2017

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: julia-stripe, wojtek-t

Associated issue: 49314

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@wojtek-t wojtek-t added this to the v1.7 milestone Aug 7, 2017

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Aug 7, 2017

Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

Contributor

k8s-merge-robot commented Aug 7, 2017

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Aug 7, 2017

Contributor

Automatic merge from submit-queue

Contributor

k8s-merge-robot commented Aug 7, 2017

Automatic merge from submit-queue

@k8s-merge-robot k8s-merge-robot merged commit bc7ccfe into kubernetes:master Aug 7, 2017

9 of 10 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-verify
Details
cla/linuxfoundation julia-stripe authorized
Details
pull-kubernetes-bazel Job succeeded.
Details
pull-kubernetes-e2e-gce-etcd3 Jenkins job succeeded.
Details
pull-kubernetes-e2e-kops-aws Jenkins job succeeded.
Details
pull-kubernetes-federation-e2e-gce Jenkins job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Jenkins job succeeded.
Details
pull-kubernetes-node-e2e Jenkins job succeeded.
Details
pull-kubernetes-unit Jenkins job succeeded.
Details
pull-kubernetes-verify Jenkins job succeeded.
Details
@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Aug 7, 2017

Member

Cherrypick in #50240

Member

wojtek-t commented Aug 7, 2017

Cherrypick in #50240

k8s-merge-robot added a commit that referenced this pull request Aug 7, 2017

Merge pull request #50240 from wojtek-t/automated-cherry-pick-of-#50028-
#50106-upstream-release-1.7

Automatic merge from submit-queue

Automated cherry pick of #50028 #50106 upstream release 1.7

Cherry pick of #50028 and #50106 on release-1.7.

#50028: Fix incorrect call to 'bind' in scheduler
#50106: Retry scheduling pods after errors more consistently in scheduler
@k8s-cherrypick-bot

This comment has been minimized.

Show comment
Hide comment
@k8s-cherrypick-bot

k8s-cherrypick-bot Aug 7, 2017

Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

k8s-cherrypick-bot commented Aug 7, 2017

Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment