Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not snapshot scheduler cache before starting preemption #72895

Merged
merged 2 commits into from Jan 15, 2019

Conversation

@bsalamat
Copy link
Member

commented Jan 14, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
In clusters with many pending pods with the same priority, if a pod is nominated, it goes back to the queue and behind all other pending pods with the same priority. This is important to avoid starvation of other pods, but it could also hold the nominated resources for a long time while the scheduler tries to schedule other pending pods in front of it. This scenario can happen and we cannot do much about it, but in the existing implementation of the scheduler, preemption updates its scheduler cache snapshot before starting the preemption work. If a node is added to the cluster or a pod is terminated, after a scheduling cycle and before the preemption logic starts its work, the preemption logic may find a feasible node without preempting any pods. The node becomes nominated in such cases and without preempting any pods. Now the nominated pod goes back to the queue and is placed behind other pods with similar priority, but none of those other pods can be scheduled on the node because there is a nominated pod for the node and the pod may take minutes before being retried if there are thousands of pending pods in the queue. To avoid such scenarios, we do not update the snapshot that the preemption logic uses and use the same one that its corresponding scheduling cycle has used.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Improve efficiency of preemption logic in clusters with many pending pods.

/sig scheduling

@bsalamat bsalamat force-pushed the bsalamat:no_refresh_preemption branch from 8265859 to e3f4e1e Jan 14, 2019
Copy link
Contributor

left a comment

The code change looks fine, but I don't completely understand the logic change.

In the case of nominating a node for a pending pod. Some cluster resources are considered as reserved for that pod and will prevent scheduling of other pods. This change seems geared at moving where those reserved resources happen, but why does changing a nomination from a new node to an existing node improve behavior?

@bsalamat

This comment has been minimized.

Copy link
Member Author

commented Jan 14, 2019

This change seems geared at moving where those reserved resources happen, but why does changing a nomination from a new node to an existing node improve behavior?

I tried to explain that when new nodes are added to a cluster, there is a chance that one is added somewhere between the start of a scheduling cycle and the start of the following preemption cycle. In such cases, the pod get nominated node name with no preemption and then the pod is moved back to the queue and is placed behind other pods with the same priority. When there are many pending pods, it could take minutes before the pod is retried while all the resources remain unused for other pods that are being attempted by the scheduler. This causes delays in cluster autoscaler. That's why we want to use the same set of nodes that the scheduling cycle used.

@misterikkit

This comment has been minimized.

Copy link
Contributor

commented Jan 14, 2019

the pod get nominated node name with no preemption

Why is the alternative better? The pod that triggered preemption still goes to the back of the queue, while some running pods get terminated. The resources freed up for the pending pod would still go unused while the scheduler works through the queue of equal-priority pods.

That is, unless the queue has a feature to jump a pending pod to the front of the queue when it's nominated resources become free?

@bsalamat

This comment has been minimized.

Copy link
Member Author

commented Jan 14, 2019

Why is the alternative better? The pod that triggered preemption still goes to the back of the queue, while some running pods get terminated. The resources freed up for the pending pod would still go unused while the scheduler works through the queue of equal-priority pods.

In the case that I described, no running pod is terminated. The resources which are not marked as "nominated" are used by the next pod and don't stay unused for a long time.

@misterikkit

This comment has been minimized.

Copy link
Contributor

commented Jan 14, 2019

Ah okay, that was my misunderstanding. Because the pod would otherwise not be scheduled, the reserved & idle resources only happen as a result of resources becoming available during a scheduling cycle.

/lgtm

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

commented Jan 15, 2019

/retest

Copy link
Member

left a comment

/lgtm

During one scheduling cycle (in both first attempt and second preemption), the scheduler cache should be consistent - which is snapshotted at the first attempt:

if err := g.snapshot(); err != nil {

Copy link
Contributor

left a comment

/lgtm

Thanks for the PR and explanation @bsalamat

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jan 15, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1482483 into kubernetes:master Jan 15, 2019
18 checks passed
18 checks passed
cla/linuxfoundation bsalamat authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
k8s-ci-robot added a commit that referenced this pull request Jan 17, 2019
…95-upstream-release-1.11

Automated cherry pick of #72895 upstream release 1.11
k8s-ci-robot added a commit that referenced this pull request Jan 19, 2019
…95-upstream-release-1.13

Automated cherry pick of #72895 upstream release 1.13
k8s-ci-robot added a commit that referenced this pull request Jan 22, 2019
…95-upstream-release-1.12

Automated cherry pick of #72895 upstream release 1.12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.