New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change sort function of the scheduling queue to avoid starvation #71488

Merged
merged 2 commits into from Dec 1, 2018

Conversation

@bsalamat
Copy link
Contributor

bsalamat commented Nov 28, 2018

What type of PR is this?
/king bug

What this PR does / why we need it:
Addresses scenario #2 of #71486

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes part of #71486

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix scheduling starvation of pods in cluster with large number of unschedulable pods.

@bsalamat bsalamat force-pushed the bsalamat:queue-sort branch from efe214a to 36f8859 Nov 28, 2018

@bsalamat bsalamat added this to the v1.13 milestone Nov 28, 2018

@k8s-ci-robot k8s-ci-robot requested review from Huang-Wei and resouer Nov 28, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 28, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 28, 2018

@bsalamat is this a critical PR for 1.13? Is there a CI signal for this fix? We are really close exiting code freeze for 1.13 and I would like for this to wait if its not super critical for 1.13

/hold

@hex108

This comment has been minimized.

Copy link
Member

hex108 commented Nov 28, 2018

Thanks! We also observed this problem in our cluster. The patch lgtm.

Maybe the performance issue is caused by adding another compare(pod time compare)? How about solving the problem by using map[priority]PodArray instead of heap? In the PodArray, pods are appended naturally by createTime or LastTransitionTime(if scheduler every tried to schedule it).

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 28, 2018

/retest

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Nov 28, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 28, 2018

@bsalamat I understand the criticality of this fix, but looks like this has been an issue since 1.11 (since we are backporting the fix there). So can this wait until 1.13.1 ? I am nervous that it might risk CI stability and delay the release this late into it, especially around performance and scalability. Can you please give a risk assessment of this fix. Thanks.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 28, 2018

/hold

Adding the hold label back until we can discuss this a bit further thanks.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 28, 2018

@bsalamat would it be possible for you to attend the 10am release burndown today where we can discuss this a bit more before making a Go-NoGo for 1.13 thanks.

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Nov 28, 2018

In today's release burndown meeting, we decided to hold this for the next 1.13 patch release (1.13.1).

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Nov 28, 2018

@hex108 I know understand your solution better. It may work, but please keep in mind that when the highest priority array gets empty, you will need to traverse the map to find the next high priority queue. This may happen frequently.
I think we should measure the performance impact of your solution. If I recall correctly, we didn't see any meaningful performance change after switching from a FIFO to the current scheduling queue which is a lot more sophisticated. I believe there are two reasons: 1. placing pods in the active queue (heap) happens in event handlers which are executed in separate threads in parallel to the main scheduling thread, 2. the main scheduling thread mostly does Pop(). Popping the queue is such a small portion of the whole scheduling cycle. It doesn't impact the scheduler's performance.
All that being said, please feel free to try your idea. If you can show performance improvement, we will be happy to replace the current mechanism.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 28, 2018

/remove milestone

Removing milestone for now. @aleksandra-malinowska @tpepper this is a critical candidate for 1.13.1. Is there a timeline for it yet?

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 28, 2018

/milestone none

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 28, 2018

@AishSundar: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.10, v1.11, v1.12, v1.13, v1.14, v1.4, v1.5, v1.6, v1.7, v1.8, v1.9]

Use /milestone clear to clear the milestone.

In response to this:

/milestone none

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 28, 2018

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.13 milestone Nov 28, 2018

@tpepper

This comment has been minimized.

Copy link
Contributor

tpepper commented Nov 28, 2018

From my perspective I want a few days yet to see how much is deferred to 1.13.1, how post-code-freeze merges progress during the thaw, observe 1.13 branch and master CI status, and think about risks. We'll need some time to cherry pick the select commits for 1.13.1 and see CI status there.

@aleksandra-malinowska

This comment has been minimized.

Copy link
Contributor

aleksandra-malinowska commented Nov 29, 2018

Really looking forward to this fix as we tend to hit the edge cases in scheduling way more frequently with autoscaling. Are there any plans to also handle the case where higher priority unschedulable pod blocks lower priority pods, or to prioritize pods with nominated node name?

@tpepper I'm in favor of releasing earlier with fewer cherry-picks, assuming CI signal is good. Let's discuss it later today if possible

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Nov 30, 2018

@aleksandra-malinowska Yes, there are two other PRs, #57057 and #71551, to address a higher priority pod blocking head of the queue.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Dec 1, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

2 similar comments
@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Dec 1, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Dec 1, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 82abbdc into kubernetes:master Dec 1, 2018

18 checks passed

cla/linuxfoundation bsalamat authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details

@bsalamat bsalamat deleted the bsalamat:queue-sort branch Dec 3, 2018

k8s-ci-robot added a commit that referenced this pull request Dec 3, 2018

Merge pull request #71503 from bsalamat/automated-cherry-pick-of-#714…
…88-upstream-release-1.11

Cherry pick of #71488 upstream release 1.11

k8s-ci-robot added a commit that referenced this pull request Dec 4, 2018

Merge pull request #71499 from bsalamat/automated-cherry-pick-of-#714…
…88-upstream-release-1.12

Automated cherry pick of #71488 upstream release 1.12

k8s-ci-robot added a commit that referenced this pull request Dec 6, 2018

Merge pull request #71670 from bsalamat/automated-cherry-pick-of-#714…
…88-upstream-release-1.13

Automated cherry pick of #71488 upstream release 1.13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment