Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure scheduler preemptor behaves in an efficient/correct path #70898

Merged
merged 1 commit into from
Nov 17, 2018

Conversation

Huang-Wei
Copy link
Member

@Huang-Wei Huang-Wei commented Nov 10, 2018

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • don't update nominatedMap cache when Pop() an element from activeQ
  • instead, delete the nominated info from cache when it's bound (scheduled)

Which issue(s) this PR fixes:

Fixes #70622

Special notes for your reviewer:

The most significant change introduced in this PR is: when popping a pod, scheduler doesn't update internal cache from nominatedMap immediately. Instead, it invalidates the cache until the pod is bound.

Why this? It's because in a very rare case: when a high priority pod comes in, and it's unschedulable (failed in scheduling) (1), it got a chance to try "preemption" (2) and preempt low priority pods (3) to make room.

During phase (1), in function Error() (1.1), it's put back into unschedulableQ where cache nominatedMap is being re-updated, the key point here is: the function is asynchronous (in a goroutine). In other words, after (3) is finished, a backfill pod for the preempted low priority pod (suppose it's managed by a deployment/replicaset) can be spawned and come into scheduling cycle, and it happens prior to (1.1). At this moment, it doesn't know a Nominated pod has been there (as cache hasn't been re-updated), then it's created and enters running state, but it will definitely be preempted again. So this case can happen again and again, although not endless, but really wastes resources to do unnecessary scheduling/preemption.

Along with this PR, I wrote an e2e test to simulate the issue.

Does this PR introduce a user-facing change?:

Fix a potential bug that scheduler preempts unnecessary pods.

/sig scheduling

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 10, 2018
@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 10, 2018
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 12, 2018
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 12, 2018
@Huang-Wei Huang-Wei force-pushed the preemption-issue branch 4 times, most recently from 9254427 to 024d6b5 Compare November 12, 2018 21:36
@Huang-Wei Huang-Wei changed the title [WIP] ensure scheduler preemptor behaves in an efficient/correct path ensure scheduler preemptor behaves in an efficient/correct path Nov 12, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 12, 2018
@Huang-Wei
Copy link
Member Author

CI is green now (including the e2e test in this PR).

/cc @bsalamat @resouer @k82cn @ravisantoshgudimetla

@Huang-Wei
Copy link
Member Author

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 12, 2018
@Huang-Wei Huang-Wei force-pushed the preemption-issue branch 2 times, most recently from 5f31e41 to 4cc7c4b Compare November 15, 2018 19:43
@Huang-Wei
Copy link
Member Author

@bsalamat I've updated the logic of deleting nominatedPod from cache to assume() - I didn't expose the whole scheduleQueue in factor.Config, instead expose a function. PTAL (latest commit).

Regarding the test, I'm still trying to build an integration test, but no luck to reproduce the issue so far.

framework.Logf("length of pods created so far: %v", len(podEvents))
// 13 = 5+4+4, which implies ReplicaSet{1,2,3} should create and only create
// exactly the "replicas" of pods
if podEventsNum := len(podEvents); podEventsNum != 13 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. That integration test actually exists. Have you tried running it with "stress" to see if it fails at all?


// DeleteNominatedPodIfExists is called when a pod is assumed
// It will delete the pod from internal cache if it's a nominated pod
DeleteNominatedPodIfExists func(pod *v1.Pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still prefer exposing the scheduling queue here. This function looks too low level to me to get exposed in the scheduler config. We have the scheduler cache, event recorder, volume binder, etc. here. The scheduling queue looks more of the same nature of the existing members of this struct.

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei if you can verify in your real clusters that the fix actually works, I am fine with having the fix in 1.13 and adding the test in the next few days.

@Huang-Wei
Copy link
Member Author

Huang-Wei commented Nov 16, 2018

@bsalamat yes, I can reproduce both manually and using the e2e test I wrote.

I am fine with having the fix in 1.13 and adding the test in the next few days.

This SGTM. I will update this PR to remove the e2e test and address the comment to "expose scheduling queue in factory.Config, instead of a function".

@Huang-Wei
Copy link
Member Author

Huang-Wei commented Nov 16, 2018

@bsalamat @ravisantoshgudimetla PTAL. (I will do a final squash and remove the e2e test commit Done)

- don't update nominatedMap cache when Pop() an element from activeQ
- instead, delete the nominated info from cache when it's "assumed"
- unit test behavior adjusted
- expose SchedulingQueue in factory.Config
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 16, 2018
@Huang-Wei
Copy link
Member Author

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 16, 2018
@Huang-Wei
Copy link
Member Author

/retest

2 similar comments
@Huang-Wei
Copy link
Member Author

/retest

@Huang-Wei
Copy link
Member Author

/retest

@Huang-Wei
Copy link
Member Author

Integration test keeps failing due to different flakes.

@Huang-Wei
Copy link
Member Author

/retest

Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the PR based on #70898 (review)

@Huang-Wei Thanks for identifying the issue and fixing it. Please prepare a PR for a test case as follow-up.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 17, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 17, 2018
@k8s-ci-robot k8s-ci-robot merged commit 1f3057b into kubernetes:master Nov 17, 2018
@Huang-Wei Huang-Wei deleted the preemption-issue branch November 20, 2018 04:11
@Huang-Wei Huang-Wei mentioned this pull request Nov 20, 2018
k8s-ci-robot added a commit that referenced this pull request Dec 5, 2018
…898-#71281-upstream-release-1.12

Automated cherry pick of #70898: ensure scheduler preemptor behaves in an efficient/correct #71281: add an e2e test to verify preemption running path
k8s-ci-robot added a commit that referenced this pull request Dec 10, 2018
…898-#71281-upstream-release-1.11

Automated cherry pick of #70898: ensure scheduler preemptor behaves in an efficient/correct #71281: add an e2e test to verify preemption running path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduler sometimes preempts unnecessary pods
5 participants