ensure scheduler preemptor behaves in an efficient/correct path #70898

Huang-Wei · 2018-11-10T00:29:46Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

don't update nominatedMap cache when Pop() an element from activeQ
instead, delete the nominated info from cache when it's bound (scheduled)

Which issue(s) this PR fixes:

Special notes for your reviewer:

The most significant change introduced in this PR is: when popping a pod, scheduler doesn't update internal cache from nominatedMap immediately. Instead, it invalidates the cache until the pod is bound.

Why this? It's because in a very rare case: when a high priority pod comes in, and it's unschedulable (failed in scheduling) (1), it got a chance to try "preemption" (2) and preempt low priority pods (3) to make room.

During phase (1), in function Error() (1.1), it's put back into unschedulableQ where cache nominatedMap is being re-updated, the key point here is: the function is asynchronous (in a goroutine). In other words, after (3) is finished, a backfill pod for the preempted low priority pod (suppose it's managed by a deployment/replicaset) can be spawned and come into scheduling cycle, and it happens prior to (1.1). At this moment, it doesn't know a Nominated pod has been there (as cache hasn't been re-updated), then it's created and enters running state, but it will definitely be preempted again. So this case can happen again and again, although not endless, but really wastes resources to do unnecessary scheduling/preemption.

Along with this PR, I wrote an e2e test to simulate the issue.

Does this PR introduce a user-facing change?:

Fix a potential bug that scheduler preempts unnecessary pods.

/sig scheduling

pkg/scheduler/internal/queue/scheduling_queue.go

Huang-Wei · 2018-11-12T23:02:15Z

CI is green now (including the e2e test in this PR).

/cc @bsalamat @resouer @k82cn @ravisantoshgudimetla

Huang-Wei · 2018-11-12T23:14:46Z

/priority important-soon

Huang-Wei · 2018-11-15T22:34:07Z

@bsalamat I've updated the logic of deleting nominatedPod from cache to assume() - I didn't expose the whole scheduleQueue in factor.Config, instead expose a function. PTAL (latest commit).

Regarding the test, I'm still trying to build an integration test, but no luck to reproduce the issue so far.

bsalamat · 2018-11-16T00:42:13Z

test/e2e/scheduling/preemption.go

+		framework.Logf("length of pods created so far: %v", len(podEvents))
+		// 13 = 5+4+4, which implies ReplicaSet{1,2,3} should create and only create
+		// exactly the "replicas" of pods
+		if podEventsNum := len(podEvents); podEventsNum != 13 {


You are right. That integration test actually exists. Have you tried running it with "stress" to see if it fails at all?

bsalamat · 2018-11-16T00:48:11Z

pkg/scheduler/factory/factory.go

+
+	// DeleteNominatedPodIfExists is called when a pod is assumed
+	// It will delete the pod from internal cache if it's a nominated pod
+	DeleteNominatedPodIfExists func(pod *v1.Pod)


I would still prefer exposing the scheduling queue here. This function looks too low level to me to get exposed in the scheduler config. We have the scheduler cache, event recorder, volume binder, etc. here. The scheduling queue looks more of the same nature of the existing members of this struct.

bsalamat

@Huang-Wei if you can verify in your real clusters that the fix actually works, I am fine with having the fix in 1.13 and adding the test in the next few days.

Huang-Wei · 2018-11-16T20:49:00Z

@bsalamat yes, I can reproduce both manually and using the e2e test I wrote.

I am fine with having the fix in 1.13 and adding the test in the next few days.

This SGTM. I will update this PR to remove the e2e test and address the comment to "expose scheduling queue in factory.Config, instead of a function".

Huang-Wei · 2018-11-16T22:08:22Z

@bsalamat @ravisantoshgudimetla PTAL. (~~I will do a final squash and remove the e2e test commit~~ Done)

- don't update nominatedMap cache when Pop() an element from activeQ - instead, delete the nominated info from cache when it's "assumed" - unit test behavior adjusted - expose SchedulingQueue in factory.Config

Huang-Wei · 2018-11-16T22:26:02Z

/priority critical-urgent

Huang-Wei · 2018-11-16T22:53:52Z

/retest

Huang-Wei · 2018-11-16T23:20:27Z

/retest

Huang-Wei · 2018-11-16T23:49:56Z

/retest

Huang-Wei · 2018-11-17T00:18:56Z

Integration test keeps failing due to different flakes.

Huang-Wei · 2018-11-17T00:19:11Z

/retest

ravisantoshgudimetla

Approving the PR based on #70898 (review)

@Huang-Wei Thanks for identifying the issue and fixing it. Please prepare a PR for a test case as follow-up.

/lgtm
/approve

k8s-ci-robot · 2018-11-17T01:44:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [ravisantoshgudimetla]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…898-#71281-upstream-release-1.12 Automated cherry pick of #70898: ensure scheduler preemptor behaves in an efficient/correct #71281: add an e2e test to verify preemption running path

…898-#71281-upstream-release-1.11 Automated cherry pick of #70898: ensure scheduler preemptor behaves in an efficient/correct #71281: add an e2e test to verify preemption running path

k8s-ci-robot requested review from ravisantoshgudimetla and resouer November 10, 2018 00:29

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 10, 2018

Huang-Wei force-pushed the preemption-issue branch from 0b580bb to 07ea951 Compare November 12, 2018 06:35

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 12, 2018

resouer reviewed Nov 12, 2018

View reviewed changes

pkg/scheduler/internal/queue/scheduling_queue.go Outdated Show resolved Hide resolved

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 12, 2018

Huang-Wei force-pushed the preemption-issue branch 4 times, most recently from 9254427 to 024d6b5 Compare November 12, 2018 21:36

Huang-Wei changed the title ~~[WIP] ensure scheduler preemptor behaves in an efficient/correct path~~ ensure scheduler preemptor behaves in an efficient/correct path Nov 12, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 12, 2018

k8s-ci-robot requested review from bsalamat, k82cn and resouer November 12, 2018 23:02

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 12, 2018

Huang-Wei force-pushed the preemption-issue branch 2 times, most recently from 5f31e41 to 4cc7c4b Compare November 15, 2018 19:43

bsalamat reviewed Nov 16, 2018

View reviewed changes

ensure scheduler preemptor behaves in an efficient/correct path

b4fd115

- don't update nominatedMap cache when Pop() an element from activeQ - instead, delete the nominated info from cache when it's "assumed" - unit test behavior adjusted - expose SchedulingQueue in factory.Config

Huang-Wei force-pushed the preemption-issue branch from 8de0583 to b4fd115 Compare November 16, 2018 22:24

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 16, 2018

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 16, 2018

k8s-ci-robot assigned ravisantoshgudimetla Nov 17, 2018

ravisantoshgudimetla approved these changes Nov 17, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 17, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 17, 2018

k8s-ci-robot merged commit 1f3057b into kubernetes:master Nov 17, 2018

Huang-Wei deleted the preemption-issue branch November 20, 2018 04:11

Huang-Wei mentioned this pull request Nov 20, 2018

Preemption e2e test #71281

Merged

Huang-Wei mentioned this pull request Dec 4, 2018

Automated cherry pick of #70898: ensure scheduler preemptor behaves in an efficient/correct #71281: add an e2e test to verify preemption running path #71724

Merged

Huang-Wei mentioned this pull request Dec 8, 2018

Automated cherry pick of #70898: ensure scheduler preemptor behaves in an efficient/correct #71281: add an e2e test to verify preemption running path #71884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure scheduler preemptor behaves in an efficient/correct path #70898

ensure scheduler preemptor behaves in an efficient/correct path #70898

Huang-Wei commented Nov 10, 2018 •

edited

Huang-Wei commented Nov 12, 2018

Huang-Wei commented Nov 12, 2018

Huang-Wei commented Nov 15, 2018

bsalamat Nov 16, 2018

bsalamat Nov 16, 2018

bsalamat left a comment

Huang-Wei commented Nov 16, 2018 •

edited

Huang-Wei commented Nov 16, 2018 •

edited

Huang-Wei commented Nov 16, 2018

Huang-Wei commented Nov 16, 2018

Huang-Wei commented Nov 16, 2018

Huang-Wei commented Nov 16, 2018

Huang-Wei commented Nov 17, 2018

Huang-Wei commented Nov 17, 2018

ravisantoshgudimetla left a comment

k8s-ci-robot commented Nov 17, 2018

ensure scheduler preemptor behaves in an efficient/correct path #70898

ensure scheduler preemptor behaves in an efficient/correct path #70898

Conversation

Huang-Wei commented Nov 10, 2018 • edited

Huang-Wei commented Nov 12, 2018

Huang-Wei commented Nov 12, 2018

Huang-Wei commented Nov 15, 2018

bsalamat Nov 16, 2018

Choose a reason for hiding this comment

bsalamat Nov 16, 2018

Choose a reason for hiding this comment

bsalamat left a comment

Choose a reason for hiding this comment

Huang-Wei commented Nov 16, 2018 • edited

Huang-Wei commented Nov 16, 2018 • edited

Huang-Wei commented Nov 16, 2018

Huang-Wei commented Nov 16, 2018

Huang-Wei commented Nov 16, 2018

Huang-Wei commented Nov 16, 2018

Huang-Wei commented Nov 17, 2018

Huang-Wei commented Nov 17, 2018

ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 17, 2018

Huang-Wei commented Nov 10, 2018 •

edited

Huang-Wei commented Nov 16, 2018 •

edited

Huang-Wei commented Nov 16, 2018 •

edited