New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure scheduler preemptor behaves in an efficient/correct path #70898

Merged
merged 1 commit into from Nov 17, 2018

Conversation

@Huang-Wei
Member

Huang-Wei commented Nov 10, 2018

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • don't update nominatedMap cache when Pop() an element from activeQ
  • instead, delete the nominated info from cache when it's bound (scheduled)

Which issue(s) this PR fixes:

Fixes #70622

Special notes for your reviewer:

The most significant change introduced in this PR is: when popping a pod, scheduler doesn't update internal cache from nominatedMap immediately. Instead, it invalidates the cache until the pod is bound.

Why this? It's because in a very rare case: when a high priority pod comes in, and it's unschedulable (failed in scheduling) (1), it got a chance to try "preemption" (2) and preempt low priority pods (3) to make room.

During phase (1), in function Error() (1.1), it's put back into unschedulableQ where cache nominatedMap is being re-updated, the key point here is: the function is asynchronous (in a goroutine). In other words, after (3) is finished, a backfill pod for the preempted low priority pod (suppose it's managed by a deployment/replicaset) can be spawned and come into scheduling cycle, and it happens prior to (1.1). At this moment, it doesn't know a Nominated pod has been there (as cache hasn't been re-updated), then it's created and enters running state, but it will definitely be preempted again. So this case can happen again and again, although not endless, but really wastes resources to do unnecessary scheduling/preemption.

Along with this PR, I wrote an e2e test to simulate the issue.

Does this PR introduce a user-facing change?:

Fix a potential bug that scheduler preempts unnecessary pods.

/sig scheduling

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:preemption-issue branch 4 times, most recently from 9254427 to 024d6b5 Nov 12, 2018

@Huang-Wei Huang-Wei changed the title from [WIP] ensure scheduler preemptor behaves in an efficient/correct path to ensure scheduler preemptor behaves in an efficient/correct path Nov 12, 2018

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 12, 2018

CI is green now (including the e2e test in this PR).

/cc @bsalamat @resouer @k82cn @ravisantoshgudimetla

@k8s-ci-robot k8s-ci-robot requested review from bsalamat, k82cn and resouer Nov 12, 2018

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 12, 2018

/priority important-soon

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:preemption-issue branch from 5f31e41 to 4cc7c4b Nov 15, 2018

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 15, 2018

@bsalamat I've updated the logic of deleting nominatedPod from cache to assume() - I didn't expose the whole scheduleQueue in factor.Config, instead expose a function. PTAL (latest commit).

Regarding the test, I'm still trying to build an integration test, but no luck to reproduce the issue so far.

framework.Logf("length of pods created so far: %v", len(podEvents))
// 13 = 5+4+4, which implies ReplicaSet{1,2,3} should create and only create
// exactly the "replicas" of pods
if podEventsNum := len(podEvents); podEventsNum != 13 {

This comment has been minimized.

@bsalamat

bsalamat Nov 16, 2018

Contributor

You are right. That integration test actually exists. Have you tried running it with "stress" to see if it fails at all?

// DeleteNominatedPodIfExists is called when a pod is assumed
// It will delete the pod from internal cache if it's a nominated pod
DeleteNominatedPodIfExists func(pod *v1.Pod)

This comment has been minimized.

@bsalamat

bsalamat Nov 16, 2018

Contributor

I would still prefer exposing the scheduling queue here. This function looks too low level to me to get exposed in the scheduler config. We have the scheduler cache, event recorder, volume binder, etc. here. The scheduling queue looks more of the same nature of the existing members of this struct.

@bsalamat

@Huang-Wei if you can verify in your real clusters that the fix actually works, I am fine with having the fix in 1.13 and adding the test in the next few days.

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

@bsalamat yes, I can reproduce both manually and using the e2e test I wrote.

I am fine with having the fix in 1.13 and adding the test in the next few days.

This SGTM. I will update this PR to remove the e2e test and address the comment to "expose scheduling queue in factory.Config, instead of a function".

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

@bsalamat @ravisantoshgudimetla PTAL. (I will do a final squash and remove the e2e test commit Done)

ensure scheduler preemptor behaves in an efficient/correct path
- don't update nominatedMap cache when Pop() an element from activeQ
- instead, delete the nominated info from cache when it's "assumed"
- unit test behavior adjusted
- expose SchedulingQueue in factory.Config

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:preemption-issue branch from 8de0583 to b4fd115 Nov 16, 2018

@k8s-ci-robot k8s-ci-robot added size/M and removed size/L labels Nov 16, 2018

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

/priority critical-urgent

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

/retest

2 similar comments
@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

/retest

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

/retest

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 17, 2018

Integration test keeps failing due to different flakes.

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 17, 2018

/retest

@ravisantoshgudimetla

Approving the PR based on #70898 (review)

@Huang-Wei Thanks for identifying the issue and fixing it. Please prepare a PR for a test case as follow-up.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Nov 17, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Nov 17, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1f3057b into kubernetes:master Nov 17, 2018

18 checks passed

cla/linuxfoundation Huang-Wei authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details

@Huang-Wei Huang-Wei deleted the Huang-Wei:preemption-issue branch Nov 20, 2018

@Huang-Wei Huang-Wei referenced this pull request Nov 20, 2018

Merged

Preemption e2e test #71281

k8s-ci-robot added a commit that referenced this pull request Dec 5, 2018

Merge pull request #71724 from Huang-Wei/automated-cherry-pick-of-#70898
-#71281-upstream-release-1.12

Automated cherry pick of #70898: ensure scheduler preemptor behaves in an efficient/correct #71281: add an e2e test to verify preemption running path

k8s-ci-robot added a commit that referenced this pull request Dec 10, 2018

Merge pull request #71884 from Huang-Wei/automated-cherry-pick-of-#70898
-#71281-upstream-release-1.11

Automated cherry pick of #70898: ensure scheduler preemptor behaves in an efficient/correct #71281: add an e2e test to verify preemption running path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment