-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ensure scheduler preemptor behaves in an efficient/correct path #70898
Conversation
0b580bb
to
07ea951
Compare
9254427
to
024d6b5
Compare
CI is green now (including the e2e test in this PR). |
/priority important-soon |
5f31e41
to
4cc7c4b
Compare
@bsalamat I've updated the logic of deleting nominatedPod from cache to Regarding the test, I'm still trying to build an integration test, but no luck to reproduce the issue so far. |
test/e2e/scheduling/preemption.go
Outdated
framework.Logf("length of pods created so far: %v", len(podEvents)) | ||
// 13 = 5+4+4, which implies ReplicaSet{1,2,3} should create and only create | ||
// exactly the "replicas" of pods | ||
if podEventsNum := len(podEvents); podEventsNum != 13 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. That integration test actually exists. Have you tried running it with "stress" to see if it fails at all?
pkg/scheduler/factory/factory.go
Outdated
|
||
// DeleteNominatedPodIfExists is called when a pod is assumed | ||
// It will delete the pod from internal cache if it's a nominated pod | ||
DeleteNominatedPodIfExists func(pod *v1.Pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still prefer exposing the scheduling queue here. This function looks too low level to me to get exposed in the scheduler config. We have the scheduler cache, event recorder, volume binder, etc. here. The scheduling queue looks more of the same nature of the existing members of this struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei if you can verify in your real clusters that the fix actually works, I am fine with having the fix in 1.13 and adding the test in the next few days.
@bsalamat yes, I can reproduce both manually and using the e2e test I wrote.
This SGTM. I will update this PR to remove the e2e test and address the comment to "expose scheduling queue in factory.Config, instead of a function". |
@bsalamat @ravisantoshgudimetla PTAL. ( |
- don't update nominatedMap cache when Pop() an element from activeQ - instead, delete the nominated info from cache when it's "assumed" - unit test behavior adjusted - expose SchedulingQueue in factory.Config
8de0583
to
b4fd115
Compare
/priority critical-urgent |
/retest |
2 similar comments
/retest |
/retest |
Integration test keeps failing due to different flakes. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the PR based on #70898 (review)
@Huang-Wei Thanks for identifying the issue and fixing it. Please prepare a PR for a test case as follow-up.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei, ravisantoshgudimetla The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #70622
Special notes for your reviewer:
The most significant change introduced in this PR is: when popping a pod, scheduler doesn't update internal cache from nominatedMap immediately. Instead, it invalidates the cache until the pod is bound.
Why this? It's because in a very rare case: when a high priority pod comes in, and it's unschedulable (failed in scheduling) (1), it got a chance to try "preemption" (2) and preempt low priority pods (3) to make room.
During phase (1), in function Error() (1.1), it's put back into unschedulableQ where cache nominatedMap is being re-updated, the key point here is: the function is asynchronous (in a goroutine). In other words, after (3) is finished, a backfill pod for the preempted low priority pod (suppose it's managed by a deployment/replicaset) can be spawned and come into scheduling cycle, and it happens prior to (1.1). At this moment, it doesn't know a Nominated pod has been there (as cache hasn't been re-updated), then it's created and enters running state, but it will definitely be preempted again. So this case can happen again and again, although not endless, but really wastes resources to do unnecessary scheduling/preemption.
Along with this PR, I wrote an e2e test to simulate the issue.
Does this PR introduce a user-facing change?:
/sig scheduling