New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore some update Pod events in scheduler #96071
Ignore some update Pod events in scheduler #96071
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
pkg/scheduler/eventhandlers.go
Outdated
return | ||
} | ||
if err := sched.SchedulingQueue.Update(oldObj.(*v1.Pod), pod); err != nil { | ||
old := oldObj.(*v1.Pod) | ||
if old.ResourceVersion == new.ResourceVersion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we get this, and why wouldn't skipPodUpdate
skip it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforced sync would trigger an update event carrying two identical objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come this is not enough to cover that case?
kubernetes/pkg/scheduler/eventhandlers.go
Line 334 in d1c2964
p.ResourceVersion = "" |
There is probably some other field we should be excluding, instead of comparing the resource version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipPodUpdate()
deals with assumed pod only, right? It's very likely the update event carries a non-assumed pod. Let me put some log to verify it.
d80f582
to
1355f12
Compare
I've been discussing offline with @Huang-Wei the usefulness of this PR after #82222 is merged. Apparently, when resync period is set to 0, full resyncs might still happen, at a cadence between 10 to 20 minutes. While this PR would avoid extra scheduling attempts, I fear that we might have been relaying on it to reschedule certain pods. So the question is: Note that we still have the goroutine that flushes unschedulable Pods from time to time (So, pro to merging this PR). |
TL;DR I can confirm that it's the enforced sync (every 1 second) that triggered the flake, which internally left a phantom preemptor obj in schedulingQ and results in a series of bizarre behavior. See the details below. /cc @ahg-g All right, I think I have a full picture of what's happening in the flakes. The culprit is the "phantom" preemptor obj, which in the flake, is triggered by enforced resync in an Update() event:
When the event comes in, the real preemptor may not be assumed yet (e.g., being scheduled), so the phantom one can pass skipPodUpdate() and gets added to schedulingQ. When the real preemptor finishes its 1st attempt, it updates its status as well as scheduler's internal nominatedPodsMap:
Then comes a bizarre thing, remember the phantom preemptor? It's being popped out, and being scheduled smoothly - just looks like it's the real preemptor's 2nd attempt:
After this phantom one gets scheduled, the real preemptor's update event comes in, it has to go through skipUpdate() as well. It, for sure, has to go through sematic obj comparison as this pod has already been assumed:
Interestly, it passed the semantic check as the assumed (phantom) one doesn't carry .status.nominatedNodeName, while the real one carries:
This means, the real preemptor gets queued, in a super ridiculous way though. This will cause big trouble. I will mention it later. All right, let's continue. The binding cycle of phantom preemptor continues to bind it successfully:
This successful binding triggers two things: remove pod from schedulingQ (also cleared the nominatedPodsMap), and also add pod to cache.
So far, everything should be fine, except for that we still have a real preemptor obj left in schedulingQ, as aforementioned. The real preemptor gets popped out, for sure, it cannot be scheduled as the phantom one has been scheduled and cache gets updated properly:
As the DEBUG log shows, kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go Lines 741 to 747 in f0dfa55
And the severe thing is, deleting all pods doesn't help clearing the nominatedPodsMap - deleting the (phantom) preemptor only clears the cache, not nominatedPodsMap. This eventually triggered the flakes - the carried-over nominatedPodsMap from Nth test iteration blocks the initial pods to be scheduled in N+1th test iteration:
And the real preemptor from the previous cycle will be scheduled successfully, and failed at binding. As now it becomes a "phantom" obj:
|
I would prefer to block the phantom/duplicate pod from entering scheduler from the beginning, although we may surface some other bugs (I'm not aware of so far) as the phantom/duplicate pod is way too dangerous:
Let phantom/duplicate pod in and fix it afterwards, in my option, would need much more mindset, and hence error-prone. BTW: regarding #82222, setting a 0 sync period is a best practice, but we cannot take it for granted, so this PR is still needed. |
Thanks for the detailed case analysis. |
oldPod := oldObj.(*v1.Pod) | ||
if oldPod.ResourceVersion == newPod.ResourceVersion { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test the version before invoking sched.skipPodUpdate
, then we save some useless process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of that as well. My idea was to make it conservative. But you're right, moving it ahead will be more efficient.
The key question is: can we get a consensus to compare ResourceVersion and do an early return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think doing this check after skipPodUpdate
is any more conservative. So, fine doing it earlier.
Does ResourceVersion also change if there is a Status update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipPodUpdate zeroes out ResourceVersion when comparing with in-cache copy. This seems to suggest that resource version may change even if there isn't a real pod update, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the intention of that zeroing. It's just that resource version will change when fields we don't care about change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ResourceVersion also change if there is a Status update?
Yes, even with status update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to suggest that resource version may change even if there isn't a real pod update, right?
Nope, once any field of an obj is updated, its resource version gets changed.
One case that resource doesn't change is "resync", the semantics of "resync" is to re-play the items in informerCache every "resync period", so it happens on client side, not server side. Those events individually carry identical objects actually. In some cases, some of them become a "phantom" event, just like this issue described.
I hope that "resync" is the only case, so that #82222 should be good enough to handle this case. But I'm not 100% sure "resync" is the only culprit, so this PR is for extra guarding.
I didn't found the force resync will happen if we set the default resync peroid for the informer. I printed the log, see that the resync period for each event-handler listener will be 0 when the informer's default resync period is 0 also. So the resync shouldn't happened. kubernetes/staging/src/k8s.io/client-go/tools/cache/shared_informer.go Lines 795 to 797 in 9253aa9
Maybe I missed something. |
Thanks @soulxu. That was my memory a while ago, probably the behavior gets updated. |
cc/ @liggitt https://kubernetes.slack.com/archives/C0EG7JC6T/p1604952531010300 was asked based on the context of this issue: in some scheduler integration test, the resyncPeriod was set to 1 second, and it caused weird flakes (detailed analysis see #96071 (comment)). This PR applies an additional check on resource version upon Update(), acting as an extra guard in addition to #82222 (which sets all resyncePeriod to 0). |
1355f12
to
4bb31d6
Compare
Thanks @Huang-Wei for the detailed investigation! skipping updates that doesn't change the resourceversion make sense to me; ideally we shouldn't be getting those updates at all, right? |
Yes, that's my understanding. Letting it sneak in will really cause unexpected behavior. |
pod := newObj.(*v1.Pod) | ||
if sched.skipPodUpdate(pod) { | ||
oldPod, newPod := oldObj.(*v1.Pod), newObj.(*v1.Pod) | ||
if oldPod.ResourceVersion == newPod.ResourceVersion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a comment here explaining when this could happen would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
looks good to me, documenting why we are doing that would be great. |
4bb31d6
to
6e0fb9a
Compare
/lgtm hold if others still looking at the PR |
No concerns from me |
Thanks all for the review! /hold cancel |
What type of PR is this?
/kind feature
/kind flake
/sig scheduling
What this PR does / why we need it:
This PR introduces an optimization to ignore the Pod update events when the resourceVersion of old and new pods are identical.
This is not only an enhancement, but also help to deflake some integration test (e.g.,
TestPreemptionRaces
). In terms of #85031, I ran the following command w/ and w/o this PR:Which issue(s) this PR fixes:
Fixes #85031 (particularly the 2 failures mentions in #85031 (comment))
Special notes for your reviewer:
Does this PR introduce a user-facing change?: