Ignore some update Pod events in scheduler #96071

Huang-Wei · 2020-10-31T05:22:18Z

What type of PR is this?

/kind feature
/kind flake
/sig scheduling

What this PR does / why we need it:

This PR introduces an optimization to ignore the Pod update events when the resourceVersion of old and new pods are identical.

This is not only an enhancement, but also help to deflake some integration test (e.g., TestPreemptionRaces). In terms of #85031, I ran the following command w/ and w/o this PR:

make test-integration WHAT=./test/integration/scheduler KUBE_TEST_ARGS="-run ^TestPreemptionRaces -count 10"

Without this PR, it fails 2 times in consecutive 10 runnings (the embedded "-count 10" counts as one running).
With this PR, it passed 40 runnings in a row.

Which issue(s) this PR fixes:

Fixes #85031 (particularly the 2 failures mentions in #85031 (comment))

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Scheduler now ignores Pod update events if the resourceVersion of old and new Pods are identical.

k8s-ci-robot · 2020-10-31T05:23:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [Huang-Wei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Huang-Wei · 2020-10-31T07:26:02Z

/retest
/triage accepted
/cc @ahg-g @alculquicondor
/uncc @wgliang @k82cn

pkg/scheduler/eventhandlers.go

alculquicondor · 2020-11-02T14:19:15Z

pkg/scheduler/eventhandlers.go

 		return
 	}
-	if err := sched.SchedulingQueue.Update(oldObj.(*v1.Pod), pod); err != nil {
+	old := oldObj.(*v1.Pod)
+	if old.ResourceVersion == new.ResourceVersion {


Why would we get this, and why wouldn't skipPodUpdate skip it?

Enforced sync would trigger an update event carrying two identical objects.

how come this is not enough to cover that case?

kubernetes/pkg/scheduler/eventhandlers.go

Line 334 in d1c2964

p.ResourceVersion = ""

There is probably some other field we should be excluding, instead of comparing the resource version.

skipPodUpdate() deals with assumed pod only, right? It's very likely the update event carries a non-assumed pod. Let me put some log to verify it.

alculquicondor · 2020-11-02T19:22:53Z

I've been discussing offline with @Huang-Wei the usefulness of this PR after #82222 is merged.

Apparently, when resync period is set to 0, full resyncs might still happen, at a cadence between 10 to 20 minutes.

While this PR would avoid extra scheduling attempts, I fear that we might have been relaying on it to reschedule certain pods. So the question is:
Should we merge this and deal with any unknown bugs where Pods remain unschedulable OR should we be conservative, keep things as they are and deal with flakies some other way?

Note that we still have the goroutine that flushes unschedulable Pods from time to time (So, pro to merging this PR).
An alternative way to deal with flakies is to avoid doing exact comparisons of lists of events and instead make sure all events in a list happened (but there could be others).

Huang-Wei · 2020-11-03T07:41:55Z

TL;DR

I can confirm that it's the enforced sync (every 1 second) that triggered the flake, which internally left a phantom preemptor obj in schedulingQ and results in a series of bizarre behavior. See the details below.

/cc @ahg-g

All right, I think I have a full picture of what's happening in the flakes.

The culprit is the "phantom" preemptor obj, which in the flake, is triggered by enforced resync in an Update() event:

# a debug shows the 2 objs pass to Update() hold the same resourceVersion
I1102 21:59:56.447858   20777 eventhandlers.go:185] [DEBUG] skipPodUpdate returns false. old: 526; new: 526

When the event comes in, the real preemptor may not be assumed yet (e.g., being scheduled), so the phantom one can pass skipPodUpdate() and gets added to schedulingQ.

When the real preemptor finishes its 1st attempt, it updates its status as well as scheduler's internal nominatedPodsMap:

I1102 21:59:56.497336   20777 scheduler.go:327] [DEBUG] nominatedNode "node1" for "preemptor-pod"
I1102 21:59:56.497357   20777 scheduler.go:344] Updating pod condition for preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod to (PodScheduled==False, Reason=Unschedulable)

Then comes a bizarre thing, remember the phantom preemptor? It's being popped out, and being scheduled smoothly - just looks like it's the real preemptor's 2nd attempt:

I1102 21:59:56.537870   20777 scheduler.go:448] Attempting to schedule pod: preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod
I1102 21:59:56.538016   20777 default_binder.go:51] Attempting to bind preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod to node1

After this phantom one gets scheduled, the real preemptor's update event comes in, it has to go through skipUpdate() as well. It, for sure, has to go through sematic obj comparison as this pod has already been assumed:

I1102 21:59:56.538039   20777 eventhandlers.go:317] [DEBUG] pod preemptor-pod is assumed ? true

Interestly, it passed the semantic check as the assumed (phantom) one doesn't carry .status.nominatedNodeName, while the real one carries:

  3375 I1102 21:59:56.538998   20777 eventhandlers.go:358] [DEBUG] diff =   &v1.Pod{
  3376     TypeMeta:   v1.TypeMeta{},
  3377     ObjectMeta: v1.ObjectMeta{Name: "preemptor-pod", Namespace: "preemption-race477aa484-aa08-4777-9888-6e55138e504e", UID: "5fda0dea-efd2-47d0-8810       -6729b567ccd9", CreationTimestamp: v1.Time{Time: s"2020-11-02 21:59:56 -0800 PST"}},
  3378     Spec:       v1.PodSpec{Containers: []v1.Container{{Name: "preemptor-pod", Image: "k8s.gcr.io/pause:3.2", Resources: v1.ResourceRequirements{Requ       ests: v1.ResourceList{s"cpu": {i: resource.int64Amount{value: 4900, scale: -3}, s: "4900m", Format: "DecimalSI"}, s"memory": {i: resource.int64Amoun       t{value: 4900}, s: "4900", Format: "DecimalSI"}}}, TerminationMessagePath: "/dev/termination-log", TerminationMessagePolicy: "File", ImagePullPolicy       : "IfNotPresent"}}, RestartPolicy: "Always", TerminationGracePeriodSeconds: &30, DNSPolicy: "ClusterFirst", SecurityContext: &v1.PodSecurityContext{       }, SchedulerName: "default-scheduler", Priority: &300, EnableServiceLinks: &true},
  3379     Status: v1.PodStatus{
  3380         ... // 2 identical fields
  3381         Message:           "",
  3382         Reason:            "",
  3383 -       NominatedNodeName: "",
  3384 +       NominatedNodeName: "node1",
  3385         HostIP:            "",
  3386         PodIP:             "",
  3387         ... // 6 identical fields
  3388     },
  3389   }

This means, the real preemptor gets queued, in a super ridiculous way though. This will cause big trouble. I will mention it later.

All right, let's continue. The binding cycle of phantom preemptor continues to bind it successfully:

I1102 21:59:56.545621   20777 scheduler.go:593] "Successfully bound pod to node" pod="preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod" node="node1" evaluatedNodes=1 feasibleNodes=1

This successful binding triggers two things: remove pod from schedulingQ (also cleared the nominatedPodsMap), and also add pod to cache.

I1102 21:59:56.545836   20777 eventhandlers.go:227] add event for scheduled pod preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod
I1102 21:59:56.545841   20777 eventhandlers.go:207] delete event for unscheduled pod preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod

So far, everything should be fine, except for that we still have a real preemptor obj left in schedulingQ, as aforementioned. The real preemptor gets popped out, for sure, it cannot be scheduled as the phantom one has been scheduled and cache gets updated properly:

I1102 21:59:56.546199   20777 factory.go:329] "Unable to schedule pod; no fit; waiting" pod="preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod" err="0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory."
I1102 21:59:56.546212   20777 scheduler.go:327] [DEBUG] nominatedNode "" for "preemptor-pod"
I1102 21:59:56.546229   20777 scheduler.go:344] Updating pod condition for preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod to (PodScheduled==False, Reason=Unschedulable)

As the DEBUG log shows, nominatedNode is empty as preemption doesn't help. But, remember that it's the real preemptor, it itself carries .status.nominatedNodeName from the beginning, so the following logic would re-fill nominatedPodsMap (L747):

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Lines 741 to 747 in f0dfa55

    
           if len(nnn) == 0 { 
        
           	nnn = NominatedNodeName(p) 
        
           	if len(nnn) == 0 { 
        
           		return 
        
           	} 
        
           } 
        
           npm.nominatedPodToNode[p.UID] = nnn

And the severe thing is, deleting all pods doesn't help clearing the nominatedPodsMap - deleting the (phantom) preemptor only clears the cache, not nominatedPodsMap. This eventually triggered the flakes - the carried-over nominatedPodsMap from N_th test iteration blocks the initial pods to be scheduled in N+1_th test iteration:

I1102 21:59:58.974307   20777 scheduler.go:448] Attempting to schedule pod: preemption-race477aa484-aa08-4777-9888-6e55138e504e/rpod-1
# It's the start of a new testing iteration, len(nominatedPods) != 0 means there is something carried over from previous test
I1102 21:59:58.974392   20777 generic_scheduler.go:387] [DEBUG] len(nominatedPods) != 0 for pod rpod-1

And the real preemptor from the previous cycle will be scheduled successfully, and failed at binding. As now it becomes a "phantom" obj:

I1102 22:00:00.536313   20777 scheduling_queue.go:814] About to try and schedule pod preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod
I1102 22:00:00.536359   20777 scheduler.go:448] Attempting to schedule pod: preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod
I1102 22:00:00.536555   20777 default_binder.go:51] Attempting to bind preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod to node1
I1102 22:00:00.539045   20777 httplog.go:89] "HTTP" verb="POST" URI="/api/v1/namespaces/preemption-race477aa484-aa08-4777-9888-6e55138e504e/pods/preemptor-pod/binding" latency="2.091263ms" userAgent="scheduler.test/v0.0.0 (darwin/amd64) kubernetes/$Format" srcIP="127.0.0.1:62697" resp=404
E1102 22:00:00.539369   20777 framework.go:743] "Failed running Bind plugin" err="pods \"preemptor-pod\" not found" plugin="DefaultBinder" pod="preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod"
I1102 22:00:00.539421   20777 scheduler.go:421] Failed to bind pod: preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod
E1102 22:00:00.539489   20777 factory.go:345] "Error scheduling pod; retrying" err="Binding rejected: running Bind plugin \"DefaultBinder\": pods \"preemptor-pod\" not found" pod="preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod"
W1102 22:00:00.539516   20777 factory.go:351] Pod preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod doesn't exist in informer cache: pod "preemptor-pod" not found

Huang-Wei · 2020-11-03T08:07:40Z

Should we merge this and deal with any unknown bugs where Pods remain unschedulable OR should we be conservative, keep things as they are and deal with flakies some other way?

I would prefer to block the phantom/duplicate pod from entering scheduler from the beginning, although we may surface some other bugs (I'm not aware of so far) as the phantom/duplicate pod is way too dangerous:

Leave a phantom/duplicate pod in scheduling queue would cause unexpected behavior, and it's almost impossible to debug
A phantom/duplicate pod would waste scheduling cycles (sometimes forever) and slow down scheduler
The phantom/duplicate pod actually keeps updating the pod object. In other words, you may see a scheduled pod with failure conditions.

Let phantom/duplicate pod in and fix it afterwards, in my option, would need much more mindset, and hence error-prone.

BTW: regarding #82222, setting a 0 sync period is a best practice, but we cannot take it for granted, so this PR is still needed.

alculquicondor · 2020-11-03T14:38:49Z

Thanks for the detailed case analysis.
Regardless of the decision, I don't think we should hold #82222 on this PR.

soulxu · 2020-11-06T06:05:59Z

pkg/scheduler/eventhandlers.go

+	oldPod := oldObj.(*v1.Pod)
+	if oldPod.ResourceVersion == newPod.ResourceVersion {
+		return
+	}


Should we test the version before invoking sched.skipPodUpdate, then we save some useless process.

I thought of that as well. My idea was to make it conservative. But you're right, moving it ahead will be more efficient.

The key question is: can we get a consensus to compare ResourceVersion and do an early return?

@alculquicondor @ahg-g @damemi @adtac ^^

I don't think doing this check after skipPodUpdate is any more conservative. So, fine doing it earlier.

Does ResourceVersion also change if there is a Status update?

skipPodUpdate zeroes out ResourceVersion when comparing with in-cache copy. This seems to suggest that resource version may change even if there isn't a real pod update, right?

That's not the intention of that zeroing. It's just that resource version will change when fields we don't care about change.

Does ResourceVersion also change if there is a Status update?

Yes, even with status update.

This seems to suggest that resource version may change even if there isn't a real pod update, right?

Nope, once any field of an obj is updated, its resource version gets changed.

One case that resource doesn't change is "resync", the semantics of "resync" is to re-play the items in informerCache every "resync period", so it happens on client side, not server side. Those events individually carry identical objects actually. In some cases, some of them become a "phantom" event, just like this issue described.
I hope that "resync" is the only case, so that #82222 should be good enough to handle this case. But I'm not 100% sure "resync" is the only culprit, so this PR is for extra guarding.

soulxu · 2020-11-07T10:46:41Z

Apparently, when resync period is set to 0, full resyncs might still happen, at a cadence between 10 to 20 minutes.

I didn't found the force resync will happen if we set the default resync peroid for the informer. I printed the log, see that the resync period for each event-handler listener will be 0 when the informer's default resync period is 0 also. So the resync shouldn't happened.

kubernetes/staging/src/k8s.io/client-go/tools/cache/shared_informer.go

Lines 795 to 797 in 9253aa9

    
           if p.resyncPeriod == 0 { 
        
           	return false 
        
           }

Maybe I missed something.

Huang-Wei · 2020-11-09T18:29:39Z

So the resync shouldn't happened.

Thanks @soulxu. That was my memory a while ago, probably the behavior gets updated.

Huang-Wei · 2020-11-09T20:25:20Z

cc/ @liggitt https://kubernetes.slack.com/archives/C0EG7JC6T/p1604952531010300 was asked based on the context of this issue: in some scheduler integration test, the resyncPeriod was set to 1 second, and it caused weird flakes (detailed analysis see #96071 (comment)). This PR applies an additional check on resource version upon Update(), acting as an extra guard in addition to #82222 (which sets all resyncePeriod to 0).

ahg-g · 2020-11-09T21:21:19Z

Thanks @Huang-Wei for the detailed investigation! skipping updates that doesn't change the resourceversion make sense to me; ideally we shouldn't be getting those updates at all, right?

Huang-Wei · 2020-11-09T22:23:39Z

ideally we shouldn't be getting those updates at all, right?

Yes, that's my understanding. Letting it sneak in will really cause unexpected behavior.

ahg-g · 2020-11-10T02:28:53Z

pkg/scheduler/eventhandlers.go

-	pod := newObj.(*v1.Pod)
-	if sched.skipPodUpdate(pod) {
+	oldPod, newPod := oldObj.(*v1.Pod), newObj.(*v1.Pod)
+	if oldPod.ResourceVersion == newPod.ResourceVersion {


adding a comment here explaining when this could happen would be great.

ahg-g · 2020-11-10T02:30:17Z

ideally we shouldn't be getting those updates at all, right?

Yes, that's my understanding. Letting it sneak in will really cause unexpected behavior.

looks good to me, documenting why we are doing that would be great.

ahg-g · 2020-11-10T11:21:08Z

/lgtm
/hold

hold if others still looking at the PR

alculquicondor · 2020-11-10T14:42:36Z

No concerns from me

Huang-Wei · 2020-11-10T18:00:50Z

Thanks all for the review!

/hold cancel

k8s-ci-robot requested review from k82cn and wgliang October 31, 2020 05:23

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2020

k8s-ci-robot removed request for wgliang and k82cn October 31, 2020 07:26

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Oct 31, 2020

k8s-ci-robot requested review from ahg-g and alculquicondor October 31, 2020 07:26

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 31, 2020

alculquicondor reviewed Nov 2, 2020

View reviewed changes

Huang-Wei force-pushed the phantom-sched-update-evt branch from d80f582 to 1355f12 Compare November 2, 2020 17:14

Huang-Wei mentioned this pull request Nov 2, 2020

Set 0 sync period in scheduler integration test #82222

Merged

soulxu reviewed Nov 6, 2020

View reviewed changes

Huang-Wei force-pushed the phantom-sched-update-evt branch from 1355f12 to 4bb31d6 Compare November 9, 2020 20:39

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 9, 2020

ahg-g reviewed Nov 10, 2020

View reviewed changes

Ignore specific Pod update events in scheduler

6e0fb9a

Huang-Wei force-pushed the phantom-sched-update-evt branch from 4bb31d6 to 6e0fb9a Compare November 10, 2020 04:35

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 10, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2020

k8s-ci-robot assigned ahg-g Nov 10, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2020

k8s-ci-robot merged commit 2b8f43b into kubernetes:master Nov 10, 2020

k8s-ci-robot added this to the v1.20 milestone Nov 10, 2020

Huang-Wei deleted the phantom-sched-update-evt branch November 10, 2020 20:25

Huang-Wei mentioned this pull request May 31, 2021

Scheduler occasionally has stale state and does not assign pods to nodes that could fit them #102339

Closed

liggitt mentioned this pull request Sep 14, 2021

Automated cherry pick of #93044: Initialize scheduler's podInformer in sharedInformerFactory #105015

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore some update Pod events in scheduler #96071

Ignore some update Pod events in scheduler #96071

Huang-Wei commented Oct 31, 2020

k8s-ci-robot commented Oct 31, 2020

Huang-Wei commented Oct 31, 2020

alculquicondor Nov 2, 2020

Huang-Wei Nov 2, 2020

alculquicondor Nov 2, 2020

Huang-Wei Nov 2, 2020

alculquicondor commented Nov 2, 2020

Huang-Wei commented Nov 3, 2020 •

edited

Huang-Wei commented Nov 3, 2020

alculquicondor commented Nov 3, 2020

soulxu Nov 6, 2020

Huang-Wei Nov 6, 2020

alculquicondor Nov 9, 2020

adtac Nov 9, 2020

alculquicondor Nov 9, 2020

Huang-Wei Nov 9, 2020

Huang-Wei Nov 9, 2020

soulxu commented Nov 7, 2020

Huang-Wei commented Nov 9, 2020

Huang-Wei commented Nov 9, 2020

ahg-g commented Nov 9, 2020 •

edited

Huang-Wei commented Nov 9, 2020

ahg-g Nov 10, 2020

Huang-Wei Nov 10, 2020

ahg-g commented Nov 10, 2020

ahg-g commented Nov 10, 2020

alculquicondor commented Nov 10, 2020

Huang-Wei commented Nov 10, 2020

Ignore some update Pod events in scheduler #96071

Ignore some update Pod events in scheduler #96071

Conversation

Huang-Wei commented Oct 31, 2020

k8s-ci-robot commented Oct 31, 2020

Huang-Wei commented Oct 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Nov 2, 2020

Huang-Wei commented Nov 3, 2020 • edited

Huang-Wei commented Nov 3, 2020

alculquicondor commented Nov 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soulxu commented Nov 7, 2020

Huang-Wei commented Nov 9, 2020

Huang-Wei commented Nov 9, 2020

ahg-g commented Nov 9, 2020 • edited

Huang-Wei commented Nov 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Nov 10, 2020

ahg-g commented Nov 10, 2020

alculquicondor commented Nov 10, 2020

Huang-Wei commented Nov 10, 2020

Huang-Wei commented Nov 3, 2020 •

edited

ahg-g commented Nov 9, 2020 •

edited