Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore some update Pod events in scheduler #96071

Merged
merged 1 commit into from Nov 10, 2020

Conversation

Huang-Wei
Copy link
Member

What type of PR is this?

/kind feature
/kind flake
/sig scheduling

What this PR does / why we need it:

This PR introduces an optimization to ignore the Pod update events when the resourceVersion of old and new pods are identical.

This is not only an enhancement, but also help to deflake some integration test (e.g., TestPreemptionRaces). In terms of #85031, I ran the following command w/ and w/o this PR:

make test-integration WHAT=./test/integration/scheduler KUBE_TEST_ARGS="-run ^TestPreemptionRaces -count 10"
  • Without this PR, it fails 2 times in consecutive 10 runnings (the embedded "-count 10" counts as one running).
  • With this PR, it passed 40 runnings in a row.

Which issue(s) this PR fixes:

Fixes #85031 (particularly the 2 failures mentions in #85031 (comment))

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Scheduler now ignores Pod update events if the resourceVersion of old and new Pods are identical.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note Denotes a PR that will be considered when it comes time to generate release notes. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 31, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2020
@Huang-Wei
Copy link
Member Author

/retest
/triage accepted
/cc @ahg-g @alculquicondor
/uncc @wgliang @k82cn

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Oct 31, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 31, 2020
pkg/scheduler/eventhandlers.go Outdated Show resolved Hide resolved
return
}
if err := sched.SchedulingQueue.Update(oldObj.(*v1.Pod), pod); err != nil {
old := oldObj.(*v1.Pod)
if old.ResourceVersion == new.ResourceVersion {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we get this, and why wouldn't skipPodUpdate skip it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforced sync would trigger an update event carrying two identical objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come this is not enough to cover that case?

p.ResourceVersion = ""

There is probably some other field we should be excluding, instead of comparing the resource version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipPodUpdate() deals with assumed pod only, right? It's very likely the update event carries a non-assumed pod. Let me put some log to verify it.

@alculquicondor
Copy link
Member

I've been discussing offline with @Huang-Wei the usefulness of this PR after #82222 is merged.

Apparently, when resync period is set to 0, full resyncs might still happen, at a cadence between 10 to 20 minutes.

While this PR would avoid extra scheduling attempts, I fear that we might have been relaying on it to reschedule certain pods. So the question is:
Should we merge this and deal with any unknown bugs where Pods remain unschedulable OR should we be conservative, keep things as they are and deal with flakies some other way?

Note that we still have the goroutine that flushes unschedulable Pods from time to time (So, pro to merging this PR).
An alternative way to deal with flakies is to avoid doing exact comparisons of lists of events and instead make sure all events in a list happened (but there could be others).

@Huang-Wei
Copy link
Member Author

Huang-Wei commented Nov 3, 2020

TL;DR

I can confirm that it's the enforced sync (every 1 second) that triggered the flake, which internally left a phantom preemptor obj in schedulingQ and results in a series of bizarre behavior. See the details below.

/cc @ahg-g


All right, I think I have a full picture of what's happening in the flakes.

The culprit is the "phantom" preemptor obj, which in the flake, is triggered by enforced resync in an Update() event:

# a debug shows the 2 objs pass to Update() hold the same resourceVersion
I1102 21:59:56.447858   20777 eventhandlers.go:185] [DEBUG] skipPodUpdate returns false. old: 526; new: 526

When the event comes in, the real preemptor may not be assumed yet (e.g., being scheduled), so the phantom one can pass skipPodUpdate() and gets added to schedulingQ.

When the real preemptor finishes its 1st attempt, it updates its status as well as scheduler's internal nominatedPodsMap:

I1102 21:59:56.497336   20777 scheduler.go:327] [DEBUG] nominatedNode "node1" for "preemptor-pod"
I1102 21:59:56.497357   20777 scheduler.go:344] Updating pod condition for preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod to (PodScheduled==False, Reason=Unschedulable)

Then comes a bizarre thing, remember the phantom preemptor? It's being popped out, and being scheduled smoothly - just looks like it's the real preemptor's 2nd attempt:

I1102 21:59:56.537870   20777 scheduler.go:448] Attempting to schedule pod: preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod
I1102 21:59:56.538016   20777 default_binder.go:51] Attempting to bind preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod to node1

After this phantom one gets scheduled, the real preemptor's update event comes in, it has to go through skipUpdate() as well. It, for sure, has to go through sematic obj comparison as this pod has already been assumed:

I1102 21:59:56.538039   20777 eventhandlers.go:317] [DEBUG] pod preemptor-pod is assumed ? true

Interestly, it passed the semantic check as the assumed (phantom) one doesn't carry .status.nominatedNodeName, while the real one carries:

  3375 I1102 21:59:56.538998   20777 eventhandlers.go:358] [DEBUG] diff =   &v1.Pod{
  3376     TypeMeta:   v1.TypeMeta{},
  3377     ObjectMeta: v1.ObjectMeta{Name: "preemptor-pod", Namespace: "preemption-race477aa484-aa08-4777-9888-6e55138e504e", UID: "5fda0dea-efd2-47d0-8810       -6729b567ccd9", CreationTimestamp: v1.Time{Time: s"2020-11-02 21:59:56 -0800 PST"}},
  3378     Spec:       v1.PodSpec{Containers: []v1.Container{{Name: "preemptor-pod", Image: "k8s.gcr.io/pause:3.2", Resources: v1.ResourceRequirements{Requ       ests: v1.ResourceList{s"cpu": {i: resource.int64Amount{value: 4900, scale: -3}, s: "4900m", Format: "DecimalSI"}, s"memory": {i: resource.int64Amoun       t{value: 4900}, s: "4900", Format: "DecimalSI"}}}, TerminationMessagePath: "/dev/termination-log", TerminationMessagePolicy: "File", ImagePullPolicy       : "IfNotPresent"}}, RestartPolicy: "Always", TerminationGracePeriodSeconds: &30, DNSPolicy: "ClusterFirst", SecurityContext: &v1.PodSecurityContext{       }, SchedulerName: "default-scheduler", Priority: &300, EnableServiceLinks: &true},
  3379     Status: v1.PodStatus{
  3380         ... // 2 identical fields
  3381         Message:           "",
  3382         Reason:            "",
  3383 -       NominatedNodeName: "",
  3384 +       NominatedNodeName: "node1",
  3385         HostIP:            "",
  3386         PodIP:             "",
  3387         ... // 6 identical fields
  3388     },
  3389   }

This means, the real preemptor gets queued, in a super ridiculous way though. This will cause big trouble. I will mention it later.

All right, let's continue. The binding cycle of phantom preemptor continues to bind it successfully:

I1102 21:59:56.545621   20777 scheduler.go:593] "Successfully bound pod to node" pod="preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod" node="node1" evaluatedNodes=1 feasibleNodes=1

This successful binding triggers two things: remove pod from schedulingQ (also cleared the nominatedPodsMap), and also add pod to cache.

I1102 21:59:56.545836   20777 eventhandlers.go:227] add event for scheduled pod preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod
I1102 21:59:56.545841   20777 eventhandlers.go:207] delete event for unscheduled pod preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod

So far, everything should be fine, except for that we still have a real preemptor obj left in schedulingQ, as aforementioned. The real preemptor gets popped out, for sure, it cannot be scheduled as the phantom one has been scheduled and cache gets updated properly:

I1102 21:59:56.546199   20777 factory.go:329] "Unable to schedule pod; no fit; waiting" pod="preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod" err="0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory."
I1102 21:59:56.546212   20777 scheduler.go:327] [DEBUG] nominatedNode "" for "preemptor-pod"
I1102 21:59:56.546229   20777 scheduler.go:344] Updating pod condition for preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod to (PodScheduled==False, Reason=Unschedulable)

As the DEBUG log shows, nominatedNode is empty as preemption doesn't help. But, remember that it's the real preemptor, it itself carries .status.nominatedNodeName from the beginning, so the following logic would re-fill nominatedPodsMap (L747):

if len(nnn) == 0 {
nnn = NominatedNodeName(p)
if len(nnn) == 0 {
return
}
}
npm.nominatedPodToNode[p.UID] = nnn

And the severe thing is, deleting all pods doesn't help clearing the nominatedPodsMap - deleting the (phantom) preemptor only clears the cache, not nominatedPodsMap. This eventually triggered the flakes - the carried-over nominatedPodsMap from Nth test iteration blocks the initial pods to be scheduled in N+1th test iteration:

I1102 21:59:58.974307   20777 scheduler.go:448] Attempting to schedule pod: preemption-race477aa484-aa08-4777-9888-6e55138e504e/rpod-1
# It's the start of a new testing iteration, len(nominatedPods) != 0 means there is something carried over from previous test
I1102 21:59:58.974392   20777 generic_scheduler.go:387] [DEBUG] len(nominatedPods) != 0 for pod rpod-1

And the real preemptor from the previous cycle will be scheduled successfully, and failed at binding. As now it becomes a "phantom" obj:

I1102 22:00:00.536313   20777 scheduling_queue.go:814] About to try and schedule pod preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod
I1102 22:00:00.536359   20777 scheduler.go:448] Attempting to schedule pod: preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod
I1102 22:00:00.536555   20777 default_binder.go:51] Attempting to bind preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod to node1
I1102 22:00:00.539045   20777 httplog.go:89] "HTTP" verb="POST" URI="/api/v1/namespaces/preemption-race477aa484-aa08-4777-9888-6e55138e504e/pods/preemptor-pod/binding" latency="2.091263ms" userAgent="scheduler.test/v0.0.0 (darwin/amd64) kubernetes/$Format" srcIP="127.0.0.1:62697" resp=404
E1102 22:00:00.539369   20777 framework.go:743] "Failed running Bind plugin" err="pods \"preemptor-pod\" not found" plugin="DefaultBinder" pod="preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod"
I1102 22:00:00.539421   20777 scheduler.go:421] Failed to bind pod: preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod
E1102 22:00:00.539489   20777 factory.go:345] "Error scheduling pod; retrying" err="Binding rejected: running Bind plugin \"DefaultBinder\": pods \"preemptor-pod\" not found" pod="preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod"
W1102 22:00:00.539516   20777 factory.go:351] Pod preemption-race477aa484-aa08-4777-9888-6e55138e504e/preemptor-pod doesn't exist in informer cache: pod "preemptor-pod" not found

@Huang-Wei
Copy link
Member Author

Should we merge this and deal with any unknown bugs where Pods remain unschedulable OR should we be conservative, keep things as they are and deal with flakies some other way?

I would prefer to block the phantom/duplicate pod from entering scheduler from the beginning, although we may surface some other bugs (I'm not aware of so far) as the phantom/duplicate pod is way too dangerous:

  1. Leave a phantom/duplicate pod in scheduling queue would cause unexpected behavior, and it's almost impossible to debug
  2. A phantom/duplicate pod would waste scheduling cycles (sometimes forever) and slow down scheduler
  3. The phantom/duplicate pod actually keeps updating the pod object. In other words, you may see a scheduled pod with failure conditions.

Let phantom/duplicate pod in and fix it afterwards, in my option, would need much more mindset, and hence error-prone.

BTW: regarding #82222, setting a 0 sync period is a best practice, but we cannot take it for granted, so this PR is still needed.

@alculquicondor
Copy link
Member

Thanks for the detailed case analysis.
Regardless of the decision, I don't think we should hold #82222 on this PR.

oldPod := oldObj.(*v1.Pod)
if oldPod.ResourceVersion == newPod.ResourceVersion {
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test the version before invoking sched.skipPodUpdate, then we save some useless process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of that as well. My idea was to make it conservative. But you're right, moving it ahead will be more efficient.

The key question is: can we get a consensus to compare ResourceVersion and do an early return?

@alculquicondor @ahg-g @damemi @adtac ^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think doing this check after skipPodUpdate is any more conservative. So, fine doing it earlier.

Does ResourceVersion also change if there is a Status update?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipPodUpdate zeroes out ResourceVersion when comparing with in-cache copy. This seems to suggest that resource version may change even if there isn't a real pod update, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the intention of that zeroing. It's just that resource version will change when fields we don't care about change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ResourceVersion also change if there is a Status update?

Yes, even with status update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to suggest that resource version may change even if there isn't a real pod update, right?

Nope, once any field of an obj is updated, its resource version gets changed.

One case that resource doesn't change is "resync", the semantics of "resync" is to re-play the items in informerCache every "resync period", so it happens on client side, not server side. Those events individually carry identical objects actually. In some cases, some of them become a "phantom" event, just like this issue described.
I hope that "resync" is the only case, so that #82222 should be good enough to handle this case. But I'm not 100% sure "resync" is the only culprit, so this PR is for extra guarding.

@soulxu
Copy link
Contributor

soulxu commented Nov 7, 2020

Apparently, when resync period is set to 0, full resyncs might still happen, at a cadence between 10 to 20 minutes.

I didn't found the force resync will happen if we set the default resync peroid for the informer. I printed the log, see that the resync period for each event-handler listener will be 0 when the informer's default resync period is 0 also. So the resync shouldn't happened.

if p.resyncPeriod == 0 {
return false
}

Maybe I missed something.

@Huang-Wei
Copy link
Member Author

So the resync shouldn't happened.

Thanks @soulxu. That was my memory a while ago, probably the behavior gets updated.

@Huang-Wei
Copy link
Member Author

cc/ @liggitt https://kubernetes.slack.com/archives/C0EG7JC6T/p1604952531010300 was asked based on the context of this issue: in some scheduler integration test, the resyncPeriod was set to 1 second, and it caused weird flakes (detailed analysis see #96071 (comment)). This PR applies an additional check on resource version upon Update(), acting as an extra guard in addition to #82222 (which sets all resyncePeriod to 0).

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 9, 2020
@ahg-g
Copy link
Member

ahg-g commented Nov 9, 2020

Thanks @Huang-Wei for the detailed investigation! skipping updates that doesn't change the resourceversion make sense to me; ideally we shouldn't be getting those updates at all, right?

@Huang-Wei
Copy link
Member Author

ideally we shouldn't be getting those updates at all, right?

Yes, that's my understanding. Letting it sneak in will really cause unexpected behavior.

pod := newObj.(*v1.Pod)
if sched.skipPodUpdate(pod) {
oldPod, newPod := oldObj.(*v1.Pod), newObj.(*v1.Pod)
if oldPod.ResourceVersion == newPod.ResourceVersion {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a comment here explaining when this could happen would be great.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@ahg-g
Copy link
Member

ahg-g commented Nov 10, 2020

ideally we shouldn't be getting those updates at all, right?

Yes, that's my understanding. Letting it sneak in will really cause unexpected behavior.

looks good to me, documenting why we are doing that would be great.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 10, 2020
@ahg-g
Copy link
Member

ahg-g commented Nov 10, 2020

/lgtm
/hold

hold if others still looking at the PR

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2020
@alculquicondor
Copy link
Member

No concerns from me

@Huang-Wei
Copy link
Member Author

Thanks all for the review!

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2020
@k8s-ci-robot k8s-ci-robot merged commit 2b8f43b into kubernetes:master Nov 10, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Nov 10, 2020
@Huang-Wei Huang-Wei deleted the phantom-sched-update-evt branch November 10, 2020 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestPreemptionRaces is Flaky
6 participants