Changed scheduler to use patch when updating pod status #90978

brianpursley · 2020-05-11T13:44:51Z

What type of PR is this?
/kind flake

What this PR does / why we need it:
Fixes #89259
Fixes #89728
Fixes #90627

Which issue(s) this PR fixes:
Flake occurs in TestNominatedNodeCleanUp when a medium-priority pod preempts low priority pods (with long grace periods) and then next, a high-priority pod preempts the medium-priority pod before it can be scheduled/bound.

An update conflict can sometimes occur when the high priority pod preempts the medium priority pod and it tries to clear the nominated node from the status of the medium priority pod.

This conflict was logged but did not result in an error, so the scheduling of the high priority pod still succeeded, but the integration test would fail because the medium priority pod's nominated node would never be cleared as expected.

This PR changes the schedule to use Patch() instead of UpdateStatus() to update the pod status using a merge patch in order to avoid a conflict.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

brianpursley · 2020-05-11T16:16:59Z

/assign @Huang-Wei

Huang-Wei

Thanks @brianpursley ! Just some nits, LGTM otherwise.

pkg/scheduler/scheduler.go

alculquicondor · 2020-05-11T16:46:21Z

/priority critical-urgent

alculquicondor

I'm quite confused here.

A race condition occurs because the medium priority pod is still attempting to schedule at the same time when removeNominatedNodeName() is being called to clear it's nominated node value.

Is "still attempting to schedule" referring to binding? Otherwise there shouldn't be 2 pods being scheduled at the same time.

Often times this test succeeds, but if the timing is unlucky, then the medium-priority pod will never have its nominated node name removed because it didn't retry on conflict,

What's the conflict exactly? I would say that, if there is a conflict, the whole scheduling attempt should be retried. Perhaps this is not a non-critical error:

kubernetes/pkg/scheduler/scheduler.go

Lines 452 to 456 in a151682

    
           rErr := sched.podPreemptor.removeNominatedNodeName(p) 
        
           if rErr != nil { 
        
           	klog.Errorf("Cannot remove 'NominatedNodeName' field of pod: %v", rErr) 
        
           	// We do not return as this error is not critical. 
        
           }

And if you disagree with my statement above, let's use informer's cache as #90660

alculquicondor · 2020-05-11T16:46:55Z

pkg/scheduler/scheduler.go

@@ -32,6 +32,7 @@ import (
 	coreinformers "k8s.io/client-go/informers/core/v1"
 	clientset "k8s.io/client-go/kubernetes"
 	"k8s.io/client-go/tools/cache"
+	clientretry "k8s.io/client-go/util/retry"


nit: no need for the alias

I'm quite confused here.

A race condition occurs because the medium priority pod is still attempting to schedule at the same time when removeNominatedNodeName() is being called to clear it's nominated node value.

Is "still attempting to schedule" referring to binding? Otherwise there shouldn't be 2 pods being scheduled at the same time.

Yes, sorry. I think the go-routine from scheduleOne that binds the pod is still running asynchronously when the high-priority pod preempts medium-priority pod.

In scheduler.preempt(), this line gets the nominated pods to clear:

node, victims, nominatedPodsToClear, err := sched.Algorithm.Preempt(ctx, prof, state, preemptor, scheduleErr)

but it is possible (although unlikely, which is why it shows up as a test flake) that nominatedPodsToClear gets out of date before it is updated later by the call to sched.podPreemptor.removeNominatedNodeName(p).

The integration testing logs I looked at show response of 409 coming back when it tries to update the pod status, so there is definitely something causing a conflict. Retrieving the latest and retrying the update succeeds.

Whether there is another way to solve this, I'm not sure. I was thinking at the very least, if a 409 comes back when the scheduler is trying to update a pod, it should retry. Currently, if any failure occurs during removeNominatedNodeName, it only logs the error and continues to return successfully. That is actually what causes the timeout to occur in the integration test... it doesn't know that it failed and is waiting for the nominated node name to be cleared from the medium-priority pod, but it will never occur at that point.

Often times this test succeeds, but if the timing is unlucky, then the medium-priority pod will never have its nominated node name removed because it didn't retry on conflict,

What's the conflict exactly? I would say that, if there is a conflict, the whole scheduling attempt should be retried. Perhaps this is not a non-critical error:

kubernetes/pkg/scheduler/scheduler.go

Lines 452 to 456 in a151682

rErr := sched.podPreemptor.removeNominatedNodeName(p)

if rErr != nil {

klog.Errorf("Cannot remove 'NominatedNodeName' field of pod: %v", rErr)

// We do not return as this error is not critical.

}

And if you disagree with my statement above, let's use informer's cache as #90660

I don't know if this is a critical error or not. I think it really only happens when you have two successive preemptions on the same node within a very short period of time. For practical purposes, this is probably an edge case, but I guess a system under heavy load could experience this.

Regarding retrying to whole scheduling attempt. The problem I see with that is the conflict occurs when it is trying to remove the nominated node name from another pod and it seems like the preemption has already occurred, so it might be hard to retry the whole thing.

I'm definitely new to the scheduler, so please take what I'm saying with a grain of salt and I will defer to those more familiar with it to let me know if I am wrong about something.

Thanks, I really do appreciate the comments and questions.

One more thing, regarding the // We do not return as this error is not critical. comment. Read the comment above that for the assumption being made:

// Clearing nominated pods should happen outside of "if node != nil". Node could // be nil when a pod with nominated node name is eligible to preempt again, // but preemption logic does not find any node for it. In that case Preempt() // function of generic_scheduler.go returns the pod itself for removal of // the 'NominatedNodeName' field. for _, p := range nominatedPodsToClear { rErr := sched.podPreemptor.removeNominatedNodeName(p) if rErr != nil { klog.Errorf("Cannot remove 'NominatedNodeName' field of pod: %v", rErr) // We do not return as this error is not critical. } }

It is assuming this happens because the node is nil, but rErr is not checked, and it is not because the node is nil, but a conflict occurs, which I believe to be a different problem that should not be ignored.

Actually, I find that whole comment confusing to me, but I still think this is an error that should not be ignored.

nit: no need for the alias

Fixed, thanks

alculquicondor · 2020-05-11T17:12:21Z

/hold

alculquicondor · 2020-05-11T17:13:52Z

cc @ahg-g for more thoughts

Huang-Wei · 2020-05-11T18:00:16Z

@alculquicondor I agree the wordings need to be more precise.

The conflict is caused like this:

Medium-priority Pod is being scheduled and able to preempt low-priority Pods, then its nominatedNodeName field gets set and ends its scheduling cycle. Note that although we have:

kubernetes/pkg/scheduler/scheduler.go

Line 419 in 9b44fa4

sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)

to update the in-cache version of med-priority Pod to have its nominatedNodeName set, its ResourceVersion remains unchanged with the version which goes through current preemption cycle.
If the Update event of medium-priority comes first, everything is fine. The in-cache obj is updated to the latest version.

However, if the high-priority Pod comes first, it's going to preempt medium-priority Pod. In that case, it will collect the nominatedPodsToClear from cache and then sending API requests to remove their nominatedNodeName field. Unfortunately, the nominatedPodsToClear are fetched from cache:

kubernetes/pkg/scheduler/core/generic_scheduler.go

Lines 355 to 356 in 9b44fa4

    
           func (g *genericScheduler) getLowerPriorityNominatedPods(pod *v1.Pod, nodeName string) []*v1.Pod { 
        
           	pods := g.schedulingQueue.NominatedPodsForNode(nodeName)

As aforementioned, their ResourceVersion was stale - and clearning nominatedNodeName field can result in API conflict error. And in other words, although rare, fetching objects from informers may not be sufficient.

alculquicondor · 2020-05-11T18:50:50Z

If the Update event of medium-priority comes first, everything is fine. The in-cache obj is updated to the latest version.

So these lines update the internal cache, but there would still be a resource version change once informer update comes in, correct?

kubernetes/pkg/scheduler/scheduler.go

Lines 419 to 425 in a151682

    
           // Update the scheduling queue with the nominated pod information. Without 
        
           // this, there would be a race condition between the next scheduling cycle 
        
           // and the time the scheduler receives a Pod Update for the nominated pod. 
        
           sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName) 
        
           // Make a call to update nominated node name of the pod on the API server. 
        
           err = sched.podPreemptor.setNominatedNodeName(preemptor, nodeName)

Apropos, maybe we should undo the change in the cache if that second line fails...

fetching objects from informers may not be sufficient.

If there are retries, it should be.

There are a few options:

We consider this kind of error non-critical, i.e., it's ok for nominatedNodeName to be incorrect. In this case, we should fix the test to not rely on it. One option is to just make sure the nominatedNodeName is clear before we create the high-priority pod.
The error is somewhat critical, thus we retry clearing nominatedNodeName. I think we can still use the informers cache, but we shouldn't retry indefinitely. If errors persist, we should just error the cycle.
We consider this critical on the first fail (something has changed in the cluster and we are assuming it's just the resource number, but it could be other changes coming from the user). Again, we retry the whole cycle. However, since API changes are non transactional, there might be other pods which got their nominatedNodeName cleared. But I think this is fine.

I prefer either 1 or 3.

Huang-Wei · 2020-05-11T18:57:23Z

So these lines update the internal cache

Just this line:

sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)

but there would still be a resource version change once informer update comes in, correct?

Yes. However, it's not guaranteed the change'd happen before processing high-priority Pod. It can be: the live version in etcd side has a higher resourceVersion, but in scheduler side, the update event hasn't come in yet, or hasn't been processed yet.

brianpursley · 2020-05-14T13:30:46Z

/retest

brianpursley · 2020-05-14T14:35:50Z

This pull-kubernetes-e2e-kind-ipv6 seems to have a lot of issues... 🤕
/retest

Huang-Wei · 2020-05-14T14:50:20Z

/approve
Thanks @brianpursley for the patience and efforts trying out different options.
(BTW: I did some exercise on update/patch here: https://github.com/Huang-Wei/tryout-kube-strategic-merge-patch/blob/master/main.go. Feel free to help make it complete.)

I will leave /lgtm to @ahg-g and @alculquicondor .

k8s-ci-robot · 2020-05-14T14:50:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brianpursley, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [Huang-Wei]
~~test/integration/scheduler/OWNERS~~ [Huang-Wei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor

/lgtm

brianpursley · 2020-05-14T17:01:21Z

/retest

alculquicondor · 2020-05-14T17:48:32Z

sigh... Please rebase

…ntial conflicts

brianpursley · 2020-05-14T19:18:59Z

Rebased

alculquicondor · 2020-05-14T21:06:42Z

/retest

/lgtm

Huang-Wei · 2020-05-14T23:28:18Z

/retest

Huang-Wei · 2020-05-15T01:16:20Z

/retest

Huang-Wei · 2020-05-15T05:37:53Z

/retest
/hold cancel

k8s-ci-robot requested review from k82cn and wgliang May 11, 2020 13:46

brianpursley force-pushed the kubernetes-89259 branch 2 times, most recently from bb43786 to fadee47 Compare May 11, 2020 14:02

k8s-ci-robot assigned Huang-Wei May 11, 2020

Huang-Wei reviewed May 11, 2020

View reviewed changes

pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved

pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 11, 2020

brianpursley force-pushed the kubernetes-89259 branch 2 times, most recently from cc3a95a to a3dff6f Compare May 11, 2020 17:03

alculquicondor reviewed May 11, 2020

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2020

brianpursley force-pushed the kubernetes-89259 branch from a3dff6f to 530c589 Compare May 11, 2020 18:16

Huang-Wei mentioned this pull request May 13, 2020

Move unschedulable Pod to internal schedulingQ synchronously #90660

Merged

brianpursley force-pushed the kubernetes-89259 branch from 17e2981 to a1ee014 Compare May 14, 2020 12:36

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2020

alculquicondor reviewed May 14, 2020

View reviewed changes

k8s-ci-robot assigned alculquicondor May 14, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 14, 2020

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2020

brianpursley force-pushed the kubernetes-89259 branch from a1ee014 to 54e67ee Compare May 14, 2020 19:04

k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 14, 2020

Changed scheduler to use patch when updating pod status to avoid pote…

9eb8e7a

…ntial conflicts

brianpursley force-pushed the kubernetes-89259 branch from 54e67ee to 9eb8e7a Compare May 14, 2020 19:18

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 14, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 15, 2020

k8s-ci-robot merged commit 78abe8b into kubernetes:master May 15, 2020

k8s-ci-robot added this to the v1.19 milestone May 15, 2020

github-actions bot mentioned this pull request May 27, 2020

Week Ending May 17, 2020 dev-obs/actus#156

Closed

wojtek-t mentioned this pull request Jul 22, 2020

[flaky] SchedulingThroughput - SchedulingThroughput error: scheduler throughput: actual throughput 81.200000 lower than threshold 90.000000] #93217

Closed

brianpursley deleted the kubernetes-89259 branch February 2, 2023 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed scheduler to use patch when updating pod status #90978

Changed scheduler to use patch when updating pod status #90978

brianpursley commented May 11, 2020 •

edited

Loading

brianpursley commented May 11, 2020

Huang-Wei left a comment

alculquicondor commented May 11, 2020

alculquicondor left a comment •

edited

Loading

alculquicondor May 11, 2020

brianpursley May 11, 2020

brianpursley May 11, 2020 •

edited

Loading

brianpursley May 11, 2020

alculquicondor commented May 11, 2020

alculquicondor commented May 11, 2020

Huang-Wei commented May 11, 2020 •

edited

Loading

alculquicondor commented May 11, 2020

Huang-Wei commented May 11, 2020

brianpursley commented May 14, 2020

brianpursley commented May 14, 2020

Huang-Wei commented May 14, 2020

k8s-ci-robot commented May 14, 2020

alculquicondor left a comment

brianpursley commented May 14, 2020

alculquicondor commented May 14, 2020

brianpursley commented May 14, 2020

alculquicondor commented May 14, 2020

Huang-Wei commented May 14, 2020

Huang-Wei commented May 15, 2020

Huang-Wei commented May 15, 2020

	rErr := sched.podPreemptor.removeNominatedNodeName(p)
	if rErr != nil {
	klog.Errorf("Cannot remove 'NominatedNodeName' field of pod: %v", rErr)
	// We do not return as this error is not critical.
	}

Changed scheduler to use patch when updating pod status #90978

Changed scheduler to use patch when updating pod status #90978

Conversation

brianpursley commented May 11, 2020 • edited Loading

brianpursley commented May 11, 2020

Huang-Wei left a comment

Choose a reason for hiding this comment

alculquicondor commented May 11, 2020

alculquicondor left a comment • edited Loading

Choose a reason for hiding this comment

alculquicondor May 11, 2020

Choose a reason for hiding this comment

brianpursley May 11, 2020

Choose a reason for hiding this comment

brianpursley May 11, 2020 • edited Loading

Choose a reason for hiding this comment

brianpursley May 11, 2020

Choose a reason for hiding this comment

alculquicondor commented May 11, 2020

alculquicondor commented May 11, 2020

Huang-Wei commented May 11, 2020 • edited Loading

alculquicondor commented May 11, 2020

Huang-Wei commented May 11, 2020

brianpursley commented May 14, 2020

brianpursley commented May 14, 2020

Huang-Wei commented May 14, 2020

k8s-ci-robot commented May 14, 2020

alculquicondor left a comment

Choose a reason for hiding this comment

brianpursley commented May 14, 2020

alculquicondor commented May 14, 2020

brianpursley commented May 14, 2020

alculquicondor commented May 14, 2020

Huang-Wei commented May 14, 2020

Huang-Wei commented May 15, 2020

Huang-Wei commented May 15, 2020

brianpursley commented May 11, 2020 •

edited

Loading

alculquicondor left a comment •

edited

Loading

brianpursley May 11, 2020 •

edited

Loading

Huang-Wei commented May 11, 2020 •

edited

Loading