-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed scheduler to use patch when updating pod status #90978
Conversation
bb43786
to
fadee47
Compare
/assign @Huang-Wei |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brianpursley ! Just some nits, LGTM otherwise.
/priority critical-urgent |
cc3a95a
to
a3dff6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite confused here.
A race condition occurs because the medium priority pod is still attempting to schedule at the same time when removeNominatedNodeName() is being called to clear it's nominated node value.
Is "still attempting to schedule" referring to binding? Otherwise there shouldn't be 2 pods being scheduled at the same time.
Often times this test succeeds, but if the timing is unlucky, then the medium-priority pod will never have its nominated node name removed because it didn't retry on conflict,
What's the conflict exactly? I would say that, if there is a conflict, the whole scheduling attempt should be retried. Perhaps this is not a non-critical error:
kubernetes/pkg/scheduler/scheduler.go
Lines 452 to 456 in a151682
rErr := sched.podPreemptor.removeNominatedNodeName(p) | |
if rErr != nil { | |
klog.Errorf("Cannot remove 'NominatedNodeName' field of pod: %v", rErr) | |
// We do not return as this error is not critical. | |
} |
And if you disagree with my statement above, let's use informer's cache as #90660
pkg/scheduler/scheduler.go
Outdated
@@ -32,6 +32,7 @@ import ( | |||
coreinformers "k8s.io/client-go/informers/core/v1" | |||
clientset "k8s.io/client-go/kubernetes" | |||
"k8s.io/client-go/tools/cache" | |||
clientretry "k8s.io/client-go/util/retry" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need for the alias
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite confused here.
A race condition occurs because the medium priority pod is still attempting to schedule at the same time when removeNominatedNodeName() is being called to clear it's nominated node value.
Is "still attempting to schedule" referring to binding? Otherwise there shouldn't be 2 pods being scheduled at the same time.
Yes, sorry. I think the go-routine from scheduleOne
that binds the pod is still running asynchronously when the high-priority pod preempts medium-priority pod.
In scheduler.preempt()
, this line gets the nominated pods to clear:
node, victims, nominatedPodsToClear, err := sched.Algorithm.Preempt(ctx, prof, state, preemptor, scheduleErr)
but it is possible (although unlikely, which is why it shows up as a test flake) that nominatedPodsToClear gets out of date before it is updated later by the call to sched.podPreemptor.removeNominatedNodeName(p)
.
The integration testing logs I looked at show response of 409 coming back when it tries to update the pod status, so there is definitely something causing a conflict. Retrieving the latest and retrying the update succeeds.
Whether there is another way to solve this, I'm not sure. I was thinking at the very least, if a 409 comes back when the scheduler is trying to update a pod, it should retry. Currently, if any failure occurs during removeNominatedNodeName, it only logs the error and continues to return successfully. That is actually what causes the timeout to occur in the integration test... it doesn't know that it failed and is waiting for the nominated node name to be cleared from the medium-priority pod, but it will never occur at that point.
Often times this test succeeds, but if the timing is unlucky, then the medium-priority pod will never have its nominated node name removed because it didn't retry on conflict,
What's the conflict exactly? I would say that, if there is a conflict, the whole scheduling attempt should be retried. Perhaps this is not a non-critical error:
kubernetes/pkg/scheduler/scheduler.go
Lines 452 to 456 in a151682
rErr := sched.podPreemptor.removeNominatedNodeName(p) if rErr != nil { klog.Errorf("Cannot remove 'NominatedNodeName' field of pod: %v", rErr) // We do not return as this error is not critical. } And if you disagree with my statement above, let's use informer's cache as #90660
I don't know if this is a critical error or not. I think it really only happens when you have two successive preemptions on the same node within a very short period of time. For practical purposes, this is probably an edge case, but I guess a system under heavy load could experience this.
Regarding retrying to whole scheduling attempt. The problem I see with that is the conflict occurs when it is trying to remove the nominated node name from another pod and it seems like the preemption has already occurred, so it might be hard to retry the whole thing.
I'm definitely new to the scheduler, so please take what I'm saying with a grain of salt and I will defer to those more familiar with it to let me know if I am wrong about something.
Thanks, I really do appreciate the comments and questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing, regarding the // We do not return as this error is not critical.
comment. Read the comment above that for the assumption being made:
// Clearing nominated pods should happen outside of "if node != nil". Node could
// be nil when a pod with nominated node name is eligible to preempt again,
// but preemption logic does not find any node for it. In that case Preempt()
// function of generic_scheduler.go returns the pod itself for removal of
// the 'NominatedNodeName' field.
for _, p := range nominatedPodsToClear {
rErr := sched.podPreemptor.removeNominatedNodeName(p)
if rErr != nil {
klog.Errorf("Cannot remove 'NominatedNodeName' field of pod: %v", rErr)
// We do not return as this error is not critical.
}
}
It is assuming this happens because the node is nil, but rErr
is not checked, and it is not because the node is nil, but a conflict occurs, which I believe to be a different problem that should not be ignored.
Actually, I find that whole comment confusing to me, but I still think this is an error that should not be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need for the alias
Fixed, thanks
/hold |
cc @ahg-g for more thoughts |
@alculquicondor I agree the wordings need to be more precise. The conflict is caused like this:
|
a3dff6f
to
530c589
Compare
So these lines update the internal cache, but there would still be a resource version change once informer update comes in, correct? kubernetes/pkg/scheduler/scheduler.go Lines 419 to 425 in a151682
Apropos, maybe we should undo the change in the cache if that second line fails...
If there are retries, it should be. There are a few options:
I prefer either 1 or 3. |
Just this line: sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)
Yes. However, it's not guaranteed the change'd happen before processing high-priority Pod. It can be: the live version in etcd side has a higher resourceVersion, but in scheduler side, the update event hasn't come in yet, or hasn't been processed yet. |
17e2981
to
a1ee014
Compare
/retest |
This pull-kubernetes-e2e-kind-ipv6 seems to have a lot of issues... 🤕 |
/approve I will leave /lgtm to @ahg-g and @alculquicondor . |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: brianpursley, Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest |
sigh... Please rebase |
a1ee014
to
54e67ee
Compare
54e67ee
to
9eb8e7a
Compare
Rebased |
/retest /lgtm |
/retest |
1 similar comment
/retest |
/retest |
What type of PR is this?
/kind flake
What this PR does / why we need it:
Fixes #89259
Fixes #89728
Fixes #90627
Which issue(s) this PR fixes:
Flake occurs in TestNominatedNodeCleanUp when a medium-priority pod preempts low priority pods (with long grace periods) and then next, a high-priority pod preempts the medium-priority pod before it can be scheduled/bound.
An update conflict can sometimes occur when the high priority pod preempts the medium priority pod and it tries to clear the nominated node from the status of the medium priority pod.
This conflict was logged but did not result in an error, so the scheduling of the high priority pod still succeeded, but the integration test would fail because the medium priority pod's nominated node would never be cleared as expected.
This PR changes the schedule to use
Patch()
instead ofUpdateStatus()
to update the pod status using a merge patch in order to avoid a conflict.Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: