Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed scheduler to use patch when updating pod status #90978

Merged
merged 1 commit into from
May 15, 2020

Conversation

brianpursley
Copy link
Member

@brianpursley brianpursley commented May 11, 2020

What type of PR is this?
/kind flake

What this PR does / why we need it:
Fixes #89259
Fixes #89728
Fixes #90627

Which issue(s) this PR fixes:
Flake occurs in TestNominatedNodeCleanUp when a medium-priority pod preempts low priority pods (with long grace periods) and then next, a high-priority pod preempts the medium-priority pod before it can be scheduled/bound.

An update conflict can sometimes occur when the high priority pod preempts the medium priority pod and it tries to clear the nominated node from the status of the medium priority pod.

This conflict was logged but did not result in an error, so the scheduling of the high priority pod still succeeded, but the integration test would fail because the medium priority pod's nominated node would never be cleared as expected.

This PR changes the schedule to use Patch() instead of UpdateStatus() to update the pod status using a merge patch in order to avoid a conflict.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 11, 2020
@brianpursley brianpursley force-pushed the kubernetes-89259 branch 2 times, most recently from bb43786 to fadee47 Compare May 11, 2020 14:02
@brianpursley
Copy link
Member Author

/assign @Huang-Wei

Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brianpursley ! Just some nits, LGTM otherwise.

pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved
pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved
@alculquicondor
Copy link
Member

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 11, 2020
@brianpursley brianpursley force-pushed the kubernetes-89259 branch 2 times, most recently from cc3a95a to a3dff6f Compare May 11, 2020 17:03
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confused here.

A race condition occurs because the medium priority pod is still attempting to schedule at the same time when removeNominatedNodeName() is being called to clear it's nominated node value.

Is "still attempting to schedule" referring to binding? Otherwise there shouldn't be 2 pods being scheduled at the same time.

Often times this test succeeds, but if the timing is unlucky, then the medium-priority pod will never have its nominated node name removed because it didn't retry on conflict,

What's the conflict exactly? I would say that, if there is a conflict, the whole scheduling attempt should be retried. Perhaps this is not a non-critical error:

rErr := sched.podPreemptor.removeNominatedNodeName(p)
if rErr != nil {
klog.Errorf("Cannot remove 'NominatedNodeName' field of pod: %v", rErr)
// We do not return as this error is not critical.
}

And if you disagree with my statement above, let's use informer's cache as #90660

@@ -32,6 +32,7 @@ import (
coreinformers "k8s.io/client-go/informers/core/v1"
clientset "k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/cache"
clientretry "k8s.io/client-go/util/retry"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for the alias

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confused here.

A race condition occurs because the medium priority pod is still attempting to schedule at the same time when removeNominatedNodeName() is being called to clear it's nominated node value.

Is "still attempting to schedule" referring to binding? Otherwise there shouldn't be 2 pods being scheduled at the same time.

Yes, sorry. I think the go-routine from scheduleOne that binds the pod is still running asynchronously when the high-priority pod preempts medium-priority pod.

In scheduler.preempt(), this line gets the nominated pods to clear:

node, victims, nominatedPodsToClear, err := sched.Algorithm.Preempt(ctx, prof, state, preemptor, scheduleErr)

but it is possible (although unlikely, which is why it shows up as a test flake) that nominatedPodsToClear gets out of date before it is updated later by the call to sched.podPreemptor.removeNominatedNodeName(p).

The integration testing logs I looked at show response of 409 coming back when it tries to update the pod status, so there is definitely something causing a conflict. Retrieving the latest and retrying the update succeeds.

Whether there is another way to solve this, I'm not sure. I was thinking at the very least, if a 409 comes back when the scheduler is trying to update a pod, it should retry. Currently, if any failure occurs during removeNominatedNodeName, it only logs the error and continues to return successfully. That is actually what causes the timeout to occur in the integration test... it doesn't know that it failed and is waiting for the nominated node name to be cleared from the medium-priority pod, but it will never occur at that point.

Often times this test succeeds, but if the timing is unlucky, then the medium-priority pod will never have its nominated node name removed because it didn't retry on conflict,

What's the conflict exactly? I would say that, if there is a conflict, the whole scheduling attempt should be retried. Perhaps this is not a non-critical error:

rErr := sched.podPreemptor.removeNominatedNodeName(p)
if rErr != nil {
klog.Errorf("Cannot remove 'NominatedNodeName' field of pod: %v", rErr)
// We do not return as this error is not critical.
}

And if you disagree with my statement above, let's use informer's cache as #90660

I don't know if this is a critical error or not. I think it really only happens when you have two successive preemptions on the same node within a very short period of time. For practical purposes, this is probably an edge case, but I guess a system under heavy load could experience this.

Regarding retrying to whole scheduling attempt. The problem I see with that is the conflict occurs when it is trying to remove the nominated node name from another pod and it seems like the preemption has already occurred, so it might be hard to retry the whole thing.

I'm definitely new to the scheduler, so please take what I'm saying with a grain of salt and I will defer to those more familiar with it to let me know if I am wrong about something.

Thanks, I really do appreciate the comments and questions.

Copy link
Member Author

@brianpursley brianpursley May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing, regarding the // We do not return as this error is not critical. comment. Read the comment above that for the assumption being made:

        // Clearing nominated pods should happen outside of "if node != nil". Node could
	// be nil when a pod with nominated node name is eligible to preempt again,
	// but preemption logic does not find any node for it. In that case Preempt()
	// function of generic_scheduler.go returns the pod itself for removal of
	// the 'NominatedNodeName' field.
	for _, p := range nominatedPodsToClear {
		rErr := sched.podPreemptor.removeNominatedNodeName(p)
		if rErr != nil {
			klog.Errorf("Cannot remove 'NominatedNodeName' field of pod: %v", rErr)
			// We do not return as this error is not critical.
		}
	}

It is assuming this happens because the node is nil, but rErr is not checked, and it is not because the node is nil, but a conflict occurs, which I believe to be a different problem that should not be ignored.

Actually, I find that whole comment confusing to me, but I still think this is an error that should not be ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for the alias

Fixed, thanks

@alculquicondor
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2020
@alculquicondor
Copy link
Member

cc @ahg-g for more thoughts

@Huang-Wei
Copy link
Member

Huang-Wei commented May 11, 2020

@alculquicondor I agree the wordings need to be more precise.

The conflict is caused like this:

  • Medium-priority Pod is being scheduled and able to preempt low-priority Pods, then its nominatedNodeName field gets set and ends its scheduling cycle. Note that although we have:

    sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)

    to update the in-cache version of med-priority Pod to have its nominatedNodeName set, its ResourceVersion remains unchanged with the version which goes through current preemption cycle.

  • If the Update event of medium-priority comes first, everything is fine. The in-cache obj is updated to the latest version.

  • However, if the high-priority Pod comes first, it's going to preempt medium-priority Pod. In that case, it will collect the nominatedPodsToClear from cache and then sending API requests to remove their nominatedNodeName field. Unfortunately, the nominatedPodsToClear are fetched from cache:

    func (g *genericScheduler) getLowerPriorityNominatedPods(pod *v1.Pod, nodeName string) []*v1.Pod {
    pods := g.schedulingQueue.NominatedPodsForNode(nodeName)

    As aforementioned, their ResourceVersion was stale - and clearning nominatedNodeName field can result in API conflict error. And in other words, although rare, fetching objects from informers may not be sufficient.

@alculquicondor
Copy link
Member

If the Update event of medium-priority comes first, everything is fine. The in-cache obj is updated to the latest version.

So these lines update the internal cache, but there would still be a resource version change once informer update comes in, correct?

// Update the scheduling queue with the nominated pod information. Without
// this, there would be a race condition between the next scheduling cycle
// and the time the scheduler receives a Pod Update for the nominated pod.
sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)
// Make a call to update nominated node name of the pod on the API server.
err = sched.podPreemptor.setNominatedNodeName(preemptor, nodeName)

Apropos, maybe we should undo the change in the cache if that second line fails...

fetching objects from informers may not be sufficient.

If there are retries, it should be.

There are a few options:

  1. We consider this kind of error non-critical, i.e., it's ok for nominatedNodeName to be incorrect. In this case, we should fix the test to not rely on it. One option is to just make sure the nominatedNodeName is clear before we create the high-priority pod.
  2. The error is somewhat critical, thus we retry clearing nominatedNodeName. I think we can still use the informers cache, but we shouldn't retry indefinitely. If errors persist, we should just error the cycle.
  3. We consider this critical on the first fail (something has changed in the cluster and we are assuming it's just the resource number, but it could be other changes coming from the user). Again, we retry the whole cycle. However, since API changes are non transactional, there might be other pods which got their nominatedNodeName cleared. But I think this is fine.

I prefer either 1 or 3.

@Huang-Wei
Copy link
Member

So these lines update the internal cache

Just this line:

sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)

but there would still be a resource version change once informer update comes in, correct?

Yes. However, it's not guaranteed the change'd happen before processing high-priority Pod. It can be: the live version in etcd side has a higher resourceVersion, but in scheduler side, the update event hasn't come in yet, or hasn't been processed yet.

@brianpursley
Copy link
Member Author

/retest

@brianpursley
Copy link
Member Author

This pull-kubernetes-e2e-kind-ipv6 seems to have a lot of issues... 🤕
/retest

@Huang-Wei
Copy link
Member

/approve
Thanks @brianpursley for the patience and efforts trying out different options.
(BTW: I did some exercise on update/patch here: https://github.com/Huang-Wei/tryout-kube-strategic-merge-patch/blob/master/main.go. Feel free to help make it complete.)

I will leave /lgtm to @ahg-g and @alculquicondor .

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brianpursley, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2020
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 14, 2020
@brianpursley
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2020
@alculquicondor
Copy link
Member

sigh... Please rebase

@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 14, 2020
@brianpursley
Copy link
Member Author

Rebased

@alculquicondor
Copy link
Member

/retest

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 14, 2020
@Huang-Wei
Copy link
Member

/retest

1 similar comment
@Huang-Wei
Copy link
Member

/retest

@Huang-Wei
Copy link
Member

/retest
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 15, 2020
@k8s-ci-robot k8s-ci-robot merged commit 78abe8b into kubernetes:master May 15, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 15, 2020
@brianpursley brianpursley deleted the kubernetes-89259 branch February 2, 2023 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestNominatedNodeCleanUp flakes [Flaky Test] TestNominatedNodeCleanUp TestNominatedNodeCleanUp is flaky
5 participants