Clear pod binding cache #71212

cofyc · 2018-11-19T14:19:46Z

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Clear pod binding cache.

Previous way to clear pod binding cache in ErrorFunc asynchronously is not enough, because pod binding cache are only cleared when pod is assigned a node or not found (deleted).

Volumes may be bound by PV controller. In next run, scheduler binder will skip these bound claims in FindPodVolumes. If pod binding cache is not cleared, BindPodVolumes will try to bind stale volumes and fail.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Partially address #70180

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Clear pod binding cache on bind error to make sure stale pod binding cache will not be used.

cofyc · 2018-11-19T14:35:11Z

/assign @msau42 @lichuqiang

liggitt · 2018-11-19T14:35:41Z

pkg/scheduler/scheduler.go

@@ -399,6 +399,9 @@ func (sched *Scheduler) bindVolumes(assumed *v1.Pod) error {
 		if forgetErr := sched.config.SchedulerCache.ForgetPod(assumed); forgetErr != nil {
 			klog.Errorf("scheduler cache ForgetPod failed: %v", forgetErr)
 		}
+		// Volumes may be bound by PV controller asynchronously, we must clear
+		// stale pod binding cache.
+		sched.config.VolumeBinder.DeletePodBindings(assumed)


is there a test we can set up that would exercise the bug this fixes?

hard to set up a test case to cover this, probably we can add a stress test, it's easier to reproduce this issue in a few runs locally

+1 to async "stress" test... this looks like it should be fairly quick to trigger

We do have some existing stress integration tests but I think they don't handle prepound PVs. It shouldn't be too difficult to randomly make a fraction of the PVs prebound

I've created a issue for this: #71301

cofyc · 2018-11-19T16:20:09Z

/test pull-kubernetes-verify

cofyc · 2018-11-19T17:01:41Z

/test pull-kubernetes-e2e-gce-100-performance
/test pull-kubernetes-kubemark-e2e-gce-big

cofyc · 2018-11-20T11:35:31Z

@msau42
I found a issue to clear pod binding cache in FindPodVolumes. Sometimes, schedule.bindVolume failed to get cached bindings, e.g.

I1120 19:16:50.955166   56524 scheduler.go:396] Failed to bind volumes for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned": failed to get cached bindings for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned"
E1120 19:16:50.955390   56524 factory.go:1465] Error scheduling volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned: failed to get cached bindings for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned"; retrying
I1120 19:16:50.955657   56524 factory.go:1558] Updating pod condition for volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned to (PodScheduled==False)
E1120 19:16:50.958873   56524 scheduler.go:567] error binding volumes: failed to get cached bindings for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned"

Not sure why. schedule.bindVolume is asynchronous, maybe cache is cleared in scheduler main loop because pod is added into active queue by other events? Is it possible a pod can be scheduled twice even if it's assumed to a host?

msau42 · 2018-11-20T14:13:09Z

Does this cause hard failure or will it fix itself on a retry?

@bsalamat, could you clarify if it's intended that a pod can go through scheduling while it's still in the bind phase?

nikopen · 2018-11-20T17:52:51Z

/priority critical-urgent

for merging when it's ready

msau42 · 2018-11-20T18:07:23Z

pkg/controller/volume/persistentvolume/scheduler_binder.go

@@ -143,6 +143,9 @@ func (b *volumeBinder) GetBindingsCache() PodBindingCache {
 func (b *volumeBinder) FindPodVolumes(pod *v1.Pod, node *v1.Node) (unboundVolumesSatisfied, boundVolumesSatisfied bool, err error) {
 	podName := getPodName(pod)

+	// Pod binding cache may be out of date, clear for the given pod and node first.
+	b.podBindingCache.ClearBindings(pod, node.Name)


Can this go after getPodVolumes, if number of total claims > 0? That way we save a check with mutex for pods that don't use volumes.

msau42 · 2018-11-20T23:00:17Z

Does this cause hard failure or will it fix itself on a retry?

AssumePodVolumes runs in sequence with FindPodVolumes, so pod binding cache cannot be modified by FindPodVolumes while AssumePodVolumes is running. So the races we're concerned about are in between Find/Assume and Bind.

AssumePodVolumes will:

Return all volumes are bound, nothing to do. BindPodVolumes will be skipped.
Set empty pod bindings because either static binding or dynamic provisioning does not need to happen. For both to be empty, I think the first case would have happened first.
Set pod bindings for pvcs that need binding.

BindPodVolumes returns:

error if pod binding cache returns nil (vs empty)
success if empty
success/fail for relevant binding operations

So I think binding should succeed after some retries, unless for some reason the scheduler is constantly retrying the pod even though it's currently binding. If you're seeing the test fail because of the scheduler retries, then let's go back to the original method for now so that we can unblock the release, and revisit this again as a refactor later.

cofyc · 2018-11-21T03:54:57Z

/retest

cofyc · 2018-11-21T03:55:44Z

So I think binding should succeed after some retries, unless for some reason the scheduler is constantly retrying the pod even though it's currently binding. If you're seeing the test fail because of the scheduler retries, then let's go back to the original method for now so that we can unblock the release, and revisit this again as a refactor later.

Sometimes it succeeds sometimes it will fail. I've reverted it to original method for now.

I've made a quick test in scheduleOne, a pod can be scheduled while it is assumed to a host, so there are races on PodBindingCache between Find/Assume and async Bind. If this is intended, we need revisit this as you said.

cofyc · 2018-11-21T05:07:38Z

/hold cancel

msau42 · 2018-11-21T05:27:29Z

/lgtm

@cofyc ran the flaky test a few hundred times to confirm it doesn't flake anymore with this fix. We'll look into adding new stress tests to more reliably repro the error condition separately, as well as follow up with sig-scheduling to confirm if it's expected behavior that pods can go through the scheduler again while they're still binding.

msau42 · 2018-11-21T05:27:47Z

/assign @k82cn

k82cn · 2018-11-21T09:57:10Z

/approve

k8s-ci-robot · 2018-11-21T09:57:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cofyc, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [k82cn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cofyc · 2018-11-21T11:27:53Z

/test pull-kubernetes-kubemark-e2e-gce-big

nikopen · 2018-11-21T13:54:58Z

/retest

k8s-ci-robot requested review from misterikkit and ravisantoshgudimetla November 19, 2018 14:19

cofyc mentioned this pull request Nov 19, 2018

Remove misleading case of volume scheduling #71068

Closed

k8s-ci-robot assigned lichuqiang and msau42 Nov 19, 2018

liggitt reviewed Nov 19, 2018

View reviewed changes

cofyc force-pushed the fix71068 branch from 2566ccb to 3742680 Compare November 19, 2018 14:51

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 19, 2018

cofyc changed the title ~~Clear stale pod binding cache before rescheduling~~ Clear stale pod binding cache in bindVolumes Nov 19, 2018

cofyc changed the title ~~Clear stale pod binding cache in bindVolumes~~ WIP: Clear stale pod binding cache in bindVolumes Nov 19, 2018

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2018

cofyc changed the title ~~WIP: Clear stale pod binding cache in bindVolumes~~ Clear stale pod binding cache in bindVolumes Nov 19, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2018

cofyc force-pushed the fix71068 branch from 3742680 to 7cec99d Compare November 19, 2018 15:26

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 19, 2018

cofyc force-pushed the fix71068 branch 2 times, most recently from 8058d12 to 4cd4b28 Compare November 19, 2018 16:13

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 20, 2018

msau42 reviewed Nov 20, 2018

View reviewed changes

Clear pod binding cache.

8fc00eb

cofyc force-pushed the fix71068 branch from 6e25351 to 8fc00eb Compare November 21, 2018 03:25

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 21, 2018

cofyc changed the title ~~Clear pod binding cache before finding~~ Clear pod binding cache Nov 21, 2018

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 21, 2018

cofyc mentioned this pull request Nov 21, 2018

Add prebound pvs test cases in integration stress tests #71301

Closed

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2018

k8s-ci-robot assigned k82cn Nov 21, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 21, 2018

k8s-ci-robot merged commit 9801572 into kubernetes:master Nov 21, 2018

liggitt mentioned this pull request Nov 26, 2018

pr:pull-kubernetes-integration flaked 77 times in the past week #70180

Closed

cofyc mentioned this pull request Dec 13, 2018

Revisit scheduler volume binder races #72013

Closed

cofyc mentioned this pull request Jan 16, 2019

Improve FindPodVolumes performance #72953

Merged

cofyc deleted the fix71068 branch May 4, 2019 07:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear pod binding cache #71212

Clear pod binding cache #71212

cofyc commented Nov 19, 2018 •

edited

Loading

cofyc commented Nov 19, 2018

liggitt Nov 19, 2018

cofyc Nov 19, 2018

liggitt Nov 19, 2018

msau42 Nov 19, 2018

cofyc Nov 21, 2018

cofyc commented Nov 19, 2018

cofyc commented Nov 19, 2018

cofyc commented Nov 20, 2018

msau42 commented Nov 20, 2018

nikopen commented Nov 20, 2018

msau42 Nov 20, 2018

msau42 commented Nov 20, 2018 •

edited

Loading

cofyc commented Nov 21, 2018

cofyc commented Nov 21, 2018

cofyc commented Nov 21, 2018

msau42 commented Nov 21, 2018

msau42 commented Nov 21, 2018

k82cn commented Nov 21, 2018

k8s-ci-robot commented Nov 21, 2018

cofyc commented Nov 21, 2018

nikopen commented Nov 21, 2018

Clear pod binding cache #71212

Clear pod binding cache #71212

Conversation

cofyc commented Nov 19, 2018 • edited Loading

cofyc commented Nov 19, 2018

liggitt Nov 19, 2018

Choose a reason for hiding this comment

cofyc Nov 19, 2018

Choose a reason for hiding this comment

liggitt Nov 19, 2018

Choose a reason for hiding this comment

msau42 Nov 19, 2018

Choose a reason for hiding this comment

cofyc Nov 21, 2018

Choose a reason for hiding this comment

cofyc commented Nov 19, 2018

cofyc commented Nov 19, 2018

cofyc commented Nov 20, 2018

msau42 commented Nov 20, 2018

nikopen commented Nov 20, 2018

msau42 Nov 20, 2018

Choose a reason for hiding this comment

msau42 commented Nov 20, 2018 • edited Loading

cofyc commented Nov 21, 2018

cofyc commented Nov 21, 2018

cofyc commented Nov 21, 2018

msau42 commented Nov 21, 2018

msau42 commented Nov 21, 2018

k82cn commented Nov 21, 2018

k8s-ci-robot commented Nov 21, 2018

cofyc commented Nov 21, 2018

nikopen commented Nov 21, 2018

cofyc commented Nov 19, 2018 •

edited

Loading

msau42 commented Nov 20, 2018 •

edited

Loading