New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear pod binding cache #71212

Merged
merged 1 commit into from Nov 21, 2018

Conversation

@cofyc
Contributor

cofyc commented Nov 19, 2018

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Clear pod binding cache.

Previous way to clear pod binding cache in ErrorFunc asynchronously is not enough, because pod binding cache are only cleared when pod is assigned a node or not found (deleted).

Volumes may be bound by PV controller. In next run, scheduler binder will skip these bound claims in FindPodVolumes. If pod binding cache is not cleared, BindPodVolumes will try to bind stale volumes and fail.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Partially address #70180

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Clear pod binding cache on bind error to make sure stale pod binding cache will not be used.
@cofyc

This comment has been minimized.

Contributor

cofyc commented Nov 19, 2018

@@ -399,6 +399,9 @@ func (sched *Scheduler) bindVolumes(assumed *v1.Pod) error {
if forgetErr := sched.config.SchedulerCache.ForgetPod(assumed); forgetErr != nil {
klog.Errorf("scheduler cache ForgetPod failed: %v", forgetErr)
}
// Volumes may be bound by PV controller asynchronously, we must clear
// stale pod binding cache.
sched.config.VolumeBinder.DeletePodBindings(assumed)

This comment has been minimized.

@liggitt

liggitt Nov 19, 2018

Member

is there a test we can set up that would exercise the bug this fixes?

This comment has been minimized.

@cofyc

cofyc Nov 19, 2018

Contributor

hard to set up a test case to cover this, probably we can add a stress test, it's easier to reproduce this issue in a few runs locally

This comment has been minimized.

@liggitt

liggitt Nov 19, 2018

Member

+1 to async "stress" test... this looks like it should be fairly quick to trigger

This comment has been minimized.

@msau42

msau42 Nov 19, 2018

Member

We do have some existing stress integration tests but I think they don't handle prepound PVs. It shouldn't be too difficult to randomly make a fraction of the PVs prebound

This comment has been minimized.

@cofyc

cofyc Nov 21, 2018

Contributor

I've created a issue for this: #71301

@cofyc cofyc force-pushed the cofyc:fix71068 branch from 2566ccb to 3742680 Nov 19, 2018

@k8s-ci-robot k8s-ci-robot added size/S and removed size/XS labels Nov 19, 2018

@cofyc cofyc changed the title from Clear stale pod binding cache before rescheduling to Clear stale pod binding cache in bindVolumes Nov 19, 2018

@cofyc cofyc changed the title from Clear stale pod binding cache in bindVolumes to WIP: Clear stale pod binding cache in bindVolumes Nov 19, 2018

@cofyc cofyc changed the title from WIP: Clear stale pod binding cache in bindVolumes to Clear stale pod binding cache in bindVolumes Nov 19, 2018

@cofyc cofyc force-pushed the cofyc:fix71068 branch from 3742680 to 7cec99d Nov 19, 2018

@k8s-ci-robot k8s-ci-robot added size/XS and removed size/S labels Nov 19, 2018

@cofyc cofyc force-pushed the cofyc:fix71068 branch 2 times, most recently from 8058d12 to 4cd4b28 Nov 19, 2018

@cofyc

This comment has been minimized.

Contributor

cofyc commented Nov 19, 2018

/test pull-kubernetes-verify

@cofyc

This comment has been minimized.

Contributor

cofyc commented Nov 19, 2018

/test pull-kubernetes-e2e-gce-100-performance
/test pull-kubernetes-kubemark-e2e-gce-big

@cofyc

This comment has been minimized.

Contributor

cofyc commented Nov 20, 2018

/hold

@cofyc

This comment has been minimized.

Contributor

cofyc commented Nov 20, 2018

@msau42
I found a issue to clear pod binding cache in FindPodVolumes. Sometimes, schedule.bindVolume failed to get cached bindings, e.g.

I1120 19:16:50.955166   56524 scheduler.go:396] Failed to bind volumes for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned": failed to get cached bindings for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned"
E1120 19:16:50.955390   56524 factory.go:1465] Error scheduling volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned: failed to get cached bindings for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned"; retrying
I1120 19:16:50.955657   56524 factory.go:1558] Updating pod condition for volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned to (PodScheduled==False)
E1120 19:16:50.958873   56524 scheduler.go:567] error binding volumes: failed to get cached bindings for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned"

Not sure why. schedule.bindVolume is asynchronous, maybe cache is cleared in scheduler main loop because pod is added into active queue by other events? Is it possible a pod can be scheduled twice even if it's assumed to a host?

@msau42

This comment has been minimized.

Member

msau42 commented Nov 20, 2018

Does this cause hard failure or will it fix itself on a retry?

@bsalamat, could you clarify if it's intended that a pod can go through scheduling while it's still in the bind phase?

@nikopen

This comment has been minimized.

Member

nikopen commented Nov 20, 2018

/priority critical-urgent

for merging when it's ready

@@ -143,6 +143,9 @@ func (b *volumeBinder) GetBindingsCache() PodBindingCache {
func (b *volumeBinder) FindPodVolumes(pod *v1.Pod, node *v1.Node) (unboundVolumesSatisfied, boundVolumesSatisfied bool, err error) {
podName := getPodName(pod)
// Pod binding cache may be out of date, clear for the given pod and node first.
b.podBindingCache.ClearBindings(pod, node.Name)

This comment has been minimized.

@msau42

msau42 Nov 20, 2018

Member

Can this go after getPodVolumes, if number of total claims > 0? That way we save a check with mutex for pods that don't use volumes.

@msau42

This comment has been minimized.

Member

msau42 commented Nov 20, 2018

Does this cause hard failure or will it fix itself on a retry?

AssumePodVolumes runs in sequence with FindPodVolumes, so pod binding cache cannot be modified by FindPodVolumes while AssumePodVolumes is running. So the races we're concerned about are in between Find/Assume and Bind.

AssumePodVolumes will:

  • Return all volumes are bound, nothing to do. BindPodVolumes will be skipped.
  • Set empty pod bindings because either static binding or dynamic provisioning does not need to happen. For both to be empty, I think the first case would have happened first.
  • Set pod bindings for pvcs that need binding.

BindPodVolumes returns:

  • error if pod binding cache returns nil (vs empty)
  • success if empty
  • success/fail for relevant binding operations

So I think binding should succeed after some retries, unless for some reason the scheduler is constantly retrying the pod even though it's currently binding. If you're seeing the test fail because of the scheduler retries, then let's go back to the original method for now so that we can unblock the release, and revisit this again as a refactor later.

@cofyc cofyc force-pushed the cofyc:fix71068 branch from 6e25351 to 8fc00eb Nov 21, 2018

@k8s-ci-robot k8s-ci-robot added size/XS and removed size/S labels Nov 21, 2018

@cofyc cofyc changed the title from Clear pod binding cache before finding to Clear pod binding cache Nov 21, 2018

@cofyc

This comment has been minimized.

Contributor

cofyc commented Nov 21, 2018

/retest

@cofyc

This comment has been minimized.

Contributor

cofyc commented Nov 21, 2018

So I think binding should succeed after some retries, unless for some reason the scheduler is constantly retrying the pod even though it's currently binding. If you're seeing the test fail because of the scheduler retries, then let's go back to the original method for now so that we can unblock the release, and revisit this again as a refactor later.

Sometimes it succeeds sometimes it will fail. I've reverted it to original method for now.

I've made a quick test in scheduleOne, a pod can be scheduled while it is assumed to a host, so there are races on PodBindingCache between Find/Assume and async Bind. If this is intended, we need revisit this as you said.

@cofyc

This comment has been minimized.

Contributor

cofyc commented Nov 21, 2018

/hold cancel

@msau42

This comment has been minimized.

Member

msau42 commented Nov 21, 2018

/lgtm

@cofyc ran the flaky test a few hundred times to confirm it doesn't flake anymore with this fix. We'll look into adding new stress tests to more reliably repro the error condition separately, as well as follow up with sig-scheduling to confirm if it's expected behavior that pods can go through the scheduler again while they're still binding.

@k8s-ci-robot k8s-ci-robot added the lgtm label Nov 21, 2018

@msau42

This comment has been minimized.

Member

msau42 commented Nov 21, 2018

/assign @k82cn

@k82cn

This comment has been minimized.

Member

k82cn commented Nov 21, 2018

/approve

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Nov 21, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cofyc, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cofyc

This comment has been minimized.

Contributor

cofyc commented Nov 21, 2018

/test pull-kubernetes-kubemark-e2e-gce-big

@nikopen

This comment has been minimized.

Member

nikopen commented Nov 21, 2018

/retest

@k8s-ci-robot k8s-ci-robot merged commit 9801572 into kubernetes:master Nov 21, 2018

17 of 18 checks passed

pull-kubernetes-e2e-gce-100-performance Job triggered.
Details
cla/linuxfoundation cofyc authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment