Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear pod binding cache #71212

Merged
merged 1 commit into from
Nov 21, 2018
Merged

Clear pod binding cache #71212

merged 1 commit into from
Nov 21, 2018

Conversation

cofyc
Copy link
Member

@cofyc cofyc commented Nov 19, 2018

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Clear pod binding cache.

Previous way to clear pod binding cache in ErrorFunc asynchronously is not enough, because pod binding cache are only cleared when pod is assigned a node or not found (deleted).

Volumes may be bound by PV controller. In next run, scheduler binder will skip these bound claims in FindPodVolumes. If pod binding cache is not cleared, BindPodVolumes will try to bind stale volumes and fail.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Partially address #70180

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Clear pod binding cache on bind error to make sure stale pod binding cache will not be used.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 19, 2018
@cofyc
Copy link
Member Author

cofyc commented Nov 19, 2018

/assign @msau42 @lichuqiang

@@ -399,6 +399,9 @@ func (sched *Scheduler) bindVolumes(assumed *v1.Pod) error {
if forgetErr := sched.config.SchedulerCache.ForgetPod(assumed); forgetErr != nil {
klog.Errorf("scheduler cache ForgetPod failed: %v", forgetErr)
}
// Volumes may be bound by PV controller asynchronously, we must clear
// stale pod binding cache.
sched.config.VolumeBinder.DeletePodBindings(assumed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a test we can set up that would exercise the bug this fixes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to set up a test case to cover this, probably we can add a stress test, it's easier to reproduce this issue in a few runs locally

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to async "stress" test... this looks like it should be fairly quick to trigger

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have some existing stress integration tests but I think they don't handle prepound PVs. It shouldn't be too difficult to randomly make a fraction of the PVs prebound

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a issue for this: #71301

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 19, 2018
@cofyc cofyc changed the title Clear stale pod binding cache before rescheduling Clear stale pod binding cache in bindVolumes Nov 19, 2018
@cofyc cofyc changed the title Clear stale pod binding cache in bindVolumes WIP: Clear stale pod binding cache in bindVolumes Nov 19, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2018
@cofyc cofyc changed the title WIP: Clear stale pod binding cache in bindVolumes Clear stale pod binding cache in bindVolumes Nov 19, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2018
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 19, 2018
@cofyc cofyc force-pushed the fix71068 branch 2 times, most recently from 8058d12 to 4cd4b28 Compare November 19, 2018 16:13
@cofyc
Copy link
Member Author

cofyc commented Nov 19, 2018

/test pull-kubernetes-verify

@cofyc
Copy link
Member Author

cofyc commented Nov 19, 2018

/test pull-kubernetes-e2e-gce-100-performance
/test pull-kubernetes-kubemark-e2e-gce-big

@cofyc
Copy link
Member Author

cofyc commented Nov 20, 2018

@msau42
I found a issue to clear pod binding cache in FindPodVolumes. Sometimes, schedule.bindVolume failed to get cached bindings, e.g.

I1120 19:16:50.955166   56524 scheduler.go:396] Failed to bind volumes for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned": failed to get cached bindings for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned"
E1120 19:16:50.955390   56524 factory.go:1465] Error scheduling volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned: failed to get cached bindings for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned"; retrying
I1120 19:16:50.955657   56524 factory.go:1558] Updating pod condition for volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned to (PodScheduled==False)
E1120 19:16:50.958873   56524 scheduler.go:567] error binding volumes: failed to get cached bindings for pod "volume-schedulingbc48cddc-ecb5-11e8-9d92-000c299a5e15/pod-w-pv-prebound-w-provisioned"

Not sure why. schedule.bindVolume is asynchronous, maybe cache is cleared in scheduler main loop because pod is added into active queue by other events? Is it possible a pod can be scheduled twice even if it's assumed to a host?

@msau42
Copy link
Member

msau42 commented Nov 20, 2018

Does this cause hard failure or will it fix itself on a retry?

@bsalamat, could you clarify if it's intended that a pod can go through scheduling while it's still in the bind phase?

@nikopen
Copy link
Contributor

nikopen commented Nov 20, 2018

/priority critical-urgent

for merging when it's ready

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 20, 2018
@@ -143,6 +143,9 @@ func (b *volumeBinder) GetBindingsCache() PodBindingCache {
func (b *volumeBinder) FindPodVolumes(pod *v1.Pod, node *v1.Node) (unboundVolumesSatisfied, boundVolumesSatisfied bool, err error) {
podName := getPodName(pod)

// Pod binding cache may be out of date, clear for the given pod and node first.
b.podBindingCache.ClearBindings(pod, node.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go after getPodVolumes, if number of total claims > 0? That way we save a check with mutex for pods that don't use volumes.

@msau42
Copy link
Member

msau42 commented Nov 20, 2018

Does this cause hard failure or will it fix itself on a retry?

AssumePodVolumes runs in sequence with FindPodVolumes, so pod binding cache cannot be modified by FindPodVolumes while AssumePodVolumes is running. So the races we're concerned about are in between Find/Assume and Bind.

AssumePodVolumes will:

  • Return all volumes are bound, nothing to do. BindPodVolumes will be skipped.
  • Set empty pod bindings because either static binding or dynamic provisioning does not need to happen. For both to be empty, I think the first case would have happened first.
  • Set pod bindings for pvcs that need binding.

BindPodVolumes returns:

  • error if pod binding cache returns nil (vs empty)
  • success if empty
  • success/fail for relevant binding operations

So I think binding should succeed after some retries, unless for some reason the scheduler is constantly retrying the pod even though it's currently binding. If you're seeing the test fail because of the scheduler retries, then let's go back to the original method for now so that we can unblock the release, and revisit this again as a refactor later.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 21, 2018
@cofyc cofyc changed the title Clear pod binding cache before finding Clear pod binding cache Nov 21, 2018
@cofyc
Copy link
Member Author

cofyc commented Nov 21, 2018

/retest

@cofyc
Copy link
Member Author

cofyc commented Nov 21, 2018

So I think binding should succeed after some retries, unless for some reason the scheduler is constantly retrying the pod even though it's currently binding. If you're seeing the test fail because of the scheduler retries, then let's go back to the original method for now so that we can unblock the release, and revisit this again as a refactor later.

Sometimes it succeeds sometimes it will fail. I've reverted it to original method for now.

I've made a quick test in scheduleOne, a pod can be scheduled while it is assumed to a host, so there are races on PodBindingCache between Find/Assume and async Bind. If this is intended, we need revisit this as you said.

@cofyc
Copy link
Member Author

cofyc commented Nov 21, 2018

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 21, 2018
@msau42
Copy link
Member

msau42 commented Nov 21, 2018

/lgtm

@cofyc ran the flaky test a few hundred times to confirm it doesn't flake anymore with this fix. We'll look into adding new stress tests to more reliably repro the error condition separately, as well as follow up with sig-scheduling to confirm if it's expected behavior that pods can go through the scheduler again while they're still binding.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2018
@msau42
Copy link
Member

msau42 commented Nov 21, 2018

/assign @k82cn

@k82cn
Copy link
Member

k82cn commented Nov 21, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cofyc, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 21, 2018
@cofyc
Copy link
Member Author

cofyc commented Nov 21, 2018

/test pull-kubernetes-kubemark-e2e-gce-big

@nikopen
Copy link
Contributor

nikopen commented Nov 21, 2018

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants