Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make volume binder resilient to races #72045

Merged
merged 4 commits into from
Jan 12, 2019
Merged

Conversation

cofyc
Copy link
Member

@cofyc cofyc commented Dec 14, 2018

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Make volume binder resilient to races between main schedule loop and async binding operation.

Which issue(s) this PR fixes:

Fixes #71928 #72013 #56236

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Make volume binder resilient to races between main schedule loop and async binding operation 

/sig storage
/sig scheduling

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 14, 2018
@cofyc
Copy link
Member Author

cofyc commented Dec 14, 2018

/assign @msau42

Verified this can fix issue locally, but couldn't figure out how to write a test for it yet. Because this adds a new interface, please review first.

@msau42
Copy link
Member

msau42 commented Dec 18, 2018

Hm another idea I had is to treat static binding and pending provisioning similarly in the predicate. Instead of failing the predicate if selectedNode is set, can we instead make the predicate return true only for the selectedNode, and false for all other nodes? AssumePodVolumes checks for if the PVCs are fully bound (which they are not) to determine whether or not to call BindPodVolumes.

This would be a bigger change, but I think would make our logic more resilient to these races.

@cofyc
Copy link
Member Author

cofyc commented Dec 18, 2018

That makes sense and seems better! Always try to bind pod volumes if they are not fully bound.

@cofyc cofyc changed the title Fix stale assumed pod volumes WIP: Fix stale assumed pod volumes Dec 18, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 18, 2018
@cofyc cofyc changed the title WIP: Fix stale assumed pod volumes Make volume binder resilient to races between schedule loop and async binding operation Dec 24, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 24, 2018
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 24, 2018
@cofyc
Copy link
Member Author

cofyc commented Dec 24, 2018

Instead of failing the predicate if selectedNode is set, can we instead make the predicate return true only for the selectedNode, and false for all other nodes? AssumePodVolumes checks for if the PVCs are fully bound (which they are not) to determine whether or not to call BindPodVolumes.

  1. One thing is how to make API calls again when necessary. IMO there are two methods:
  • Method 1: BindPodVolumes makes API calls whenever it find it is necessary to update API objects
  • Method 2: BindPodVolumes makes API calls once for PVs and PVCs in binding cache, and fails with error when it find it is necessary to update API objects, let scheduler retry again

I'm using method 2.

  1. Another thing is how to detect if it is necessary to update API objects

There is a special case, if there is an another instance of pod already in scheduling and find same bindings and selected node and assume same PVs and PVCs. API changes from apiserver will be overwritten.

At first, we make FindPodVolumes/AssumePodVolumes idempotent, do not fail with error (as you suggested). Then, BindPodVolumes detects periodically if it is necessary to update API objects:

  • Method 1: simplest way is to fail whenever there is another schedule loop to find/assume pod volumes (make API calls for each new schedule, assuming new schedule means we need bind again
  • Method 2: check PVs & PVCs against objects from apiserver, fail when they are out of date

I've tested method 1, but scheduler will retry pod repeatedly and make API calls every second (check period) currently, because there are two instances of pod in scheduler, each one will fail another instance of pod in binding operation.

I'm using method 2.


IMO, the simplest design is to avoid calling find/assume on pod when it is in binding, then bindings and assumed PVs/PVCs for assumed pod will not be changed in one scheduling cycle (if we think binding operation is part of scheduling cycle). Scheduling on assumed pod repeatedly is unnecessary, better to wait for pod is bound successfully or unassumed on failure.

Current scheduler does not support this. And we cannot do it in FindPodVolumes, because error in FindPodVolumes will make pod unscheduable. I'm investigating possibilities with new scheduler framework. Maybe we can achieve this in future.

@cofyc
Copy link
Member Author

cofyc commented Dec 24, 2018

Scenarios

PVC selectedNode is rejected by provisioner

xref: #71928

BindPodVolumes will be retired by following cases:

  • No pod is in scheduling. New PVC object from apiserver is added into pvc cache. BindPodVolumes find selectedNode does not exist, fail with an error on which scheduler will reschedule and make API calls again
  • A pod is in scheduling. It will assume on new PVC object and update binding cache. However, BindPodVolumes will find PVs & PVCs in binding cache are out of date, fail with an error on which scheduler will reschedule and make API calls again

@cofyc cofyc force-pushed the fix71928 branch 2 times, most recently from 123382c to 4022103 Compare December 24, 2018 07:21
There is no need to clear stale pod binding cache in scheduling, because
it will be recreated at beginning of each schedule loop, and will be
cleared when pod is removed from scheduling queue.
@cofyc
Copy link
Member Author

cofyc commented Jan 9, 2019

@msau42 Squashed into logical commits (no content change):

  • 13d87fb: Make volume binder resilient to races
  • 8b94b96: Unit tests only
  • cfc8ef5: Scheduler change
  • 1a62f53: If provisioning PVC's PV is not found, check next time

@msau42
Copy link
Member

msau42 commented Jan 9, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2019
@cofyc
Copy link
Member Author

cofyc commented Jan 9, 2019

cc @k82cn @bsalamat
For approval on scheduler change

@cofyc
Copy link
Member Author

cofyc commented Jan 10, 2019

/hold
for checking pod binding cache issues

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2019
@cofyc
Copy link
Member Author

cofyc commented Jan 10, 2019

/hold cancel

xref: #72045 (comment)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2019
@cofyc
Copy link
Member Author

cofyc commented Jan 11, 2019

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 11, 2019
Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

I do not know the volume binder logic very well, but based on your explanation that the volume binder cache needs to be kept only for unassigned pods, the changes in the scheduler code look good to me.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, cofyc, msau42, xiaoxubeii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2019
@k8s-ci-robot k8s-ci-robot merged commit ccb1e1f into kubernetes:master Jan 12, 2019
@liggitt
Copy link
Member

liggitt commented Jan 12, 2019

CI failures around PV binding sharply increased between 1/11 and 1/12:

can we determine if this PR contributed and consider rollback/rework if so?

@msau42
Copy link
Member

msau42 commented Jan 12, 2019

With a quick glance, I don't think so. It seems like only containerd jobs with CSI hostpath or intree nfs drivers have spiked. Also some of the nfs tests that are failing don't use PVs. I more suspect an issue with containerd

@liggitt
Copy link
Member

liggitt commented Jan 12, 2019

good observation. opened #72863

@bsalamat
Copy link
Member

We are also seeing an increase in scheduling latency right after this PR. The scheduler predicate evaluation latency has increase by about 50%.

@cofyc
Copy link
Member Author

cofyc commented Jan 16, 2019

hi, @bsalamat
Could you give me your benchmark commands? I'd like to debug and optimize FindPodVolumes which is called by predicate.

@cofyc
Copy link
Member Author

cofyc commented Jan 16, 2019

I guess it is because new FindPodVolumes in this PR needs to clear old pod binding cache, and to achieve this I update pod cache even if there is no PVCs to bind/provision. There are some optimizations we can do. Sorry about increased scheduling latency. I'm working on a fix.

@bsalamat
Copy link
Member

@cofyc You can go the our perf dashboard and then choose:
gce-100 nodes > Scheduler > Scheduling Latency > predicate evaluation
to see the results of latest runs.

@cofyc
Copy link
Member Author

cofyc commented Jan 17, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When a PVC is provisioned with delayed binding, it never recovers from a failed provision
6 participants