New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make volume binder resilient to races #72045

Merged
merged 4 commits into from Jan 12, 2019

Conversation

@cofyc
Copy link
Contributor

cofyc commented Dec 14, 2018

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Make volume binder resilient to races between main schedule loop and async binding operation.

Which issue(s) this PR fixes:

Fixes #71928 #72013 #56236

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Make volume binder resilient to races between main schedule loop and async binding operation 

/sig storage
/sig scheduling

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Dec 14, 2018

/assign @msau42

Verified this can fix issue locally, but couldn't figure out how to write a test for it yet. Because this adds a new interface, please review first.

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Dec 18, 2018

Hm another idea I had is to treat static binding and pending provisioning similarly in the predicate. Instead of failing the predicate if selectedNode is set, can we instead make the predicate return true only for the selectedNode, and false for all other nodes? AssumePodVolumes checks for if the PVCs are fully bound (which they are not) to determine whether or not to call BindPodVolumes.

This would be a bigger change, but I think would make our logic more resilient to these races.

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Dec 18, 2018

That makes sense and seems better! Always try to bind pod volumes if they are not fully bound.

@cofyc cofyc changed the title Fix stale assumed pod volumes WIP: Fix stale assumed pod volumes Dec 18, 2018

@cofyc cofyc changed the title WIP: Fix stale assumed pod volumes Make volume binder resilient to races between schedule loop and async binding operation Dec 24, 2018

@cofyc cofyc force-pushed the cofyc:fix71928 branch from 02e4063 to d67b660 Dec 24, 2018

@k8s-ci-robot k8s-ci-robot added size/L and removed size/S labels Dec 24, 2018

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Dec 24, 2018

Instead of failing the predicate if selectedNode is set, can we instead make the predicate return true only for the selectedNode, and false for all other nodes? AssumePodVolumes checks for if the PVCs are fully bound (which they are not) to determine whether or not to call BindPodVolumes.

  1. One thing is how to make API calls again when necessary. IMO there are two methods:
  • Method 1: BindPodVolumes makes API calls whenever it find it is necessary to update API objects
  • Method 2: BindPodVolumes makes API calls once for PVs and PVCs in binding cache, and fails with error when it find it is necessary to update API objects, let scheduler retry again

I'm using method 2.

  1. Another thing is how to detect if it is necessary to update API objects

There is a special case, if there is an another instance of pod already in scheduling and find same bindings and selected node and assume same PVs and PVCs. API changes from apiserver will be overwritten.

At first, we make FindPodVolumes/AssumePodVolumes idempotent, do not fail with error (as you suggested). Then, BindPodVolumes detects periodically if it is necessary to update API objects:

  • Method 1: simplest way is to fail whenever there is another schedule loop to find/assume pod volumes (make API calls for each new schedule, assuming new schedule means we need bind again
  • Method 2: check PVs & PVCs against objects from apiserver, fail when they are out of date

I've tested method 1, but scheduler will retry pod repeatedly and make API calls every second (check period) currently, because there are two instances of pod in scheduler, each one will fail another instance of pod in binding operation.

I'm using method 2.


IMO, the simplest design is to avoid calling find/assume on pod when it is in binding, then bindings and assumed PVs/PVCs for assumed pod will not be changed in one scheduling cycle (if we think binding operation is part of scheduling cycle). Scheduling on assumed pod repeatedly is unnecessary, better to wait for pod is bound successfully or unassumed on failure.

Current scheduler does not support this. And we cannot do it in FindPodVolumes, because error in FindPodVolumes will make pod unscheduable. I'm investigating possibilities with new scheduler framework. Maybe we can achieve this in future.

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Dec 24, 2018

Scenarios

PVC selectedNode is rejected by provisioner

xref: #71928

BindPodVolumes will be retired by following cases:

  • No pod is in scheduling. New PVC object from apiserver is added into pvc cache. BindPodVolumes find selectedNode does not exist, fail with an error on which scheduler will reschedule and make API calls again
  • A pod is in scheduling. It will assume on new PVC object and update binding cache. However, BindPodVolumes will find PVs & PVCs in binding cache are out of date, fail with an error on which scheduler will reschedule and make API calls again

@cofyc cofyc force-pushed the cofyc:fix71928 branch 2 times, most recently from 123382c to 4022103 Dec 24, 2018

cofyc added some commits Jan 7, 2019

Make volume binder resilient to races: scheduler change
There is no need to clear stale pod binding cache in scheduling, because
it will be recreated at beginning of each schedule loop, and will be
cleared when pod is removed from scheduling queue.

@cofyc cofyc force-pushed the cofyc:fix71928 branch from 0a1d04b to 1a62f53 Jan 9, 2019

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Jan 9, 2019

@msau42 Squashed into logical commits (no content change):

  • 13d87fb: Make volume binder resilient to races
  • 8b94b96: Unit tests only
  • cfc8ef5: Scheduler change
  • 1a62f53: If provisioning PVC's PV is not found, check next time
@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Jan 9, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Jan 9, 2019

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Jan 9, 2019

cc @k82cn @bsalamat
For approval on scheduler change

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Jan 10, 2019

/hold
for checking pod binding cache issues

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Jan 10, 2019

/hold cancel

xref: #72045 (comment)

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Jan 11, 2019

/priority important-soon

@bsalamat
Copy link
Contributor

bsalamat left a comment

/approve

I do not know the volume binder logic very well, but based on your explanation that the volume binder cache needs to be kept only for unassigned pods, the changes in the scheduler code look good to me.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 11, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, cofyc, msau42, xiaoxubeii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit ccb1e1f into kubernetes:master Jan 12, 2019

19 checks passed

cla/linuxfoundation cofyc authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 12, 2019

CI failures around PV binding sharply increased between 1/11 and 1/12:

can we determine if this PR contributed and consider rollback/rework if so?

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Jan 12, 2019

With a quick glance, I don't think so. It seems like only containerd jobs with CSI hostpath or intree nfs drivers have spiked. Also some of the nfs tests that are failing don't use PVs. I more suspect an issue with containerd

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 12, 2019

good observation. opened #72863

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jan 16, 2019

We are also seeing an increase in scheduling latency right after this PR. The scheduler predicate evaluation latency has increase by about 50%.

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Jan 16, 2019

hi, @bsalamat
Could you give me your benchmark commands? I'd like to debug and optimize FindPodVolumes which is called by predicate.

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Jan 16, 2019

I guess it is because new FindPodVolumes in this PR needs to clear old pod binding cache, and to achieve this I update pod cache even if there is no PVCs to bind/provision. There are some optimizations we can do. Sorry about increased scheduling latency. I'm working on a fix.

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jan 16, 2019

@cofyc You can go the our perf dashboard and then choose:
gce-100 nodes > Scheduler > Scheduling Latency > predicate evaluation
to see the results of latest runs.

@cofyc

This comment has been minimized.

Copy link
Contributor Author

cofyc commented Jan 17, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment