Make volume binder resilient to races #72045

cofyc · 2018-12-14T07:17:57Z

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Make volume binder resilient to races between main schedule loop and async binding operation.

Which issue(s) this PR fixes:

Fixes #71928 #72013 #56236

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Make volume binder resilient to races between main schedule loop and async binding operation

/sig storage
/sig scheduling

cofyc · 2018-12-14T07:18:38Z

/assign @msau42

Verified this can fix issue locally, but couldn't figure out how to write a test for it yet. Because this adds a new interface, please review first.

msau42 · 2018-12-18T00:09:24Z

Hm another idea I had is to treat static binding and pending provisioning similarly in the predicate. Instead of failing the predicate if selectedNode is set, can we instead make the predicate return true only for the selectedNode, and false for all other nodes? AssumePodVolumes checks for if the PVCs are fully bound (which they are not) to determine whether or not to call BindPodVolumes.

This would be a bigger change, but I think would make our logic more resilient to these races.

cofyc · 2018-12-18T01:51:22Z

That makes sense and seems better! Always try to bind pod volumes if they are not fully bound.

cofyc · 2018-12-24T06:54:22Z

Instead of failing the predicate if selectedNode is set, can we instead make the predicate return true only for the selectedNode, and false for all other nodes? AssumePodVolumes checks for if the PVCs are fully bound (which they are not) to determine whether or not to call BindPodVolumes.

One thing is how to make API calls again when necessary. IMO there are two methods:

Method 1: BindPodVolumes makes API calls whenever it find it is necessary to update API objects
Method 2: BindPodVolumes makes API calls once for PVs and PVCs in binding cache, and fails with error when it find it is necessary to update API objects, let scheduler retry again

I'm using method 2.

Another thing is how to detect if it is necessary to update API objects

There is a special case, if there is an another instance of pod already in scheduling and find same bindings and selected node and assume same PVs and PVCs. API changes from apiserver will be overwritten.

At first, we make FindPodVolumes/AssumePodVolumes idempotent, do not fail with error (as you suggested). Then, BindPodVolumes detects periodically if it is necessary to update API objects:

Method 1: simplest way is to fail whenever there is another schedule loop to find/assume pod volumes (make API calls for each new schedule, assuming new schedule means we need bind again
Method 2: check PVs & PVCs against objects from apiserver, fail when they are out of date

I've tested method 1, but scheduler will retry pod repeatedly and make API calls every second (check period) currently, because there are two instances of pod in scheduler, each one will fail another instance of pod in binding operation.

I'm using method 2.

IMO, the simplest design is to avoid calling find/assume on pod when it is in binding, then bindings and assumed PVs/PVCs for assumed pod will not be changed in one scheduling cycle (if we think binding operation is part of scheduling cycle). Scheduling on assumed pod repeatedly is unnecessary, better to wait for pod is bound successfully or unassumed on failure.

Current scheduler does not support this. And we cannot do it in FindPodVolumes, because error in FindPodVolumes will make pod unscheduable. I'm investigating possibilities with new scheduler framework. Maybe we can achieve this in future.

cofyc · 2018-12-24T07:08:46Z

Scenarios

PVC selectedNode is rejected by provisioner

xref: #71928

BindPodVolumes will be retired by following cases:

No pod is in scheduling. New PVC object from apiserver is added into pvc cache. BindPodVolumes find selectedNode does not exist, fail with an error on which scheduler will reschedule and make API calls again
A pod is in scheduling. It will assume on new PVC object and update binding cache. However, BindPodVolumes will find PVs & PVCs in binding cache are out of date, fail with an error on which scheduler will reschedule and make API calls again

There is no need to clear stale pod binding cache in scheduling, because it will be recreated at beginning of each schedule loop, and will be cleared when pod is removed from scheduling queue.

cofyc · 2019-01-09T02:51:35Z

@msau42 Squashed into logical commits (no content change):

13d87fb: Make volume binder resilient to races
8b94b96: Unit tests only
cfc8ef5: Scheduler change
1a62f53: If provisioning PVC's PV is not found, check next time

msau42 · 2019-01-09T15:41:20Z

/lgtm
/approve

cofyc · 2019-01-09T17:08:11Z

cc @k82cn @bsalamat
For approval on scheduler change

cofyc · 2019-01-10T06:49:45Z

/hold
for checking pod binding cache issues

cofyc · 2019-01-10T07:19:00Z

/hold cancel

xref: #72045 (comment)

cofyc · 2019-01-11T14:18:00Z

/priority important-soon

bsalamat

/approve

I do not know the volume binder logic very well, but based on your explanation that the volume binder cache needs to be kept only for unassigned pods, the changes in the scheduler code look good to me.

k8s-ci-robot · 2019-01-11T18:54:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, cofyc, msau42, xiaoxubeii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/volume/persistentvolume/OWNERS~~ [msau42]
~~pkg/scheduler/OWNERS~~ [bsalamat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liggitt · 2019-01-12T17:27:54Z

CI failures around PV binding sharply increased between 1/11 and 1/12:

can we determine if this PR contributed and consider rollback/rework if so?

msau42 · 2019-01-12T19:49:37Z

With a quick glance, I don't think so. It seems like only containerd jobs with CSI hostpath or intree nfs drivers have spiked. Also some of the nfs tests that are failing don't use PVs. I more suspect an issue with containerd

liggitt · 2019-01-12T20:09:28Z

good observation. opened #72863

bsalamat · 2019-01-16T00:18:15Z

We are also seeing an increase in scheduling latency right after this PR. The scheduler predicate evaluation latency has increase by about 50%.

cofyc · 2019-01-16T02:26:01Z

hi, @bsalamat
Could you give me your benchmark commands? I'd like to debug and optimize FindPodVolumes which is called by predicate.

cofyc · 2019-01-16T04:31:55Z

I guess it is because new FindPodVolumes in this PR needs to clear old pod binding cache, and to achieve this I update pod cache even if there is no PVCs to bind/provision. There are some optimizations we can do. Sorry about increased scheduling latency. I'm working on a fix.

bsalamat · 2019-01-16T19:28:30Z

@cofyc You can go the our perf dashboard and then choose:
gce-100 nodes > Scheduler > Scheduling Latency > predicate evaluation
to see the results of latest runs.

cofyc · 2019-01-17T05:42:58Z

Thanks!

k8s-ci-robot requested review from Huang-Wei and msau42 December 14, 2018 07:18

k8s-ci-robot assigned msau42 Dec 14, 2018

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Dec 14, 2018

cofyc mentioned this pull request Dec 14, 2018

When a PVC is provisioned with delayed binding, it never recovers from a failed provision #71928

Closed

xiaoxubeii approved these changes Dec 17, 2018

View reviewed changes

cofyc changed the title ~~Fix stale assumed pod volumes~~ WIP: Fix stale assumed pod volumes Dec 18, 2018

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 18, 2018

This was referenced Dec 22, 2018

Revisit scheduler volume binder races #72013

Closed

Still some flaky volume binding integration tests #71716

Closed

cofyc changed the title ~~WIP: Fix stale assumed pod volumes~~ Make volume binder resilient to races between schedule loop and async binding operation Dec 24, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 24, 2018

cofyc force-pushed the fix71928 branch from 02e4063 to d67b660 Compare December 24, 2018 06:17

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 24, 2018

cofyc mentioned this pull request Dec 24, 2018

Revisit volume scheduler pod cache cleanup #56236

Closed

cofyc force-pushed the fix71928 branch 2 times, most recently from 123382c to 4022103 Compare December 24, 2018 07:21

cofyc added 2 commits January 9, 2019 10:50

Make volume binder resilient to races: scheduler change

cfc8ef5

There is no need to clear stale pod binding cache in scheduling, because it will be recreated at beginning of each schedule loop, and will be cleared when pod is removed from scheduling queue.

If provisioning PVC's PV is not found, check next time.

1a62f53

cofyc force-pushed the fix71928 branch from 0a1d04b to 1a62f53 Compare January 9, 2019 02:50

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2019

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2019

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 11, 2019

bsalamat approved these changes Jan 11, 2019

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2019

k8s-ci-robot merged commit ccb1e1f into kubernetes:master Jan 12, 2019

This was referenced Jan 16, 2019

Improve FindPodVolumes performance #72953

Merged

Fix flaky test TestBindAPIUpdate #72959

Merged

tuminoid mentioned this pull request Jan 23, 2019

Pods sharing a PVC on a single node cluster fail to schedule #73216

Closed

cofyc deleted the fix71928 branch May 4, 2019 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make volume binder resilient to races #72045

Make volume binder resilient to races #72045

cofyc commented Dec 14, 2018 •

edited

Loading

cofyc commented Dec 14, 2018 •

edited

Loading

msau42 commented Dec 18, 2018

cofyc commented Dec 18, 2018

cofyc commented Dec 24, 2018 •

edited

Loading

cofyc commented Dec 24, 2018 •

edited

Loading

cofyc commented Jan 9, 2019

msau42 commented Jan 9, 2019

cofyc commented Jan 9, 2019

cofyc commented Jan 10, 2019

cofyc commented Jan 10, 2019

cofyc commented Jan 11, 2019

bsalamat left a comment

k8s-ci-robot commented Jan 11, 2019

liggitt commented Jan 12, 2019 •

edited

Loading

msau42 commented Jan 12, 2019

liggitt commented Jan 12, 2019

bsalamat commented Jan 16, 2019

cofyc commented Jan 16, 2019

cofyc commented Jan 16, 2019

bsalamat commented Jan 16, 2019

cofyc commented Jan 17, 2019

Make volume binder resilient to races #72045

Make volume binder resilient to races #72045

Conversation

cofyc commented Dec 14, 2018 • edited Loading

cofyc commented Dec 14, 2018 • edited Loading

msau42 commented Dec 18, 2018

cofyc commented Dec 18, 2018

cofyc commented Dec 24, 2018 • edited Loading

cofyc commented Dec 24, 2018 • edited Loading

Scenarios

PVC selectedNode is rejected by provisioner

cofyc commented Jan 9, 2019

msau42 commented Jan 9, 2019

cofyc commented Jan 9, 2019

cofyc commented Jan 10, 2019

cofyc commented Jan 10, 2019

cofyc commented Jan 11, 2019

bsalamat left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 11, 2019

liggitt commented Jan 12, 2019 • edited Loading

msau42 commented Jan 12, 2019

liggitt commented Jan 12, 2019

bsalamat commented Jan 16, 2019

cofyc commented Jan 16, 2019

cofyc commented Jan 16, 2019

bsalamat commented Jan 16, 2019

cofyc commented Jan 17, 2019

cofyc commented Dec 14, 2018 •

edited

Loading

cofyc commented Dec 14, 2018 •

edited

Loading

cofyc commented Dec 24, 2018 •

edited

Loading

cofyc commented Dec 24, 2018 •

edited

Loading

liggitt commented Jan 12, 2019 •

edited

Loading