Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Make volume binder resilient to races #72045
What type of PR is this?
What this PR does / why we need it:
Make volume binder resilient to races between main schedule loop and async binding operation.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
referenced this pull request
Dec 14, 2018
Hm another idea I had is to treat static binding and pending provisioning similarly in the predicate. Instead of failing the predicate if selectedNode is set, can we instead make the predicate return true only for the selectedNode, and false for all other nodes? AssumePodVolumes checks for if the PVCs are fully bound (which they are not) to determine whether or not to call BindPodVolumes.
This would be a bigger change, but I think would make our logic more resilient to these races.
This was referenced
Dec 22, 2018
I'm using method 2.
There is a special case, if there is an another instance of pod already in scheduling and find same bindings and selected node and assume same PVs and PVCs. API changes from apiserver will be overwritten.
At first, we make FindPodVolumes/AssumePodVolumes idempotent, do not fail with error (as you suggested). Then, BindPodVolumes detects periodically if it is necessary to update API objects:
I've tested method 1, but scheduler will retry pod repeatedly and make API calls every second (check period) currently, because there are two instances of pod in scheduler, each one will fail another instance of pod in binding operation.
I'm using method 2.
IMO, the simplest design is to avoid calling find/assume on pod when it is in binding, then bindings and assumed PVs/PVCs for assumed pod will not be changed in one scheduling cycle (if we think binding operation is part of scheduling cycle). Scheduling on assumed pod repeatedly is unnecessary, better to wait for pod is bound successfully or unassumed on failure.
Current scheduler does not support this. And we cannot do it in FindPodVolumes, because error in FindPodVolumes will make pod unscheduable. I'm investigating possibilities with new scheduler framework. Maybe we can achieve this in future.
PVC selectedNode is rejected by provisioner
BindPodVolumes will be retired by following cases:
bsalamat left a comment
I do not know the volume binder logic very well, but based on your explanation that the volume binder cache needs to be kept only for unassigned pods, the changes in the scheduler code look good to me.
[APPROVALNOTIFIER] This PR is APPROVED
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Jan 12, 2019
19 checks passed
CI failures around PV binding sharply increased between 1/11 and 1/12:
can we determine if this PR contributed and consider rollback/rework if so?
I guess it is because new FindPodVolumes in this PR needs to clear old pod binding cache, and to achieve this I update pod cache even if there is no PVCs to bind/provision. There are some optimizations we can do. Sorry about increased scheduling latency. I'm working on a fix.