-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make volume binder resilient to races #72045
Conversation
/assign @msau42 Verified this can fix issue locally, but couldn't figure out how to write a test for it yet. Because this adds a new interface, please review first. |
Hm another idea I had is to treat static binding and pending provisioning similarly in the predicate. Instead of failing the predicate if selectedNode is set, can we instead make the predicate return true only for the selectedNode, and false for all other nodes? AssumePodVolumes checks for if the PVCs are fully bound (which they are not) to determine whether or not to call BindPodVolumes. This would be a bigger change, but I think would make our logic more resilient to these races. |
That makes sense and seems better! Always try to bind pod volumes if they are not fully bound. |
I'm using method 2.
There is a special case, if there is an another instance of pod already in scheduling and find same bindings and selected node and assume same PVs and PVCs. API changes from apiserver will be overwritten. At first, we make FindPodVolumes/AssumePodVolumes idempotent, do not fail with error (as you suggested). Then, BindPodVolumes detects periodically if it is necessary to update API objects:
I've tested method 1, but scheduler will retry pod repeatedly and make API calls every second (check period) currently, because there are two instances of pod in scheduler, each one will fail another instance of pod in binding operation. I'm using method 2. IMO, the simplest design is to avoid calling find/assume on pod when it is in binding, then bindings and assumed PVs/PVCs for assumed pod will not be changed in one scheduling cycle (if we think binding operation is part of scheduling cycle). Scheduling on assumed pod repeatedly is unnecessary, better to wait for pod is bound successfully or unassumed on failure. Current scheduler does not support this. And we cannot do it in FindPodVolumes, because error in FindPodVolumes will make pod unscheduable. I'm investigating possibilities with new scheduler framework. Maybe we can achieve this in future. |
ScenariosPVC selectedNode is rejected by provisionerxref: #71928 BindPodVolumes will be retired by following cases:
|
123382c
to
4022103
Compare
There is no need to clear stale pod binding cache in scheduling, because it will be recreated at beginning of each schedule loop, and will be cleared when pod is removed from scheduling queue.
/lgtm |
/hold |
/hold cancel xref: #72045 (comment) |
/priority important-soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
I do not know the volume binder logic very well, but based on your explanation that the volume binder cache needs to be kept only for unassigned pods, the changes in the scheduler code look good to me.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, cofyc, msau42, xiaoxubeii The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
CI failures around PV binding sharply increased between 1/11 and 1/12:
can we determine if this PR contributed and consider rollback/rework if so? |
With a quick glance, I don't think so. It seems like only containerd jobs with CSI hostpath or intree nfs drivers have spiked. Also some of the nfs tests that are failing don't use PVs. I more suspect an issue with containerd |
good observation. opened #72863 |
We are also seeing an increase in scheduling latency right after this PR. The scheduler predicate evaluation latency has increase by about 50%. |
hi, @bsalamat |
I guess it is because new FindPodVolumes in this PR needs to clear old pod binding cache, and to achieve this I update pod cache even if there is no PVCs to bind/provision. There are some optimizations we can do. Sorry about increased scheduling latency. I'm working on a fix. |
@cofyc You can go the our perf dashboard and then choose: |
Thanks! |
What type of PR is this?
What this PR does / why we need it:
Make volume binder resilient to races between main schedule loop and async binding operation.
Which issue(s) this PR fixes:
Fixes #71928 #72013 #56236
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/sig storage
/sig scheduling