-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear pod binding cache #71212
Clear pod binding cache #71212
Conversation
/assign @msau42 @lichuqiang |
@@ -399,6 +399,9 @@ func (sched *Scheduler) bindVolumes(assumed *v1.Pod) error { | |||
if forgetErr := sched.config.SchedulerCache.ForgetPod(assumed); forgetErr != nil { | |||
klog.Errorf("scheduler cache ForgetPod failed: %v", forgetErr) | |||
} | |||
// Volumes may be bound by PV controller asynchronously, we must clear | |||
// stale pod binding cache. | |||
sched.config.VolumeBinder.DeletePodBindings(assumed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a test we can set up that would exercise the bug this fixes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to set up a test case to cover this, probably we can add a stress test, it's easier to reproduce this issue in a few runs locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to async "stress" test... this looks like it should be fairly quick to trigger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have some existing stress integration tests but I think they don't handle prepound PVs. It shouldn't be too difficult to randomly make a fraction of the PVs prebound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created a issue for this: #71301
8058d12
to
4cd4b28
Compare
/test pull-kubernetes-verify |
/test pull-kubernetes-e2e-gce-100-performance |
@msau42
Not sure why. |
Does this cause hard failure or will it fix itself on a retry? @bsalamat, could you clarify if it's intended that a pod can go through scheduling while it's still in the bind phase? |
/priority critical-urgent for merging when it's ready |
@@ -143,6 +143,9 @@ func (b *volumeBinder) GetBindingsCache() PodBindingCache { | |||
func (b *volumeBinder) FindPodVolumes(pod *v1.Pod, node *v1.Node) (unboundVolumesSatisfied, boundVolumesSatisfied bool, err error) { | |||
podName := getPodName(pod) | |||
|
|||
// Pod binding cache may be out of date, clear for the given pod and node first. | |||
b.podBindingCache.ClearBindings(pod, node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this go after getPodVolumes
, if number of total claims > 0? That way we save a check with mutex for pods that don't use volumes.
AssumePodVolumes runs in sequence with FindPodVolumes, so pod binding cache cannot be modified by FindPodVolumes while AssumePodVolumes is running. So the races we're concerned about are in between Find/Assume and Bind. AssumePodVolumes will:
BindPodVolumes returns:
So I think binding should succeed after some retries, unless for some reason the scheduler is constantly retrying the pod even though it's currently binding. If you're seeing the test fail because of the scheduler retries, then let's go back to the original method for now so that we can unblock the release, and revisit this again as a refactor later. |
/retest |
Sometimes it succeeds sometimes it will fail. I've reverted it to original method for now. I've made a quick test in |
/hold cancel |
/lgtm @cofyc ran the flaky test a few hundred times to confirm it doesn't flake anymore with this fix. We'll look into adding new stress tests to more reliably repro the error condition separately, as well as follow up with sig-scheduling to confirm if it's expected behavior that pods can go through the scheduler again while they're still binding. |
/assign @k82cn |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cofyc, k82cn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-kubemark-e2e-gce-big |
/retest |
What type of PR is this?
What this PR does / why we need it:
Clear pod binding cache.
Previous way to clear pod binding cache in
ErrorFunc
asynchronously is not enough, because pod binding cache are only cleared when pod is assigned a node or not found (deleted).Volumes may be bound by PV controller. In next run, scheduler binder will skip these bound claims in FindPodVolumes. If pod binding cache is not cleared,
BindPodVolumes
will try to bind stale volumes and fail.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Partially address #70180
Special notes for your reviewer:
Does this PR introduce a user-facing change?: