-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Un-reserve extension point for the scheduling framework #77598
Conversation
/assign @bsalamat @neolit123 @liggitt |
can you separate out the changes made that address the flake into their own commit (straight un-revert in one commit, then flake fixes in a second)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the the update on the t.Errorf
messages.
@liggitt yeah. Thanks for your suggestion, it's more readable to review. |
if err = wait.Poll(10*time.Millisecond, 30*time.Second, podSchedulingError(cs, pod.Namespace, pod.Name)); err != nil { | ||
t.Errorf("test #%v: Expected a scheduling error, but didn't get it. error: %v", i, err) | ||
} | ||
if unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any explanation why the unreserve plugin might be called more than once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From log, I find sometimes the test pod will be schedule twice before it be cleaned.
I0509 00:49:27.662644 7237 wrap.go:47] POST /api/v1/namespaces/test-2/pods: (1.030311ms) 201 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42654]
I0509 00:49:27.662832 7237 scheduling_queue.go:795] About to try and schedule pod test-2/test-pod
I0509 00:49:27.662850 7237 scheduler.go:452] Attempting to schedule pod: test-2/test-pod
I0509 00:49:27.662962 7237 scheduler_binder.go:256] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0"
I0509 00:49:27.662980 7237 scheduler_binder.go:266] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0": all PVCs bound and nothing to do
numPrebindCalled++ -> 1
I0509 00:49:27.663012 7237 framework.go:85] rejected by prebind-plugin at prebind: reject pod test-pod
E0509 00:49:27.663025 7237 factory.go:662] Error scheduling test-2/test-pod: rejected by prebind-plugin at prebind: reject pod test-pod; retrying
I0509 00:49:27.663043 7237 factory.go:720] Updating pod condition for test-2/test-pod to (PodScheduled==False, Reason=Unschedulable)
I0509 00:49:27.668967 7237 wrap.go:47] POST /api/v1/namespaces/test-2/events: (5.354584ms) 201 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
I0509 00:49:27.669020 7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (5.473273ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42652]
I0509 00:49:27.669169 7237 wrap.go:47] PUT /api/v1/namespaces/test-2/pods/test-pod/status: (5.935527ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42654]
I0509 00:49:27.669322 7237 scheduling_queue.go:795] About to try and schedule pod test-2/test-pod
I0509 00:49:27.669347 7237 scheduler.go:452] Attempting to schedule pod: test-2/test-pod
numUnreserveCalled++ -> 1
I0509 00:49:27.669550 7237 scheduler_binder.go:256] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0"
I0509 00:49:27.669570 7237 scheduler_binder.go:266] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0": all PVCs bound and nothing to do
numPrebindCalled++ -> 2
I0509 00:49:27.669609 7237 framework.go:85] rejected by prebind-plugin at prebind: reject pod test-pod
E0509 00:49:27.669618 7237 factory.go:662] Error scheduling test-2/test-pod: rejected by prebind-plugin at prebind: reject pod test-pod; retrying
I0509 00:49:27.669638 7237 factory.go:720] Updating pod condition for test-2/test-pod to (PodScheduled==False, Reason=Unschedulable)
numUnreserveCalled++ -> 2
I0509 00:49:27.671425 7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (1.537755ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42652]
E0509 00:49:27.671603 7237 factory.go:686] pod is already present in unschedulableQ
I0509 00:49:27.671793 7237 wrap.go:47] PATCH /api/v1/namespaces/test-2/events/test-pod.159cf4497ef3b14b: (1.799554ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
I0509 00:49:27.764167 7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (739.161µs) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
reset numUnreserveCalled -> 0
reset numPrebindCalled -> 0
I0509 00:49:27.766347 7237 scheduling_queue.go:795] About to try and schedule pod test-2/test-pod
I0509 00:49:27.766376 7237 scheduler.go:448] Skip schedule deleting pod: test-2/test-pod
I0509 00:49:27.771321 7237 wrap.go:47] POST /api/v1/namespaces/test-2/events: (4.727587ms) 201 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42652]
I0509 00:49:27.771869 7237 wrap.go:47] DELETE /api/v1/namespaces/test-2/pods/test-pod: (7.36413ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
I0509 00:49:27.774353 7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (1.067354ms) 404 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
danieldebug: pod is deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat Due to the time competition between :
1. scheduling_queue
retry to schedule the test pod
2. cleanupPods
What we can do here is to check whether the times of prebind
fails equal the times of unreserve
. And I think it's safe and by design, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. This is expected. Thanks for checking. More accurately, since the pod is rejected at pre-bind, the scheduler will retry scheduling it. The scheduler may retry the pod one or more times before we check the number of times unreserve
is called. In order to make the test more robust, please change the condition to:
if unresPlugin.numUnreserveCalled == 0 || unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Condition changed. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @danielqsj!
Please squash commits.
/lgtm
@bsalamat squashed. PTAL, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Thanks, @danielqsj!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, danielqsj The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
/sig scheduling
What this PR does / why we need it:
Add Un-reserve extension point for the scheduling framework
Which issue(s) this PR fixes:
Fixes #77288 #77573
Special notes for your reviewer:
The former PR is #77457, but the test
TestUnreservePlugin
introduced is flaky. Ref #77573.Then #77577 revert it.
This PR reintroduce Un-reserve extension point and fix the
TestUnreservePlugin
.Does this PR introduce a user-facing change?: