Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Un-reserve extension point for the scheduling framework #77598

Merged
merged 1 commit into from
May 10, 2019

Conversation

danielqsj
Copy link
Contributor

@danielqsj danielqsj commented May 8, 2019

What type of PR is this?

/kind feature
/sig scheduling

What this PR does / why we need it:

Add Un-reserve extension point for the scheduling framework

Which issue(s) this PR fixes:

Fixes #77288 #77573

Special notes for your reviewer:

The former PR is #77457, but the test TestUnreservePlugin introduced is flaky. Ref #77573.
Then #77577 revert it.

This PR reintroduce Un-reserve extension point and fix the TestUnreservePlugin.

Does this PR introduce a user-facing change?:

Add Un-reserve extension point for the scheduling framework.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 8, 2019
@danielqsj
Copy link
Contributor Author

/assign @bsalamat @neolit123 @liggitt
/cc @tedyu

@liggitt
Copy link
Member

liggitt commented May 8, 2019

can you separate out the changes made that address the flake into their own commit (straight un-revert in one commit, then flake fixes in a second)?

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the the update on the t.Errorf messages.

@danielqsj
Copy link
Contributor Author

@liggitt yeah. Thanks for your suggestion, it's more readable to review.

if err = wait.Poll(10*time.Millisecond, 30*time.Second, podSchedulingError(cs, pod.Namespace, pod.Name)); err != nil {
t.Errorf("test #%v: Expected a scheduling error, but didn't get it. error: %v", i, err)
}
if unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any explanation why the unreserve plugin might be called more than once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From log, I find sometimes the test pod will be schedule twice before it be cleaned.

I0509 00:49:27.662644    7237 wrap.go:47] POST /api/v1/namespaces/test-2/pods: (1.030311ms) 201 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42654]
I0509 00:49:27.662832    7237 scheduling_queue.go:795] About to try and schedule pod test-2/test-pod
I0509 00:49:27.662850    7237 scheduler.go:452] Attempting to schedule pod: test-2/test-pod
I0509 00:49:27.662962    7237 scheduler_binder.go:256] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0"
I0509 00:49:27.662980    7237 scheduler_binder.go:266] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0": all PVCs bound and nothing to do
numPrebindCalled++ -> 1
I0509 00:49:27.663012    7237 framework.go:85] rejected by prebind-plugin at prebind: reject pod test-pod
E0509 00:49:27.663025    7237 factory.go:662] Error scheduling test-2/test-pod: rejected by prebind-plugin at prebind: reject pod test-pod; retrying
I0509 00:49:27.663043    7237 factory.go:720] Updating pod condition for test-2/test-pod to (PodScheduled==False, Reason=Unschedulable)
I0509 00:49:27.668967    7237 wrap.go:47] POST /api/v1/namespaces/test-2/events: (5.354584ms) 201 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
I0509 00:49:27.669020    7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (5.473273ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42652]
I0509 00:49:27.669169    7237 wrap.go:47] PUT /api/v1/namespaces/test-2/pods/test-pod/status: (5.935527ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42654]
I0509 00:49:27.669322    7237 scheduling_queue.go:795] About to try and schedule pod test-2/test-pod
I0509 00:49:27.669347    7237 scheduler.go:452] Attempting to schedule pod: test-2/test-pod
numUnreserveCalled++ -> 1
I0509 00:49:27.669550    7237 scheduler_binder.go:256] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0"
I0509 00:49:27.669570    7237 scheduler_binder.go:266] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0": all PVCs bound and nothing to do
numPrebindCalled++ -> 2
I0509 00:49:27.669609    7237 framework.go:85] rejected by prebind-plugin at prebind: reject pod test-pod
E0509 00:49:27.669618    7237 factory.go:662] Error scheduling test-2/test-pod: rejected by prebind-plugin at prebind: reject pod test-pod; retrying
I0509 00:49:27.669638    7237 factory.go:720] Updating pod condition for test-2/test-pod to (PodScheduled==False, Reason=Unschedulable)
numUnreserveCalled++ -> 2
I0509 00:49:27.671425    7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (1.537755ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42652]
E0509 00:49:27.671603    7237 factory.go:686] pod is already present in unschedulableQ
I0509 00:49:27.671793    7237 wrap.go:47] PATCH /api/v1/namespaces/test-2/events/test-pod.159cf4497ef3b14b: (1.799554ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
I0509 00:49:27.764167    7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (739.161µs) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
reset numUnreserveCalled -> 0
reset numPrebindCalled -> 0
I0509 00:49:27.766347    7237 scheduling_queue.go:795] About to try and schedule pod test-2/test-pod
I0509 00:49:27.766376    7237 scheduler.go:448] Skip schedule deleting pod: test-2/test-pod
I0509 00:49:27.771321    7237 wrap.go:47] POST /api/v1/namespaces/test-2/events: (4.727587ms) 201 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42652]
I0509 00:49:27.771869    7237 wrap.go:47] DELETE /api/v1/namespaces/test-2/pods/test-pod: (7.36413ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
I0509 00:49:27.774353    7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (1.067354ms) 404 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
danieldebug: pod is deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat Due to the time competition between :
1. scheduling_queue retry to schedule the test pod
2. cleanupPods

What we can do here is to check whether the times of prebind fails equal the times of unreserve. And I think it's safe and by design, right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This is expected. Thanks for checking. More accurately, since the pod is rejected at pre-bind, the scheduler will retry scheduling it. The scheduler may retry the pod one or more times before we check the number of times unreserve is called. In order to make the test more robust, please change the condition to:

if unresPlugin.numUnreserveCalled == 0 || unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Condition changed. PTAL

@liggitt liggitt removed their assignment May 10, 2019
Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @danielqsj!
Please squash commits.

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 10, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2019
@danielqsj
Copy link
Contributor Author

@bsalamat squashed. PTAL, thanks

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks, @danielqsj!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, danielqsj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2019
@k8s-ci-robot k8s-ci-robot merged commit b9ccdd2 into kubernetes:master May 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Un-reserve extension point for the scheduling framework
5 participants