Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Un-reserve extension point for the scheduling framework #77598

Merged
merged 1 commit into from May 10, 2019

Conversation

@danielqsj
Copy link
Member

commented May 8, 2019

What type of PR is this?

/kind feature
/sig scheduling

What this PR does / why we need it:

Add Un-reserve extension point for the scheduling framework

Which issue(s) this PR fixes:

Fixes #77288 #77573

Special notes for your reviewer:

The former PR is #77457, but the test TestUnreservePlugin introduced is flaky. Ref #77573.
Then #77577 revert it.

This PR reintroduce Un-reserve extension point and fix the TestUnreservePlugin.

Does this PR introduce a user-facing change?:

Add Un-reserve extension point for the scheduling framework.
@danielqsj

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

@liggitt

This comment has been minimized.

Copy link
Member

commented May 8, 2019

can you separate out the changes made that address the flake into their own commit (straight un-revert in one commit, then flake fixes in a second)?

@danielqsj danielqsj force-pushed the danielqsj:unreserve branch from eae0b65 to 3599159 May 8, 2019

@neolit123
Copy link
Member

left a comment

thanks for the the update on the t.Errorf messages.

@danielqsj

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

@liggitt yeah. Thanks for your suggestion, it's more readable to review.

if err = wait.Poll(10*time.Millisecond, 30*time.Second, podSchedulingError(cs, pod.Namespace, pod.Name)); err != nil {
t.Errorf("test #%v: Expected a scheduling error, but didn't get it. error: %v", i, err)
}
if unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled {

This comment has been minimized.

Copy link
@bsalamat

bsalamat May 8, 2019

Member

Is there any explanation why the unreserve plugin might be called more than once?

This comment has been minimized.

Copy link
@danielqsj

danielqsj May 9, 2019

Author Member

From log, I find sometimes the test pod will be schedule twice before it be cleaned.

I0509 00:49:27.662644    7237 wrap.go:47] POST /api/v1/namespaces/test-2/pods: (1.030311ms) 201 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42654]
I0509 00:49:27.662832    7237 scheduling_queue.go:795] About to try and schedule pod test-2/test-pod
I0509 00:49:27.662850    7237 scheduler.go:452] Attempting to schedule pod: test-2/test-pod
I0509 00:49:27.662962    7237 scheduler_binder.go:256] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0"
I0509 00:49:27.662980    7237 scheduler_binder.go:266] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0": all PVCs bound and nothing to do
numPrebindCalled++ -> 1
I0509 00:49:27.663012    7237 framework.go:85] rejected by prebind-plugin at prebind: reject pod test-pod
E0509 00:49:27.663025    7237 factory.go:662] Error scheduling test-2/test-pod: rejected by prebind-plugin at prebind: reject pod test-pod; retrying
I0509 00:49:27.663043    7237 factory.go:720] Updating pod condition for test-2/test-pod to (PodScheduled==False, Reason=Unschedulable)
I0509 00:49:27.668967    7237 wrap.go:47] POST /api/v1/namespaces/test-2/events: (5.354584ms) 201 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
I0509 00:49:27.669020    7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (5.473273ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42652]
I0509 00:49:27.669169    7237 wrap.go:47] PUT /api/v1/namespaces/test-2/pods/test-pod/status: (5.935527ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42654]
I0509 00:49:27.669322    7237 scheduling_queue.go:795] About to try and schedule pod test-2/test-pod
I0509 00:49:27.669347    7237 scheduler.go:452] Attempting to schedule pod: test-2/test-pod
numUnreserveCalled++ -> 1
I0509 00:49:27.669550    7237 scheduler_binder.go:256] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0"
I0509 00:49:27.669570    7237 scheduler_binder.go:266] AssumePodVolumes for pod "test-2/test-pod", node "test-node-0": all PVCs bound and nothing to do
numPrebindCalled++ -> 2
I0509 00:49:27.669609    7237 framework.go:85] rejected by prebind-plugin at prebind: reject pod test-pod
E0509 00:49:27.669618    7237 factory.go:662] Error scheduling test-2/test-pod: rejected by prebind-plugin at prebind: reject pod test-pod; retrying
I0509 00:49:27.669638    7237 factory.go:720] Updating pod condition for test-2/test-pod to (PodScheduled==False, Reason=Unschedulable)
numUnreserveCalled++ -> 2
I0509 00:49:27.671425    7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (1.537755ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42652]
E0509 00:49:27.671603    7237 factory.go:686] pod is already present in unschedulableQ
I0509 00:49:27.671793    7237 wrap.go:47] PATCH /api/v1/namespaces/test-2/events/test-pod.159cf4497ef3b14b: (1.799554ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
I0509 00:49:27.764167    7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (739.161µs) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
reset numUnreserveCalled -> 0
reset numPrebindCalled -> 0
I0509 00:49:27.766347    7237 scheduling_queue.go:795] About to try and schedule pod test-2/test-pod
I0509 00:49:27.766376    7237 scheduler.go:448] Skip schedule deleting pod: test-2/test-pod
I0509 00:49:27.771321    7237 wrap.go:47] POST /api/v1/namespaces/test-2/events: (4.727587ms) 201 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42652]
I0509 00:49:27.771869    7237 wrap.go:47] DELETE /api/v1/namespaces/test-2/pods/test-pod: (7.36413ms) 200 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
I0509 00:49:27.774353    7237 wrap.go:47] GET /api/v1/namespaces/test-2/pods/test-pod: (1.067354ms) 404 [scheduler.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:42656]
danieldebug: pod is deleted

This comment has been minimized.

Copy link
@danielqsj

danielqsj May 9, 2019

Author Member

@bsalamat Due to the time competition between :
1. scheduling_queue retry to schedule the test pod
2. cleanupPods

What we can do here is to check whether the times of prebind fails equal the times of unreserve. And I think it's safe and by design, right ?

This comment has been minimized.

Copy link
@bsalamat

bsalamat May 9, 2019

Member

Correct. This is expected. Thanks for checking. More accurately, since the pod is rejected at pre-bind, the scheduler will retry scheduling it. The scheduler may retry the pod one or more times before we check the number of times unreserve is called. In order to make the test more robust, please change the condition to:

if unresPlugin.numUnreserveCalled == 0 || unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled

This comment has been minimized.

Copy link
@danielqsj

danielqsj May 10, 2019

Author Member

Agree. Condition changed. PTAL

@liggitt liggitt removed their assignment May 10, 2019

@bsalamat
Copy link
Member

left a comment

Thanks, @danielqsj!
Please squash commits.

/lgtm

@danielqsj danielqsj force-pushed the danielqsj:unreserve branch from ea59c86 to 997648a May 10, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label May 10, 2019

@danielqsj

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

@bsalamat squashed. PTAL, thanks

@bsalamat
Copy link
Member

left a comment

/lgtm
/approve

Thanks, @danielqsj!

@k8s-ci-robot k8s-ci-robot added the lgtm label May 10, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, danielqsj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit b9ccdd2 into kubernetes:master May 10, 2019

20 checks passed

cla/linuxfoundation danielqsj authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@draveness draveness referenced this pull request Jun 3, 2019

Open

Scheduling Framework #624

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.