Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick node for preemption based on start time of pods #74974

Merged

Conversation

@goodluckbot
Copy link
Contributor

commented Mar 5, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
While picking one node for preemption, we can break the tie based on the start time of all the pods, i.e., node with the latest start time of all victims is picked.
This is because pods running for a longer time may be more useful than the pod running a short time, especially in batch job scenarios.

The node selection method will change to below sequence
// 1. A node with minimum number of PDB violations.
// 2. A node with minimum highest priority victim is picked.
// 3. Ties are broken by sum of priorities of all victims.
// 4. If there are still ties, node with the minimum number of victims is picked.
// 5. If there are still ties, node with the latest start time of all highest priority victims is picked.
// 6. If there are still ties, the first such node is picked (sort of randomly).

Also noticed there's a scheduler framework being designed, not sure the tie-breaking in this PR shall be better implemented using the future scheduler framework.

cc @bsalamat @k82cn @resouer
thanks.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
Not user-facing

NONE
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Mar 5, 2019

Hi @goodluckbot. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested review from k82cn and wgliang Mar 5, 2019

pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
pkg/scheduler/core/generic_scheduler_test.go Outdated Show resolved Hide resolved
pkg/scheduler/core/generic_scheduler_test.go Outdated Show resolved Hide resolved
pkg/scheduler/core/generic_scheduler_test.go Outdated Show resolved Hide resolved
@goodluckbot

This comment has been minimized.

Copy link
Contributor Author

commented Mar 7, 2019

thanks @bsalamat @wgliang for reviewing. updated the code, PTAL


for _, pod := range victims.Pods {
if *pod.Spec.Priority < highestPriority {
// break since the victims are ranked from highest priority to lowest

This comment has been minimized.

Copy link
@bsalamat

bsalamat Mar 7, 2019

Member

This is brittle. Your logic in this function makes a strong assumption about the order of victims which is implemented in another function.
Since this function is not called frequently and we don't expect the number of victims to be large (often times less than 10), I don't think it is worth optimizing the logic at the expense of making it error prone.
I would remove the break and just compare priority and if it is greater than or equal to the highest priority then update startup time if needed.

This comment has been minimized.

Copy link
@goodluckbot

goodluckbot Mar 8, 2019

Author Contributor

Agreed, especially if there's a scheduler extender, the returned victims' pods from extender may not be sorted by priority. Updated the code, PTAL, thanks.

@bsalamat
Copy link
Member

left a comment

Thanks. Looks good. A few minor comments.

pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved
pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved
@goodluckbot

This comment has been minimized.

Copy link
Contributor Author

commented Mar 9, 2019

Thanks @bsalamat , updated the PR, PTAL

pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved
@wgliang

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

@goodluckbot Please fill out the update note.
/ok-to-test

@goodluckbot

This comment has been minimized.

Copy link
Contributor Author

commented Mar 11, 2019

@goodluckbot Please fill out the update note.
/ok-to-test

can you please point me to the note that i should fill in, thanks.

@bsalamat
Copy link
Member

left a comment

/lgtm

Thanks, @goodluckbot!

@wgliang

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

/lgtm

@bsalamat

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

@goodluckbot could you please squash commits?

@goodluckbot goodluckbot force-pushed the goodluckbot:pick-node-preempt-start-time branch from db52a5d to 8d991e6 Mar 12, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 12, 2019

@goodluckbot

This comment has been minimized.

Copy link
Contributor Author

commented Mar 14, 2019

@bsalamat , @wgliang squashed commits, could you PTAL? thanks.

@wgliang

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 14, 2019

@bsalamat
Copy link
Member

left a comment

/approve

Thanks, @goodluckbot!

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, goodluckbot

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 8592098 into kubernetes:master Mar 20, 2019

17 checks passed

cla/linuxfoundation goodluckbot authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]])
if latestStartTime == nil {
// If the earliest start time of all pods on the 1st node is nil, just return it,
// which is not expected to happen.

This comment has been minimized.

Copy link
@tedyu

tedyu Mar 20, 2019

Contributor

Should we check the second node's start time (in this case) ?

This comment has been minimized.

Copy link
@goodluckbot

goodluckbot Mar 20, 2019

Author Contributor

i think this is a case which should not be reached, the log has Should not reach here

This comment has been minimized.

Copy link
@tedyu

tedyu Mar 20, 2019

Contributor

Later on line 925:

                        klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", node)
                        continue

I think using a loop for finding start time around line 913 would make the code more consistent.

This comment has been minimized.

Copy link
@bsalamat

bsalamat Mar 20, 2019

Member

@tedyu latestStartTime should never be nil. If for some unexpected reason it becomes nil, we prefer to see an error message and investigate the issue instead of tolerating the error.

This comment has been minimized.

Copy link
@tedyu

tedyu Mar 20, 2019

Contributor

In that case, panic'ing would be proper choice in this scenario.

This comment has been minimized.

Copy link
@goodluckbot

goodluckbot Mar 20, 2019

Author Contributor

Hi @tedyu , IMO, a nil for timestamp is unlikely to happen. and if it happens, we can still fall back by returning minNodes2[0], which is the same as before where we don't take "start time of pods" into account.

This comment has been minimized.

Copy link
@hex108

hex108 Mar 21, 2019

Member

In that case, panic'ing would be proper choice in this scenario.

Some test cases will panic because they perhaps do not set StartTime :(

This comment has been minimized.

Copy link
@tedyu

tedyu Mar 21, 2019

Contributor

Interesting.
In another popular Apache project, my test encounters NPE due to null being passed by test(s) to configure() (in production, it wouldn't be null).
I am modifying the test not to do that.

For k8s, we should do the same - tests should mimic what happens in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.