Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimise defaultpreemption: enumerate fewer candidates #94814

Merged
merged 1 commit into from
Nov 6, 2020

Conversation

adtac
Copy link
Member

@adtac adtac commented Sep 15, 2020

What type of PR is this?

/kind feature

What this PR does / why we need it: Instead of considering all nodes as preemption candidates, look for a smaller pool when dry running preemption and stop when a threshold number of candidates is found. In the average case (a good number of nodes are preemptible), this optimisation produces very good results (~25.3% throughput improvement). In the worst case (preemption is impossible), this approach will do no better or worse than the in-tree approach. The improvement is especially pronounced in large clusters (> 1k nodes). Clusters smaller than 100 nodes will not see any improvement from this optimisation.

The following benchmarks were performed with a 5,000 node cluster where 20,000 low priority pods are scheduled and then 5,000 high priority pods are scheduled (test run time is ~10 minutes). The high priority pods do not fit in any node without preempting one or more low priority pods. See the in-tree scheduler_perf integration benchmark for more details. Throughput percentile numbers are left out because median is 0 (kinda meaningless as a result).

Quantity Before After Change
Throughput average 13.12 pods/sec 16.45 pods/sec +25.4%
preemption_evaluation_seconds average 37.96ms 30.20ms -20.4%
preemption_evaluation_seconds p50 7.12ms 6.22ms -12.6%
preemption_evaluation_seconds p90 30.42ms 24.01ms -21.1%
preemption_evaluation_seconds p99 57.65ms 41.75ms -27.6%
Time spent in main thread (fraction of test time) 6.18% 2.51% -59.3%
Time spent in parallel code (fraction of test time) 16.26% 6.25% -61.6%

Full diff (minus is after optimisation, plus is before): http://ix.io/2xHm

Before After
Screenshot_2020-09-15 scheduler_perf test cpu(2) Screenshot_2020-09-15 scheduler_perf test cpu(3)

No performance regression in 100 node tests (in fact, throughput is up slightly by 9%, not sure how).

Which issue(s) this PR fixes:

Ref #89036

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fewer candidates are enumerated for preemption to improve performance in large clusters

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig scheduling
/cc @alculquicondor @ahg-g

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 15, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @adtac. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@adtac
Copy link
Member Author

adtac commented Sep 15, 2020

Will add/modify tests tomorrow.

@ahg-g
Copy link
Member

ahg-g commented Sep 16, 2020

/ok-to-test
/assign @Huang-Wei

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 16, 2020
@alculquicondor
Copy link
Member

Did you add a new performance test case? Or are the above numbers from the existing ones?

Also, avoid fixes #123 from the description, as that will cause github to close the issue once this merges.

@ahg-g
Copy link
Member

ahg-g commented Sep 16, 2020

Just noticed a problem with the windowing approach, it breaks the promise of picking "A node with minimum number of PDB violations.":

// 1. A node with minimum number of PDB violations.

I am not sure we want to break that promise.

@alculquicondor
Copy link
Member

I am not sure we want to break that promise.

I think it's ok to break it as long as the default is kept at 100%

@ahg-g
Copy link
Member

ahg-g commented Sep 16, 2020

I am not sure we want to break that promise.

I think it's ok to break it as long as the default is kept at 100%

I think we can modify the promise in the following way:

  1. the scheduler is free to select any node where the evicted pods don't violate pod disruption budget (the minimum within the window)
  2. if we can't find a node where the evicted pods don't violate pdb, then we have to select one where the evicted pods have the lowest impact on pdb across all nodes

@Huang-Wei
Copy link
Member

I think we can modify the promise in the following way:

  1. the scheduler is free to select any node where the evicted pods don't violate pod disruption budget (the minimum within the window)
  2. if we can't find a node where the evicted pods don't violate pdb, then we have to select one where the evicted pods have the lowest impact on pdb across all nodes

The semantics looks good to me, however, in practice, we should evaluate the difficulty of getting them implemented. If it's too much work (I'd assume we need to pre-know the nodes that don't have pods violating PDBs, so as to prioritize them in the searching scope), maybe it's ok to break the promise - as we're just increasing the disruption budget by 1, rather than totally breaking the PDB violation (semantics of PDB is guarded by API server).

@alculquicondor
Copy link
Member

The semantics looks good to me, however, in practice, we should evaluate the difficulty of getting them implemented.

It shouldn't be hard at all. It's a matter of continuing iterating while:

  • a node with zero violations is not found, OR
  • percentage has not been reached.

@adtac
Copy link
Member Author

adtac commented Sep 16, 2020

Yes, it's just a matter of changing the context cancel condition slightly; we already have the PDB violation data for each node to make that decision. I'm still working on the component config changes haha, they're pretty confusing :)

@Huang-Wei
Copy link
Member

It shouldn't be hard at all. It's a matter of continuing iterating while:

  • a node with zero violations is not found, OR
  • percentage has not been reached.

What if the percentage has been reached, and all so-far-calculated candidates have non-zero violations? Would you continue searching?

@adtac
Copy link
Member Author

adtac commented Sep 16, 2020

What if the percentage has been reached, and all so-far-calculated candidates have non-zero violations? Would you continue searching?

Yes. The exit condition is basically len(candidates) > threshold && nonViolatingNodeFound. In the worst case (all nodes need to be searched), we don't do any better/worse than the in-tree approach. In the average case, we check fewer nodes.

@Huang-Wei
Copy link
Member

Yes. The exit condition is basically len(candidates) > threshold && nonViolatingNodeFound. In the worst case (all nodes need to be searched), we don't do any better/worse than the in-tree approach. In the average case, we check fewer nodes.

SG.

@Huang-Wei
Copy link
Member

@liggitt could you help review/approve the API changes? Thanks!

pkg/scheduler/apis/config/types_pluginargs.go Outdated Show resolved Hide resolved
// validateMinCandidateNodesPercentage validates that
// minCandidateNodesPercentage is within the allowed range.
func validateMinCandidateNodesPercentage(minCandidateNodesPercentage int32) error {
if minCandidateNodesPercentage < 0 || minCandidateNodesPercentage > 100 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does a 0 percent minimum mean? consider 0 candidate nodes? wouldn't that break the scheduler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the absolute config parameter has precedence over the percentage one, so one can use minCandidateNodesPercentage = 0% to denote that they only want to use an absolute minimum

there's a problem only if both the absolute and the percentage parameter are both zero as that would break things; and we check for that scenario in ValidateDefaultPreemptionArgs

// validateMinCandidateNodesAbsolute validates that minCandidateNodesAbsolute
// is within the allowed range.
func validateMinCandidateNodesAbsolute(minCandidateNodesAbsolute int32) error {
if minCandidateNodesAbsolute < 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does a 0 minimum mean? consider 0 candidate nodes? wouldn't that break the scheduler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting minCandidateNodesAbsolute = 0 could be used to turn off the absolute knob entirely. For example, if an operator wants only 10% of the cluster size to be evaluated without an absolute minimum, they can do this

see previous comment re both absolute and percentage both being zero

return nil, err
}

offset, numCandidates := pl.updateAndReturnOffset(int32(len(potentialNodes)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be simpler to pick a random start point rather than making *DefaultPreemption stateful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we did consider that, but the round robin-like approach guarantees that it will evenly distributes preemption across all nodes, whereas a random approach could result in confusing results to the end-user (e.g. the same node getting picked for preemption twice in a row)

but I totally see why a random starting index might be nice to have too -- this is something that's certainly worth exploring in the future once we have more feedback from users

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

picking the same node twice isn't an issue (not doing so is not part of the contract), it could happen with the existing logic as well.

Copy link
Member Author

@adtac adtac Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could happen with the existing logic as well

picking the same node twice in a row could happen, but if that happens, it also guarantees that all other nodes have been tried. while this isn't a formal guarantee on the plugin's behalf, it's a nice thing to have

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making this stateful and self-modifying makes it a lot harder to reason about (especially in terms of thread safety and data races)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched to a random offset. There is no difference in throughput between random and moving offset in integration benchmarks (as expected).

// get returns the internal candidate array. This function is NOT atomic and
// assumes that the caller has acquired mutual exclusion on the list.
func (cl *candidateList) get() []Candidate {
return cl.items[:cl.size()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning a subslice here means that callers can modify members of cl.items beyond cl.size() which will be overwritten if cl.add is called again (see https://play.golang.org/p/Uw_cq4NK4Tz)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this isn't a concern. get() is guaranteed to be called after all add() operations, but perhaps it's worth adding this as a note to the function documentation -- done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making the other methods atomically safe implies multiple goroutines are interacting with this object... are all parallel operations guaranteed to be stopped before this is called or can a timeout return control to the get() caller while some background operations are still running?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all parallel operations guaranteed to be stopped before this is called

Yes, parallelize.Until will guarantee that all goroutines interacting with this candidateList will have completed before this called.

pkg/scheduler/apis/config/v1beta1/defaults.go Show resolved Hide resolved
@adtac
Copy link
Member Author

adtac commented Oct 7, 2020

/test pull-kubernetes-e2e-kind

return nil, err
}

offset, numCandidates := pl.updateAndReturnOffset(int32(len(potentialNodes)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making this stateful and self-modifying makes it a lot harder to reason about (especially in terms of thread safety and data races)

// get returns the internal candidate array. This function is NOT atomic and
// assumes that the caller has acquired mutual exclusion on the list.
func (cl *candidateList) get() []Candidate {
return cl.items[:cl.size()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making the other methods atomically safe implies multiple goroutines are interacting with this object... are all parallel operations guaranteed to be stopped before this is called or can a timeout return control to the get() caller while some background operations are still running?

}

// size returns the number of candidate stored atomically.
func (cl *candidateList) size() int32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size() is also not atomic and assumes all add() operations have been completed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but we also don't access the list elements until all add() operations complete. I've updated the function doc to reflect this more accurately.

const parallelism = 16
// Parallelism is exported for tests where non-determinism from parallelism is
// not desired. Do NOT modify outside tests.
var Parallelism = 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a todo to make this private again once #94636 merges (or as part of that PR?)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2020
@alculquicondor
Copy link
Member

can we get this to a mergeable state? code freeze is approaching

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2020
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2020
@adtac
Copy link
Member Author

adtac commented Nov 4, 2020

ping @liggitt :)

Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
@liggitt
Copy link
Member

liggitt commented Nov 6, 2020

/approve
for API bits

scheduler reviewers have LGTM

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adtac, alculquicondor, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2020
@ahg-g
Copy link
Member

ahg-g commented Nov 6, 2020

/lgtm
/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 6, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2020
@k8s-ci-robot k8s-ci-robot merged commit bd95fb1 into kubernetes:master Nov 6, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants