optimise defaultpreemption: enumerate fewer candidates #94814

adtac · 2020-09-15T23:11:16Z

What type of PR is this?

/kind feature

What this PR does / why we need it: Instead of considering all nodes as preemption candidates, look for a smaller pool when dry running preemption and stop when a threshold number of candidates is found. In the average case (a good number of nodes are preemptible), this optimisation produces very good results (~25.3% throughput improvement). In the worst case (preemption is impossible), this approach will do no better or worse than the in-tree approach. The improvement is especially pronounced in large clusters (> 1k nodes). Clusters smaller than 100 nodes will not see any improvement from this optimisation.

The following benchmarks were performed with a 5,000 node cluster where 20,000 low priority pods are scheduled and then 5,000 high priority pods are scheduled (test run time is ~10 minutes). The high priority pods do not fit in any node without preempting one or more low priority pods. See the in-tree scheduler_perf integration benchmark for more details. Throughput percentile numbers are left out because median is 0 (kinda meaningless as a result).

Quantity	Before	After	Change
Throughput average	13.12 pods/sec	16.45 pods/sec	+25.4%
`preemption_evaluation_seconds` average	37.96ms	30.20ms	-20.4%
`preemption_evaluation_seconds` p50	7.12ms	6.22ms	-12.6%
`preemption_evaluation_seconds` p90	30.42ms	24.01ms	-21.1%
`preemption_evaluation_seconds` p99	57.65ms	41.75ms	-27.6%
Time spent in main thread (fraction of test time)	6.18%	2.51%	-59.3%
Time spent in parallel code (fraction of test time)	16.26%	6.25%	-61.6%

Full diff (minus is after optimisation, plus is before): http://ix.io/2xHm

Before	After

No performance regression in 100 node tests (in fact, throughput is up slightly by 9%, not sure how).

Which issue(s) this PR fixes:

Ref #89036

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fewer candidates are enumerated for preemption to improve performance in large clusters

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/sig scheduling
/cc @alculquicondor @ahg-g

k8s-ci-robot · 2020-09-15T23:11:25Z

Hi @adtac. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

adtac · 2020-09-15T23:15:04Z

Will add/modify tests tomorrow.

pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

ahg-g · 2020-09-16T11:35:48Z

/ok-to-test
/assign @Huang-Wei

alculquicondor · 2020-09-16T13:47:28Z

Did you add a new performance test case? Or are the above numbers from the existing ones?

Also, avoid fixes #123 from the description, as that will cause github to close the issue once this merges.

pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

ahg-g · 2020-09-16T14:44:40Z

Just noticed a problem with the windowing approach, it breaks the promise of picking "A node with minimum number of PDB violations.":

kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

Line 356 in 5d095c8

// 1. A node with minimum number of PDB violations.

I am not sure we want to break that promise.

alculquicondor · 2020-09-16T15:02:24Z

I am not sure we want to break that promise.

I think it's ok to break it as long as the default is kept at 100%

ahg-g · 2020-09-16T15:16:15Z

I am not sure we want to break that promise.

I think it's ok to break it as long as the default is kept at 100%

I think we can modify the promise in the following way:

the scheduler is free to select any node where the evicted pods don't violate pod disruption budget (the minimum within the window)
if we can't find a node where the evicted pods don't violate pdb, then we have to select one where the evicted pods have the lowest impact on pdb across all nodes

pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

Huang-Wei · 2020-09-16T18:54:33Z

I think we can modify the promise in the following way:

the scheduler is free to select any node where the evicted pods don't violate pod disruption budget (the minimum within the window)

if we can't find a node where the evicted pods don't violate pdb, then we have to select one where the evicted pods have the lowest impact on pdb across all nodes

The semantics looks good to me, however, in practice, we should evaluate the difficulty of getting them implemented. If it's too much work (I'd assume we need to pre-know the nodes that don't have pods violating PDBs, so as to prioritize them in the searching scope), maybe it's ok to break the promise - as we're just increasing the disruption budget by 1, rather than totally breaking the PDB violation (semantics of PDB is guarded by API server).

alculquicondor · 2020-09-16T19:00:03Z

The semantics looks good to me, however, in practice, we should evaluate the difficulty of getting them implemented.

It shouldn't be hard at all. It's a matter of continuing iterating while:

a node with zero violations is not found, OR
percentage has not been reached.

adtac · 2020-09-16T19:09:35Z

Yes, it's just a matter of changing the context cancel condition slightly; we already have the PDB violation data for each node to make that decision. I'm still working on the component config changes haha, they're pretty confusing :)

Huang-Wei · 2020-09-16T19:13:37Z

It shouldn't be hard at all. It's a matter of continuing iterating while:

a node with zero violations is not found, OR

percentage has not been reached.

What if the percentage has been reached, and all so-far-calculated candidates have non-zero violations? Would you continue searching?

adtac · 2020-09-16T19:15:54Z

What if the percentage has been reached, and all so-far-calculated candidates have non-zero violations? Would you continue searching?

Yes. The exit condition is basically len(candidates) > threshold && nonViolatingNodeFound. In the worst case (all nodes need to be searched), we don't do any better/worse than the in-tree approach. In the average case, we check fewer nodes.

Huang-Wei · 2020-09-16T19:18:07Z

Yes. The exit condition is basically len(candidates) > threshold && nonViolatingNodeFound. In the worst case (all nodes need to be searched), we don't do any better/worse than the in-tree approach. In the average case, we check fewer nodes.

SG.

Huang-Wei · 2020-10-07T19:29:32Z

@liggitt could you help review/approve the API changes? Thanks!

pkg/scheduler/apis/config/types_pluginargs.go

liggitt · 2020-10-07T19:58:23Z

pkg/scheduler/apis/config/validation/validation_pluginargs.go

+// validateMinCandidateNodesPercentage validates that
+// minCandidateNodesPercentage is within the allowed range.
+func validateMinCandidateNodesPercentage(minCandidateNodesPercentage int32) error {
+	if minCandidateNodesPercentage < 0 || minCandidateNodesPercentage > 100 {


what does a 0 percent minimum mean? consider 0 candidate nodes? wouldn't that break the scheduler?

the absolute config parameter has precedence over the percentage one, so one can use minCandidateNodesPercentage = 0% to denote that they only want to use an absolute minimum

there's a problem only if both the absolute and the percentage parameter are both zero as that would break things; and we check for that scenario in ValidateDefaultPreemptionArgs

liggitt · 2020-10-07T19:58:29Z

pkg/scheduler/apis/config/validation/validation_pluginargs.go

+// validateMinCandidateNodesAbsolute validates that minCandidateNodesAbsolute
+// is within the allowed range.
+func validateMinCandidateNodesAbsolute(minCandidateNodesAbsolute int32) error {
+	if minCandidateNodesAbsolute < 0 {


what does a 0 minimum mean? consider 0 candidate nodes? wouldn't that break the scheduler?

setting minCandidateNodesAbsolute = 0 could be used to turn off the absolute knob entirely. For example, if an operator wants only 10% of the cluster size to be evaluated without an absolute minimum, they can do this

see previous comment re both absolute and percentage both being zero

liggitt · 2020-10-07T20:02:16Z

pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

+		return nil, err
+	}
+
+	offset, numCandidates := pl.updateAndReturnOffset(int32(len(potentialNodes)))


would it be simpler to pick a random start point rather than making *DefaultPreemption stateful?

we did consider that, but the round robin-like approach guarantees that it will evenly distributes preemption across all nodes, whereas a random approach could result in confusing results to the end-user (e.g. the same node getting picked for preemption twice in a row)

but I totally see why a random starting index might be nice to have too -- this is something that's certainly worth exploring in the future once we have more feedback from users

picking the same node twice isn't an issue (not doing so is not part of the contract), it could happen with the existing logic as well.

it could happen with the existing logic as well

picking the same node twice in a row could happen, but if that happens, it also guarantees that all other nodes have been tried. while this isn't a formal guarantee on the plugin's behalf, it's a nice thing to have

making this stateful and self-modifying makes it a lot harder to reason about (especially in terms of thread safety and data races)

Done. Switched to a random offset. There is no difference in throughput between random and moving offset in integration benchmarks (as expected).

liggitt · 2020-10-07T20:11:49Z

pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

+// get returns the internal candidate array. This function is NOT atomic and
+// assumes that the caller has acquired mutual exclusion on the list.
+func (cl *candidateList) get() []Candidate {
+	return cl.items[:cl.size()]


returning a subslice here means that callers can modify members of cl.items beyond cl.size() which will be overwritten if cl.add is called again (see https://play.golang.org/p/Uw_cq4NK4Tz)

IMO this isn't a concern. get() is guaranteed to be called after all add() operations, but perhaps it's worth adding this as a note to the function documentation -- done.

making the other methods atomically safe implies multiple goroutines are interacting with this object... are all parallel operations guaranteed to be stopped before this is called or can a timeout return control to the get() caller while some background operations are still running?

are all parallel operations guaranteed to be stopped before this is called

Yes, parallelize.Until will guarantee that all goroutines interacting with this candidateList will have completed before this called.

staging/src/k8s.io/kube-scheduler/config/v1beta1/types_pluginargs.go

pkg/scheduler/apis/config/v1beta1/defaults.go

adtac · 2020-10-07T23:05:29Z

/test pull-kubernetes-e2e-kind

liggitt · 2020-10-08T15:38:14Z

pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

+		return nil, err
+	}
+
+	offset, numCandidates := pl.updateAndReturnOffset(int32(len(potentialNodes)))


making this stateful and self-modifying makes it a lot harder to reason about (especially in terms of thread safety and data races)

liggitt · 2020-10-08T15:40:18Z

pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

+// get returns the internal candidate array. This function is NOT atomic and
+// assumes that the caller has acquired mutual exclusion on the list.
+func (cl *candidateList) get() []Candidate {
+	return cl.items[:cl.size()]


making the other methods atomically safe implies multiple goroutines are interacting with this object... are all parallel operations guaranteed to be stopped before this is called or can a timeout return control to the get() caller while some background operations are still running?

liggitt · 2020-10-08T15:42:29Z

pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

+}
+
+// size returns the number of candidate stored atomically.
+func (cl *candidateList) size() int32 {


size() is also not atomic and assumes all add() operations have been completed

True, but we also don't access the list elements until all add() operations complete. I've updated the function doc to reflect this more accurately.

liggitt · 2020-10-08T15:44:54Z

pkg/scheduler/internal/parallelize/parallelism.go

-const parallelism = 16
+// Parallelism is exported for tests where non-determinism from parallelism is
+// not desired. Do NOT modify outside tests.
+var Parallelism = 16


can you add a todo to make this private again once #94636 merges (or as part of that PR?)

alculquicondor · 2020-10-22T13:50:45Z

can we get this to a mergeable state? code freeze is approaching

adtac · 2020-11-04T19:14:17Z

ping @liggitt :)

Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>

liggitt · 2020-11-06T02:24:10Z

/approve
for API bits

scheduler reviewers have LGTM

k8s-ci-robot · 2020-11-06T02:24:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adtac, alculquicondor, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [alculquicondor,liggitt]
~~staging/src/k8s.io/kube-scheduler/config/OWNERS~~ [liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2020-11-06T03:58:39Z

/lgtm
/priority important-soon

k8s-ci-robot requested review from ahg-g and alculquicondor September 15, 2020 23:11

soulxu reviewed Sep 16, 2020

View reviewed changes

pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned Huang-Wei Sep 16, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 16, 2020

alculquicondor reviewed Sep 16, 2020

View reviewed changes

Huang-Wei reviewed Sep 16, 2020

View reviewed changes

adtac force-pushed the preemption2 branch from baabe6c to 16ef3f7 Compare October 6, 2020 22:58

liggitt reviewed Oct 7, 2020

View reviewed changes

adtac force-pushed the preemption2 branch from 16ef3f7 to d6fa4af Compare October 7, 2020 21:28

liggitt reviewed Oct 8, 2020

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2020

adtac force-pushed the preemption2 branch from d6fa4af to cf0b25e Compare October 28, 2020 13:50

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2020

adtac force-pushed the preemption2 branch from cf0b25e to 213d226 Compare October 28, 2020 13:54

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2020

adtac force-pushed the preemption2 branch from 213d226 to b5fb689 Compare November 4, 2020 19:13

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2020

adtac force-pushed the preemption2 branch from b5fb689 to 7e027a4 Compare November 4, 2020 19:14

adtac force-pushed the preemption2 branch from 7e027a4 to 9a1aa25 Compare November 4, 2020 19:42

optimise defaultpreemption: enumerate fewer candidates

a3d94b5

Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>

adtac force-pushed the preemption2 branch from 9a1aa25 to a3d94b5 Compare November 4, 2020 22:40

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2020

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 6, 2020

k8s-ci-robot assigned ahg-g Nov 6, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2020

k8s-ci-robot merged commit bd95fb1 into kubernetes:master Nov 6, 2020

k8s-ci-robot added this to the v1.20 milestone Nov 6, 2020

github-actions bot mentioned this pull request Nov 10, 2020

Week Ending November 8, 2020 dev-obs/actus#265

Open

optimise defaultpreemption: enumerate fewer candidates #94814

optimise defaultpreemption: enumerate fewer candidates #94814

Conversation

adtac commented Sep 15, 2020 • edited Loading

k8s-ci-robot commented Sep 15, 2020

adtac commented Sep 15, 2020

ahg-g commented Sep 16, 2020

alculquicondor commented Sep 16, 2020

ahg-g commented Sep 16, 2020

alculquicondor commented Sep 16, 2020

ahg-g commented Sep 16, 2020 • edited Loading

Huang-Wei commented Sep 16, 2020

alculquicondor commented Sep 16, 2020

adtac commented Sep 16, 2020

Huang-Wei commented Sep 16, 2020

adtac commented Sep 16, 2020

Huang-Wei commented Sep 16, 2020

Huang-Wei commented Oct 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adtac Oct 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adtac commented Oct 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Oct 22, 2020

adtac commented Nov 4, 2020

liggitt commented Nov 6, 2020

k8s-ci-robot commented Nov 6, 2020

ahg-g commented Nov 6, 2020

adtac commented Sep 15, 2020 •

edited

Loading

ahg-g commented Sep 16, 2020 •

edited

Loading

adtac Oct 8, 2020 •

edited

Loading