-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimise defaultpreemption: enumerate fewer candidates #94814
Conversation
Hi @adtac. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Will add/modify tests tomorrow. |
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
Outdated
Show resolved
Hide resolved
/ok-to-test |
Did you add a new performance test case? Or are the above numbers from the existing ones? Also, avoid |
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
Outdated
Show resolved
Hide resolved
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
Outdated
Show resolved
Hide resolved
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
Outdated
Show resolved
Hide resolved
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
Outdated
Show resolved
Hide resolved
Just noticed a problem with the windowing approach, it breaks the promise of picking "A node with minimum number of PDB violations.": kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go Line 356 in 5d095c8
I am not sure we want to break that promise. |
I think it's ok to break it as long as the default is kept at 100% |
I think we can modify the promise in the following way:
|
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
Outdated
Show resolved
Hide resolved
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
Outdated
Show resolved
Hide resolved
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
Outdated
Show resolved
Hide resolved
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
Outdated
Show resolved
Hide resolved
The semantics looks good to me, however, in practice, we should evaluate the difficulty of getting them implemented. If it's too much work (I'd assume we need to pre-know the nodes that don't have pods violating PDBs, so as to prioritize them in the searching scope), maybe it's ok to break the promise - as we're just increasing the disruption budget by 1, rather than totally breaking the PDB violation (semantics of PDB is guarded by API server). |
It shouldn't be hard at all. It's a matter of continuing iterating while:
|
Yes, it's just a matter of changing the context cancel condition slightly; we already have the PDB violation data for each node to make that decision. I'm still working on the component config changes haha, they're pretty confusing :) |
What if the percentage has been reached, and all so-far-calculated candidates have non-zero violations? Would you continue searching? |
Yes. The exit condition is basically |
SG. |
@liggitt could you help review/approve the API changes? Thanks! |
// validateMinCandidateNodesPercentage validates that | ||
// minCandidateNodesPercentage is within the allowed range. | ||
func validateMinCandidateNodesPercentage(minCandidateNodesPercentage int32) error { | ||
if minCandidateNodesPercentage < 0 || minCandidateNodesPercentage > 100 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does a 0 percent minimum mean? consider 0 candidate nodes? wouldn't that break the scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the absolute config parameter has precedence over the percentage one, so one can use minCandidateNodesPercentage = 0%
to denote that they only want to use an absolute minimum
there's a problem only if both the absolute and the percentage parameter are both zero as that would break things; and we check for that scenario in ValidateDefaultPreemptionArgs
// validateMinCandidateNodesAbsolute validates that minCandidateNodesAbsolute | ||
// is within the allowed range. | ||
func validateMinCandidateNodesAbsolute(minCandidateNodesAbsolute int32) error { | ||
if minCandidateNodesAbsolute < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does a 0 minimum mean? consider 0 candidate nodes? wouldn't that break the scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting minCandidateNodesAbsolute = 0
could be used to turn off the absolute knob entirely. For example, if an operator wants only 10% of the cluster size to be evaluated without an absolute minimum, they can do this
see previous comment re both absolute and percentage both being zero
return nil, err | ||
} | ||
|
||
offset, numCandidates := pl.updateAndReturnOffset(int32(len(potentialNodes))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be simpler to pick a random start point rather than making *DefaultPreemption stateful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we did consider that, but the round robin-like approach guarantees that it will evenly distributes preemption across all nodes, whereas a random approach could result in confusing results to the end-user (e.g. the same node getting picked for preemption twice in a row)
but I totally see why a random starting index might be nice to have too -- this is something that's certainly worth exploring in the future once we have more feedback from users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
picking the same node twice isn't an issue (not doing so is not part of the contract), it could happen with the existing logic as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could happen with the existing logic as well
picking the same node twice in a row could happen, but if that happens, it also guarantees that all other nodes have been tried. while this isn't a formal guarantee on the plugin's behalf, it's a nice thing to have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making this stateful and self-modifying makes it a lot harder to reason about (especially in terms of thread safety and data races)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Switched to a random offset. There is no difference in throughput between random and moving offset in integration benchmarks (as expected).
// get returns the internal candidate array. This function is NOT atomic and | ||
// assumes that the caller has acquired mutual exclusion on the list. | ||
func (cl *candidateList) get() []Candidate { | ||
return cl.items[:cl.size()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning a subslice here means that callers can modify members of cl.items
beyond cl.size()
which will be overwritten if cl.add is called again (see https://play.golang.org/p/Uw_cq4NK4Tz)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this isn't a concern. get()
is guaranteed to be called after all add()
operations, but perhaps it's worth adding this as a note to the function documentation -- done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making the other methods atomically safe implies multiple goroutines are interacting with this object... are all parallel operations guaranteed to be stopped before this is called or can a timeout return control to the get() caller while some background operations are still running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are all parallel operations guaranteed to be stopped before this is called
Yes, parallelize.Until will guarantee that all goroutines interacting with this candidateList will have completed before this called.
staging/src/k8s.io/kube-scheduler/config/v1beta1/types_pluginargs.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/kube-scheduler/config/v1beta1/types_pluginargs.go
Outdated
Show resolved
Hide resolved
/test pull-kubernetes-e2e-kind |
return nil, err | ||
} | ||
|
||
offset, numCandidates := pl.updateAndReturnOffset(int32(len(potentialNodes))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making this stateful and self-modifying makes it a lot harder to reason about (especially in terms of thread safety and data races)
// get returns the internal candidate array. This function is NOT atomic and | ||
// assumes that the caller has acquired mutual exclusion on the list. | ||
func (cl *candidateList) get() []Candidate { | ||
return cl.items[:cl.size()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making the other methods atomically safe implies multiple goroutines are interacting with this object... are all parallel operations guaranteed to be stopped before this is called or can a timeout return control to the get() caller while some background operations are still running?
} | ||
|
||
// size returns the number of candidate stored atomically. | ||
func (cl *candidateList) size() int32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size() is also not atomic and assumes all add() operations have been completed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but we also don't access the list elements until all add() operations complete. I've updated the function doc to reflect this more accurately.
const parallelism = 16 | ||
// Parallelism is exported for tests where non-determinism from parallelism is | ||
// not desired. Do NOT modify outside tests. | ||
var Parallelism = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a todo to make this private again once #94636 merges (or as part of that PR?)
can we get this to a mergeable state? code freeze is approaching |
ping @liggitt :) |
Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
/approve scheduler reviewers have LGTM |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adtac, alculquicondor, liggitt The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
What type of PR is this?
/kind feature
What this PR does / why we need it: Instead of considering all nodes as preemption candidates, look for a smaller pool when dry running preemption and stop when a threshold number of candidates is found. In the average case (a good number of nodes are preemptible), this optimisation produces very good results (~25.3% throughput improvement). In the worst case (preemption is impossible), this approach will do no better or worse than the in-tree approach. The improvement is especially pronounced in large clusters (> 1k nodes). Clusters smaller than 100 nodes will not see any improvement from this optimisation.
The following benchmarks were performed with a 5,000 node cluster where 20,000 low priority pods are scheduled and then 5,000 high priority pods are scheduled (test run time is ~10 minutes). The high priority pods do not fit in any node without preempting one or more low priority pods. See the in-tree
scheduler_perf
integration benchmark for more details. Throughput percentile numbers are left out because median is 0 (kinda meaningless as a result).preemption_evaluation_seconds
averagepreemption_evaluation_seconds
p50preemption_evaluation_seconds
p90preemption_evaluation_seconds
p99Full diff (minus is after optimisation, plus is before): http://ix.io/2xHm
No performance regression in 100 node tests (in fact, throughput is up slightly by 9%, not sure how).
Which issue(s) this PR fixes:
Ref #89036
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/sig scheduling
/cc @alculquicondor @ahg-g