-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize pod topology spread performance #107623
Optimize pod topology spread performance #107623
Conversation
@bbarnes52: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Welcome @bbarnes52! |
Hi @bbarnes52. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
} | ||
count := countPodsMatchSelector(nodeInfo.Pods, constraint.Selector, pod.Namespace) | ||
atomic.AddInt32(tpCount, int32(count)) | ||
atomic.AddInt32(tpCount, count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is no longer a concurrent calculation, do we still need atomic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, fixed. I've also updated TpPairToMatchNum to use an int32 rather than an *int32.
/retest |
lgtm /assign @ahg-g |
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bbarnes52 . The procedure to pinpoint the bottleneck into nodeLabelsMatchSpreadConstraints
is decent and impressive.
requiredSchedulingTerm := nodeaffinity.GetRequiredNodeAffinity(pod) | ||
for _, n := range allNodes { | ||
node := n.Node() | ||
tpCountsByNode := make([]map[topologyPair]int32, len(allNodes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, we also used this "space-for-time" trick somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, in pod affinities
topoScores := make([]scoreMap, len(allNodes)) |
topoMaps := make([]topologyToMatchedTermCount, len(nodes)) |
} | ||
pl.parallelizer.Until(context.Background(), len(allNodes), processNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this code pre-exists)
Can you help to pass in ctx
down here?
calPreFilterState(ctx context.Context, pod *v1.Pod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Could you squash the commits?
Thanks @bbarnes52 , this looks great! |
@bbarnes52 would you mind squshing the commits into one commit? Then I think it's good to be merged. BTW: as mentioned in today's sig-meeting, could you help to summarize the profiling procedure into a practical document, in https://github.com/kubernetes/community, under the folder FYI @alculquicondor this is the PR I mentioned in today's meeting. |
I'd like to take a look as well if you don't mind, as I've personally tried to optimize these memory accesses myself in the past. @bbarnes52, do you think there are learnings from here that could be applied to the PreScore/Score extension points? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
@@ -46,7 +45,7 @@ type preFilterState struct { | |||
// it's not guaranteed to be the 2nd minimum match number. | |||
TpKeyToCriticalPaths map[string]*criticalPaths | |||
// TpPairToMatchNum is keyed with topologyPair, and valued with the number of matching pods. | |||
TpPairToMatchNum map[topologyPair]*int32 | |||
TpPairToMatchNum map[topologyPair]int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that we are not using atomic
, this can probably be just map[topologyPair]int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int
instead of int32
... but just a nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Ping @bbarnes52, could you followup on #107623 (comment)?
|
d822262
to
1fd0fa6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
@@ -306,7 +296,7 @@ func (pl *PodTopologySpread) Filter(ctx context.Context, cycleState *framework.C | |||
return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReasonNodeLabelNotMatch) | |||
} | |||
|
|||
selfMatchNum := int32(0) | |||
selfMatchNum := int(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
selfMatchNum := int(0) | |
selfMatchNum := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same a few lines below
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, bbarnes52 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Good question. I took a look at parallelizing node label accesses in initPreScoreState. This is less of a bottleneck because it iterates over filtered nodes and not all nodes. The benchmarks I performed yielded statistically insignificant results, I do not believe it is worth introducing additional synchronization at this time. Regarding capturing the benchmarking procedure in a practical document, I would suggest Russ Cox's Profiling Go Programs, which describes all the methods used here and more. |
1fd0fa6
to
8399fc1
Compare
8399fc1
to
4222d3a
Compare
/lgtm Thanks! Do you plan to evaluate what can be done to improve Score? |
Good question. Score appears to be less of a performance bottleneck. The latencies of PreScore/Score from my benchmarks are reproduced below (all units in microseconds). I've also attached a CPU profile. This plugin's Score function is not shown because it consumes comparably little CPU. Looking at the code, I cannot identify any non-intrusive changes that would yield performance improvements.
|
Yeah, I squeezed the performance of Score as much as I could. But I thought this might help across the codebase #107504 |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
/retest |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
As documented in #105750, the performance of the pod topology spread plugin degrades as the number of pods increases. The following CPU profile of the scheduler was taken during a benchmarking run with the same setup described in #105750:
A disproportionate amount of CPU time is spent in the
calPreFilterState
function, as expected. Let's zoom in on this function:Interestingly, the cumulative CPU time spent on
mapaccess2_faststr
accounts for nearly half (9.15/19.88 = 0.46) of the CPU time ofcalPreFilterState
. pprof's list command can help us locate the bottlenecks:A disproportionate amount of CPU time is spent accessing node labels; these accesses are currently performed in sequence. This PR relaxes this bottleneck by parallelizing the node label accesses. Note that
calPreFilterState
now accesses each key ofTpPairToMatchNum
in sequence, which amounts to the same number of sequential map accesses as before. This is faster because the accesses toTpPairToMatchNum
exhibit better spatial locality and are less likely to incur a memory access. Our benchmarks (methodology described in #105750) yielded the following reductions in PreFilter latencies with a single topology constraint for hostname.