Optimize pod topology spread performance #107623

bbarnes52-zz · 2022-01-18T21:36:04Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

As documented in #105750, the performance of the pod topology spread plugin degrades as the number of pods increases. The following CPU profile of the scheduler was taken during a benchmarking run with the same setup described in #105750:

A disproportionate amount of CPU time is spent in the calPreFilterState function, as expected. Let's zoom in on this function:

Interestingly, the cumulative CPU time spent on mapaccess2_faststr accounts for nearly half (9.15/19.88 = 0.46) of the CPU time of calPreFilterState. pprof's list command can help us locate the bottlenecks:

(pprof) list calPreFilterState
     7.50s     19.88s (flat, cum) 34.39% of Total

        3s     10.63s    238:		if !nodeLabelsMatchSpreadConstraints(node.Labels, constraints) {
     130ms      6.80s    257:		count := countPodsMatchSelector(nodeInfo.Pods, constraint.Selector, pod.Namespace)

(pprof) list nodeLabelsMatchSpreadConstraints
     110ms      7.63s (flat, cum) 13.20% of Total

      40ms      7.56s     62:		if _, ok := nodeLabels[c.TopologyKey]; !ok {

A disproportionate amount of CPU time is spent accessing node labels; these accesses are currently performed in sequence. This PR relaxes this bottleneck by parallelizing the node label accesses. Note that calPreFilterState now accesses each key of TpPairToMatchNum in sequence, which amounts to the same number of sequential map accesses as before. This is faster because the accesses to TpPairToMatchNum exhibit better spatial locality and are less likely to incur a memory access. Our benchmarks (methodology described in #105750) yielded the following reductions in PreFilter latencies with a single topology constraint for hostname.

p50: 41.0% reduction
p90: 12.0% reduction
p99: 21.0% reduction

NONE

linux-foundation-easycla · 2022-01-18T21:36:08Z

The committers are authorized under a signed CLA.

✅ bbarnes52 (b1453b1378a5e49c9580337d4d1ceac11941390f)

k8s-ci-robot · 2022-01-18T21:36:12Z

@bbarnes52: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-01-18T21:36:12Z

Welcome @bbarnes52!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2022-01-18T21:36:13Z

Hi @bbarnes52. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

denkensk · 2022-01-19T02:35:41Z

/ok-to-test

denkensk · 2022-01-19T12:20:25Z

pkg/scheduler/framework/plugins/podtopologyspread/filtering.go

 			}
-			count := countPodsMatchSelector(nodeInfo.Pods, constraint.Selector, pod.Namespace)
-			atomic.AddInt32(tpCount, int32(count))
+			atomic.AddInt32(tpCount, count)


If this is no longer a concurrent calculation, do we still need atomic？

nice catch, fixed. I've also updated TpPairToMatchNum to use an int32 rather than an *int32.

sanposhiho · 2022-01-20T16:51:04Z

/retest

denkensk · 2022-01-21T09:58:22Z

lgtm
Thanks @bbarnes52
Please squash commits to one.

/assign @ahg-g
Pls take a look.

Huang-Wei · 2022-01-27T02:37:02Z

/assign
I will take a look.

Huang-Wei

Thanks @bbarnes52 . The procedure to pinpoint the bottleneck into nodeLabelsMatchSpreadConstraints is decent and impressive.

Huang-Wei · 2022-01-27T05:28:44Z

pkg/scheduler/framework/plugins/podtopologyspread/filtering.go

 	requiredSchedulingTerm := nodeaffinity.GetRequiredNodeAffinity(pod)
-	for _, n := range allNodes {
-		node := n.Node()
+	tpCountsByNode := make([]map[topologyPair]int32, len(allNodes))


Aha, we also used this "space-for-time" trick somewhere else.

yup, in pod affinities

kubernetes/pkg/scheduler/framework/plugins/interpodaffinity/scoring.go

Line 185 in f03c06a

topoScores := make([]scoreMap, len(allNodes))

and

kubernetes/pkg/scheduler/framework/plugins/interpodaffinity/filtering.go

Line 158 in f03c06a

topoMaps := make([]topologyToMatchedTermCount, len(nodes))

Huang-Wei · 2022-01-27T05:50:07Z

pkg/scheduler/framework/plugins/podtopologyspread/filtering.go

 	}
+	pl.parallelizer.Until(context.Background(), len(allNodes), processNode)


(this code pre-exists)

Can you help to pass in ctx down here?

calPreFilterState(ctx context.Context, pod *v1.Pod)

absolutely, updated.

LGTM overall. Could you squash the commits?

ahg-g · 2022-01-27T14:13:27Z

Thanks @bbarnes52 , this looks great!

Huang-Wei · 2022-01-27T18:36:27Z

LGTM overall. Could you squash the commits?

@bbarnes52 would you mind squshing the commits into one commit? Then I think it's good to be merged. BTW: as mentioned in today's sig-meeting, could you help to summarize the profiling procedure into a practical document, in https://github.com/kubernetes/community, under the folder contributors/devel/sig-scheduling. I believe this will be of great help to the whole community.

FYI @alculquicondor this is the PR I mentioned in today's meeting.

alculquicondor · 2022-01-27T18:42:51Z

I'd like to take a look as well if you don't mind, as I've personally tried to optimize these memory accesses myself in the past.

@bbarnes52, do you think there are learnings from here that could be applied to the PreScore/Score extension points?

alculquicondor

Great!

alculquicondor · 2022-01-27T19:04:40Z

pkg/scheduler/framework/plugins/podtopologyspread/filtering.go

@@ -46,7 +45,7 @@ type preFilterState struct {
 	// it's not guaranteed to be the 2nd minimum match number.
 	TpKeyToCriticalPaths map[string]*criticalPaths
 	// TpPairToMatchNum is keyed with topologyPair, and valued with the number of matching pods.
-	TpPairToMatchNum map[topologyPair]*int32
+	TpPairToMatchNum map[topologyPair]int32


now that we are not using atomic, this can probably be just map[topologyPair]int

int instead of int32... but just a nit

Huang-Wei · 2022-02-03T18:28:21Z

Ping @bbarnes52, could you followup on #107623 (comment)?

@bbarnes52 would you mind squshing the commits into one commit? Then I think it's good to be merged. BTW: as mentioned in today's sig-meeting, could you help to summarize the profiling procedure into a practical document, in https://github.com/kubernetes/community, under the folder contributors/devel/sig-scheduling. I believe this will be of great help to the whole community.

alculquicondor

/approve

alculquicondor · 2022-02-04T18:16:09Z

pkg/scheduler/framework/plugins/podtopologyspread/filtering.go

@@ -306,7 +296,7 @@ func (pl *PodTopologySpread) Filter(ctx context.Context, cycleState *framework.C
 			return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReasonNodeLabelNotMatch)
 		}

-		selfMatchNum := int32(0)
+		selfMatchNum := int(0)


nit

Suggested change

selfMatchNum := int(0)

selfMatchNum := 0

same a few lines below

k8s-ci-robot · 2022-02-04T18:17:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, bbarnes52

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bbarnes52-zz · 2022-02-04T18:18:25Z

I'd like to take a look as well if you don't mind, as I've personally tried to optimize these memory accesses myself in the past.

@bbarnes52, do you think there are learnings from here that could be applied to the PreScore/Score extension points?

Good question. I took a look at parallelizing node label accesses in initPreScoreState. This is less of a bottleneck because it iterates over filtered nodes and not all nodes. The benchmarks I performed yielded statistically insignificant results, I do not believe it is worth introducing additional synchronization at this time.

Regarding capturing the benchmarking procedure in a practical document, I would suggest Russ Cox's Profiling Go Programs, which describes all the methods used here and more.

alculquicondor · 2022-02-04T18:36:07Z

/lgtm

Thanks!

Do you plan to evaluate what can be done to improve Score?

bbarnes52-zz · 2022-02-04T19:11:53Z

Do you plan to evaluate what can be done to improve Score?

Good question. Score appears to be less of a performance bottleneck. The latencies of PreScore/Score from my benchmarks are reproduced below (all units in microseconds). I've also attached a CPU profile. This plugin's Score function is not shown because it consumes comparably little CPU.

Looking at the code, I cannot identify any non-intrusive changes that would yield performance improvements.

   "PreScore": {
      "Perc50": 2212528,
      "Perc90": 3526917,
      "Perc99": 6195231
    },
    "Score": {
      "Perc50": 950042,
      "Perc90": 1478549,
      "Perc99": 1597463
    },

alculquicondor · 2022-02-04T19:34:46Z

Yeah, I squeezed the performance of Score as much as I could. But I thought this might help across the codebase #107504

k8s-triage-robot · 2022-02-04T21:37:29Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

Huang-Wei · 2022-02-04T22:18:25Z

/retest

k8s-ci-robot added do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 18, 2022

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 18, 2022

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 18, 2022

k8s-ci-robot requested review from damemi and denkensk January 18, 2022 21:40

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 18, 2022

bbarnes52-zz marked this pull request as ready for review January 18, 2022 22:50

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 18, 2022

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 19, 2022

denkensk reviewed Jan 19, 2022

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 19, 2022

bbarnes52-zz requested a review from denkensk January 21, 2022 05:49

k8s-ci-robot assigned ahg-g Jan 21, 2022

onerandomdude mentioned this pull request Jan 26, 2022

Optimize PodTopologySpread scheduler plugin performance #107780

Closed

k8s-ci-robot assigned Huang-Wei Jan 27, 2022

Huang-Wei reviewed Jan 27, 2022

View reviewed changes

alculquicondor reviewed Jan 27, 2022

View reviewed changes

bbarnes52-zz force-pushed the podtopologyoptimization branch from d822262 to 1fd0fa6 Compare February 4, 2022 18:06

alculquicondor reviewed Feb 4, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2022

bbarnes52-zz force-pushed the podtopologyoptimization branch from 1fd0fa6 to 8399fc1 Compare February 4, 2022 18:27

optimize pod topology spread

4222d3a

bbarnes52-zz force-pushed the podtopologyoptimization branch from 8399fc1 to 4222d3a Compare February 4, 2022 18:29

k8s-ci-robot assigned alculquicondor Feb 4, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2022

k8s-ci-robot merged commit 6410dda into kubernetes:master Feb 4, 2022

k8s-ci-robot added this to the v1.24 milestone Feb 4, 2022

Ramyak mentioned this pull request Jun 22, 2022

Scheduler crashes. #110726

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize pod topology spread performance #107623

Optimize pod topology spread performance #107623

bbarnes52-zz commented Jan 18, 2022 •

edited

linux-foundation-easycla bot commented Jan 18, 2022 •

edited

k8s-ci-robot commented Jan 18, 2022

k8s-ci-robot commented Jan 18, 2022

k8s-ci-robot commented Jan 18, 2022

denkensk commented Jan 19, 2022

denkensk Jan 19, 2022

bbarnes52-zz Jan 19, 2022

sanposhiho commented Jan 20, 2022

denkensk commented Jan 21, 2022

Huang-Wei commented Jan 27, 2022

Huang-Wei left a comment

Huang-Wei Jan 27, 2022

ahg-g Jan 27, 2022

Huang-Wei Jan 27, 2022 •

edited

bbarnes52-zz Jan 27, 2022

Huang-Wei Jan 27, 2022

ahg-g commented Jan 27, 2022

Huang-Wei commented Jan 27, 2022

alculquicondor commented Jan 27, 2022

alculquicondor left a comment

alculquicondor Jan 27, 2022

alculquicondor Jan 27, 2022

bbarnes52-zz Feb 4, 2022

Huang-Wei commented Feb 3, 2022

alculquicondor left a comment

alculquicondor Feb 4, 2022

alculquicondor Feb 4, 2022

k8s-ci-robot commented Feb 4, 2022

bbarnes52-zz commented Feb 4, 2022

alculquicondor commented Feb 4, 2022

bbarnes52-zz commented Feb 4, 2022

alculquicondor commented Feb 4, 2022

k8s-triage-robot commented Feb 4, 2022

Huang-Wei commented Feb 4, 2022

		}
		pl.parallelizer.Until(context.Background(), len(allNodes), processNode)

Optimize pod topology spread performance #107623

Optimize pod topology spread performance #107623

Conversation

bbarnes52-zz commented Jan 18, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

linux-foundation-easycla bot commented Jan 18, 2022 • edited

k8s-ci-robot commented Jan 18, 2022

k8s-ci-robot commented Jan 18, 2022

k8s-ci-robot commented Jan 18, 2022

denkensk commented Jan 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Jan 20, 2022

denkensk commented Jan 21, 2022

Huang-Wei commented Jan 27, 2022

Huang-Wei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei Jan 27, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Jan 27, 2022

Huang-Wei commented Jan 27, 2022

alculquicondor commented Jan 27, 2022

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Feb 3, 2022

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 4, 2022

bbarnes52-zz commented Feb 4, 2022

alculquicondor commented Feb 4, 2022

bbarnes52-zz commented Feb 4, 2022

alculquicondor commented Feb 4, 2022

k8s-triage-robot commented Feb 4, 2022

Huang-Wei commented Feb 4, 2022

bbarnes52-zz commented Jan 18, 2022 •

edited

linux-foundation-easycla bot commented Jan 18, 2022 •

edited

Huang-Wei Jan 27, 2022 •

edited