Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize pod topology spread performance #107623

Merged

Conversation

bbarnes52-zz
Copy link
Contributor

@bbarnes52-zz bbarnes52-zz commented Jan 18, 2022

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

As documented in #105750, the performance of the pod topology spread plugin degrades as the number of pods increases. The following CPU profile of the scheduler was taken during a benchmarking run with the same setup described in #105750:

Screen Shot 2021-12-13 at 6 19 22 PM

A disproportionate amount of CPU time is spent in the calPreFilterState function, as expected. Let's zoom in on this function:

Screen Shot 2021-12-13 at 6 25 20 PM

Interestingly, the cumulative CPU time spent on mapaccess2_faststr accounts for nearly half (9.15/19.88 = 0.46) of the CPU time of calPreFilterState. pprof's list command can help us locate the bottlenecks:

(pprof) list calPreFilterState
     7.50s     19.88s (flat, cum) 34.39% of Total

        3s     10.63s    238:		if !nodeLabelsMatchSpreadConstraints(node.Labels, constraints) {
     130ms      6.80s    257:		count := countPodsMatchSelector(nodeInfo.Pods, constraint.Selector, pod.Namespace)

(pprof) list nodeLabelsMatchSpreadConstraints
     110ms      7.63s (flat, cum) 13.20% of Total

      40ms      7.56s     62:		if _, ok := nodeLabels[c.TopologyKey]; !ok {

A disproportionate amount of CPU time is spent accessing node labels; these accesses are currently performed in sequence. This PR relaxes this bottleneck by parallelizing the node label accesses. Note that calPreFilterState now accesses each key of TpPairToMatchNum in sequence, which amounts to the same number of sequential map accesses as before. This is faster because the accesses to TpPairToMatchNum exhibit better spatial locality and are less likely to incur a memory access. Our benchmarks (methodology described in #105750) yielded the following reductions in PreFilter latencies with a single topology constraint for hostname.

p50: 41.0% reduction
p90: 12.0% reduction
p99: 21.0% reduction
NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 18, 2022
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 18, 2022

CLA Signed

The committers are authorized under a signed CLA.

  • ✅ bbarnes52 (b1453b1378a5e49c9580337d4d1ceac11941390f)

@k8s-ci-robot k8s-ci-robot added do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 18, 2022
@k8s-ci-robot
Copy link
Contributor

@bbarnes52: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

Welcome @bbarnes52!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 18, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @bbarnes52. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 18, 2022
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 18, 2022
@bbarnes52-zz bbarnes52-zz marked this pull request as ready for review January 18, 2022 22:50
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 18, 2022
@denkensk
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 19, 2022
}
count := countPodsMatchSelector(nodeInfo.Pods, constraint.Selector, pod.Namespace)
atomic.AddInt32(tpCount, int32(count))
atomic.AddInt32(tpCount, count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is no longer a concurrent calculation, do we still need atomic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, fixed. I've also updated TpPairToMatchNum to use an int32 rather than an *int32.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 19, 2022
@sanposhiho
Copy link
Member

/retest

@denkensk
Copy link
Member

lgtm
Thanks @bbarnes52
Please squash commits to one.

/assign @ahg-g
Pls take a look.

@Huang-Wei
Copy link
Member

/assign
I will take a look.

Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bbarnes52 . The procedure to pinpoint the bottleneck into nodeLabelsMatchSpreadConstraints is decent and impressive.

requiredSchedulingTerm := nodeaffinity.GetRequiredNodeAffinity(pod)
for _, n := range allNodes {
node := n.Node()
tpCountsByNode := make([]map[topologyPair]int32, len(allNodes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, we also used this "space-for-time" trick somewhere else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, in pod affinities

topoScores := make([]scoreMap, len(allNodes))
and
topoMaps := make([]topologyToMatchedTermCount, len(nodes))

}
pl.parallelizer.Until(context.Background(), len(allNodes), processNode)
Copy link
Member

@Huang-Wei Huang-Wei Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this code pre-exists)

Can you help to pass in ctx down here?

calPreFilterState(ctx context.Context, pod *v1.Pod)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely, updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Could you squash the commits?

@ahg-g
Copy link
Member

ahg-g commented Jan 27, 2022

Thanks @bbarnes52 , this looks great!

@Huang-Wei
Copy link
Member

LGTM overall. Could you squash the commits?

@bbarnes52 would you mind squshing the commits into one commit? Then I think it's good to be merged. BTW: as mentioned in today's sig-meeting, could you help to summarize the profiling procedure into a practical document, in https://github.com/kubernetes/community, under the folder contributors/devel/sig-scheduling. I believe this will be of great help to the whole community.

FYI @alculquicondor this is the PR I mentioned in today's meeting.

@alculquicondor
Copy link
Member

I'd like to take a look as well if you don't mind, as I've personally tried to optimize these memory accesses myself in the past.

@bbarnes52, do you think there are learnings from here that could be applied to the PreScore/Score extension points?

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@@ -46,7 +45,7 @@ type preFilterState struct {
// it's not guaranteed to be the 2nd minimum match number.
TpKeyToCriticalPaths map[string]*criticalPaths
// TpPairToMatchNum is keyed with topologyPair, and valued with the number of matching pods.
TpPairToMatchNum map[topologyPair]*int32
TpPairToMatchNum map[topologyPair]int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that we are not using atomic, this can probably be just map[topologyPair]int

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int instead of int32... but just a nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@Huang-Wei
Copy link
Member

Ping @bbarnes52, could you followup on #107623 (comment)?

@bbarnes52 would you mind squshing the commits into one commit? Then I think it's good to be merged. BTW: as mentioned in today's sig-meeting, could you help to summarize the profiling procedure into a practical document, in https://github.com/kubernetes/community, under the folder contributors/devel/sig-scheduling. I believe this will be of great help to the whole community.

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@@ -306,7 +296,7 @@ func (pl *PodTopologySpread) Filter(ctx context.Context, cycleState *framework.C
return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReasonNodeLabelNotMatch)
}

selfMatchNum := int32(0)
selfMatchNum := int(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
selfMatchNum := int(0)
selfMatchNum := 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same a few lines below

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, bbarnes52

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2022
@bbarnes52-zz
Copy link
Contributor Author

I'd like to take a look as well if you don't mind, as I've personally tried to optimize these memory accesses myself in the past.

@bbarnes52, do you think there are learnings from here that could be applied to the PreScore/Score extension points?

Good question. I took a look at parallelizing node label accesses in initPreScoreState. This is less of a bottleneck because it iterates over filtered nodes and not all nodes. The benchmarks I performed yielded statistically insignificant results, I do not believe it is worth introducing additional synchronization at this time.

Regarding capturing the benchmarking procedure in a practical document, I would suggest Russ Cox's Profiling Go Programs, which describes all the methods used here and more.

@alculquicondor
Copy link
Member

/lgtm

Thanks!

Do you plan to evaluate what can be done to improve Score?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2022
@bbarnes52-zz
Copy link
Contributor Author

Do you plan to evaluate what can be done to improve Score?

Good question. Score appears to be less of a performance bottleneck. The latencies of PreScore/Score from my benchmarks are reproduced below (all units in microseconds). I've also attached a CPU profile. This plugin's Score function is not shown because it consumes comparably little CPU.

Looking at the code, I cannot identify any non-intrusive changes that would yield performance improvements.

   "PreScore": {
      "Perc50": 2212528,
      "Perc90": 3526917,
      "Perc99": 6195231
    },
    "Score": {
      "Perc50": 950042,
      "Perc90": 1478549,
      "Perc99": 1597463
    },

Screen Shot 2022-01-25 at 11 07 22 AM

@alculquicondor
Copy link
Member

Yeah, I squeezed the performance of Score as much as I could. But I thought this might help across the codebase #107504

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@Huang-Wei
Copy link
Member

/retest

@k8s-ci-robot k8s-ci-robot merged commit 6410dda into kubernetes:master Feb 4, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 4, 2022
@Ramyak Ramyak mentioned this pull request Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants