Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: performance improvement on PodAffinity #76243

Merged
merged 1 commit into from Apr 10, 2019

Conversation

Huang-Wei
Copy link
Member

@Huang-Wei Huang-Wei commented Apr 7, 2019

What type of PR is this?

/kind design
/sig scheduling
/assign @bsalamat

What this PR does / why we need it:

This PR tries to eliminate unnecessary Lock/Unlock in the logic of InterPodAffinity priorities. By replacing them with atomic AddInt64 can significantly improve the performance of:

  • Hard PodAffinity (2.2X performance improvement, see below)

    • Before

      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         284,892,826 ns/op
      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         285,320,757 ns/op
      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         287,498,053 ns/op
      
    • After (with this PR)

      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         127,001,991 ns/op
      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         129,263,078 ns/op
      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         125,813,803 ns/op
      
  • Soft PodAffinity/PodAntiAffinity (Can be inferred from the code and benchmark result of Hard PodAffinity. We can add the benchmark tests if necessary)

Which issue(s) this PR fixes:

Special notes for your reviewer:

Above test result is run at a Baremetal machine with 8 core cpus and 32GB memory.

Does this PR introduce a user-facing change?:

2X performance improvement on both required and preferred PodAffinity.

- replace unnecessary Lock/Unlock with atomic AddInt64
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/design Categorizes issue or PR as related to design. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 7, 2019
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2019
Copy link
Contributor

@wgliang wgliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, LGTM.
Will leave /lgtm to @bsalamat.

@@ -230,7 +227,7 @@ func (ipa *InterPodAffinity) CalculateInterPodAffinityPriority(pod *v1.Pod, node
for _, node := range nodes {
fScore := float64(0)
if (maxCount - minCount) > 0 {
fScore = float64(schedulerapi.MaxPriority) * ((pm.counts[node.Name] - minCount) / (maxCount - minCount))
fScore = float64(schedulerapi.MaxPriority) * (float64(*pm.counts[node.Name]-minCount) / float64(maxCount-minCount))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, we might have lost the float scores by now. I believe, we tried something similar in the past and it affected correctness. The one thing, I am curious about is - any difference between the scores computed, is there some difference between the scores computed with this patch/without this patch?(IIRC, last time the issue was related to too many nodes having same score).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravisantoshgudimetla could you point me the issue #?

This PR introduced following changes related with fraction:

  • weight is always an integer, so new change that loads weight into p.counts[node.Name] (from float64 to int64) in an atomic way should be fine.
  • change from float64(term.Weight*int32(multiplier))) to int64(term.Weight*int32(multiplier)) should be also fine as multiplier is of type int
  • maxCount/minCount are assigned from p.counts[node.Name]; should be good as well
  • the last one: change from (pm.counts[node.Name] - minCount) / (maxCount - minCount) to float64(*pm.counts[node.Name]-minCount) / float64(maxCount-minCount); this doens't lose correctness either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the explanation. I have some questions on the term.Weight*int32(multiplier) but I see that both of them are of the type int.

@Huang-Wei
Copy link
Member Author

/hold
in case of merge without thorough discussion.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2019
@@ -63,15 +64,15 @@ type podAffinityPriorityMap struct {
nodes []*v1.Node
// counts store the mapping from node name to so-far computed score of
// the node.
counts map[string]float64
counts map[string]*int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the value need to be an int64 pointer, instead of just an int64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because atomic#AddInt64 takes pointer as parameter.

If we go with map[string]int64, map value is not addressable (&map["key"] is illegal); and alsoyou can't make val := map["key"]; ptr := &val b/c that's another address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. In Go you cannot get a pointer to map entries as they may change during execution.

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good and makes sense. Just a small comment.

@bsalamat bsalamat changed the title scheduler: performance improvement on PodAffinity scheduler: performance improvement on "preferred" PodAffinity Apr 9, 2019
@bsalamat
Copy link
Member

bsalamat commented Apr 9, 2019

Performance improvement is over 2X for preferred affinity. This is worth noting in the release notes.

@Huang-Wei
Copy link
Member Author

@bsalamat Will update the release note.

One thing to note is that the 2X improvement is not for soft pod affinity, it's for hard pod affinity.

BTW: performance for soft(preferred) pod (anti-) affinity hasn't been measured due to lack of benchmark tests. But from the code's perspective, it's expected to have a performance improvement as well.

Why the priority changes can impact hard pod affinity is because of hardPodAffinitySymmetricWeight:

if existingHasAffinityConstraints {
// For every hard pod affinity term of <existingPod>, if <pod> matches the term,
// increment <pm.counts> for every node in the cluster with the same <term.TopologyKey>
// value as that of <existingPod>'s node by the constant <ipa.hardPodAffinityWeight>
if ipa.hardPodAffinityWeight > 0 {
terms := existingPodAffinity.PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution
// TODO: Uncomment this block when implement RequiredDuringSchedulingRequiredDuringExecution.
//if len(existingPodAffinity.PodAffinity.RequiredDuringSchedulingRequiredDuringExecution) != 0 {
// terms = append(terms, existingPodAffinity.PodAffinity.RequiredDuringSchedulingRequiredDuringExecution...)
//}
for _, term := range terms {
pm.processTerm(&term, existingPod, pod, existingPodNode, float64(ipa.hardPodAffinityWeight))
}
}

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 9, 2019
@Huang-Wei Huang-Wei changed the title scheduler: performance improvement on "preferred" PodAffinity scheduler: performance improvement on PodAffinity Apr 9, 2019
@bsalamat
Copy link
Member

bsalamat commented Apr 9, 2019

Thanks, @Huang-Wei for clarifying. You are right. This improvement impacts both soft and hard affinity.

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Please change the release note and remove "hard" from it. This PR improves performance of both hard and soft affinity.

@@ -63,15 +64,15 @@ type podAffinityPriorityMap struct {
nodes []*v1.Node
// counts store the mapping from node name to so-far computed score of
// the node.
counts map[string]float64
counts map[string]*int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. In Go you cannot get a pointer to map entries as they may change during execution.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 9, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Huang-Wei
Copy link
Member Author

Yeap. And hard pod anti-affinity isn't involved in Priorities, so this PR doesn't help for that case.

BTW: I'm also trying to improve hard pod anti-affinity, such as:

  1. change podsWithAffinity from slice to set

podsWithAffinity []*v1.Pod

  1. split podsWithAffinity into podsWithAffinity and podsWithHardAntiAffinity, so as to short circuit:

for _, existingPod := range nodeInfo.PodsWithAffinity() {

But I don't see either of them gains performance improvement yet. Will continue digging.

@bsalamat
Copy link
Member

One thing that can improve performance, is to remove podsWithAffinity []*v1.Pod from the NodeInfo and make it independent (or maybe a part of metadata). Today, we go over the whole "NodeInfo" array and check if any of them has a non-zero length podWithAffinity. In clusters with only a few affinity pods, podWithAffinity is empty for most of the "NodeInfo" entries. If podWithAffinity was outside of NodeInfo, we could populate the entries only for those nodes which had a pod with affinity.

something like:

NodesWithAffinityPods map[string]podsWithAffinity     // map from node name to affinity pods

@Huang-Wei
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2019
@k8s-ci-robot k8s-ci-robot merged commit 8a9ed4c into kubernetes:master Apr 10, 2019
@Huang-Wei Huang-Wei deleted the perf-podaffinity branch April 10, 2019 05:18
@Huang-Wei
Copy link
Member Author

Huang-Wei commented Apr 11, 2019

FYI: after merge of this PR, the perf improvement can be observed in perf-dash.k8s.io:

image

@bsalamat
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants