Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: performance improvement on PodAffinity #76243

Merged
merged 1 commit into from Apr 10, 2019

Conversation

@Huang-Wei
Copy link
Member

commented Apr 7, 2019

What type of PR is this?

/kind design
/sig scheduling
/assign @bsalamat

What this PR does / why we need it:

This PR tries to eliminate unnecessary Lock/Unlock in the logic of InterPodAffinity priorities. By replacing them with atomic AddInt64 can significantly improve the performance of:

  • Hard PodAffinity (2.2X performance improvement, see below)

    • Before

      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         284,892,826 ns/op
      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         285,320,757 ns/op
      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         287,498,053 ns/op
      
    • After (with this PR)

      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         127,001,991 ns/op
      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         129,263,078 ns/op
      BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         125,813,803 ns/op
      
  • Soft PodAffinity/PodAntiAffinity (Can be inferred from the code and benchmark result of Hard PodAffinity. We can add the benchmark tests if necessary)

Which issue(s) this PR fixes:

Special notes for your reviewer:

Above test result is run at a Baremetal machine with 8 core cpus and 32GB memory.

Does this PR introduce a user-facing change?:

2X performance improvement on both required and preferred PodAffinity.
scheduler: performance improvement on PodAffinity
- replace unnecessary Lock/Unlock with atomic AddInt64
@wgliang

wgliang approved these changes Apr 8, 2019

Copy link
Member

left a comment

Cool, LGTM.
Will leave /lgtm to @bsalamat.

@@ -230,7 +227,7 @@ func (ipa *InterPodAffinity) CalculateInterPodAffinityPriority(pod *v1.Pod, node
for _, node := range nodes {
fScore := float64(0)
if (maxCount - minCount) > 0 {
fScore = float64(schedulerapi.MaxPriority) * ((pm.counts[node.Name] - minCount) / (maxCount - minCount))
fScore = float64(schedulerapi.MaxPriority) * (float64(*pm.counts[node.Name]-minCount) / float64(maxCount-minCount))

This comment has been minimized.

Copy link
@ravisantoshgudimetla

ravisantoshgudimetla Apr 8, 2019

Contributor

I guess, we might have lost the float scores by now. I believe, we tried something similar in the past and it affected correctness. The one thing, I am curious about is - any difference between the scores computed, is there some difference between the scores computed with this patch/without this patch?(IIRC, last time the issue was related to too many nodes having same score).

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Apr 8, 2019

Author Member

@ravisantoshgudimetla could you point me the issue #?

This PR introduced following changes related with fraction:

  • weight is always an integer, so new change that loads weight into p.counts[node.Name] (from float64 to int64) in an atomic way should be fine.
  • change from float64(term.Weight*int32(multiplier))) to int64(term.Weight*int32(multiplier)) should be also fine as multiplier is of type int
  • maxCount/minCount are assigned from p.counts[node.Name]; should be good as well
  • the last one: change from (pm.counts[node.Name] - minCount) / (maxCount - minCount) to float64(*pm.counts[node.Name]-minCount) / float64(maxCount-minCount); this doens't lose correctness either.

This comment has been minimized.

Copy link
@ravisantoshgudimetla

ravisantoshgudimetla Apr 10, 2019

Contributor

Got it, thanks for the explanation. I have some questions on the term.Weight*int32(multiplier) but I see that both of them are of the type int.

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Apr 9, 2019

/hold
in case of merge without thorough discussion.

@@ -63,15 +64,15 @@ type podAffinityPriorityMap struct {
nodes []*v1.Node
// counts store the mapping from node name to so-far computed score of
// the node.
counts map[string]float64
counts map[string]*int64

This comment has been minimized.

Copy link
@bsalamat

bsalamat Apr 9, 2019

Member

why does the value need to be an int64 pointer, instead of just an int64?

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Apr 9, 2019

Author Member

It's because atomic#AddInt64 takes pointer as parameter.

If we go with map[string]int64, map value is not addressable (&map["key"] is illegal); and alsoyou can't make val := map["key"]; ptr := &val b/c that's another address.

This comment has been minimized.

Copy link
@bsalamat

bsalamat Apr 9, 2019

Member

That's right. In Go you cannot get a pointer to map entries as they may change during execution.

@bsalamat
Copy link
Member

left a comment

Overall looks good and makes sense. Just a small comment.

@bsalamat bsalamat changed the title scheduler: performance improvement on PodAffinity scheduler: performance improvement on "preferred" PodAffinity Apr 9, 2019

@bsalamat

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

Performance improvement is over 2X for preferred affinity. This is worth noting in the release notes.

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Apr 9, 2019

@bsalamat Will update the release note.

One thing to note is that the 2X improvement is not for soft pod affinity, it's for hard pod affinity.

BTW: performance for soft(preferred) pod (anti-) affinity hasn't been measured due to lack of benchmark tests. But from the code's perspective, it's expected to have a performance improvement as well.

Why the priority changes can impact hard pod affinity is because of hardPodAffinitySymmetricWeight:

if existingHasAffinityConstraints {
// For every hard pod affinity term of <existingPod>, if <pod> matches the term,
// increment <pm.counts> for every node in the cluster with the same <term.TopologyKey>
// value as that of <existingPod>'s node by the constant <ipa.hardPodAffinityWeight>
if ipa.hardPodAffinityWeight > 0 {
terms := existingPodAffinity.PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution
// TODO: Uncomment this block when implement RequiredDuringSchedulingRequiredDuringExecution.
//if len(existingPodAffinity.PodAffinity.RequiredDuringSchedulingRequiredDuringExecution) != 0 {
// terms = append(terms, existingPodAffinity.PodAffinity.RequiredDuringSchedulingRequiredDuringExecution...)
//}
for _, term := range terms {
pm.processTerm(&term, existingPod, pod, existingPodNode, float64(ipa.hardPodAffinityWeight))
}
}

@Huang-Wei Huang-Wei changed the title scheduler: performance improvement on "preferred" PodAffinity scheduler: performance improvement on PodAffinity Apr 9, 2019

@bsalamat

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

Thanks, @Huang-Wei for clarifying. You are right. This improvement impacts both soft and hard affinity.

@bsalamat
Copy link
Member

left a comment

/lgtm
/approve

Please change the release note and remove "hard" from it. This PR improves performance of both hard and soft affinity.

@@ -63,15 +64,15 @@ type podAffinityPriorityMap struct {
nodes []*v1.Node
// counts store the mapping from node name to so-far computed score of
// the node.
counts map[string]float64
counts map[string]*int64

This comment has been minimized.

Copy link
@bsalamat

bsalamat Apr 9, 2019

Member

That's right. In Go you cannot get a pointer to map entries as they may change during execution.

@k8s-ci-robot k8s-ci-robot added the lgtm label Apr 9, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Apr 9, 2019

Yeap. And hard pod anti-affinity isn't involved in Priorities, so this PR doesn't help for that case.

BTW: I'm also trying to improve hard pod anti-affinity, such as:

  1. change podsWithAffinity from slice to set

podsWithAffinity []*v1.Pod

  1. split podsWithAffinity into podsWithAffinity and podsWithHardAntiAffinity, so as to short circuit:

for _, existingPod := range nodeInfo.PodsWithAffinity() {

But I don't see either of them gains performance improvement yet. Will continue digging.

@bsalamat

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

One thing that can improve performance, is to remove podsWithAffinity []*v1.Pod from the NodeInfo and make it independent (or maybe a part of metadata). Today, we go over the whole "NodeInfo" array and check if any of them has a non-zero length podWithAffinity. In clusters with only a few affinity pods, podWithAffinity is empty for most of the "NodeInfo" entries. If podWithAffinity was outside of NodeInfo, we could populate the entries only for those nodes which had a pod with affinity.

something like:

NodesWithAffinityPods map[string]podsWithAffinity     // map from node name to affinity pods
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Apr 10, 2019

/hold cancel

@k8s-ci-robot k8s-ci-robot merged commit 8a9ed4c into kubernetes:master Apr 10, 2019

17 checks passed

cla/linuxfoundation Huang-Wei authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@Huang-Wei Huang-Wei deleted the Huang-Wei:perf-podaffinity branch Apr 10, 2019

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Apr 11, 2019

FYI: after merge of this PR, the perf improvement can be observed in perf-dash.k8s.io:

image

@bsalamat

This comment has been minimized.

Copy link
Member

commented Apr 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.