scheduler: performance improvement on PodAffinity #76243

Huang-Wei · 2019-04-07T21:52:27Z

What type of PR is this?

/kind design
/sig scheduling
/assign @bsalamat

What this PR does / why we need it:

This PR tries to eliminate unnecessary Lock/Unlock in the logic of InterPodAffinity priorities. By replacing them with atomic AddInt64 can significantly improve the performance of:

Hard PodAffinity (2.2X performance improvement, see below)

Before

BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         284,892,826 ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         285,320,757 ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         287,498,053 ns/op

After (with this PR)

BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         127,001,991 ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         129,263,078 ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-8               1000         125,813,803 ns/op

Soft PodAffinity/PodAntiAffinity (Can be inferred from the code and benchmark result of Hard PodAffinity. We can add the benchmark tests if necessary)

Which issue(s) this PR fixes:

Special notes for your reviewer:

Above test result is run at a Baremetal machine with 8 core cpus and 32GB memory.

Does this PR introduce a user-facing change?:

2X performance improvement on both required and preferred PodAffinity.

- replace unnecessary Lock/Unlock with atomic AddInt64

wgliang

Cool, LGTM.
Will leave /lgtm to @bsalamat.

ravisantoshgudimetla · 2019-04-08T19:31:52Z

pkg/scheduler/algorithm/priorities/interpod_affinity.go

@@ -230,7 +227,7 @@ func (ipa *InterPodAffinity) CalculateInterPodAffinityPriority(pod *v1.Pod, node
 	for _, node := range nodes {
 		fScore := float64(0)
 		if (maxCount - minCount) > 0 {
-			fScore = float64(schedulerapi.MaxPriority) * ((pm.counts[node.Name] - minCount) / (maxCount - minCount))
+			fScore = float64(schedulerapi.MaxPriority) * (float64(*pm.counts[node.Name]-minCount) / float64(maxCount-minCount))


I guess, we might have lost the float scores by now. I believe, we tried something similar in the past and it affected correctness. The one thing, I am curious about is - any difference between the scores computed, is there some difference between the scores computed with this patch/without this patch?(IIRC, last time the issue was related to too many nodes having same score).

@ravisantoshgudimetla could you point me the issue #?

This PR introduced following changes related with fraction:

weight is always an integer, so new change that loads weight into p.counts[node.Name] (from float64 to int64) in an atomic way should be fine.

change from float64(term.Weight*int32(multiplier))) to int64(term.Weight*int32(multiplier)) should be also fine as multiplier is of type int

maxCount/minCount are assigned from p.counts[node.Name]; should be good as well

the last one: change from (pm.counts[node.Name] - minCount) / (maxCount - minCount) to float64(*pm.counts[node.Name]-minCount) / float64(maxCount-minCount); this doens't lose correctness either.

Got it, thanks for the explanation. I have some questions on the term.Weight*int32(multiplier) but I see that both of them are of the type int.

Huang-Wei · 2019-04-09T17:07:56Z

/hold
in case of merge without thorough discussion.

bsalamat · 2019-04-09T21:13:52Z

pkg/scheduler/algorithm/priorities/interpod_affinity.go

@@ -63,15 +64,15 @@ type podAffinityPriorityMap struct {
 	nodes []*v1.Node
 	// counts store the mapping from node name to so-far computed score of
 	// the node.
-	counts map[string]float64
+	counts map[string]*int64


why does the value need to be an int64 pointer, instead of just an int64?

It's because atomic#AddInt64 takes pointer as parameter.

If we go with map[string]int64, map value is not addressable (&map["key"] is illegal); and alsoyou can't make val := map["key"]; ptr := &val b/c that's another address.

That's right. In Go you cannot get a pointer to map entries as they may change during execution.

bsalamat

Overall looks good and makes sense. Just a small comment.

bsalamat · 2019-04-09T21:17:49Z

Performance improvement is over 2X for preferred affinity. This is worth noting in the release notes.

Huang-Wei · 2019-04-09T21:28:19Z

@bsalamat Will update the release note.

One thing to note is that the 2X improvement is not for soft pod affinity, it's for hard pod affinity.

BTW: performance for soft(preferred) pod (anti-) affinity hasn't been measured due to lack of benchmark tests. But from the code's perspective, it's expected to have a performance improvement as well.

Why the priority changes can impact hard pod affinity is because of hardPodAffinitySymmetricWeight:

kubernetes/pkg/scheduler/algorithm/priorities/interpod_affinity.go

Lines 164 to 177 in 552d1eb

    
           if existingHasAffinityConstraints { 
        
           	// For every hard pod affinity term of <existingPod>, if <pod> matches the term, 
        
           	// increment <pm.counts> for every node in the cluster with the same <term.TopologyKey> 
        
           	// value as that of <existingPod>'s node by the constant <ipa.hardPodAffinityWeight> 
        
           	if ipa.hardPodAffinityWeight > 0 { 
        
           		terms := existingPodAffinity.PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution 
        
           		// TODO: Uncomment this block when implement RequiredDuringSchedulingRequiredDuringExecution. 
        
           		//if len(existingPodAffinity.PodAffinity.RequiredDuringSchedulingRequiredDuringExecution) != 0 { 
        
           		//	terms = append(terms, existingPodAffinity.PodAffinity.RequiredDuringSchedulingRequiredDuringExecution...) 
        
           		//} 
        
           		for _, term := range terms { 
        
           			pm.processTerm(&term, existingPod, pod, existingPodNode, float64(ipa.hardPodAffinityWeight)) 
        
           		} 
        
           	}

bsalamat · 2019-04-09T23:42:54Z

Thanks, @Huang-Wei for clarifying. You are right. This improvement impacts both soft and hard affinity.

bsalamat

/lgtm
/approve

Please change the release note and remove "hard" from it. This PR improves performance of both hard and soft affinity.

bsalamat · 2019-04-09T23:45:29Z

pkg/scheduler/algorithm/priorities/interpod_affinity.go

@@ -63,15 +64,15 @@ type podAffinityPriorityMap struct {
 	nodes []*v1.Node
 	// counts store the mapping from node name to so-far computed score of
 	// the node.
-	counts map[string]float64
+	counts map[string]*int64


That's right. In Go you cannot get a pointer to map entries as they may change during execution.

k8s-ci-robot · 2019-04-09T23:46:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [Huang-Wei,bsalamat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Huang-Wei · 2019-04-09T23:55:13Z

Yeap. And hard pod anti-affinity isn't involved in Priorities, so this PR doesn't help for that case.

BTW: I'm also trying to improve hard pod anti-affinity, such as:

change podsWithAffinity from slice to set

kubernetes/pkg/scheduler/nodeinfo/node_info.go

Line 52 in c4f97fa

podsWithAffinity []*v1.Pod

split podsWithAffinity into podsWithAffinity and podsWithHardAntiAffinity, so as to short circuit:

kubernetes/pkg/scheduler/algorithm/predicates/metadata.go

Line 402 in c4f97fa

for _, existingPod := range nodeInfo.PodsWithAffinity() {

But I don't see either of them gains performance improvement yet. Will continue digging.

bsalamat · 2019-04-10T00:08:27Z

One thing that can improve performance, is to remove podsWithAffinity []*v1.Pod from the NodeInfo and make it independent (or maybe a part of metadata). Today, we go over the whole "NodeInfo" array and check if any of them has a non-zero length podWithAffinity. In clusters with only a few affinity pods, podWithAffinity is empty for most of the "NodeInfo" entries. If podWithAffinity was outside of NodeInfo, we could populate the entries only for those nodes which had a pod with affinity.

something like:

NodesWithAffinityPods map[string]podsWithAffinity     // map from node name to affinity pods

Huang-Wei · 2019-04-10T00:15:12Z

/hold cancel

Huang-Wei · 2019-04-11T23:30:23Z

FYI: after merge of this PR, the perf improvement can be observed in perf-dash.k8s.io:

bsalamat · 2019-04-12T17:43:28Z

@Huang-Wei Scheduling benchmarks show faster execution, but our Kubemark and real cluster benchmarks show a regression:

scheduler: performance improvement on PodAffinity

2873091

- replace unnecessary Lock/Unlock with atomic AddInt64

k8s-ci-robot assigned bsalamat Apr 7, 2019

k8s-ci-robot requested review from ravisantoshgudimetla and wgliang April 7, 2019 21:53

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2019

wgliang approved these changes Apr 8, 2019

View reviewed changes

ravisantoshgudimetla reviewed Apr 8, 2019

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2019

bsalamat reviewed Apr 9, 2019

View reviewed changes

bsalamat changed the title ~~scheduler: performance improvement on PodAffinity~~ scheduler: performance improvement on "preferred" PodAffinity Apr 9, 2019

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 9, 2019

Huang-Wei changed the title ~~scheduler: performance improvement on "preferred" PodAffinity~~ scheduler: performance improvement on PodAffinity Apr 9, 2019

bsalamat reviewed Apr 9, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 9, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2019

k8s-ci-robot merged commit 8a9ed4c into kubernetes:master Apr 10, 2019

Huang-Wei deleted the perf-podaffinity branch April 10, 2019 05:18

cwdsuzhou mentioned this pull request Apr 10, 2019

[WIP] Use atomic to replace lock #76387

Closed

Huang-Wei mentioned this pull request Apr 13, 2019

Revert "scheduler: performance improvement on PodAffinity" #76547

Merged

Huang-Wei mentioned this pull request Apr 24, 2019

scheduler: fix perf downgrade of cases without presence of (anti-)affinity pods #76973

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: performance improvement on PodAffinity #76243

scheduler: performance improvement on PodAffinity #76243

Huang-Wei commented Apr 7, 2019 •

edited

wgliang left a comment

ravisantoshgudimetla Apr 8, 2019

Huang-Wei Apr 8, 2019

ravisantoshgudimetla Apr 10, 2019

Huang-Wei commented Apr 9, 2019

bsalamat Apr 9, 2019

Huang-Wei Apr 9, 2019

bsalamat Apr 9, 2019

bsalamat left a comment •

edited

bsalamat commented Apr 9, 2019

Huang-Wei commented Apr 9, 2019

bsalamat commented Apr 9, 2019

bsalamat left a comment

bsalamat Apr 9, 2019

k8s-ci-robot commented Apr 9, 2019

Huang-Wei commented Apr 9, 2019

bsalamat commented Apr 10, 2019

Huang-Wei commented Apr 10, 2019

Huang-Wei commented Apr 11, 2019 •

edited

bsalamat commented Apr 12, 2019

scheduler: performance improvement on PodAffinity #76243

scheduler: performance improvement on PodAffinity #76243

Conversation

Huang-Wei commented Apr 7, 2019 • edited

wgliang left a comment

Choose a reason for hiding this comment

ravisantoshgudimetla Apr 8, 2019

Choose a reason for hiding this comment

Huang-Wei Apr 8, 2019

Choose a reason for hiding this comment

ravisantoshgudimetla Apr 10, 2019

Choose a reason for hiding this comment

Huang-Wei commented Apr 9, 2019

bsalamat Apr 9, 2019

Choose a reason for hiding this comment

Huang-Wei Apr 9, 2019

Choose a reason for hiding this comment

bsalamat Apr 9, 2019

Choose a reason for hiding this comment

bsalamat left a comment • edited

Choose a reason for hiding this comment

bsalamat commented Apr 9, 2019

Huang-Wei commented Apr 9, 2019

bsalamat commented Apr 9, 2019

bsalamat left a comment

Choose a reason for hiding this comment

bsalamat Apr 9, 2019

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 9, 2019

Huang-Wei commented Apr 9, 2019

bsalamat commented Apr 10, 2019

Huang-Wei commented Apr 10, 2019

Huang-Wei commented Apr 11, 2019 • edited

bsalamat commented Apr 12, 2019

Huang-Wei commented Apr 7, 2019 •

edited

bsalamat left a comment •

edited

Huang-Wei commented Apr 11, 2019 •

edited