New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of affinity/anti-affinity predicate by 20x in large clusters #62211

Merged
merged 3 commits into from Apr 13, 2018

Conversation

Projects
None yet
6 participants
@bsalamat
Contributor

bsalamat commented Apr 6, 2018

What this PR does / why we need it:
Improves performance of affinity/anti-affinity predicate by over 20x in large clusters. Performance improvement is smaller in small clusters, but it is still very significant and is about 4x. Also, before this PR, performance of the predicate was dropping quadratically with increasing size of nodes and pods. As the results shows, the slow down is now linear in larger clusters.

Affinity/anti-affinity predicate was checking all pods of the cluster for each node in the cluster to determine feasibility of affinit/anti-affinity terms of the pod being scheduled. This optimization first finds all the pods in a cluster that match the affinity/anti-affinity terms of the pod being scheduled once and stores the metadata. It then only checks the topology of the matching pods for each node in the cluster.
This results in major reduction of the search space per node and improves performance significantly.

Below results are obtained by running scheduler benchmarks:

make test-integration WHAT=./test/integration/scheduler_perf KUBE_TEST_ARGS="-run=xxx -bench=.*BenchmarkSchedulingAntiAffinity"
AntiAffinity Topology: Hostname
before: BenchmarkSchedulingAntiAffinity/500Nodes/250Pods-12         	     	  37031638 ns/op
after:  BenchmarkSchedulingAntiAffinity/500Nodes/250Pods-12         	     	  10373222 ns/op

before: BenchmarkSchedulingAntiAffinity/500Nodes/5000Pods-12        	     	 134205302 ns/op
after:  BenchmarkSchedulingAntiAffinity/500Nodes/5000Pods-12        	     	  12000580 ns/op

befor: BenchmarkSchedulingAntiAffinity/1000Nodes/10000Pods-12         	     	 498439953 ns/op
after: BenchmarkSchedulingAntiAffinity/1000Nodes/10000Pods-12         	     	  24692552 ns/op


AntiAffinity Topology: Region
before: BenchmarkSchedulingAntiAffinity/500Nodes/250Pods-12         	     	  60003672 ns/op
after:  BenchmarkSchedulingAntiAffinity/500Nodes/250Pods-12         	     	  13346400 ns/op

before: BenchmarkSchedulingAntiAffinity/1000Nodes/10000Pods-12         	     	 600085491 ns/op
after: BenchmarkSchedulingAntiAffinity/1000Nodes/10000Pods-12         	     	  27783333 ns/op

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

ref/ #56032 #47318 #25319

Release note:

improve performance of affinity/anti-affinity predicate of default scheduler significantly.

/sig scheduling

@k8s-ci-robot k8s-ci-robot requested review from jayunit100 and timothysc Apr 6, 2018

@bsalamat bsalamat requested review from k82cn and removed request for timothysc and jayunit100 Apr 6, 2018

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Apr 6, 2018

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Apr 9, 2018

@k82cn

This comment has been minimized.

Member

k82cn commented Apr 9, 2018

/assign

I'm going to review this ASAP :)

@k82cn

This comment has been minimized.

Member

k82cn commented Apr 11, 2018

Status Update: start reviewing today, not finished yet :)

@k82cn

This comment has been minimized.

Member

k82cn commented Apr 12, 2018

LGTM overall; performance is improved by avoiding checking all pods for each node.
I'd like to check more detail tomorrow, thanks very much for your patients.

matchingAntiAffinityTerms map[string][]matchingPodAntiAffinityTerm
// A map of node name to a list of Pods on the node that can potentially match
// the affinity rules of the "pod".
matchingAffinityPods map[string][]*v1.Pod

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Apr 12, 2018

Contributor

probably nodeNameToMatchingAffinityPods conveys the intention clearly.

This comment has been minimized.

@bsalamat

bsalamat Apr 12, 2018

Contributor

Done.

if affinity != nil && len(podNodeName) > 0 {
if affinity.PodAffinity != nil {
for i, p := range meta.matchingAffinityPods[podNodeName] {
if p == deletedPod {

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Apr 12, 2018

Contributor

Shouldn't we use a reflect.DeepEqual?

This comment has been minimized.

@bsalamat

bsalamat Apr 12, 2018

Contributor

It is not needed as the removal can happen only within a single scheduling cycle. We take a snapshot of the NodeInfo at the start of a scheduling cycle and the pointers are expected to remain the same during the cycle for the take snapshot. But if we decide to change that logic one day, we should then compare UID of the pods.

// "pod". Similar to getPodsMatchingAffinity, this function does not check topology.
// So, whether the targetPod actually matches or not needs further checks for a specific
// node.
func targetPodMatchesAffinityOfPod(pod, targetPod *v1.Pod) bool {

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Apr 12, 2018

Contributor

I think you could combine this function and targetPodMatchesAntiaffinity by having one more field for affinity or antiaffinity terms.

This comment has been minimized.

@bsalamat

bsalamat Apr 12, 2018

Contributor

We can, but the first half of the functions must be duplicated and more conditions checks must be added. I feel that would be less readable for not duplicating only several lines of code.

This comment has been minimized.

@ravisantoshgudimetla
for i, p := range meta.nodeNameToMatchingAffinityPods[podNodeName] {
if p == deletedPod {
meta.nodeNameToMatchingAffinityPods[podNodeName] = append(meta.nodeNameToMatchingAffinityPods[podNodeName][:i],
meta.nodeNameToMatchingAffinityPods[podNodeName][i+1:]...)

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Apr 12, 2018

Contributor

nit: Since this PR is performance related and we are using pointers to pod structs in array, the append may cause potential memory leaks. I think if we can set the value of the item to nil, it would be better.

This comment has been minimized.

@bsalamat

bsalamat Apr 12, 2018

Contributor

I am not sure if I understand the concern. This way of deletion of the pod from the array, causes the pod pointer to be removed from the array. When the reference to the pod is removed from the array and other objects, garbage collector should be able to release the memory.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Apr 12, 2018

Contributor

While I am not 100% sure IIUC, the pod pointer element will have values for pod.Spec and other items in the struct v1.Pod which won't be deleted by deleting item here. So, in a way they are still referenced by this slice and probably not garbage collected. I think, it is always recommended in go if the elements are pointers is to copy and create a nil value for it.

Ref: https://github.com/golang/go/wiki/SliceTricks

This comment has been minimized.

@bsalamat

bsalamat Apr 12, 2018

Contributor

Ref: https://github.com/golang/go/wiki/SliceTricks

Oh, that's a slightly different problem. When you shrink an array length (like we do in delete), but not the array capacity, a few items at the end of the array (between array length and array capacity) remain in memory. When the items are pointers, those pointers keep referencing objects and prevent garbage collector from cleaning those objects. So, it is recommended to set those items to nil. Those items are not necessarily the deleted items.
In this piece of code, we should be fine as the whole array is deleted at the end of a scheduling cycle and all of its elements will go.

This comment has been minimized.

@bsalamat

bsalamat Apr 12, 2018

Contributor

But your link reminded me that we should swap the last element of the array with the deleted item and shrink the array length instead of removing an item from the middle of array. That method would be faster and avoids copying many elements.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Apr 13, 2018

Contributor

Thanks Bobby for explanation.

So, it is recommended to set those items to nil. Those items are not necessarily the deleted items. In this piece of code, we should be fine as the whole array is deleted at the end of a scheduling cycle and all of its elements will go.

To be clear, in delete if we use the append method as above(with pointers), we will have garbage collection issues but since we are deleting this array at the end of scheduling cycle, we won't have problem(?)

But your link reminded me that we should swap the last element of the array with the deleted item and shrink the array length instead of removing an item from the middle of array. That method would be faster and avoids copying many elements.

Right. Theoretically yes. But last time I looked around, the benchmark differences were in range of 6-10 ns/op, so I am not sure if it would add much of benefit(To be fair, the data type was a int and it was in go 1.7, so we may have some improvement).

This comment has been minimized.

@bsalamat

bsalamat Apr 13, 2018

Contributor

To be clear, in delete if we use the append method as above(with pointers), we will have garbage collection issues but since we are deleting this array at the end of scheduling cycle, we won't have problem(?)

Appending pointers to an array does not have memory leak problems if the pointers are deleted from the array. The problem arises when pointers remain in the array. In Go, slices have a length but their actual number of items in memory is often larger than their length. The total size of a slice is called capacity. The problem happens when there are pointers in memory layout of a slice between its length and capacity.

But last time I looked around, the benchmark differences were in range of 6-10 ns/op, so I am not sure if it would add much of benefit(To be fair, the data type was a int and it was in go 1.7, so we may have some improvement).

Definitely the savings is minimal, especially since these arrays are pretty short. In longer arrays the number of copies is larger and savings would be more significant, but any amount of saving is beneficial if it can be achieved easily.

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Apr 13, 2018

/retest

@@ -108,12 +117,19 @@ func (pfactory *PredicateMetadataFactory) GetMetadata(pod *v1.Pod, nodeNameToInf
if err != nil {
return nil
}
affinityPods, antiAffinityPods, err := getPodsMatchingAffinity(pod, nodeNameToInfoMap)
if err != nil {
glog.Errorf("[predicate meta data generation] error finding pods that match affinity terms")

This comment has been minimized.

@k82cn

k82cn Apr 13, 2018

Member

please also include error message in error message :)

@k82cn

This comment has been minimized.

Member

k82cn commented Apr 13, 2018

/lgtm

The fix is ok to me :)

@k8s-ci-robot k8s-ci-robot added the lgtm label Apr 13, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Apr 13, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Apr 13, 2018

Automatic merge from submit-queue (batch tested with PRs 62467, 62482, 62211). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit 3cdf5ee into kubernetes:master Apr 13, 2018

15 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation bsalamat authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment