Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize required pod affinity #86046

Merged
merged 1 commit into from Dec 10, 2019
Merged

Conversation

ahg-g
Copy link
Member

@ahg-g ahg-g commented Dec 9, 2019

What type of PR is this?

/kind feature

What this PR does / why we need it:
This is the second PR in optimizing required pod affinity. This PR converts the data structure used to calculate pod affinity from a map of topology-to-list-of-pods to a map of topology-to-pod-counts. This significantly reduces the overhead of creating the map without compromising the predicate performance.

Basically, for each topology, instead of tracking the exact set of existing pods, we just track their count.

This PR builds on #86030. It offers up to 2.3x improvement over #86030. The two PRs combined offer up to 3.7x improvement.

Before

BenchmarkSchedulingPodAntiAffinity/5000Nodes/1000Pods-12           1000  19391988  ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/5000Pods-12               1000  29507122  ns/op

After #86030

BenchmarkSchedulingPodAntiAffinity/5000Nodes/1000Pods-12           1000  12974929  ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/5000Pods-12               1000  18056693  ns/op

After this PR

BenchmarkSchedulingPodAntiAffinity/5000Nodes/1000Pods-12           1000  8942505   ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/5000Pods-12               1000  7800239   ns/op

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 9, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 9, 2019
@ahg-g
Copy link
Member Author

ahg-g commented Dec 9, 2019

/cc @Huang-Wei @alculquicondor

@alculquicondor
Copy link
Member

alculquicondor commented Dec 9, 2019

LOL, I just commented that we should do this in the previous PR. Is it worth having 2 PRs?

@ahg-g
Copy link
Member Author

ahg-g commented Dec 9, 2019

LOL, I just commented that we should do this in the previous PR. Is it worth having 2 PRs?

It is up to you (the reviewers). Do you think this one alone is manageable?

@ahg-g
Copy link
Member Author

ahg-g commented Dec 9, 2019

LOL, I just commented that we should do this in the previous PR. Is it worth having 2 PRs?

It is up to you (the reviewers). Do you think this one alone is manageable?

@Huang-Wei do you want to just look at this PR and close the other one?

Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ahg-g ! LGTM generally, just some nits.

pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
topologyPairToPods map[topologyPair]podSet
podToTopologyPairs map[string]topologyPairSet
}
type topologyToMatchedTermCount map[topologyPair]int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use map[topologyPair]*int64 so that in the initialization of the metadata, we can concurrently manipulate the value without locking (appendResult()).

(can be a followup PR maybe)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, if possible, I believe it can be a followup along with comment:

// TODO(Huang-Wei): It might be possible to use "make(map[topologyPair]*int32)".
// In that case, need to consider how to init each tpPairToCount[pair] in an atomic fashion.

Copy link
Member Author

@ahg-g ahg-g Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that, I did a mutex contention profile, and this can potentially bring ~10% improvement. We can do that in a followup PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for isolating in a separate PR. There might be other forms of locking that we could consider too. Or using a channel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to initialize, you can do something like this:

if topologyToMatchedTermCount[pair]  == nil {
  mutex.Lock()
  // we have to check again since by the time we get the lock, another thread might have already initialized the entry.
  if topologyToMatchedTermCount[pair]  == nil {
    topologyToMatchedTermCount[pair] = new(int64)
  } 
  mutex.Unlock ()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure that's thread safe. But let's leave the discussion for another PR :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we will need to use an RLock:

mutex.RLock 
ptr := topologyToMatchedTermCount[pair]
mutex.RUnlock
if ptr  == nil {
  mutex.Lock()
  // we have to check again since by the time we get the lock, another thread might have already initialized the entry.
  if topologyToMatchedTermCount[pair]  == nil {
    topologyToMatchedTermCount[pair] = new(int64)
  } 
  ptr = topologyToMatchedTermCount[pair]
  mutex.Unlock ()
}

atomicAdd(ptr, value)

pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/predicates.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/predicates.go Outdated Show resolved Hide resolved
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another PR coming to cache the affinity terms of the incoming pod?

m[pair] += value
// value could be a negative value, hence we delete the entry if
// the entry is down to zero.
if m[pair] == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it actually worth deleting? we might end up re-adding it with the preemption algorithm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: checking if m[pair] == 0 works for both a non-existing entry or existing entry but with value 0.

  • if we want to manually delete the entry, we'd better keep the logic consistent to use if _, ok := m[pair]; ok on the logic of checking its existence
  • if we don't delete the entry, probably it increases some memory footprint, but we save time on checking and deletion. in this case, checking if m[pair] == 0 should be used to check its existence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the previous logic relied on the existence of the entry rather than whether or not it is zero, I opted to do this to avoid potential bugs. Doing this also made it easy to pass the tests that does DeepEqual test. We can clean this up in a followup PR that converts the type to *int64 so that we can do atomic adds instead of using a global mutex.

pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
@Huang-Wei
Copy link
Member

@Huang-Wei do you want to just look at this PR and close the other one?

yes, the volume of code changes are manageable in one PR, but feel free to keep 2 commits in this PR.

@ahg-g
Copy link
Member Author

ahg-g commented Dec 9, 2019

Is there another PR coming to cache the affinity terms of the incoming pod?

I guess you are referring to the preemption logic? yes, we can cache that in the metadata, but I thought we just wait for PodInfo.

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/hold

for squash

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 9, 2019
@ahg-g
Copy link
Member Author

ahg-g commented Dec 9, 2019

/lgtm

/hold

for squash

Thanks, @Huang-Wei please let me know if I can squash.

@Huang-Wei
Copy link
Member

@ahg-g LGTM. Please go sqaushing.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 9, 2019
@alculquicondor
Copy link
Member

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 9, 2019
@k8s-ci-robot k8s-ci-robot merged commit 9bf52c2 into kubernetes:master Dec 10, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Dec 10, 2019
@ahg-g ahg-g deleted the ahg1-affinity branch January 10, 2020 15:38
@ahg-g ahg-g changed the title Optimize required pod affinity (2) Optimize required pod affinity Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants