Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: internal data structure optimization #81068

Merged
merged 2 commits into from
Aug 21, 2019

Conversation

Huang-Wei
Copy link
Member

@Huang-Wei Huang-Wei commented Aug 7, 2019

What type of PR is this?

/kind feature

What this PR does / why we need it:

Optimize the data structure of topologySpreadMappodSpreadCache to reduce memory and runtime overhead, and also boost the performance of predicates and preemption of EvenPodsSpread.

Here is the new structure:

// podSpreadCache combines tpKeyToCriticalPaths and tpPairToMatchNum
// to represent:
// (1) critical paths where the least pods are matched on each spread constraint.
// (2) number of pods matched on each spread constraint.
type podSpreadCache struct {
	// We don't record all critical paths here.
	// If there is only one criticalPath, criticalPaths[0] always holds the minimum matchNum;
	// If there are multiple criticalPaths, keep arbitrary 2 of them holding the minimum matchNum.
	tpKeyToCriticalPaths map[string][2]*criticalPath
	// tpPairToMatchNum is keyed with topologyPair, and valued with the number of matching pods.
	tpPairToMatchNum map[topologyPair]int32
}

CAVEAT: this PR (in particular the [2]critialPath structure) can work is based on the implementation of the current preemption algorithm:
Fact 1: we only preempt pods on the same node, instead of pods on multiple nodes.
Fact 2: each node is evaluated on a separate copy of the metadata during its preemption cycle.
If we plan to turn to a more complex algorithm like "arbitrary pods on multiple nodes", this enhancement needs to be revisited.

Which issue(s) this PR fixes:

Part of #77284. Resolve comment1 and comment2.

Special notes for your reviewer:

The performance looks good (~6X in time, ~4X in space). Let's focus on functionality reviewing.

Before

single-constraint-zone-8        100   16717361 ns/op  12650462 B/op  129741 allocs/op
single-constraint-node-8        100   17020951 ns/op  12674237 B/op  132412 allocs/op
two-constraints-zone-node-8      50   22534697 ns/op  16043751 B/op  203792 allocs/op

single-constraint-zone-8        100	  17250549 ns/op  12646159 B/op  129737 allocs/op
single-constraint-node-8        100	  14650029 ns/op  12672823 B/op  132416 allocs/op
two-constraints-zone-node-8      50	  21257976 ns/op  16043452 B/op  203793 allocs/op

single-constraint-zone-8        100	  14535822 ns/op  12651489 B/op  129740 allocs/op
single-constraint-node-8        100	  14415079 ns/op  12672163 B/op  132414 allocs/op
two-constraints-zone-node-8     100	  18504999 ns/op  16041346 B/op  203793 allocs/op

After

single-constraint-zone-8        500    2523942 ns/op  2796400 B/op   60027 allocs/op
single-constraint-node-8        500    2863317 ns/op  2984588 B/op   60051 allocs/op
two-constraints-zone-node-8     300	   5202418 ns/op  5753595 B/op  120058 allocs/op

single-constraint-zone-8        500	   2574097 ns/op  2795724 B/op   60027 allocs/op
single-constraint-node-8        500	   2824376 ns/op  2985523 B/op   60051 allocs/op
two-constraints-zone-node-8     300	   4965061 ns/op  5751401 B/op  120059 allocs/op

single-constraint-zone-8        500	   2488182 ns/op  2796486 B/op   60027 allocs/op
single-constraint-node-8        500	   2922101 ns/op  2988127 B/op   60051 allocs/op
two-constraints-zone-node-8     300	   5169896 ns/op  5751460 B/op  120059 allocs/op

Does this PR introduce a user-facing change?:

NONE

/sig scheduling
/assign @ahg-g @alculquicondor
cc/ @bsalamat @liu-cong @draveness @krmayankk

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 7, 2019
@Huang-Wei
Copy link
Member Author

/hold
There is a data race.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2019
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped the review because I think this implementation doesn't work. Specifically, it's not enough to rely in 2 critical paths when we can have "almost" arbitrary removal/addition of pods (see the comments for details).

We have 2 options:

  • Implement a priority queue (maybe we can reuse some already implemented),
  • Do a linear search for the minimum each time we add a node. Maybe this is enough, because preemption is not a common case. But we can support this with performance tests.

Additionally, some memory can be saved if we store a map from node to topology keys (instead of pod to topologyKeys).

pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
topologyPairsPodSpreadMap := &topologyPairsPodSpreadMap{
// topologyKeyToMinPodsMap will be initialized with proper size later.
topologyPairsMaps: newTopologyPairsMaps(),
// TODO(Huang-Wei): It might be possible to use "make(map[topologyPair]*int32)".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider implementing a well defined data structure in our utilities library. Atomic map counters is a very common use case for implementing scheduling algorithms. I'll give it a thought.

pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
// is definitely greater than "paths[1].matchNum".
if v == paths[0].topologyValue {
paths[0].matchNum = curMatch
// paths[0].matchNum has increased by 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, at least 2 keys had the same count, say (4, 4). After adding, we get (5, 4), so we need to swap to (4, 5). This works if those were the 2 only keys with that count. If we had (4, 4, 4), then after adding, the state should be (4, 4, 5).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[2]critialPath in preemption case doesn't always hold the minimum two paths. We only need to guarantee path[0] always holds the minimum matchNum, and that's also our predicate algorithm relies on - compare to the minMatch.

So in your case, it's fine that at some point it becomes [4,5] (one of the original 4 gets lost). At anytime we must keep the minMatch tracked well.

@Huang-Wei
Copy link
Member Author

TestPreemptWithPermitPlugin flake +1: #81238

@Huang-Wei
Copy link
Member Author

/retest

@Huang-Wei
Copy link
Member Author

Recent (3rd) commit basically involves a number of changes due to the renaming, but it doesn't change the core logic/algorithm of this PR.

@Huang-Wei
Copy link
Member Author

/retest

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's leave the 2 critical paths implementation, documenting its caveats (in PR description and code comments).

@@ -337,21 +347,16 @@ func newTopologyPairsMaps() *topologyPairsMaps {

func (m *topologyPairsMaps) addTopologyPair(pair topologyPair, pod *v1.Pod) {
podFullName := schedutil.GetPodFullName(pod)
m.addTopologyPairWithoutPods(pair)
if m.topologyPairToPods[pair] == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just reverting the initial changes to affinity structures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct :)

// We don't record all critical paths here.
// If there is only one criticalPath, criticalPaths[0] always holds the minimum matchNum;
// If there are multiple criticalPaths, keep arbitrary 2 of them holding the minimum matchNum.
tpKeyToCriticalPaths map[string][2]*criticalPath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 implementations for updating this array. Let's make it a type:

type criticalPaths [2]*struct{topologyValue string, matchNum int32}

with only one function: func (*criticalPaths) update(topologyValue string, matchNum int32) and a constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we are choosing int32 here, whereas we are using int64 for the priorities

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make them consistent, I'd like to change int64 in priorities to int32, any preference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is counting pods, that should be fine.

}

// sortCriticalPaths is only served for testing purpose.
func (c *podSpreadCache) sortCriticalPaths() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to metadata_test.go?

@Huang-Wei
Copy link
Member Author

/retest

1 similar comment
@Huang-Wei
Copy link
Member Author

/retest

tpPairToMatchNum: make(map[topologyPair]int32),
}
for tpKey, paths := range c.tpKeyToCriticalPaths {
copyPaths := [2]*criticalPath{
copyPaths := criticalPaths{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: criticalPaths{paths[0], paths[1]}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! It's an Array so we can simply put its value (as a copy) here.

@@ -1635,7 +1635,7 @@ func TestPodSpreadCache_removePod(t *testing.T) {
} else {
deletedPod = tt.deletedPod
}
podSpreadCache.removePod(deletedPod, tt.preemptor, tt.nodes[tt.nodeIdx])
podSpreadCache.updatePod(deletedPod, tt.preemptor, tt.nodes[tt.nodeIdx], -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big fan of the numbers. Maybe just add addPod and removePod that call updatePod? But this is a nit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@alculquicondor
Copy link
Member

/lgtm
unless you want to implement the nits

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 13, 2019
@Huang-Wei
Copy link
Member Author

/hold cancel

Address updated and commits (except for the benchmark one) have been squashed.

@alculquicondor PTAL.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2019
@Huang-Wei
Copy link
Member Author

/hold

@ahg-g wants to do a quick review.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2019
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 13, 2019
@Huang-Wei
Copy link
Member Author

/retest

Copy link
Member

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! this is clearly more efficient and easier to reason about compared to the original implementation.

type podSpreadCache struct {
// We don't record all critical paths here.
// If there is only one criticalPath, criticalPaths[0] always holds the minimum matchNum;
// If there are multiple criticalPaths, keep arbitrary 2 of them holding the minimum matchNum.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a little confusing, to me it reads as if we are keeping two values only when we have more than two with the same minimum matchNum, which is not the case. I would re-phrase this whole three-line comment as: we keep track of the two topology values with the minimum matchNum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds too short... How about:

	// We record 2 critical paths instead of all critical paths here.
	// criticalPaths[0].matchNum always holds the minimum matching number.
	// criticalPaths[1].matchNum is always greater or equal to criticalPaths[0].matchNum,
	// but it's not guaranteed to be the 2nd minimum match number.

} else {
// `tpVal` doesn't exist
if num < paths[0].matchNum {
// update paths[1] with paths[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with paths[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

j := (i + 1) % 2 // the other index
// adjust i and j to make sure paths[0] always holds the minimum match num.
if (i < j && paths[i].matchNum > paths[j].matchNum) ||
(i > j && paths[i].matchNum < paths[j].matchNum) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this condition be simplified as: if (paths[0].matchNum > paths[1].matchNum)

Copy link
Member Author

@Huang-Wei Huang-Wei Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good.

Another (fancy but may hard to read) way I thought was: i < j == paths[i].matchNum > paths[j].matchNum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much better now :)

// CAVEAT: the reason that `[2]criticalPath` can work is based on the implementation of current
// preemption algorithm - we only preempt pods on the same node, instead of pods on multiple nodes.
// If we plan to turn to a more complex algorithm like "arbitrary pods on multiple nodes", this
// structure needs to be revisited.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to also clarify that this works also because each node is evaluated on a separate copy of the metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 19, 2019
@Huang-Wei
Copy link
Member Author

@ahg-g Comments addressed. PTAL.

@Huang-Wei
Copy link
Member Author

/retest

@ahg-g
Copy link
Member

ahg-g commented Aug 20, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 20, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- go test k8s.io/kubernetes/pkg/scheduler/algorithm/predicates -benchmem -run=^$ -bench .
- Rename 'topologyPairsPodSpreadMap' to 'podSpreadCache'
- New struct `criticalPaths criticalPaths`
- Add unified method `*criticalPaths.update()` for:
    - regular update
    - addPod in preemption case
    - remotePod in preemption case
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 20, 2019
@Huang-Wei
Copy link
Member Author

@ahg-g @alculquicondor Commits squashed. Please help to relabel /lgtm.

Thanks for the efforts reviewing this!

@Huang-Wei
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 20, 2019
@ahg-g
Copy link
Member

ahg-g commented Aug 20, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 20, 2019
@Huang-Wei
Copy link
Member Author

/retest

1 similar comment
@Huang-Wei
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit e7ecb22 into kubernetes:master Aug 21, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 21, 2019
@Huang-Wei Huang-Wei deleted the eps-structure-optimize branch August 21, 2019 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants