Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: internal data structure optimization #81068

Merged
merged 2 commits into from Aug 21, 2019

Conversation

@Huang-Wei
Copy link
Member

commented Aug 7, 2019

What type of PR is this?

/kind feature

What this PR does / why we need it:

Optimize the data structure of topologySpreadMappodSpreadCache to reduce memory and runtime overhead, and also boost the performance of predicates and preemption of EvenPodsSpread.

Here is the new structure:

// podSpreadCache combines tpKeyToCriticalPaths and tpPairToMatchNum
// to represent:
// (1) critical paths where the least pods are matched on each spread constraint.
// (2) number of pods matched on each spread constraint.
type podSpreadCache struct {
	// We don't record all critical paths here.
	// If there is only one criticalPath, criticalPaths[0] always holds the minimum matchNum;
	// If there are multiple criticalPaths, keep arbitrary 2 of them holding the minimum matchNum.
	tpKeyToCriticalPaths map[string][2]*criticalPath
	// tpPairToMatchNum is keyed with topologyPair, and valued with the number of matching pods.
	tpPairToMatchNum map[topologyPair]int32
}

CAVEAT: this PR (in particular the [2]critialPath structure) can work is based on the implementation of the current preemption algorithm:
Fact 1: we only preempt pods on the same node, instead of pods on multiple nodes.
Fact 2: each node is evaluated on a separate copy of the metadata during its preemption cycle.
If we plan to turn to a more complex algorithm like "arbitrary pods on multiple nodes", this enhancement needs to be revisited.

Which issue(s) this PR fixes:

Part of #77284. Resolve comment1 and comment2.

Special notes for your reviewer:

The performance looks good (~6X in time, ~4X in space). Let's focus on functionality reviewing.

Before

single-constraint-zone-8        100   16717361 ns/op  12650462 B/op  129741 allocs/op
single-constraint-node-8        100   17020951 ns/op  12674237 B/op  132412 allocs/op
two-constraints-zone-node-8      50   22534697 ns/op  16043751 B/op  203792 allocs/op

single-constraint-zone-8        100	  17250549 ns/op  12646159 B/op  129737 allocs/op
single-constraint-node-8        100	  14650029 ns/op  12672823 B/op  132416 allocs/op
two-constraints-zone-node-8      50	  21257976 ns/op  16043452 B/op  203793 allocs/op

single-constraint-zone-8        100	  14535822 ns/op  12651489 B/op  129740 allocs/op
single-constraint-node-8        100	  14415079 ns/op  12672163 B/op  132414 allocs/op
two-constraints-zone-node-8     100	  18504999 ns/op  16041346 B/op  203793 allocs/op

After

single-constraint-zone-8        500    2523942 ns/op  2796400 B/op   60027 allocs/op
single-constraint-node-8        500    2863317 ns/op  2984588 B/op   60051 allocs/op
two-constraints-zone-node-8     300	   5202418 ns/op  5753595 B/op  120058 allocs/op

single-constraint-zone-8        500	   2574097 ns/op  2795724 B/op   60027 allocs/op
single-constraint-node-8        500	   2824376 ns/op  2985523 B/op   60051 allocs/op
two-constraints-zone-node-8     300	   4965061 ns/op  5751401 B/op  120059 allocs/op

single-constraint-zone-8        500	   2488182 ns/op  2796486 B/op   60027 allocs/op
single-constraint-node-8        500	   2922101 ns/op  2988127 B/op   60051 allocs/op
two-constraints-zone-node-8     300	   5169896 ns/op  5751460 B/op  120059 allocs/op

Does this PR introduce a user-facing change?:

NONE

/sig scheduling
/assign @ahg-g @alculquicondor
cc/ @bsalamat @liu-cong @draveness @krmayankk

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 7, 2019

/hold
There is a data race.

Copy link
Member

left a comment

I stopped the review because I think this implementation doesn't work. Specifically, it's not enough to rely in 2 critical paths when we can have "almost" arbitrary removal/addition of pods (see the comments for details).

We have 2 options:

  • Implement a priority queue (maybe we can reuse some already implemented),
  • Do a linear search for the minimum each time we add a node. Maybe this is enough, because preemption is not a common case. But we can support this with performance tests.

Additionally, some memory can be saved if we store a map from node to topology keys (instead of pod to topologyKeys).

pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
topologyPairsPodSpreadMap := &topologyPairsPodSpreadMap{
// topologyKeyToMinPodsMap will be initialized with proper size later.
topologyPairsMaps: newTopologyPairsMaps(),
// TODO(Huang-Wei): It might be possible to use "make(map[topologyPair]*int32)".

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 7, 2019

Member

We should consider implementing a well defined data structure in our utilities library. Atomic map counters is a very common use case for implementing scheduling algorithms. I'll give it a thought.

pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
pkg/scheduler/algorithm/predicates/metadata.go Outdated Show resolved Hide resolved
// is definitely greater than "paths[1].matchNum".
if v == paths[0].topologyValue {
paths[0].matchNum = curMatch
// paths[0].matchNum has increased by 1,

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 7, 2019

Member

in this case, at least 2 keys had the same count, say (4, 4). After adding, we get (5, 4), so we need to swap to (4, 5). This works if those were the 2 only keys with that count. If we had (4, 4, 4), then after adding, the state should be (4, 4, 5).

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 7, 2019

Author Member

[2]critialPath in preemption case doesn't always hold the minimum two paths. We only need to guarantee path[0] always holds the minimum matchNum, and that's also our predicate algorithm relies on - compare to the minMatch.

So in your case, it's fine that at some point it becomes [4,5] (one of the original 4 gets lost). At anytime we must keep the minMatch tracked well.

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:eps-structure-optimize branch from a2d0499 to 87d5fad Aug 9, 2019
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

TestPreemptWithPermitPlugin flake +1: #81238

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

/retest

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

Recent (3rd) commit basically involves a number of changes due to the renaming, but it doesn't change the core logic/algorithm of this PR.

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:eps-structure-optimize branch from 87d5fad to a907482 Aug 9, 2019
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 10, 2019

/retest

Copy link
Member

left a comment

As discussed, let's leave the 2 critical paths implementation, documenting its caveats (in PR description and code comments).

@@ -337,21 +347,16 @@ func newTopologyPairsMaps() *topologyPairsMaps {

func (m *topologyPairsMaps) addTopologyPair(pair topologyPair, pod *v1.Pod) {
podFullName := schedutil.GetPodFullName(pod)
m.addTopologyPairWithoutPods(pair)
if m.topologyPairToPods[pair] == nil {

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 12, 2019

Member

Is this just reverting the initial changes to affinity structures?

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 12, 2019

Author Member

Correct :)

// We don't record all critical paths here.
// If there is only one criticalPath, criticalPaths[0] always holds the minimum matchNum;
// If there are multiple criticalPaths, keep arbitrary 2 of them holding the minimum matchNum.
tpKeyToCriticalPaths map[string][2]*criticalPath

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 12, 2019

Member

There are 3 implementations for updating this array. Let's make it a type:

type criticalPaths [2]*struct{topologyValue string, matchNum int32}

with only one function: func (*criticalPaths) update(topologyValue string, matchNum int32) and a constructor.

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 12, 2019

Member

Also, we are choosing int32 here, whereas we are using int64 for the priorities

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 12, 2019

Author Member

To make them consistent, I'd like to change int64 in priorities to int32, any preference?

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 12, 2019

Member

Since this is counting pods, that should be fine.

}

// sortCriticalPaths is only served for testing purpose.
func (c *podSpreadCache) sortCriticalPaths() {

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 12, 2019

Member

can we move this to metadata_test.go?

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:eps-structure-optimize branch from a907482 to 6841314 Aug 12, 2019
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2019

/retest

1 similar comment
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2019

/retest

tpPairToMatchNum: make(map[topologyPair]int32),
}
for tpKey, paths := range c.tpKeyToCriticalPaths {
copyPaths := [2]*criticalPath{
copyPaths := criticalPaths{

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 13, 2019

Member

nit: criticalPaths{paths[0], paths[1]}

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 13, 2019

Author Member

Neat! It's an Array so we can simply put its value (as a copy) here.

@@ -1635,7 +1635,7 @@ func TestPodSpreadCache_removePod(t *testing.T) {
} else {
deletedPod = tt.deletedPod
}
podSpreadCache.removePod(deletedPod, tt.preemptor, tt.nodes[tt.nodeIdx])
podSpreadCache.updatePod(deletedPod, tt.preemptor, tt.nodes[tt.nodeIdx], -1)

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 13, 2019

Member

not a big fan of the numbers. Maybe just add addPod and removePod that call updatePod? But this is a nit.

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 13, 2019

Author Member

Updated.

@alculquicondor

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

/lgtm
unless you want to implement the nits

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 13, 2019
@Huang-Wei Huang-Wei force-pushed the Huang-Wei:eps-structure-optimize branch from 6841314 to 3056b2c Aug 13, 2019
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2019

/hold

@ahg-g wants to do a quick review.

Copy link
Member

left a comment

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 13, 2019
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2019

/retest

Copy link
Member

left a comment

Thanks! this is clearly more efficient and easier to reason about compared to the original implementation.

type podSpreadCache struct {
// We don't record all critical paths here.
// If there is only one criticalPath, criticalPaths[0] always holds the minimum matchNum;
// If there are multiple criticalPaths, keep arbitrary 2 of them holding the minimum matchNum.

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 19, 2019

Member

This comment is a little confusing, to me it reads as if we are keeping two values only when we have more than two with the same minimum matchNum, which is not the case. I would re-phrase this whole three-line comment as: we keep track of the two topology values with the minimum matchNum.

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 19, 2019

Author Member

That sounds too short... How about:

	// We record 2 critical paths instead of all critical paths here.
	// criticalPaths[0].matchNum always holds the minimum matching number.
	// criticalPaths[1].matchNum is always greater or equal to criticalPaths[0].matchNum,
	// but it's not guaranteed to be the 2nd minimum match number.
} else {
// `tpVal` doesn't exist
if num < paths[0].matchNum {
// update paths[1] with paths[1]

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 19, 2019

Member

with paths[0]

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 19, 2019

Author Member

Done.

j := (i + 1) % 2 // the other index
// adjust i and j to make sure paths[0] always holds the minimum match num.
if (i < j && paths[i].matchNum > paths[j].matchNum) ||
(i > j && paths[i].matchNum < paths[j].matchNum) {

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 19, 2019

Member

can this condition be simplified as: if (paths[0].matchNum > paths[1].matchNum)

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 19, 2019

Author Member

It looks good.

Another (fancy but may hard to read) way I thought was: i < j == paths[i].matchNum > paths[j].matchNum.

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 19, 2019

Member

It looks much better now :)

// CAVEAT: the reason that `[2]criticalPath` can work is based on the implementation of current
// preemption algorithm - we only preempt pods on the same node, instead of pods on multiple nodes.
// If we plan to turn to a more complex algorithm like "arbitrary pods on multiple nodes", this
// structure needs to be revisited.

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 19, 2019

Member

we need to also clarify that this works also because each node is evaluated on a separate copy of the metadata.

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 19, 2019

Author Member

SG.

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:eps-structure-optimize branch from 3056b2c to 1f78a6c Aug 19, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 19, 2019
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 19, 2019

@ahg-g Comments addressed. PTAL.

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 19, 2019

/retest

@ahg-g

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 20, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Huang-Wei added 2 commits Aug 7, 2019
- go test k8s.io/kubernetes/pkg/scheduler/algorithm/predicates -benchmem -run=^$ -bench .
- Rename 'topologyPairsPodSpreadMap' to 'podSpreadCache'
- New struct `criticalPaths criticalPaths`
- Add unified method `*criticalPaths.update()` for:
    - regular update
    - addPod in preemption case
    - remotePod in preemption case
@Huang-Wei Huang-Wei force-pushed the Huang-Wei:eps-structure-optimize branch from 1f78a6c to 8f559ea Aug 20, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 20, 2019
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 20, 2019

@ahg-g @alculquicondor Commits squashed. Please help to relabel /lgtm.

Thanks for the efforts reviewing this!

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 20, 2019

/hold cancel

@ahg-g

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 20, 2019
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 20, 2019

/retest

1 similar comment
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Aug 20, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit e7ecb22 into kubernetes:master Aug 21, 2019
23 checks passed
23 checks passed
cla/linuxfoundation Huang-Wei authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 21, 2019
@Huang-Wei Huang-Wei deleted the Huang-Wei:eps-structure-optimize branch Aug 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.