Make node tree order part of the snapshot #84014

ahg-g · 2019-10-16T19:28:47Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Makes the scheduler's nodeTree object private to the cache and part of the snapshot. The purpose of NodeTree is decide on the iteration order of the nodes (force iterating across zones etc. when evaluating nodes), currently the scheduler iterates over the live tree instead of the snapshot, this has two drawbacks:

iterating over the scheduler cache nodeTree requires acquiring a lock, which causes contention (see NodeTree.Next() in scheduler is really expensive #72408)
causes inconsistencies in clusters that bring up and shutdown nodes frequently (cluster autoscaling)

This PR make the order part of the snapshot, which improves performance (by about 3%) since we don't require locking, and removes the inconsistencies since the scheduler is now iterating over a static list of nodes.

Which issue(s) this PR fixes:
Part of #83922

Does this PR introduce a user-facing change?:

NONE

ahg-g · 2019-10-16T19:29:09Z

/assign @Huang-Wei

k8s-ci-robot · 2019-10-16T19:30:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2019-10-16T20:36:12Z

/priority important-soon

ahg-g · 2019-10-18T14:13:22Z

/assign @liu-cong

liu-cong · 2019-10-18T14:52:01Z

pkg/scheduler/internal/cache/node_tree.go

 )

 // NodeTree is a tree-like data structure that holds node names in each zone. Zone names are
 // keys to "NodeTree.tree" and values of "NodeTree.tree" are arrays of node names.
+// NodeTree is NOT thread-safe, any concurrent updates/reads from it must be synchronized by the caller.
+// It is used only by schedulerCache, and should stay as such.
 type NodeTree struct {


NodeTree -> nodeTree?

liu-cong · 2019-10-18T15:23:14Z

pkg/scheduler/internal/cache/cache.go

@@ -716,3 +718,10 @@ func (cache *schedulerCache) GetCSINodeInfo(nodeName string) (*storagev1beta1.CS

 	return n, nil
 }
+
+// NumNodes returns the number of nodes.
+func (cache *schedulerCache) NumNodes() int {


It seems that this function is not called anywhere.

liu-cong · 2019-10-18T15:24:29Z

pkg/scheduler/core/generic_scheduler.go

@@ -479,20 +478,22 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor
 		state.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: meta})

 		checkNode := func(i int) {
-			nodeName := g.cache.NodeTree().Next()
+			nodeInfo := g.nodeInfoSnapshot.NodeInfoList[i]
+			if nodeInfo == nil {


Can this be nil? Or can we make sure it's not nil?

liu-cong · 2019-10-18T16:03:39Z

pkg/scheduler/internal/cache/cache.go

+	// Take a snapshot of the nodes order in the tree
+	nodeSnapshot.NodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes)
+	for i := 0; i < cache.nodeTree.numNodes; i++ {
+		if n := nodeSnapshot.NodeInfoMap[cache.nodeTree.next()]; n != nil {


Do we know under what circumstances n can be nil? Not sure if we should log something if it happens.

inconsistency between nodeTree and nodeInfoMap, I will log an error.

liu-cong · 2019-10-18T17:50:43Z

/lgtm

/hold you can unhold when you want to merge.

ahg-g · 2019-10-18T18:01:49Z

/retest

ahg-g · 2019-10-18T18:36:58Z

/hold cancel

Huang-Wei · 2019-10-18T21:24:06Z

/hold

Huang-Wei

Thanks @ahg-g. Some comments below.

Huang-Wei · 2019-10-18T21:23:50Z

pkg/scheduler/core/generic_scheduler.go

@@ -520,8 +516,8 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor
 		workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

 		filtered = filtered[:filteredLen]
-		if len(errs) > 0 {
-			return []*v1.Node{}, FailedPredicateMap{}, framework.NodeToStatusMap{}, errors.CreateAggregateFromMessageCountMap(errs)
+		if err := errCh.ReceiveError(); err != nil {


This doesn't look right, we used to return all predicates errors to the users, but here it changes to a single error.

This was collecting errors from all nodes, not predicates. Why is it useful to continue to examine all nodes and end up existing anyways? remember that this is error, not predicate failure, so it is likely that something internal went wrong and caused the error, so again keeping iterating over all nodes does not seem useful to me.

Ah I misread, err is for unexpected internal error, not PredicateFailure. Nevermind then.

Huang-Wei · 2019-10-18T21:34:13Z

pkg/scheduler/internal/cache/cache.go

+
+	// Take a snapshot of the nodes order in the tree
+	nodeSnapshot.NodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes)
+	for i := 0; i < cache.nodeTree.numNodes; i++ {


The loop seems to be both memory and time consuming, can we have a benchmark testing the whole scheduling cycle?

It is not really, as described in the issue description, we gain about 3%, as for memory, this is just an array of pointers, so even for cluster with 5k nodes for example, the overhead is negligible.

Note that we do something similar in the predicates metadata, here are examples:

kubernetes/pkg/scheduler/algorithm/predicates/metadata.go

Line 407 in de9a7d8

allNodeNames := make([]string, 0, len(nodeInfoMap))

kubernetes/pkg/scheduler/algorithm/predicates/metadata.go

Line 731 in de9a7d8

allNodeNames := make([]string, 0, len(nodeInfoMap))

Huang-Wei · 2019-10-18T21:35:57Z

pkg/scheduler/nodeinfo/snapshot.go

 	NodeInfoMap map[string]*NodeInfo
-	Generation  int64
+	// NodeInfoList is the list of nodes as ordered in the cache's nodeTree.
+	NodeInfoList []*NodeInfo


Can we make it NodeNameList []string, and then in each checkNode(), the "nodeInfo" can be fetched by: nodeInfo := g.nodeInfoSnapshot. NodeInfoMap[NodeNameList[i]] (or from a helper method)?

why? we are going to fetch nodeinfos any ways while iterating over the nodes to evaluate the filters/predicates. Also, as I mentioned, we lookup the NodeInfo for all nodes in predicate metadata, in a followup PR, I want to change that so that we just index this new list. Finally, this should have smaller footprint since we are storing pointers, rather than strings.

sounds good.

ahg-g

Thanks, please see replies.

ahg-g · 2019-10-18T21:51:52Z

pkg/scheduler/internal/cache/cache.go

+
+	// Take a snapshot of the nodes order in the tree
+	nodeSnapshot.NodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes)
+	for i := 0; i < cache.nodeTree.numNodes; i++ {


It is not really, as described in the issue description, we gain about 3%, as for memory, this is just an array of pointers, so even for cluster with 5k nodes for example, the overhead is negligible.

Note that we do something similar in the predicates metadata, here are examples:

kubernetes/pkg/scheduler/algorithm/predicates/metadata.go

Line 407 in de9a7d8

allNodeNames := make([]string, 0, len(nodeInfoMap))

kubernetes/pkg/scheduler/algorithm/predicates/metadata.go

Line 731 in de9a7d8

allNodeNames := make([]string, 0, len(nodeInfoMap))

ahg-g · 2019-10-18T21:55:18Z

pkg/scheduler/nodeinfo/snapshot.go

 	NodeInfoMap map[string]*NodeInfo
-	Generation  int64
+	// NodeInfoList is the list of nodes as ordered in the cache's nodeTree.
+	NodeInfoList []*NodeInfo


why? we are going to fetch nodeinfos any ways while iterating over the nodes to evaluate the filters/predicates. Also, as I mentioned, we lookup the NodeInfo for all nodes in predicate metadata, in a followup PR, I want to change that so that we just index this new list. Finally, this should have smaller footprint since we are storing pointers, rather than strings.

ahg-g · 2019-10-18T21:58:21Z

pkg/scheduler/core/generic_scheduler.go

@@ -520,8 +516,8 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor
 		workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

 		filtered = filtered[:filteredLen]
-		if len(errs) > 0 {
-			return []*v1.Node{}, FailedPredicateMap{}, framework.NodeToStatusMap{}, errors.CreateAggregateFromMessageCountMap(errs)
+		if err := errCh.ReceiveError(); err != nil {


This was collecting errors from all nodes, not predicates. Why is it useful to continue to examine all nodes and end up existing anyways? remember that this is error, not predicate failure, so it is likely that something internal went wrong and caused the error, so again keeping iterating over all nodes does not seem useful to me.

Huang-Wei · 2019-10-18T22:20:21Z

/hold cancel

wojtek-t · 2019-10-23T09:14:02Z

pkg/scheduler/core/generic_scheduler.go

@@ -479,20 +478,17 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor
 		state.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: meta})

 		checkNode := func(i int) {
-			nodeName := g.cache.NodeTree().Next()
-
+			nodeInfo := g.nodeInfoSnapshot.NodeInfoList[i]


This broke scalability tests: #84151

TL;DR; this breaks spreading of pods in large clusters.

What exactly has happened:
1, in large enough clusters, we are using the features of finding only N feasible nodes and scoring only those:

kubernetes/pkg/scheduler/core/generic_scheduler.go

Line 464 in 9d17385

numNodesToFind := g.numFeasibleNodesToFind(allNodes)

Before this change, we were looking for nodes starting at the point where we previously stopped (because Next() was done at the level of the original tree).

With this change, we're always starting from 0.

So assume, you have 5k nodes, all are feasible, and numFeasible is choosing 250.

previously, you first chose nodes [1..250], then [251..500], ... [4751..5000], [1..250], ...

with this change, we will chose [1..250], [1..250], [1..250], ... until one of those nodes become unfeasible and we will choose something different.

With this PR, next() is called only in UpdateNodeSnapshotInfo:
https://github.com/kubernetes/kubernetes/pull/84014/files#diff-f4a894ca5e905aa5f613269fc967fe2cR206
and if the set of nodes doesn't change, we will pretty much always be generating the same set of nodes.

This kind of breaks the fact that scheduler is scheduling in the whole cluster. While it's not documented feature per-se, I think this isn't the right thing to do.

I'm going to open a revert of this PR to to fix scalability tests (or the half of them, because we seem to have two different regressions), but will wait for your explicit approval.
We can discuss how to fix that later.

heh... it's no longer possible to autorevert it..

Fortunately the conflicts were trivial - opened #84222

k8s-ci-robot assigned Huang-Wei Oct 16, 2019

ahg-g changed the title ~~create an ordered list of nodes instead of iterating over the tree~~ Make node tree order part of the snapshot Oct 16, 2019

k8s-ci-robot requested review from hex108 and wgliang October 16, 2019 19:29

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 16, 2019

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 16, 2019

ahg-g force-pushed the ahg-tree branch 5 times, most recently from f56e1cd to 20cc6f5 Compare October 18, 2019 03:05

k8s-ci-robot assigned liu-cong Oct 18, 2019

liu-cong reviewed Oct 18, 2019

View reviewed changes

create an ordered list of nodes instead of iterating over the tree

63d7733

ahg-g force-pushed the ahg-tree branch from 77aa569 to 63d7733 Compare October 18, 2019 16:51

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 18, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2019

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2019

Huang-Wei reviewed Oct 18, 2019

View reviewed changes

ahg-g commented Oct 18, 2019

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2019

k8s-ci-robot merged commit 70f6806 into kubernetes:master Oct 18, 2019

k8s-ci-robot added this to the v1.17 milestone Oct 18, 2019

wojtek-t reviewed Oct 23, 2019

View reviewed changes

wojtek-t mentioned this pull request Oct 23, 2019

Revert #84014 #84222

Closed

ahg-g mentioned this pull request Oct 23, 2019

fixed node search starting point #84232

Merged

ahg-g deleted the ahg-tree branch January 10, 2020 15:38

Huang-Wei mentioned this pull request Mar 10, 2020

Nil dereference panic in generic scheduler #89006

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make node tree order part of the snapshot #84014

Make node tree order part of the snapshot #84014

ahg-g commented Oct 16, 2019

ahg-g commented Oct 16, 2019

k8s-ci-robot commented Oct 16, 2019

ahg-g commented Oct 16, 2019

ahg-g commented Oct 18, 2019

liu-cong Oct 18, 2019

ahg-g Oct 18, 2019

liu-cong Oct 18, 2019

ahg-g Oct 18, 2019

liu-cong Oct 18, 2019

ahg-g Oct 18, 2019

liu-cong Oct 18, 2019

ahg-g Oct 18, 2019

ahg-g Oct 18, 2019

liu-cong commented Oct 18, 2019

ahg-g commented Oct 18, 2019

ahg-g commented Oct 18, 2019

Huang-Wei commented Oct 18, 2019

Huang-Wei left a comment

Huang-Wei Oct 18, 2019

ahg-g Oct 18, 2019

Huang-Wei Oct 18, 2019

Huang-Wei Oct 18, 2019

ahg-g Oct 18, 2019 •

edited

Huang-Wei Oct 18, 2019

ahg-g Oct 18, 2019

Huang-Wei Oct 18, 2019

ahg-g left a comment

ahg-g Oct 18, 2019 •

edited

ahg-g Oct 18, 2019

ahg-g Oct 18, 2019

Huang-Wei commented Oct 18, 2019

wojtek-t Oct 23, 2019

wojtek-t Oct 23, 2019

wojtek-t Oct 23, 2019

Make node tree order part of the snapshot #84014

Make node tree order part of the snapshot #84014

Conversation

ahg-g commented Oct 16, 2019

ahg-g commented Oct 16, 2019

k8s-ci-robot commented Oct 16, 2019

ahg-g commented Oct 16, 2019

ahg-g commented Oct 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liu-cong commented Oct 18, 2019

ahg-g commented Oct 18, 2019

ahg-g commented Oct 18, 2019

Huang-Wei commented Oct 18, 2019

Huang-Wei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Oct 18, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g left a comment

Choose a reason for hiding this comment

ahg-g Oct 18, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Oct 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Oct 18, 2019 •

edited

ahg-g Oct 18, 2019 •

edited