-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make node tree order part of the snapshot #84014
Conversation
/assign @Huang-Wei |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/priority important-soon |
f56e1cd
to
20cc6f5
Compare
/assign @liu-cong |
) | ||
|
||
// NodeTree is a tree-like data structure that holds node names in each zone. Zone names are | ||
// keys to "NodeTree.tree" and values of "NodeTree.tree" are arrays of node names. | ||
// NodeTree is NOT thread-safe, any concurrent updates/reads from it must be synchronized by the caller. | ||
// It is used only by schedulerCache, and should stay as such. | ||
type NodeTree struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeTree -> nodeTree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -716,3 +718,10 @@ func (cache *schedulerCache) GetCSINodeInfo(nodeName string) (*storagev1beta1.CS | |||
|
|||
return n, nil | |||
} | |||
|
|||
// NumNodes returns the number of nodes. | |||
func (cache *schedulerCache) NumNodes() int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this function is not called anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -479,20 +478,22 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor | |||
state.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: meta}) | |||
|
|||
checkNode := func(i int) { | |||
nodeName := g.cache.NodeTree().Next() | |||
nodeInfo := g.nodeInfoSnapshot.NodeInfoList[i] | |||
if nodeInfo == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be nil? Or can we make sure it's not nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
// Take a snapshot of the nodes order in the tree | ||
nodeSnapshot.NodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes) | ||
for i := 0; i < cache.nodeTree.numNodes; i++ { | ||
if n := nodeSnapshot.NodeInfoMap[cache.nodeTree.next()]; n != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know under what circumstances n
can be nil? Not sure if we should log something if it happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inconsistency between nodeTree and nodeInfoMap, I will log an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
/lgtm /hold you can unhold when you want to merge. |
/retest |
/hold cancel |
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ahg-g. Some comments below.
@@ -520,8 +516,8 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor | |||
workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode) | |||
|
|||
filtered = filtered[:filteredLen] | |||
if len(errs) > 0 { | |||
return []*v1.Node{}, FailedPredicateMap{}, framework.NodeToStatusMap{}, errors.CreateAggregateFromMessageCountMap(errs) | |||
if err := errCh.ReceiveError(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right, we used to return all predicates errors to the users, but here it changes to a single error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was collecting errors from all nodes, not predicates. Why is it useful to continue to examine all nodes and end up existing anyways? remember that this is error, not predicate failure, so it is likely that something internal went wrong and caused the error, so again keeping iterating over all nodes does not seem useful to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I misread, err
is for unexpected internal error, not PredicateFailure. Nevermind then.
|
||
// Take a snapshot of the nodes order in the tree | ||
nodeSnapshot.NodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes) | ||
for i := 0; i < cache.nodeTree.numNodes; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop seems to be both memory and time consuming, can we have a benchmark testing the whole scheduling cycle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not really, as described in the issue description, we gain about 3%, as for memory, this is just an array of pointers, so even for cluster with 5k nodes for example, the overhead is negligible.
Note that we do something similar in the predicates metadata, here are examples:
allNodeNames := make([]string, 0, len(nodeInfoMap)) |
allNodeNames := make([]string, 0, len(nodeInfoMap)) |
NodeInfoMap map[string]*NodeInfo | ||
Generation int64 | ||
// NodeInfoList is the list of nodes as ordered in the cache's nodeTree. | ||
NodeInfoList []*NodeInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it NodeNameList []string
, and then in each checkNode()
, the "nodeInfo" can be fetched by: nodeInfo := g.nodeInfoSnapshot. NodeInfoMap[NodeNameList[i]]
(or from a helper method)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? we are going to fetch nodeinfos any ways while iterating over the nodes to evaluate the filters/predicates. Also, as I mentioned, we lookup the NodeInfo for all nodes in predicate metadata, in a followup PR, I want to change that so that we just index this new list. Finally, this should have smaller footprint since we are storing pointers, rather than strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, please see replies.
|
||
// Take a snapshot of the nodes order in the tree | ||
nodeSnapshot.NodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes) | ||
for i := 0; i < cache.nodeTree.numNodes; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not really, as described in the issue description, we gain about 3%, as for memory, this is just an array of pointers, so even for cluster with 5k nodes for example, the overhead is negligible.
Note that we do something similar in the predicates metadata, here are examples:
allNodeNames := make([]string, 0, len(nodeInfoMap)) |
allNodeNames := make([]string, 0, len(nodeInfoMap)) |
NodeInfoMap map[string]*NodeInfo | ||
Generation int64 | ||
// NodeInfoList is the list of nodes as ordered in the cache's nodeTree. | ||
NodeInfoList []*NodeInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? we are going to fetch nodeinfos any ways while iterating over the nodes to evaluate the filters/predicates. Also, as I mentioned, we lookup the NodeInfo for all nodes in predicate metadata, in a followup PR, I want to change that so that we just index this new list. Finally, this should have smaller footprint since we are storing pointers, rather than strings.
@@ -520,8 +516,8 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor | |||
workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode) | |||
|
|||
filtered = filtered[:filteredLen] | |||
if len(errs) > 0 { | |||
return []*v1.Node{}, FailedPredicateMap{}, framework.NodeToStatusMap{}, errors.CreateAggregateFromMessageCountMap(errs) | |||
if err := errCh.ReceiveError(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was collecting errors from all nodes, not predicates. Why is it useful to continue to examine all nodes and end up existing anyways? remember that this is error, not predicate failure, so it is likely that something internal went wrong and caused the error, so again keeping iterating over all nodes does not seem useful to me.
/hold cancel |
@@ -479,20 +478,17 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor | |||
state.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: meta}) | |||
|
|||
checkNode := func(i int) { | |||
nodeName := g.cache.NodeTree().Next() | |||
|
|||
nodeInfo := g.nodeInfoSnapshot.NodeInfoList[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This broke scalability tests: #84151
TL;DR; this breaks spreading of pods in large clusters.
What exactly has happened:
1, in large enough clusters, we are using the features of finding only N feasible nodes and scoring only those:
numNodesToFind := g.numFeasibleNodesToFind(allNodes) |
- Before this change, we were looking for nodes starting at the point where we previously stopped (because Next() was done at the level of the original tree).
- With this change, we're always starting from 0.
So assume, you have 5k nodes, all are feasible, and numFeasible is choosing 250.
- previously, you first chose nodes [1..250], then [251..500], ... [4751..5000], [1..250], ...
- with this change, we will chose [1..250], [1..250], [1..250], ... until one of those nodes become unfeasible and we will choose something different.
With this PR, next() is called only in UpdateNodeSnapshotInfo:
https://github.com/kubernetes/kubernetes/pull/84014/files#diff-f4a894ca5e905aa5f613269fc967fe2cR206
and if the set of nodes doesn't change, we will pretty much always be generating the same set of nodes.
This kind of breaks the fact that scheduler is scheduling in the whole cluster. While it's not documented feature per-se, I think this isn't the right thing to do.
I'm going to open a revert of this PR to to fix scalability tests (or the half of them, because we seem to have two different regressions), but will wait for your explicit approval.
We can discuss how to fix that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh... it's no longer possible to autorevert it..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately the conflicts were trivial - opened #84222
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Makes the scheduler's nodeTree object private to the cache and part of the snapshot. The purpose of NodeTree is decide on the iteration order of the nodes (force iterating across zones etc. when evaluating nodes), currently the scheduler iterates over the live tree instead of the snapshot, this has two drawbacks:
This PR make the order part of the snapshot, which improves performance (by about 3%) since we don't require locking, and removes the inconsistencies since the scheduler is now iterating over a static list of nodes.
Which issue(s) this PR fixes:
Part of #83922
Does this PR introduce a user-facing change?: