Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make node tree order part of the snapshot #84014

Merged
merged 1 commit into from
Oct 18, 2019

Conversation

ahg-g
Copy link
Member

@ahg-g ahg-g commented Oct 16, 2019

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Makes the scheduler's nodeTree object private to the cache and part of the snapshot. The purpose of NodeTree is decide on the iteration order of the nodes (force iterating across zones etc. when evaluating nodes), currently the scheduler iterates over the live tree instead of the snapshot, this has two drawbacks:

  1. iterating over the scheduler cache nodeTree requires acquiring a lock, which causes contention (see NodeTree.Next() in scheduler is really expensive #72408)
  2. causes inconsistencies in clusters that bring up and shutdown nodes frequently (cluster autoscaling)

This PR make the order part of the snapshot, which improves performance (by about 3%) since we don't require locking, and removes the inconsistencies since the scheduler is now iterating over a static list of nodes.

Which issue(s) this PR fixes:
Part of #83922

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 16, 2019
@ahg-g
Copy link
Member Author

ahg-g commented Oct 16, 2019

/assign @Huang-Wei

@ahg-g ahg-g changed the title create an ordered list of nodes instead of iterating over the tree Make node tree order part of the snapshot Oct 16, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 16, 2019
@ahg-g
Copy link
Member Author

ahg-g commented Oct 16, 2019

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 16, 2019
@ahg-g ahg-g force-pushed the ahg-tree branch 5 times, most recently from f56e1cd to 20cc6f5 Compare October 18, 2019 03:05
@ahg-g
Copy link
Member Author

ahg-g commented Oct 18, 2019

/assign @liu-cong

)

// NodeTree is a tree-like data structure that holds node names in each zone. Zone names are
// keys to "NodeTree.tree" and values of "NodeTree.tree" are arrays of node names.
// NodeTree is NOT thread-safe, any concurrent updates/reads from it must be synchronized by the caller.
// It is used only by schedulerCache, and should stay as such.
type NodeTree struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeTree -> nodeTree?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -716,3 +718,10 @@ func (cache *schedulerCache) GetCSINodeInfo(nodeName string) (*storagev1beta1.CS

return n, nil
}

// NumNodes returns the number of nodes.
func (cache *schedulerCache) NumNodes() int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this function is not called anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -479,20 +478,22 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor
state.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: meta})

checkNode := func(i int) {
nodeName := g.cache.NodeTree().Next()
nodeInfo := g.nodeInfoSnapshot.NodeInfoList[i]
if nodeInfo == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be nil? Or can we make sure it's not nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// Take a snapshot of the nodes order in the tree
nodeSnapshot.NodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes)
for i := 0; i < cache.nodeTree.numNodes; i++ {
if n := nodeSnapshot.NodeInfoMap[cache.nodeTree.next()]; n != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know under what circumstances n can be nil? Not sure if we should log something if it happens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistency between nodeTree and nodeInfoMap, I will log an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@liu-cong
Copy link
Contributor

/lgtm

/hold you can unhold when you want to merge.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 18, 2019
@ahg-g
Copy link
Member Author

ahg-g commented Oct 18, 2019

/retest

@ahg-g
Copy link
Member Author

ahg-g commented Oct 18, 2019

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2019
@Huang-Wei
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2019
Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ahg-g. Some comments below.

@@ -520,8 +516,8 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor
workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

filtered = filtered[:filteredLen]
if len(errs) > 0 {
return []*v1.Node{}, FailedPredicateMap{}, framework.NodeToStatusMap{}, errors.CreateAggregateFromMessageCountMap(errs)
if err := errCh.ReceiveError(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right, we used to return all predicates errors to the users, but here it changes to a single error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was collecting errors from all nodes, not predicates. Why is it useful to continue to examine all nodes and end up existing anyways? remember that this is error, not predicate failure, so it is likely that something internal went wrong and caused the error, so again keeping iterating over all nodes does not seem useful to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I misread, err is for unexpected internal error, not PredicateFailure. Nevermind then.


// Take a snapshot of the nodes order in the tree
nodeSnapshot.NodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes)
for i := 0; i < cache.nodeTree.numNodes; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop seems to be both memory and time consuming, can we have a benchmark testing the whole scheduling cycle?

Copy link
Member Author

@ahg-g ahg-g Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not really, as described in the issue description, we gain about 3%, as for memory, this is just an array of pointers, so even for cluster with 5k nodes for example, the overhead is negligible.

Note that we do something similar in the predicates metadata, here are examples:

allNodeNames := make([]string, 0, len(nodeInfoMap))

allNodeNames := make([]string, 0, len(nodeInfoMap))

NodeInfoMap map[string]*NodeInfo
Generation int64
// NodeInfoList is the list of nodes as ordered in the cache's nodeTree.
NodeInfoList []*NodeInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it NodeNameList []string, and then in each checkNode(), the "nodeInfo" can be fetched by: nodeInfo := g.nodeInfoSnapshot. NodeInfoMap[NodeNameList[i]] (or from a helper method)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? we are going to fetch nodeinfos any ways while iterating over the nodes to evaluate the filters/predicates. Also, as I mentioned, we lookup the NodeInfo for all nodes in predicate metadata, in a followup PR, I want to change that so that we just index this new list. Finally, this should have smaller footprint since we are storing pointers, rather than strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

Copy link
Member Author

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, please see replies.


// Take a snapshot of the nodes order in the tree
nodeSnapshot.NodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes)
for i := 0; i < cache.nodeTree.numNodes; i++ {
Copy link
Member Author

@ahg-g ahg-g Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not really, as described in the issue description, we gain about 3%, as for memory, this is just an array of pointers, so even for cluster with 5k nodes for example, the overhead is negligible.

Note that we do something similar in the predicates metadata, here are examples:

allNodeNames := make([]string, 0, len(nodeInfoMap))

allNodeNames := make([]string, 0, len(nodeInfoMap))

NodeInfoMap map[string]*NodeInfo
Generation int64
// NodeInfoList is the list of nodes as ordered in the cache's nodeTree.
NodeInfoList []*NodeInfo
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? we are going to fetch nodeinfos any ways while iterating over the nodes to evaluate the filters/predicates. Also, as I mentioned, we lookup the NodeInfo for all nodes in predicate metadata, in a followup PR, I want to change that so that we just index this new list. Finally, this should have smaller footprint since we are storing pointers, rather than strings.

@@ -520,8 +516,8 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor
workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

filtered = filtered[:filteredLen]
if len(errs) > 0 {
return []*v1.Node{}, FailedPredicateMap{}, framework.NodeToStatusMap{}, errors.CreateAggregateFromMessageCountMap(errs)
if err := errCh.ReceiveError(); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was collecting errors from all nodes, not predicates. Why is it useful to continue to examine all nodes and end up existing anyways? remember that this is error, not predicate failure, so it is likely that something internal went wrong and caused the error, so again keeping iterating over all nodes does not seem useful to me.

@Huang-Wei
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2019
@k8s-ci-robot k8s-ci-robot merged commit 70f6806 into kubernetes:master Oct 18, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 18, 2019
@@ -479,20 +478,17 @@ func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framewor
state.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: meta})

checkNode := func(i int) {
nodeName := g.cache.NodeTree().Next()

nodeInfo := g.nodeInfoSnapshot.NodeInfoList[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broke scalability tests: #84151

TL;DR; this breaks spreading of pods in large clusters.

What exactly has happened:
1, in large enough clusters, we are using the features of finding only N feasible nodes and scoring only those:

numNodesToFind := g.numFeasibleNodesToFind(allNodes)

  1. Before this change, we were looking for nodes starting at the point where we previously stopped (because Next() was done at the level of the original tree).
  2. With this change, we're always starting from 0.

So assume, you have 5k nodes, all are feasible, and numFeasible is choosing 250.

  • previously, you first chose nodes [1..250], then [251..500], ... [4751..5000], [1..250], ...
  • with this change, we will chose [1..250], [1..250], [1..250], ... until one of those nodes become unfeasible and we will choose something different.

With this PR, next() is called only in UpdateNodeSnapshotInfo:
https://github.com/kubernetes/kubernetes/pull/84014/files#diff-f4a894ca5e905aa5f613269fc967fe2cR206
and if the set of nodes doesn't change, we will pretty much always be generating the same set of nodes.

This kind of breaks the fact that scheduler is scheduling in the whole cluster. While it's not documented feature per-se, I think this isn't the right thing to do.

I'm going to open a revert of this PR to to fix scalability tests (or the half of them, because we seem to have two different regressions), but will wait for your explicit approval.
We can discuss how to fix that later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh... it's no longer possible to autorevert it..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately the conflicts were trivial - opened #84222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants