Store node information in NodeInfo #24598

wojtek-t · 2016-04-21T13:54:04Z

This is significantly improving scheduler throughput.

On 1000-node cluster:

empty cluster: ~70pods/s
full cluster: ~45pods/s
Drop in throughput is mostly related to priority functions, which I will be looking into next (I already have some PR [WIP] [Do NOT merge] Support for reverse index in scheduler #24095, but we need for more things before).

This is roughly ~40% increase.
However, we still need better understanding of predicate function, because in my opinion it should be even faster as it is now. I'm going to look into it next week.

@gmarek @hongchaodeng @xiang90

wojtek-t · 2016-04-21T13:54:30Z

plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go

-				return predicates.NewSelectorMatchPredicate(args.NodeInfo)
-			},
-		),
+		// TODO: This is run as part of GeneralPredicates - figure out the way to remove it.


@davidopp - any idea how to solve it?

This looks duplicate. My suggestion is to remove one?

I'm not sure we can because of backward compatibility issues. That's why I'm asking @davidopp

I think moving this to init() is safe enough for backward compatibility. I should have moved it in #20204, but I forgot to do it, sorry.

Yeah, I agree with @HaiyangDING , you can just move this to init()

hongchaodeng · 2016-04-21T14:33:07Z

@wojtek-t
Why is it improving the scheduler by a significant amount? Can you help explain?
From basic point of view, it seems like GetNodeInfo() is replaced by local cache, which might reduce latency if it's remote call.

xiang90 · 2016-04-21T14:37:27Z

plugin/pkg/scheduler/factory/factory.go

 		},
 	)

 	return c
 }

+func (c *ConfigFactory) informerAddPod(obj interface{}) {


I feel this function name should be addPotToCache instead of informerAddPod. It is more clear about what this method really does. The informer is a consumer of this function, and when calling this function from the consumer side the consumer prefix seems to be duplicated too.

Agree - changed.

xiang90 · 2016-04-21T14:38:33Z

@wojtek-t Thanks! The overall idea makes a lot of sense to me. Looking forward to the numbers!

ncdc · 2016-04-21T16:13:28Z

cc @kubernetes/rh-cluster-infra @kubernetes/rh-scalability @smarterclayton

wojtek-t · 2016-04-22T07:48:47Z

@wojtek-t
Why is it improving the scheduler by a significant amount? Can you help explain?
From basic point of view, it seems like GetNodeInfo() is replaced by local cache, which might reduce latency if it's remote call.

@hongchaodeng - we were using local caches anyway.
What we are improving is basically lock contention.

Previously, every GetNodeInfo() call (which we were doing few times for every node in different predicate functions) was a get call to cache, which internally was taking a lock on the cache. This was just expensive.

In the few version, we do synchronizaction (locking) when updating pods/nodes on the same cache (which may potentially increase latency there, but differences are pretty small and it's not critical).
However, when scheduling, we are just acquiring lock only at the beginning to call GetNodeNameToInfoMap, and after that we don't do any locking.

BTW - my feeling is that all differences now between your benchmark and our system are a result of lock contention caused by updates coming from other components (pod & node updates).

hongchaodeng · 2016-04-22T08:00:12Z

plugin/pkg/scheduler/schedulercache/node_info.go

+}
+
+// Removes the overall information about the node.
+func (n *NodeInfo) RemoveNode(node *api.Node) error {


Well, if we remove node, we should just remove that entry in the cache

I didn't do that on purpose - potentially there may be some race, that in the system pods are removed and then node is removed, but since pods & nodes are delivered in different watch channels, there may be a race and node deletion may be delivered first. In that case, we don't want to remove it until all pods are also removed. This is handled in cache.go

since pods & nodes are delivered in different watch channels, there may be a race and node deletion may be delivered first.

Ah. I see. That sounds not right though. If a node is deleted, but we still know of some pods on it... Is there any explicit docs or discussion on this?

I think you don't understand. If the node is deleted, there shouldn't by any pods assigned to in in db.
But we don't controller watches - they may be delivered in different orders, because those are different watches.
It's not a bug- it's like distributed system work.

Please add a comment here explaining why you do not remove the entry from the cache; I think other people may have the same question @hongchaodeng had. Also maybe mention it in cache.go

wojtek-t · 2016-04-22T12:07:14Z

Update PR description with new numbers.

wojtek-t · 2016-04-22T12:12:54Z

@davidopp - can you please take a look or delegate?

hongchaodeng · 2016-04-22T14:53:24Z

can you please take a look or delegate?

Both @xiang90 and I have done multiple rounds of review. I definitely would like to help with the review and merge.

HaiyangDING · 2016-04-22T22:53:14Z

/CC @HaiyangDING

davidopp · 2016-04-25T05:23:28Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

@@ -732,7 +686,10 @@ func getUsedPorts(pods ...*api.Pod) map[int]bool {
 	for _, pod := range pods {
 		for _, container := range pod.Spec.Containers {
 			for _, podPort := range container.Ports {
-				ports[podPort.HostPort] = true
+				// TODO: Aggregate it at the NodeInfo level.
+				if podPort.HostPort != 0 {


Why did you add this?

This gives us significant performance gains.
To clarify, currently for every container.Ports in each container, we are adding a HostPort to the "ports" result. However, if you don't have HostPort defined, it is 0 and "0" is explicitly ignored in PodFitsHostPorts which is the only function using this one.

However, since we were returning "0" in the result previously, we didn't hit an optimization of:
if len(wantPorts) == 0 {
return true, nil
}
so we were computing ports from all pods on that node.
Which made it significantly more expensive.

Now we simply don't do it.

I see. Can you add a comment "0 is explicitly ignored in PodFitsHostPorts, which is the only function that uses this value" ?

davidopp · 2016-04-25T05:23:50Z

plugin/pkg/scheduler/factory/factory.go

+	}
+	newPod, ok := newObj.(*api.Pod)
+	if !ok {
+		glog.Errorf("cannot convert to *api.Pod")


to distinguish this from the other error maybe say "cannot convert newPod to *api.Pod"

wojtek-t · 2016-04-25T07:35:22Z

@davidopp - comments addressed.

PTAL

davidopp · 2016-04-25T08:36:27Z

LGTM

k8s-bot · 2016-04-27T09:36:30Z

GCE e2e build/test passed for commit 1835c85.

k8s-github-robot · 2016-04-28T08:41:18Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-04-28T09:16:41Z

GCE e2e build/test passed for commit 1835c85.

k8s-github-robot · 2016-04-28T09:17:58Z

Automatic merge from submit-queue

wojtek-t assigned davidopp Apr 21, 2016

googlebot added the cla: yes label Apr 21, 2016

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Apr 21, 2016

wojtek-t reviewed Apr 21, 2016
View reviewed changes

xiang90 reviewed Apr 21, 2016
View reviewed changes

wojtek-t force-pushed the improve_scheduler_predicates branch from 272ea1c to 21ac1f8 Compare April 22, 2016 07:49

wojtek-t added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Apr 22, 2016

hongchaodeng reviewed Apr 22, 2016
View reviewed changes

wojtek-t force-pushed the improve_scheduler_predicates branch 2 times, most recently from b72da9b to 2946653 Compare April 22, 2016 11:06

This was referenced Apr 22, 2016

Enforce --max-pods in kubelet admission; previously was only enforced in scheduler #24674

Merged

Move predicates into library #20204

Merged

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 23, 2016

davidopp reviewed Apr 25, 2016
View reviewed changes

wojtek-t force-pushed the improve_scheduler_predicates branch from 2946653 to 6c950fa Compare April 25, 2016 07:34

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2016

wojtek-t force-pushed the improve_scheduler_predicates branch from 6c950fa to e12162d Compare April 25, 2016 07:46

Store node information in NodeInfo

1835c85

wojtek-t force-pushed the improve_scheduler_predicates branch from e12162d to 1835c85 Compare April 25, 2016 08:08

davidopp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2016

k8s-github-robot merged commit 00308f7 into kubernetes:master Apr 28, 2016

wojtek-t deleted the improve_scheduler_predicates branch May 17, 2016 08:41

Store node information in NodeInfo #24598

Store node information in NodeInfo #24598

Conversation

wojtek-t commented Apr 21, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongchaodeng commented Apr 21, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Apr 21, 2016

ncdc commented Apr 21, 2016

wojtek-t commented Apr 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Apr 22, 2016

wojtek-t commented Apr 22, 2016

hongchaodeng commented Apr 22, 2016 • edited

HaiyangDING commented Apr 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Apr 25, 2016

davidopp commented Apr 25, 2016

k8s-bot commented Apr 27, 2016

k8s-github-robot commented Apr 28, 2016

k8s-bot commented Apr 28, 2016

k8s-github-robot commented Apr 28, 2016

wojtek-t commented Apr 21, 2016 •

edited

hongchaodeng commented Apr 21, 2016 •

edited

hongchaodeng commented Apr 22, 2016 •

edited