Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store node information in NodeInfo #24598

Merged

Conversation

wojtek-t
Copy link
Member

@wojtek-t wojtek-t commented Apr 21, 2016

This is significantly improving scheduler throughput.

On 1000-node cluster:

This is roughly ~40% increase.
However, we still need better understanding of predicate function, because in my opinion it should be even faster as it is now. I'm going to look into it next week.

@gmarek @hongchaodeng @xiang90

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Apr 21, 2016
return predicates.NewSelectorMatchPredicate(args.NodeInfo)
},
),
// TODO: This is run as part of GeneralPredicates - figure out the way to remove it.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp - any idea how to solve it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks duplicate. My suggestion is to remove one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can because of backward compatibility issues. That's why I'm asking @davidopp

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving this to init() is safe enough for backward compatibility. I should have moved it in #20204, but I forgot to do it, sorry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with @HaiyangDING , you can just move this to init()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@hongchaodeng
Copy link
Contributor

hongchaodeng commented Apr 21, 2016

@wojtek-t
Why is it improving the scheduler by a significant amount? Can you help explain?
From basic point of view, it seems like GetNodeInfo() is replaced by local cache, which might reduce latency if it's remote call.

},
)

return c
}

func (c *ConfigFactory) informerAddPod(obj interface{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this function name should be addPotToCache instead of informerAddPod. It is more clear about what this method really does. The informer is a consumer of this function, and when calling this function from the consumer side the consumer prefix seems to be duplicated too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - changed.

@xiang90
Copy link
Contributor

xiang90 commented Apr 21, 2016

@wojtek-t Thanks! The overall idea makes a lot of sense to me. Looking forward to the numbers!

@ncdc
Copy link
Member

ncdc commented Apr 21, 2016

cc @kubernetes/rh-cluster-infra @kubernetes/rh-scalability @smarterclayton

@wojtek-t
Copy link
Member Author

@wojtek-t
Why is it improving the scheduler by a significant amount? Can you help explain?
From basic point of view, it seems like GetNodeInfo() is replaced by local cache, which might reduce latency if it's remote call.

@hongchaodeng - we were using local caches anyway.
What we are improving is basically lock contention.

Previously, every GetNodeInfo() call (which we were doing few times for every node in different predicate functions) was a get call to cache, which internally was taking a lock on the cache. This was just expensive.

In the few version, we do synchronizaction (locking) when updating pods/nodes on the same cache (which may potentially increase latency there, but differences are pretty small and it's not critical).
However, when scheduling, we are just acquiring lock only at the beginning to call GetNodeNameToInfoMap, and after that we don't do any locking.

BTW - my feeling is that all differences now between your benchmark and our system are a result of lock contention caused by updates coming from other components (pod & node updates).

@wojtek-t wojtek-t force-pushed the improve_scheduler_predicates branch from 272ea1c to 21ac1f8 Compare April 22, 2016 07:49
@wojtek-t wojtek-t added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Apr 22, 2016
}

// Removes the overall information about the node.
func (n *NodeInfo) RemoveNode(node *api.Node) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if we remove node, we should just remove that entry in the cache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do that on purpose - potentially there may be some race, that in the system pods are removed and then node is removed, but since pods & nodes are delivered in different watch channels, there may be a race and node deletion may be delivered first. In that case, we don't want to remove it until all pods are also removed. This is handled in cache.go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since pods & nodes are delivered in different watch channels, there may be a race and node deletion may be delivered first.

Ah. I see. That sounds not right though. If a node is deleted, but we still know of some pods on it... Is there any explicit docs or discussion on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't understand. If the node is deleted, there shouldn't by any pods assigned to in in db.
But we don't controller watches - they may be delivered in different orders, because those are different watches.
It's not a bug- it's like distributed system work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got that!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here explaining why you do not remove the entry from the cache; I think other people may have the same question @hongchaodeng had. Also maybe mention it in cache.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wojtek-t wojtek-t force-pushed the improve_scheduler_predicates branch 2 times, most recently from b72da9b to 2946653 Compare April 22, 2016 11:06
@wojtek-t
Copy link
Member Author

Update PR description with new numbers.

@wojtek-t
Copy link
Member Author

@davidopp - can you please take a look or delegate?

@hongchaodeng
Copy link
Contributor

hongchaodeng commented Apr 22, 2016

can you please take a look or delegate?

Both @xiang90 and I have done multiple rounds of review. I definitely would like to help with the review and merge.

@HaiyangDING
Copy link

/CC @HaiyangDING

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 23, 2016
@@ -732,7 +686,10 @@ func getUsedPorts(pods ...*api.Pod) map[int]bool {
for _, pod := range pods {
for _, container := range pod.Spec.Containers {
for _, podPort := range container.Ports {
ports[podPort.HostPort] = true
// TODO: Aggregate it at the NodeInfo level.
if podPort.HostPort != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives us significant performance gains.
To clarify, currently for every container.Ports in each container, we are adding a HostPort to the "ports" result. However, if you don't have HostPort defined, it is 0 and "0" is explicitly ignored in PodFitsHostPorts which is the only function using this one.

However, since we were returning "0" in the result previously, we didn't hit an optimization of:
if len(wantPorts) == 0 {
return true, nil
}
so we were computing ports from all pods on that node.
Which made it significantly more expensive.

Now we simply don't do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Can you add a comment "0 is explicitly ignored in PodFitsHostPorts, which is the only function that uses this value" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
newPod, ok := newObj.(*api.Pod)
if !ok {
glog.Errorf("cannot convert to *api.Pod")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to distinguish this from the other error maybe say "cannot convert newPod to *api.Pod"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wojtek-t wojtek-t force-pushed the improve_scheduler_predicates branch from 2946653 to 6c950fa Compare April 25, 2016 07:34
@wojtek-t
Copy link
Member Author

@davidopp - comments addressed.

PTAL

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2016
@wojtek-t wojtek-t force-pushed the improve_scheduler_predicates branch from 6c950fa to e12162d Compare April 25, 2016 07:46
@wojtek-t wojtek-t force-pushed the improve_scheduler_predicates branch from e12162d to 1835c85 Compare April 25, 2016 08:08
@davidopp
Copy link
Member

LGTM

@davidopp davidopp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2016
@k8s-bot
Copy link

k8s-bot commented Apr 27, 2016

GCE e2e build/test passed for commit 1835c85.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Apr 28, 2016

GCE e2e build/test passed for commit 1835c85.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 00308f7 into kubernetes:master Apr 28, 2016
@wojtek-t wojtek-t deleted the improve_scheduler_predicates branch May 17, 2016 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants