-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store node information in NodeInfo #24598
Store node information in NodeInfo #24598
Conversation
return predicates.NewSelectorMatchPredicate(args.NodeInfo) | ||
}, | ||
), | ||
// TODO: This is run as part of GeneralPredicates - figure out the way to remove it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidopp - any idea how to solve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks duplicate. My suggestion is to remove one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can because of backward compatibility issues. That's why I'm asking @davidopp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving this to init()
is safe enough for backward compatibility. I should have moved it in #20204, but I forgot to do it, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with @HaiyangDING , you can just move this to init()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@wojtek-t |
}, | ||
) | ||
|
||
return c | ||
} | ||
|
||
func (c *ConfigFactory) informerAddPod(obj interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this function name should be addPotToCache instead of informerAddPod. It is more clear about what this method really does. The informer is a consumer of this function, and when calling this function from the consumer side the consumer prefix seems to be duplicated too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree - changed.
@wojtek-t Thanks! The overall idea makes a lot of sense to me. Looking forward to the numbers! |
cc @kubernetes/rh-cluster-infra @kubernetes/rh-scalability @smarterclayton |
@hongchaodeng - we were using local caches anyway. Previously, every GetNodeInfo() call (which we were doing few times for every node in different predicate functions) was a get call to cache, which internally was taking a lock on the cache. This was just expensive. In the few version, we do synchronizaction (locking) when updating pods/nodes on the same cache (which may potentially increase latency there, but differences are pretty small and it's not critical). BTW - my feeling is that all differences now between your benchmark and our system are a result of lock contention caused by updates coming from other components (pod & node updates). |
272ea1c
to
21ac1f8
Compare
} | ||
|
||
// Removes the overall information about the node. | ||
func (n *NodeInfo) RemoveNode(node *api.Node) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if we remove node, we should just remove that entry in the cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do that on purpose - potentially there may be some race, that in the system pods are removed and then node is removed, but since pods & nodes are delivered in different watch channels, there may be a race and node deletion may be delivered first. In that case, we don't want to remove it until all pods are also removed. This is handled in cache.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since pods & nodes are delivered in different watch channels, there may be a race and node deletion may be delivered first.
Ah. I see. That sounds not right though. If a node is deleted, but we still know of some pods on it... Is there any explicit docs or discussion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't understand. If the node is deleted, there shouldn't by any pods assigned to in in db.
But we don't controller watches - they may be delivered in different orders, because those are different watches.
It's not a bug- it's like distributed system work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here explaining why you do not remove the entry from the cache; I think other people may have the same question @hongchaodeng had. Also maybe mention it in cache.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
b72da9b
to
2946653
Compare
Update PR description with new numbers. |
@davidopp - can you please take a look or delegate? |
Both @xiang90 and I have done multiple rounds of review. I definitely would like to help with the review and merge. |
/CC @HaiyangDING |
@@ -732,7 +686,10 @@ func getUsedPorts(pods ...*api.Pod) map[int]bool { | |||
for _, pod := range pods { | |||
for _, container := range pod.Spec.Containers { | |||
for _, podPort := range container.Ports { | |||
ports[podPort.HostPort] = true | |||
// TODO: Aggregate it at the NodeInfo level. | |||
if podPort.HostPort != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives us significant performance gains.
To clarify, currently for every container.Ports in each container, we are adding a HostPort to the "ports" result. However, if you don't have HostPort defined, it is 0 and "0" is explicitly ignored in PodFitsHostPorts which is the only function using this one.
However, since we were returning "0" in the result previously, we didn't hit an optimization of:
if len(wantPorts) == 0 {
return true, nil
}
so we were computing ports from all pods on that node.
Which made it significantly more expensive.
Now we simply don't do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Can you add a comment "0 is explicitly ignored in PodFitsHostPorts, which is the only function that uses this value" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
newPod, ok := newObj.(*api.Pod) | ||
if !ok { | ||
glog.Errorf("cannot convert to *api.Pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to distinguish this from the other error maybe say "cannot convert newPod to *api.Pod"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
2946653
to
6c950fa
Compare
@davidopp - comments addressed. PTAL |
6c950fa
to
e12162d
Compare
e12162d
to
1835c85
Compare
LGTM |
GCE e2e build/test passed for commit 1835c85. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 1835c85. |
Automatic merge from submit-queue |
This is significantly improving scheduler throughput.
On 1000-node cluster:
Drop in throughput is mostly related to priority functions, which I will be looking into next (I already have some PR [WIP] [Do NOT merge] Support for reverse index in scheduler #24095, but we need for more things before).
This is roughly ~40% increase.
However, we still need better understanding of predicate function, because in my opinion it should be even faster as it is now. I'm going to look into it next week.
@gmarek @hongchaodeng @xiang90