-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out and implement custom handling for MatchInterPodAffinity predicate #257
Comments
This will not be easy to deal with and it looks like we can work around the performance problem with #253, so I think this is best left to 1.9 timeframe. |
@wojtek-t I remember discussing with you sometime ago about not precomputing stuff in scheduler. It changed in the meanwhile? :) |
I don't understand. Can you clarify? |
Sorry - When we were discussing about for e.g zone-level soft inter-pod anti-affinity, fwiu the scheduler was doing computation of O(nodes*pods) when scheduling a new pod. Though it seemed to be doable in O(pods + nodes) with precomputation. And from the comment it seemed like it's precomputing now. But I think I misunderstood as soft anti-affinity uses priorities instead of hard anti-affinity which uses predicates. Correct me if I said something stupid. |
Yes - we were discussing some potential optimizations in scheduler. And those are still not done. But we do some caching in scheduler, and we don't do anything in cluster autoscaler. |
If I'm understanding your statement correctly, @bsalamat ran into almost exactly the same problem in the preemption work (need to run scheduler predicates against cluster state minus preempted pods on some node). He wrote a function FilteredList() that is like List() but subtracts out the pods that match a filter provided as an argument. You might be able to do the opposite, define a function that does a List() but also returns some extra pods that you supply as an argument? I may be misunderstanding your issue completely though. |
@bskiba ran some follow-up performance tests and I've spent time experimenting with predicates. Below are results of those investigations:
My current plan is to have PRs implementing both points 2 and 3 ready by tomorrow noon. @davidopp @bsalamat does this approach sound reasonable to you? |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
/lifecycle frozen |
We should look more deeply into kubernetes/kubernetes#54189 and see how it impacts CA. We will get those changes with godeps update for 1.2.0 anyway, but we should spend some time to understand them. In particular we should make sure that:
|
@aleksandra-malinowska pointed out to me that a deployment with self-pod-antiaffinity causes nodes to be added one by one. This is because the problem with affinity predicate taking data from informers mentioned in the description of this issue is still not fixed. Other optimizations were easier to implement and got us to where we met our performance target and we actually never started using predicateMeta when binpacking. Note that this is trickier than in case of scale-down, because:
|
@MaciekPytel what do you mean It is a soft AntiAffinity (preferredDuringSchedulingIgnoredDuringExecution)? |
It's a hard Pod AntiAffinity ( |
Cluster Autoscaler completely ignores "soft" scheduler requirements (aka scores). preferredDuringScheduling don't impact CA's decisions or performance in any way. |
Thanks, so it is safe to use soft Anti Affinity since it completely ignores. Thanks |
This seems to be doable now that the autoscaler can pass in the whole snapshot and is able to run PreFilters directly. |
… annotations (kubernetes#257) * Add node group specific options to NodeGroupAutoscalingOptions from machineDeployment annotations * Updated func and variable names * Added unit tests * Fix unit tests after rebase
do workload.NewInfo when inserting and updating
As part of working on CA performance we run a large scale-up test with some additional logging including call count and total duration spent in each predicate. The results are as follows:
It turns out that MatchInterPodAffinity predicate is 3 orders of magnitude slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it and we don't maintain predicateMeta object.
After a quick glance at predicate code it makes sense - it needs to iterate over all existing pods to check if any of them has pod antiaffinity on the pod we're running predicates for. This brings up another problem - how does it get all pods and nodes? We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.
Bottom line is that using zone-level antiaffinity can cause CA to "overshoot" creating some nodes for pods that won't be able to schedule on them anyway. Fortunately, this is a pretty unlikely edge case and we will scale-down the unnecessary nodes without any problem.
The text was updated successfully, but these errors were encountered: