-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long scheduling latency in large clusters #22262
Comments
Selectors have a fast reverse-lookup procedure. We may need to actually On Tue, Mar 1, 2016 at 2:09 AM, Wojciech Tyczynski <notifications@github.com
|
@lavalamp Can you briefly describe it?
I agree. It should be O(1) in all reasonable cases.
One thing is looping over all of them. |
Oh, I missed that. We should just maintain a node-to-pod map. Then we don't have to worry about the selectors for a bit probably. And you must have meant O(#pods + #matching (RS, RC, services))? Surely we don't stack those two things, I can't imagine why we would do that? |
(The selector speedup involves keeping indexes & merging sets-- I can look that up after 1.2 or @bgrant0607 might just know off the top of his head where it's documented) |
Selector indexing: #4817 |
API for reverse selector lookup: #1348 |
ControllerRef backpointers to controller: #14961 |
This one we already have - it doesn't help.
No, I actually meant: O(#pods x #matching (RS, RC, services)) So it's definitely not that easy (to be honest, if we have more than 1 matching selector, I don't know how to do it even if we have efficient indexing) |
Oh, for the spreading? We should just maintain pod <-> "spreading group" map. It's silly to recreate that every time (sorry, haven't looked at the code). |
hmm - it probably should be "pod labels <-> spreading group" - this may be more efficient. |
If we do a serviceRef analogous to controllerRef (but it is an array of pointers rather than a pointer, since a pod can be in multiple services by design), then when scheduling pod P you can just do
This would just be O(# pods). Also, we should figure out a way of sorting the predicate functions so that the ones that are most computationally expensive go last, and have the predicate functions only process nodes that earlier predicate functions considered feasible. That would make this O(# pods that are on nodes that earlier predicate functions considered feasible for P). |
@davidopp - O(#pods) is not acceptable in my opinion. If we are targeting 5000 nodes and potentially 100 pods/node at some point, then scheduling a single pods would take a second... |
I agree with @wojtek-t - we need some way to make scheduler work on some higher level abstractions than Pods if we want to keep 5s e2e startup latency... |
It's not about e2e startup latency - it's about throughput. |
@wojtek-t It seems that it should not be hard to make
to
by incrementally caching the information at node level again. And calculating the most computationally expensive one also seems to be promising. |
Probably the optimizations described under "performance/scalability improvement ideas" here would help. ("higher level abstractions" reminded me of equivalence classes) |
The thing @xiang90 suggested above is also a good idea, and a lot simpler than equivalence classes. |
Agree with @xiang90 's idea. Just want to clarify it. It should be for (each node)
if node.controllerRefs.Contains(p.controllerRef) || pod.serviceRefs.Contains(p.serviceRef)
countsByNodeName[node]++ We can incrementally update all kinds of refs and labels aggregated in node level. |
To be honest, I'm completely not following your idea above. So your suggestion from above would translate to (in that simplified save):
It is something completely different than we are computing now. Currently we are computing the "number of pods on a given node matching the selector" (not just whether it's matching or not). Also your generalization to multiple services/RCs/RSs is something different, because you are counting the number of those objects that are matching our pod - not the number of pods that are matching the same objects on the node as the pod we would like to schedule. So, unless I'm not missing something:
So in my opinion what you are suggestion is something completely different that what we currently have. |
Yes. It should be
|
Yeah - but then if you generalize it multiple RCs/Services, then you need to be able to deduplicate those pods, which is basically impossible to do just operating on numbers.... |
Getting back to it - we are out of 1.2 work, so it's time to get back to it.
This should work, because:
WDYT? |
In the meantime, I will do a quick fix by parallelizing these computations, which seems pretty easy to do with recent changes introducing schedulercache done by @hongchaodeng |
Forgive me for being a little out of the loop here, but the scheduler knowing about services and RCs seems to actually be a leaky abstraction. Ideally there should be nothing special about RCs and Services at this layer. Are there going to have to be similar hacks in the scheduler for peers to RCs as they come out? What if someone isn't using the k8s definition of services but instead rolling their own discovery? Wouldn't it be better to have users explicitly specify the labels that matter for spreading and key off of those directly? |
I like the idea of spread-by label keys. I'm afraid that it's not backward compatible though... @bgrant0607 |
For back compat, I'd do something like create "invisible/hidden/automatic" spread-by labels for RCs and Services as they stand in addition to explicit. The scheduler can then be moved to only pay attention to labels. |
@jbeda 's idea is also good for caching results. I have a rough plan in my mind optimizing spread calculation like this:
If I'm right about this simplified version, the complexity is O(nodes * keys). A huge move for scale perspective. Maybe we should talk about this in scale meeting :) |
We're having too many sigs working on the same thing... |
Hmm - this is very interesting idea, but I'm still not fully convinced about it:
How does it change the currently existing code? IIUC, we still need to maintain more-or-less the currently existing code. And if that's the case, adding additional condition for spreading will complicate the system even more. But I might be missing something here. Also - @bgrant0607 for thoughts about it. |
Regardless of the path forward here a couple of points:
I'd suggest that first we come up with the ideal model for how spreading should be explicitly represented to the scheduler. The idea of a "spread label/key" is only a starting point. And instead of labels, annotations could be used. Once we have that model, we can decide how to move forward. Perhaps something one of these:
|
cc @davidopp |
Scheduler doesn't require RC or Services to work - it will just affect how we compute priorities.
Agree. Also, to clarify - I'm not against this idea - I just think that we need to be careful with that (maybe breaking compatibility is even fine - but it needs to be a conscious decision). |
Using services and controllers is a hack. Should use anti-affinity and disruption budget #22774 |
Labels are for identifying attributes and grouping, not semantically meaningful properties |
If we want/need to continue with hacks for now, they can't be visible in the api |
My recommendation is to continue with the default spreading in the scheduler, but consider making the set of API types used in the spreading be a configuration option to the scheduler rather than hard-coded.
I agree, but this was the only way to provide the spreading we wanted without having to modify controllers to add information to the pod representing the identity of the controller (a la controllerRef) and the service.
Many of the affinity and anti-affinity use cases rely on semantically meaningful labels (on the node or pod). Constrain to a specific zone, spread across nodes or zones, etc. We already have a moderate-sized library of semantically meaningful labels, and node and pod affinity express their requests over labels. It's not clear to me if you're saying this is a problem, or just arguing we shouldn't have controllers automatically add labels identifying the pod's controller and service. I don't see the relevance of DisruptionBudget; can you explain? |
I think it means we need to consider DisruptionBudget when spreading pods. I saw that we were discussing two things that we should split:
LabellingI want to point out that @jbeda 's idea is not about "user" labelling. Let's look at the function: // It favors nodes that have fewer existing matching pods.
// i.e. it pushes the scheduler towards a node where there's the smallest number of
// pods which match the same service selectors or RC selectors as the pod being scheduled.
// Where zone information is included on the nodes, it favors nodes in zones
// with fewer existing matching pods.
func (s *SelectorSpread) CalculateSpreadPriority (...) The problem is now scheduler needs to know higher-order objects like services, replication controllers, etc. We should provide a layer of abstraction for scheduler to do spreading. We could add some "system" labelling, grouping keys, etc. Pod Anti-affinityThe problem scope might become bigger, but I still want to ask the question: What's our future plan for pod anti-affinity? It seems pod anti-affinity might overlap/include spreading. In the near future, we will face two choices:
I'm trying to explain the problem. So this is not an argument but for everyone's future consideration. |
Ah, I see. That's a very good point--that we should treat a shared DisruptionBudget the same way we treat service, ReplicationController, ReplicaSet, etc., today for spreading. Whether that means we don't need to look at service, ReplicationController, ReplicaSet, etc. for spreading (which I now understand is what @bgrant0607 was suggesting) is an interesting question. I'm not fully convinced. Spreading by itself gives you "best-effort" protection from correlated failure, which can be beneficial even if you don't need the stronger guarantee of a DisruptionBudget. From a usability standpoint, it is also simpler to provide some default spreading even if the user does not define a DisruptionBudget. We could auto-generate DisruptionBudget from Service and/or Deployment, I guess, which would alleviate the need for the user to explicitly define a DisruptionBudget in most cases, but I think we should think through the details before we go so far as to remove the default spreading we currently do.
Right, I understand that's what Joe is suggesting, and that it would allow us to make CalculateSpreadPriority() generic and not have to know about specific controllers and services. It sounded like Brian was objecting to that, but I'm not sure. I don't really have a problem with the idea, especially since we already use labels to denote semantically meaningful node properties. But I'd like to better understand Brian's comment on this. An alternative is to make the list of spreading rules be a config parameter to the scheduler. Obviously that's much closer to what we have today, so doesn't fully accomplish what Joe is going after.
It definitely does include spreading. I think the issue is exactly what you mentioned. For usability reasons, we want to provide reasonable default spreading that doesn't require any explicit user input. This is why I don't think we should replace the controller/service-based spreading rules with one based on DisruptionBudget, unless DisruptionBudget is auto-generated (which may or may not be a good idea). Likewise, I don't think we should rely on users to explicitly indicate spreading requests using pod anti-affinity. That should only be for "special" requests. The system should do reasonable spreading by default, without any explicit pod anti-affinity. |
I completely agree with @davidopp that we should have some default spreading for simple users that do not specify DisruptionBudget, any specific spreading labels etc. The question whether it should be RC/Service is a separate question here. So my current thinking about is that we should fix the default spreading (possibly changing the algorithm) that we currently have (by fixing I actually mean improving performance) and then add things like DisruptionBudged, anti-affinity spreading and stuff like that on top of that. |
Short update from an in-person discussion with @bgrant0607 and others: we don't treat a backward compatibility of scheduler decisions as critical, as long as the scheduling is sane, for some definition of sane. |
Sorry for the terse comments. I'm timesliced among 1000 different things. Yes, as you've figured out, a DisruptionBudget would imply that pods associated with the same budget would need to be spread. Generating a DisruptionBudget automatically is worth considering. I'm fine with maintaining legacy behavior so long as we don't need to add anything to the API just for that purpose. With controllerRef and the /scale subresource, which contains a selector for identifying the controlled pods, we could make the spreading behavior generic for any controller supporting /scale. I wouldn't bother spreading pods of jobs. On labels not being semantically meaningful: I was referring to user-created resources. The way we identify user resources is with selectors, not with special labels: configuration rather than convention. Nodes are system resources, so cluster admins can label them however they like and we can use those labels, provided the mechanism is configurable by the cluster admin. |
Thanks for explanation @bgrant0607 Also, we had some offline conversations about it last week and we decided that:
|
I'm starting to implement 1. |
Issues go stale after 30d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
After recent changes in scheduler done mainly by @hongchaodeng scheduling latency is much better, but we are still not where we would like to be.
Currently the problem is: SelectorSpread::CalculateSpreadPriority function
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithm/priorities/selector_spreading.go#L81
What is happening:
So basically scheduling of a single pod is: O(#pods x #matching (RS, RC, services))
@davidopp @gmarek @hongchaodeng @xiang90 @kubernetes/sig-scalability
The text was updated successfully, but these errors were encountered: