Long scheduling latency in large clusters #22262

wojtek-t · 2016-03-01T10:03:39Z

After recent changes in scheduler done mainly by @hongchaodeng scheduling latency is much better, but we are still not where we would like to be.

Currently the problem is: SelectorSpread::CalculateSpreadPriority function
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithm/priorities/selector_spreading.go#L81

What is happening:

We first iterate over all services, replication controllers and replica sets in the system to check if they are matching labels of the pod we are trying to schedule.
Then we list all pods in the namespace (which can potentially be all the pods in the system)
Then for each service, rc and rs selected in first point 1 we iterate over all pods from point 2 and create a mapping from node to number of pods

So basically scheduling of a single pod is: O(#pods x #matching (RS, RC, services))

@davidopp @gmarek @hongchaodeng @xiang90 @kubernetes/sig-scalability

wojtek-t · 2016-03-01T10:09:48Z

FYI - we are planning to support more generic indexing hopefully for 1.3: #4817
But I'm not entirely sure whether it will even help here. I will think about it more.
@lavalamp

lavalamp · 2016-03-01T21:03:02Z

Selectors have a fast reverse-lookup procedure. We may need to actually
implement it. The number of matching RC, RS, services should be pretty low,
I assume the problem is that finding that currently involves looping over
all of them?

On Tue, Mar 1, 2016 at 2:09 AM, Wojciech Tyczynski <notifications@github.com

wrote:

FYI - we are planning to support more generic indexing hopefully for 1.3:
#4817 #4817
But I'm not entirely sure whether it will even help here. I will think
about it more.
@lavalamp https://github.com/lavalamp

—
Reply to this email directly or view it on GitHub
#22262 (comment)
.

wojtek-t · 2016-03-02T07:19:46Z

Selectors have a fast reverse-lookup procedure.

@lavalamp Can you briefly describe it?

The number of matching RC, RS, services should be pretty low,

I agree. It should be O(1) in all reasonable cases.

I assume the problem is that finding that currently involves looping over all of them?

One thing is looping over all of them.
But the biggest issue is that we are looping over all pods, which is really painful.

lavalamp · 2016-03-02T21:46:17Z

Oh, I missed that. We should just maintain a node-to-pod map. Then we don't have to worry about the selectors for a bit probably. And you must have meant O(#pods + #matching (RS, RC, services))? Surely we don't stack those two things, I can't imagine why we would do that?

lavalamp · 2016-03-02T21:47:09Z

(The selector speedup involves keeping indexes & merging sets-- I can look that up after 1.2 or @bgrant0607 might just know off the top of his head where it's documented)

bgrant0607 · 2016-03-03T04:15:09Z

Selector indexing: #4817

bgrant0607 · 2016-03-03T04:15:55Z

API for reverse selector lookup: #1348

bgrant0607 · 2016-03-03T04:16:35Z

ControllerRef backpointers to controller: #14961

wojtek-t · 2016-03-03T07:16:07Z

Oh, I missed that. We should just maintain a node-to-pod map.

This one we already have - it doesn't help.

Then we don't have to worry about the selectors for a bit probably. And you must have meant O(#pods + #matching (RS, RC, services))? Surely we don't stack those two things, I can't imagine why we would do that?

No, I actually meant: O(#pods x #matching (RS, RC, services))
Just need to look into the code - for every RC, RS, Sv that is matching pod selector, we are iterating over all pods (or once we have indexing, we choose all pods matching this selector).
However, we need to merge those at the end.

So it's definitely not that easy (to be honest, if we have more than 1 matching selector, I don't know how to do it even if we have efficient indexing)

lavalamp · 2016-03-04T00:06:00Z

Oh, for the spreading? We should just maintain pod <-> "spreading group" map. It's silly to recreate that every time (sorry, haven't looked at the code).

wojtek-t · 2016-03-04T08:37:36Z

Oh, for the spreading? We should just maintain pod <-> "spreading group" map. It's silly to recreate that every time (sorry, haven't looked at the code).

hmm - it probably should be "pod labels <-> spreading group" - this may be more efficient.

davidopp · 2016-03-06T03:02:22Z

If we do a serviceRef analogous to controllerRef (but it is an array of pointers rather than a pointer, since a pod can be in multiple services by design), then when scheduling pod P you can just do

for (each node)
   for (each pod on the node)
       if pod.controllerRef == p.controllerRef || pod.serviceRef.Contains(p.serviceRef)
           countsByNodeName[node]++

This would just be O(# pods).

Also, we should figure out a way of sorting the predicate functions so that the ones that are most computationally expensive go last, and have the predicate functions only process nodes that earlier predicate functions considered feasible. That would make this O(# pods that are on nodes that earlier predicate functions considered feasible for P).

wojtek-t · 2016-03-06T08:23:05Z

@davidopp - O(#pods) is not acceptable in my opinion. If we are targeting 5000 nodes and potentially 100 pods/node at some point, then scheduling a single pods would take a second...

gmarek · 2016-03-06T10:58:36Z

I agree with @wojtek-t - we need some way to make scheduler work on some higher level abstractions than Pods if we want to keep 5s e2e startup latency...

wojtek-t · 2016-03-06T15:27:55Z

It's not about e2e startup latency - it's about throughput.

xiang90 · 2016-03-06T16:42:55Z

@wojtek-t It seems that it should not be hard to make

for (each node)
   for (each pod on the node)
       if pod.controllerRef == p.controllerRef || pod.serviceRef.Contains(p.serviceRef)
           countsByNodeName[node]++

to

for (each node)
   if node.controllerRef == p.controllerRef || pod.serviceRef.Contains(p.serviceRef)
       countsByNodeName[node]++

by incrementally caching the information at node level again.

And calculating the most computationally expensive one also seems to be promising.

davidopp · 2016-03-06T19:10:09Z

I agree with @wojtek-t - we need some way to make scheduler work on some higher level abstractions than Pods if we want to keep 5s e2e startup latency...

Probably the optimizations described under "performance/scalability improvement ideas" here would help.

#18418 (comment)

("higher level abstractions" reminded me of equivalence classes)

davidopp · 2016-03-06T19:11:25Z

The thing @xiang90 suggested above is also a good idea, and a lot simpler than equivalence classes.

hongchaodeng · 2016-03-06T23:57:21Z

Agree with @xiang90 's idea.

Just want to clarify it. It should be

for (each node)
   if node.controllerRefs.Contains(p.controllerRef) || pod.serviceRefs.Contains(p.serviceRef)
       countsByNodeName[node]++

We can incrementally update all kinds of refs and labels aggregated in node level.

wojtek-t · 2016-03-07T07:08:48Z

To be honest, I'm completely not following your idea above.
For simplicity let's first assume that just have services (and ignore RC, RS, etc.), and there is only one such service. Let's try to generalize it later.

So your suggestion from above would translate to (in that simplified save):

for (each node) {
  if node.controllerRef.Contains(p.controllerRef) {
    countsByNodeName[node]++
  }
}

It is something completely different than we are computing now. Currently we are computing the "number of pods on a given node matching the selector" (not just whether it's matching or not).

Also your generalization to multiple services/RCs/RSs is something different, because you are counting the number of those objects that are matching our pod - not the number of pods that are matching the same objects on the node as the pod we would like to schedule.

So, unless I'm not missing something:

ObjectRef doesn't give us number of objects matching it (as it would be very difficult to keep up-to-date)
Thanks your approach, we don't get deduplication of pods on a given pod matching different RCs, services etc.

So in my opinion what you are suggestion is something completely different that what we currently have.

hongchaodeng · 2016-03-07T07:27:13Z

Yes. It should be

for (each node) {
   if c := node.controllerRefCount(p.controllerRef); c != 0 {
    countsByNodeName[node] += c
  }
}

wojtek-t · 2016-03-07T07:36:53Z

Yeah - but then if you generalize it multiple RCs/Services, then you need to be able to deduplicate those pods, which is basically impossible to do just operating on numbers....
So it would be a semantic change, which we need to be very careful about.

wojtek-t · 2016-03-21T11:33:22Z

Getting back to it - we are out of 1.2 work, so it's time to get back to it.
Let me write few comments about it.

I'm not sure using "controllerRef" here is the best idea - basically because we should do spreading based on different objects here (already RC, RS, Service, probably in the near future also Job, Deployment, etc.) - and I don't think we will have a "backpointer" to all of those objects soon [and we would like to fix this particular issue soon].
We may need to slightly change the current metric so that if a pod is satisfying X different selectors, then we count this pod X times. I think it's not a big change, and it would significantly simply computing it. @davidopp - thoughts?
Personally, I would follow with a suggestion that Daniel mentioned somewhere above in a modified version, in particular I think that we should maintain the following mappings in-memory:
[pod labels] => [ list of selectors that are satisfied by this pod ]
[selector] => [ node => #pods on that node matching that selector ]
[We probably want to switch selector with selector+namespace, but that's a detail].

This should work, because:

in general a pod should satisfy only small number of selectors\
add/update/delete pod is doable in O(#matching selectors)
add/update/delete RC/RS/Service/... can easily be done in O(total number of pods), which I think is enough

WDYT?

wojtek-t · 2016-03-21T12:06:59Z

In the meantime, I will do a quick fix by parallelizing these computations, which seems pretty easy to do with recent changes introducing schedulercache done by @hongchaodeng

jbeda · 2016-03-21T16:29:57Z

Forgive me for being a little out of the loop here, but the scheduler knowing about services and RCs seems to actually be a leaky abstraction. Ideally there should be nothing special about RCs and Services at this layer. Are there going to have to be similar hacks in the scheduler for peers to RCs as they come out? What if someone isn't using the k8s definition of services but instead rolling their own discovery?

Wouldn't it be better to have users explicitly specify the labels that matter for spreading and key off of those directly?

gmarek · 2016-03-21T17:25:41Z

I like the idea of spread-by label keys. I'm afraid that it's not backward compatible though... @bgrant0607

jbeda · 2016-03-21T17:28:56Z

For back compat, I'd do something like create "invisible/hidden/automatic" spread-by labels for RCs and Services as they stand in addition to explicit. The scheduler can then be moved to only pay attention to labels.

hongchaodeng · 2016-03-21T20:08:23Z

@jbeda 's idea is also good for caching results.

I have a rough plan in my mind optimizing spread calculation like this:

for node in nodes:
    for spreadKey in pod.SpreadKeys:
        counts[node.Name] += node.cachedRefCount(spreadKey)

If I'm right about this simplified version, the complexity is O(nodes * keys). A huge move for scale perspective. Maybe we should talk about this in scale meeting :)

gmarek · 2016-03-21T22:00:45Z

We're having too many sigs working on the same thing...

wojtek-t · 2016-03-22T07:02:40Z

Hmm - this is very interesting idea, but I'm still not fully convinced about it:

If we force ReplicationController (or any other controller) set some additional labels, this would be very misleading from the user perspective - a user is giving a pod template for RC specifying labels that those pods are supposed to have, and the pods that are created have some additional labels added automatically. I think this might be very misleading for users and we need to be careful about it.
[On the other hand, we can solve it by using annotations for this purpose, maybe...]
Backward compatibility is the second issue that @gmarek already mentioned. @jbeda - to be honest I don't really follow your suggestion:

For back compat, I'd do something like create "invisible/hidden/automatic" spread-by labels for RCs and Services as they stand in addition to explicit. The scheduler can then be moved to only pay attention to labels.

How does it change the currently existing code? IIUC, we still need to maintain more-or-less the currently existing code. And if that's the case, adding additional condition for spreading will complicate the system even more. But I might be missing something here.

Also - @bgrant0607 for thoughts about it.

jbeda · 2016-03-22T15:07:28Z

Regardless of the path forward here a couple of points:

The fact that the scheduler even knows what an RC or Service are smells bad. There are going to be usages of kubernetes that don't use either of these higher level services. We also tell users that they could replace the RC with their own control loop and it would be just the same. Furthermore, this is a vague requirement on any replacement scheduler.
As we add more types of controllers the need to do spreading will ... spread. The current code in the scheduler will only get more complex and more tightly coupled to the higher level constructs.

I'd suggest that first we come up with the ideal model for how spreading should be explicitly represented to the scheduler. The idea of a "spread label/key" is only a starting point. And instead of labels, annotations could be used.

Once we have that model, we can decide how to move forward. Perhaps something one of these:

Document that RCs now put a label on a Pod. Have an upgrade process for 1.next to make this happen.
Have the scheduler only apply a "virtual label" while it reads and caches the state of the world. This would be in RAM and never be visible outside the scheduler and would be done in an ingest/preprocessing step. Then the code to do spreading would work purely on labels -- real and virtual. At least then the legacy is isolated and getting rid of that "virtual label" could be a separable project.

gmarek · 2016-03-23T07:42:38Z

cc @davidopp

wojtek-t · 2016-03-23T14:06:16Z

The fact that the scheduler even knows what an RC or Service are smells bad. There are going to be usages of kubernetes that don't use either of these higher level services. We also tell users that they could replace the RC with their own control loop and it would be just the same. Furthermore, this is a vague requirement on any replacement scheduler.

Scheduler doesn't require RC or Services to work - it will just affect how we compute priorities.
However, I agree that giving users possibility to do spreading as they would like to do it, is a good point.

As we add more types of controllers the need to do spreading will ... spread. The current code in the scheduler will only get more complex and more tightly coupled to the higher level constructs.

Agree.

Also, to clarify - I'm not against this idea - I just think that we need to be careful with that (maybe breaking compatibility is even fine - but it needs to be a conscious decision).

bgrant0607 · 2016-04-02T17:23:17Z

Using services and controllers is a hack. Should use anti-affinity and disruption budget #22774

bgrant0607 · 2016-04-02T18:11:44Z

Labels are for identifying attributes and grouping, not semantically meaningful properties

bgrant0607 · 2016-04-02T19:14:13Z

If we want/need to continue with hacks for now, they can't be visible in the api

davidopp · 2016-04-03T01:08:58Z

My recommendation is to continue with the default spreading in the scheduler, but consider making the set of API types used in the spreading be a configuration option to the scheduler rather than hard-coded.

Using services and controllers is a hack.

I agree, but this was the only way to provide the spreading we wanted without having to modify controllers to add information to the pod representing the identity of the controller (a la controllerRef) and the service.

Labels are for identifying attributes and grouping, not semantically meaningful properties

Many of the affinity and anti-affinity use cases rely on semantically meaningful labels (on the node or pod). Constrain to a specific zone, spread across nodes or zones, etc. We already have a moderate-sized library of semantically meaningful labels, and node and pod affinity express their requests over labels. It's not clear to me if you're saying this is a problem, or just arguing we shouldn't have controllers automatically add labels identifying the pod's controller and service.

I don't see the relevance of DisruptionBudget; can you explain?

hongchaodeng · 2016-04-03T18:29:39Z

I don't see the relevance of DisruptionBudget; can you explain?

I think it means we need to consider DisruptionBudget when spreading pods.

I saw that we were discussing two things that we should split:

labelling.
pod anti-affinity.

Labelling

I want to point out that @jbeda 's idea is not about "user" labelling. Let's look at the function:

// It favors nodes that have fewer existing matching pods.
// i.e. it pushes the scheduler towards a node where there's the smallest number of
// pods which match the same service selectors or RC selectors as the pod being scheduled.
// Where zone information is included on the nodes, it favors nodes in zones 
// with fewer existing matching pods.
func (s *SelectorSpread) CalculateSpreadPriority (...)

The problem is now scheduler needs to know higher-order objects like services, replication controllers, etc. We should provide a layer of abstraction for scheduler to do spreading. We could add some "system" labelling, grouping keys, etc.

Pod Anti-affinity

The problem scope might become bigger, but I still want to ask the question: What's our future plan for pod anti-affinity? It seems pod anti-affinity might overlap/include spreading. In the near future, we will face two choices:

Merge this selector spreading into pod anti-affinity. The advantage is that we combine the codebase. The disadvantage is complexity. Let me clarify, this selector spreading (or whatever name it would be) is more a system-internal thing, while pod anti-affinity is more a user feature. I'm not sure if we want to combine the two.
Separate them. The opposite.

I'm trying to explain the problem. So this is not an argument but for everyone's future consideration.

davidopp · 2016-04-03T20:42:52Z

I think it means we need to consider DisruptionBudget when spreading pods.

Ah, I see. That's a very good point--that we should treat a shared DisruptionBudget the same way we treat service, ReplicationController, ReplicaSet, etc., today for spreading. Whether that means we don't need to look at service, ReplicationController, ReplicaSet, etc. for spreading (which I now understand is what @bgrant0607 was suggesting) is an interesting question. I'm not fully convinced. Spreading by itself gives you "best-effort" protection from correlated failure, which can be beneficial even if you don't need the stronger guarantee of a DisruptionBudget. From a usability standpoint, it is also simpler to provide some default spreading even if the user does not define a DisruptionBudget. We could auto-generate DisruptionBudget from Service and/or Deployment, I guess, which would alleviate the need for the user to explicitly define a DisruptionBudget in most cases, but I think we should think through the details before we go so far as to remove the default spreading we currently do.

We could add some "system" labelling, keys, etc.

Right, I understand that's what Joe is suggesting, and that it would allow us to make CalculateSpreadPriority() generic and not have to know about specific controllers and services. It sounded like Brian was objecting to that, but I'm not sure. I don't really have a problem with the idea, especially since we already use labels to denote semantically meaningful node properties. But I'd like to better understand Brian's comment on this. An alternative is to make the list of spreading rules be a config parameter to the scheduler. Obviously that's much closer to what we have today, so doesn't fully accomplish what Joe is going after.

It seems pod anti-affinity might overlap/include spreading

It definitely does include spreading. I think the issue is exactly what you mentioned. For usability reasons, we want to provide reasonable default spreading that doesn't require any explicit user input. This is why I don't think we should replace the controller/service-based spreading rules with one based on DisruptionBudget, unless DisruptionBudget is auto-generated (which may or may not be a good idea). Likewise, I don't think we should rely on users to explicitly indicate spreading requests using pod anti-affinity. That should only be for "special" requests. The system should do reasonable spreading by default, without any explicit pod anti-affinity.

wojtek-t · 2016-04-04T15:51:28Z

I completely agree with @davidopp that we should have some default spreading for simple users that do not specify DisruptionBudget, any specific spreading labels etc. The question whether it should be RC/Service is a separate question here.
Also - we need to answer the question how much do we care about backward compatibility in scheduling algorithm - I guess we can break it between minor releases, but I wanted to ensure this is really the case.

So my current thinking about is that we should fix the default spreading (possibly changing the algorithm) that we currently have (by fixing I actually mean improving performance) and then add things like DisruptionBudged, anti-affinity spreading and stuff like that on top of that.

gmarek · 2016-04-05T00:00:03Z

Short update from an in-person discussion with @bgrant0607 and others: we don't treat a backward compatibility of scheduler decisions as critical, as long as the scheduling is sane, for some definition of sane.

bgrant0607 · 2016-04-11T04:58:29Z

Sorry for the terse comments. I'm timesliced among 1000 different things.

Yes, as you've figured out, a DisruptionBudget would imply that pods associated with the same budget would need to be spread.

Generating a DisruptionBudget automatically is worth considering.

I'm fine with maintaining legacy behavior so long as we don't need to add anything to the API just for that purpose.

With controllerRef and the /scale subresource, which contains a selector for identifying the controlled pods, we could make the spreading behavior generic for any controller supporting /scale.

I wouldn't bother spreading pods of jobs.

On labels not being semantically meaningful: I was referring to user-created resources. The way we identify user resources is with selectors, not with special labels: configuration rather than convention.

Nodes are system resources, so cluster admins can label them however they like and we can use those labels, provided the mechanism is configurable by the cluster admin.

wojtek-t · 2016-04-11T05:45:40Z

Thanks for explanation @bgrant0607

Also, we had some offline conversations about it last week and we decided that:

short term we will add a reverse index to scheduler to maintain backward compatibility
medium/long term we will switch to using controlerRef, add DisruptionBudget support etc.
Note that once controlerRef is added, we will need the reversed index only for services and at that point we may want to rethink whether we still need spreading by services (maybe spreading by deployments which we will be able to effectively infer from controllerRef would be enough?)

wojtek-t · 2016-04-11T09:45:41Z

I'm starting to implement 1.

fejta-bot · 2017-12-19T07:23:37Z

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-18T08:11:49Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-17T08:18:00Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/control-plane labels Mar 1, 2016

wojtek-t added this to the next-candidate milestone Mar 1, 2016

wojtek-t mentioned this issue Mar 21, 2016

Parallelize computing selector spreading priority in scheduler. #23274

Merged

wojtek-t self-assigned this Apr 11, 2016

wojtek-t mentioned this issue Apr 11, 2016

[WIP] [Do NOT merge] Support for reverse index in scheduler #24095

Closed

wojtek-t mentioned this issue Jul 7, 2016

Optimize scheduler performance/throughput #28590

Closed

timothysc added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 9, 2016

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 18, 2018

k8s-ci-robot closed this as completed Feb 17, 2018

Long scheduling latency in large clusters #22262

Long scheduling latency in large clusters #22262

Comments

wojtek-t commented Mar 1, 2016

wojtek-t commented Mar 1, 2016

lavalamp commented Mar 1, 2016

wojtek-t commented Mar 2, 2016

lavalamp commented Mar 2, 2016

lavalamp commented Mar 2, 2016

bgrant0607 commented Mar 3, 2016

bgrant0607 commented Mar 3, 2016

bgrant0607 commented Mar 3, 2016

wojtek-t commented Mar 3, 2016

lavalamp commented Mar 4, 2016

wojtek-t commented Mar 4, 2016

davidopp commented Mar 6, 2016

wojtek-t commented Mar 6, 2016

gmarek commented Mar 6, 2016

wojtek-t commented Mar 6, 2016

xiang90 commented Mar 6, 2016

davidopp commented Mar 6, 2016

davidopp commented Mar 6, 2016

hongchaodeng commented Mar 6, 2016

wojtek-t commented Mar 7, 2016

hongchaodeng commented Mar 7, 2016

wojtek-t commented Mar 7, 2016

wojtek-t commented Mar 21, 2016

wojtek-t commented Mar 21, 2016

jbeda commented Mar 21, 2016

gmarek commented Mar 21, 2016

jbeda commented Mar 21, 2016

hongchaodeng commented Mar 21, 2016

gmarek commented Mar 21, 2016

wojtek-t commented Mar 22, 2016

jbeda commented Mar 22, 2016

gmarek commented Mar 23, 2016

wojtek-t commented Mar 23, 2016

bgrant0607 commented Apr 2, 2016

bgrant0607 commented Apr 2, 2016

bgrant0607 commented Apr 2, 2016

davidopp commented Apr 3, 2016

hongchaodeng commented Apr 3, 2016

Labelling

Pod Anti-affinity

davidopp commented Apr 3, 2016

wojtek-t commented Apr 4, 2016

gmarek commented Apr 5, 2016

bgrant0607 commented Apr 11, 2016

wojtek-t commented Apr 11, 2016

wojtek-t commented Apr 11, 2016

fejta-bot commented Dec 19, 2017

fejta-bot commented Jan 18, 2018

fejta-bot commented Feb 17, 2018