-
Notifications
You must be signed in to change notification settings - Fork 3.9k
core: initialize round robin load balancer picker to random index #4462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: initialize round robin load balancer picker to random index #4462
Conversation
|
Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement. After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.
Regards, |
|
@tleach One alternative is to shuffle the addresses in the name resolver before passing it to the LB. This avoids the problem of all clients overloading the backends in the same order. Picking a random start index means all backends would still end up synchronizing (like metronomes). |
|
@carl-mastrangelo great point regarding clients synchronizing their access to backends. I considered shuffling the addresses but was concerned about introducing an O(n) operation every time the address list is updated if that list changes frequently. Perhaps that concern is overblown. |
|
Another thought that occurred to me: shuffling the addresses is likely a redundant operation for other load balancer implementations (e.g. random load balancer). Therefore doing this in the name resolver is probably the wrong place to do it (and arguably leaks information about the choice of load balancing algorithm). It still might make sense to shuffle the address list inside |
|
@tleach I wouldn't worry about the time complexity here. The number of backends is usually less than 1000, and list being shuffled very likely fits in the cache. I have an internal benchmark that uses 50,000 backends and didn't see any significant issues. (Also, see #3579) As for random load balancer: probably not a good idea, since some of the backends get way more load than the others. For an example of how this happens see Genius's blog post). For gRPC we try to keep the LB implementations the same across languages, so they all behave roughly equally. @zhangkun83 and @ejona86 might be able to shed light on potentially applying your change across implementations. |
|
This is probably a good idea to do cross-language. But I'd like to discuss it with Kun, who will be back next week. I do agree that shuffling every update is a bit weird across all NameResolvers (for some it would be a good idea, for others not so much). But I can see the discussion going in a few different directions. Can we sit on this for a week-ish and get back to you? |
|
@ejona86 sure - no urgency on my end. We've shifted to a different (in-house) LB implementation for now. Thanks. |
|
I am back :). I think shuffling in the round robin balancer is the right thing to do. The shuffling helps balancing the load, it has nothing to do with NameResolver. |
| this.stickinessState = stickinessState; | ||
| // start off with a random address to ensure significant Picker churn does not skew | ||
| // subchannel selection lower-indexed addresses in the list | ||
| Random random = new Random(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't re-create the Random instance so frequently. Instead, have it a field of RoundRobinLoadBalancer and probably pass the value (or chosen starting index) into the Picker constructor. Since the Pickers are only created from one thread at a time, there's no concurrency issues from using Random at that point.
52a445f to
9d8290d
Compare
9d8290d to
40c4aef
Compare
| helper.updateBalancingState(state, new Picker(activeList, error, stickinessState)); | ||
| // initialize the Picker to a random start index to ensure that a high frequency of Picker | ||
| // churn does not skew subchannel selection. | ||
| int startIndex = activeList.isEmpty() ? -1 : random.nextInt(activeList.size()) - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "startIndex" being -1 is a bit obscure here. How about letting startIndex base on 0 and in Picker construction let index = startIndex - 1?
| assertEquals(subchannel1, picker.pickSubchannel(mockArgs).getSubchannel()); | ||
| assertEquals(subchannel2, picker.pickSubchannel(mockArgs).getSubchannel()); | ||
| assertEquals(subchannel, picker.pickSubchannel(mockArgs).getSubchannel()); | ||
| Subchannel picked1 = picker.pickSubchannel(mockArgs).getSubchannel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is actually weaker than the previous one since it no longer verifies that subchannels are visited in order and no subchannel is skipped.
I would define an internal interface RandomProvider which can be mocked out, and deterministically test that it has been called and it's used as the start index.
RoundRobinLoadBalancerFactory creates a new Picker instance every time the set of provided address groups changes or the connection state of subchannels associated with existing address groups changes. In certain scenarios, such as deployment/replacement of the target service cluster, this can lead to high churn of Picker objects. Given that each new Picker's subchannel index is initialized to zero, in these scenarios requests can end up getting disproportionately routed through subchannels (and hence server nodes) which are earlier in the list of address groups. At Netflix we have measured that some service nodes end up taking 3-4x the load that of other nodes during deployment. This commit randomizes the start index of the RoundRobinLoadBalancerFactory.Picker which eliminates this behavior.
40c4aef to
14a9260
Compare
|
@zhangkun83 I've updated the pickerRoundRobin() test which now simply passes in a start index and verifies specific subchannels are returned in a specific order as before. This should address your concern about weakening the test. Given that, in other tests I've preserved the approach of using nextSubchannel() to dynamically determine the next expected subchannel. This approach seems preferable to wiring through a random provider for the start index which introduces a fair amount of boilerplate. |
zhangkun83
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
ps. it's better to keep the commits to allow reviewers to track the progression. They will be squashed when the PR is merged.
RoundRobinLoadBalancerFactory creates a new Picker instance every time the set of provided address groups changes or the connection state of subchannels associated with existing address groups changes. In certain scenarios, such as deployment/replacement of the target service cluster, this can lead to high churn of Picker objects. Given that each new Picker's subchannel index is initialized to zero, in these scenarios requests can end up getting disproportionately routed through subchannels (and hence server nodes) which are earlier in the list of address groups.
At Netflix we have measured that some service nodes end up taking 3-4x the load that of other nodes during deployment.
This commit randomizes the start index of the RoundRobinLoadBalancerFactory.Picker which eliminates this behavior.