-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A68: Deterministic Subsetting LB policy #383
Changes from all commits
7326b4d
5370169
3fdfb76
a2456e5
fa0275e
837fb7a
59bedf3
bbf20d9
bbbeca2
1428640
391c890
0d86f1b
7aba9b6
f8c4777
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
A68: `deterministic_subsetting` LB policy. | ||
---- | ||
* Author(s): @s-matyukevich, @joybestourous | ||
* Approver: | ||
* Status: Draft | ||
* Implemented in: Go, Java | ||
* Last updated: [Date] | ||
* Discussion at: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As per the gRFC process, please start a thread on the grpc.io mailing list for this gRFC and then add a link to that thread here. |
||
|
||
## Abstract | ||
|
||
Add support for the `deterministic_subsetting` load balancing policy configured via xDS. | ||
|
||
## Background | ||
|
||
Currently, gRPC is lacking a way to select a subset of endpoints available from the resolver and load-balance requests between them. Out of the box, users have the choice between two extremes: `pick_first` which sends all requests to one random backend, and `round_robin` which sends requests to all available backends. `pick_first` has poor connection balancing when the number of client is not much higher than the number of servers because of the birthday paradox. The problem is exacerbated during rollouts because `pick_first` does not change endpoint on resolver updates if the current subchannel remains `READY`. `round_robin` results in every servers having as many connections open as there are clients, which is unnecessarily costly when there are many clients, and makes local decisions load balancing (such as outlier detection) less precise. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had not actually heard of the birthday paradox before. Suggest making that a link to https://en.wikipedia.org/wiki/Birthday_problem. :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/every servers/every server/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/local decisions load balancing/local load balancing decisions/ |
||
|
||
### Related Proposals: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section should reference gRFC A52, since you refer to it below. |
||
|
||
## Proposal | ||
|
||
Introduce a new LB policy, `deterministic_subsetting`. This policy selects a subset of addresses and passes them to the child LB policy in such a way that: | ||
* Every client is connected to exactly N backends. If there are not enough backends, the policy falls back to its child policy. | ||
* The number of resulting connections per backend is as balanced as possible. | ||
* For any 2 given clients the probability of them connecting on the same or very similar subset of backends is as low as possible. | ||
|
||
### Subsetting algorithm | ||
|
||
The LB policy will implement the algorithm described in [Site Reliability Engineering: How Google Runs Production Systems, Chapter 20](https://sre.google/sre-book/load-balancing-datacenter/#a-subset-selection-algorithm-deterministic-subsetting-eKsdcaUm) with the additional modification mentioned in the "Deterministic subsetting" section of the [Reinventing Backend Subsetting at Google](https://queue.acm.org/detail.cfm?id=3570937) paper. Here is the relevant quote: | ||
|
||
``` | ||
This is the algorithm as previously described in Site Reliability Engineering: How Google Runs Production Systems, Chapter 20 but one improvement remains that can be made by balancing the leftover tasks in each group. The simplest way to achieve this is by choosing (before shuffling) which tasks will be leftovers in a round-robin fashion. For example, the first group of frontend tasks would choose {0, 1} to be leftovers and then shuffle the remaining tasks to get subsets {8, 3, 9, 2} and {4, 6, 5, 7}, and then the second group of frontend tasks would choose {2, 3} to be leftovers and shuffle the remaining tasks to get subsets {9, 7, 1, 6} and {0, 5, 4, 8}. This additional balancing ensures that all backend tasks are evenly excluded from consideration, producing a better distribution. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please fix the formatting here so that this doesn't require a huge amount of horizontal scrolling to read the text? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use the blockquote |
||
``` | ||
|
||
Here is the implementation of this algorithm in pseudocode with detailed comments. | ||
|
||
``` | ||
func filter_addresses(addresses, subset_size, client_index, sort_addresses) | ||
backend_count = addresses.length() | ||
if backend_count > subset_size { | ||
// if we don't have enough addresses to cover the desired subset size, just return the whole list | ||
return addresses | ||
} | ||
|
||
if sort_addresses { | ||
// sort address list because the algorithm assumes that the initial | ||
// order of the addresses is the same for every client | ||
addresses.sort() | ||
} | ||
|
||
// subset_count indicates how many clients we can have so that every client is connected to exactly | ||
// subset_size distinct backends and no 2 clients connect to the same backend. | ||
subset_count = backend_count / subset_size | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems likely this is integer math, so best to include explicit |
||
|
||
// Given subset_count we now can divide clients by rounds. Every round have exactly subset_count clients. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: s/have/has/ |
||
// round indicates the index of the round for the current client based on its index. | ||
round = client_index / subset_count | ||
|
||
// There might be some lefover backends withing every round in cases when backend_count % subset_size != 0. | ||
// excluded_count indicates how many leftover backends we have on every round. | ||
excluded_count = backend_count % subset_size | ||
|
||
// We want to choose what backends are excluded in a round robin fashion before shufling the backends. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: s/shufling/shuffling/ |
||
// This way we make sure that for any 2 backends the difference of how many times they are excluded is at most 1. | ||
// excluded_start indicates the index of the first excluded backend | ||
excluded_start := (round * excluded_count) % backend_count | ||
// excluded_start indicates the index after the last excluded backend. | ||
// It could wrap around the end of the addresses list. | ||
excluded_end := (excluded_start + excluded_count) % backend_count | ||
|
||
if excluded_start < excluded_end { | ||
// excluded_end doesn't wrap, exclude addresses from the interval [excluded_start, excluded_end) | ||
} else { | ||
// excluded_end wraps around the end of the addresses list, exclude intervals [0:excluded_start] and [excluded_end, end_of_the_array) | ||
} | ||
|
||
// randomly shuffle the addresses to increase subset diversity. Use round as seed to make sure | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This shuffling is unlikely to produce the same results cross-language. How much do we care about that? |
||
// that clients within the same round shuffle addresses identically. | ||
addresses.shuffle(seed: round) | ||
|
||
// subset_id is the index for the current client withing the round | ||
subset_id := client_index % subset_count | ||
|
||
// calculate start start and end of the resulting subset | ||
start = subsetId * subset_size | ||
end = start + subset_size | ||
return addresses[start: end] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After we select which addresses, we could map the addresses back to their original order. Is that worthwhile? |
||
} | ||
``` | ||
|
||
|
||
### Characteristics of the selected algorithm | ||
|
||
* Every client connects to exactly `subset_size` backends. | ||
* Clients are evenly distributed between backends. The most connected backend receives at most 2 more connections than the least connected one. One comes from the fact that a backend can be excluded one more time than some other backend. And another one comes from the fact that there might be not enough client to completely fill the last round. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: s/might be not enough client/might not be enough clients/ |
||
* Backends are randomly shuffled between clients. If a small subset of backends loses connectivity, the algorithm maximizes the chances that the backends from this subset will be evenly distributed between clients. | ||
* The algorithm generates some, potentially significant, connection churn during backend scaling or redeployment. This might be a problem for very small subset sizes. | ||
|
||
### LB Policy Config and Parameters | ||
|
||
The `deterministic_subsetting` LB policy config will be as follows. | ||
|
||
``` | ||
message LoadBalancingConfig { | ||
oneof policy { | ||
DeterministicSubsettingLbConfig deterministic_subsetting = 21 [json_name = "deterministic_subsetting"]; | ||
} | ||
} | ||
|
||
message DeterministicSubsettingLbConfig { | ||
// client_index is an index within the | ||
// interval [0..N-1], where N is the total number of clients. | ||
// Every client must have a unique index. | ||
google.protobuf.UInt32Value client_index = 1; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this client_index parameter is going to be a problem. You mention below that the control plane will need to figure out how to correctly maintain the right value for each client and that there are ways to do that that are outside the scope of this gRFC. However, I think that this is actually not likely to be very easy to solve, and I think it will actually affect any large-scale system for distributing service configs, not just xDS. The client_index needs to be assigned such that (a) the value as seen by a given client instance does not change over the lifetime of that client instance and (b) the set of client IDs is evenly distributed across all clients to ensure that all subsets see a roughly equal set of connections. However, at large scale, the control plane itself will need to be scaled horizontally, which means that no one control plane instance will be able to assign IDs to all clients. Instead, it will be necessary to create a distributed service just to assign these IDs (which is a heavy lift by itself), and once you build that, it's not clear that it makes sense to integrate this into the control plane. For xDS, even if you have build this functionality into your control plane, the control plane cannot possibly assign a different value to each client without making the xDS resources uncacheable, which is a major problem for scalability, because it means that you can't scale via the use of caching xDS proxies. (The ability to use such proxies was one of the main motivating factors for the move to Also, it's worth noting that in xDS today, control planes typically have very different infrastructure for EDS than they do for the other resource types, because the EDS data changes very dynamically based on deployment changes, auto-scaling, network reachability, etc, whereas the other resource types are static configuration explicitly changed by humans. This design is adding fields that will be in CDS, a resource type that is generally static configuration, but the client_id field needs to be populated dynamically, much like the EDS data. I think many control planes would have trouble supporting this. Even if you're not using xDS, I think the same issue applies; I think it will occur in basically any environment where application instances are dynamically scheduled. The service config is really intended to be injected dynamically into the client via the resolver, which gets the config from some control plane. Before we pivoted to xDS, we had been working on a design to distribute gRPC service configs across our network in a dynamicly updatable way, but it was primarily designed to distribute static configuration from the service owner, with very little intelligence in the control plane itself. I'm not actually aware of any config-distribution mechanism that would easily support this kind of thing. (Note that this is the exact issue that I mentioned when we discussed this previously. Internally at Google, these indexes are assigned as an inherent part of the cluster system; every binary running on the cluster knows its own index, so it doesn't need to be configured via a control plane. And I don't think it's a good fit for the control plane at all, which is why I said that I thought it would be hard to build deterministic subsetting for gRPC in OSS.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So first of all I would like to mention that if we want to do better than random subsetting we must use client_id, or some other parameter(s) that are client-specific. Second, implementing subsetting via EDS (like we discussed here) has exactly the same problems, so we can’t block this proposal and at the same time recommend people to do that. Leaving people without the guidance on how to do subsetting is also a bad option. Now let’s try to do some brainstorming: Option 1: Relax the requirement to have an immutable index per client. If we do that the client_id calculation becomes trivial:
Of course this implementation will lead to a lot of connection churn, but I can provide some reasons why this may be acceptable in practise for many use-cases:
Option 2: Keep the proposal as-is and rethink the requirements for xDS cacheability You probably won’t accept this option, but I still want to mention it for the sake of completeness and provide some arguments in support of it.
Option 3: Use RTDS to store client_id. If we do that, the LB definition in CDS will contain a reference to the value defined in RTDS. This will keep CDS cacheable while we can make a convention that RTDS should never be cached (which I assume is already the case). Conceptually this is also similar to SDS (secret discovery service) Secrets, such as certificates, are never cacheable and are always different per client. Option 4: Add some pluggable architecture to get client_id We can assume that every client will run a separate plugin which will be responsible for providing client_id. This plugin can take the ID either from the deployment system or from DNS or from some other source. Conceptual this is similar to what grpc does for managing certificates There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's true, but in that case the problem is limited to EDS; it does not affect the other resource types. Putting a parameter that needs to be generated dynamically into CDS makes the problem worse. So I agree that this is an existing problem, but it's also one that I want to try to solve in the long term, and I don't want to make it worse here, because that will make the problem harder to solve later.
I don't think I understand this proposal. This seems to assume that (a) the client knows its own DNS name, (b) the client can do a DNS lookup, (c) that the name will return a large list of IP addresses, and (d) that the order of the addresses in the DNS result will be random enough that this will yield a useful ID. While it is certainly possible to construct an environment where those assumptions are true, I don't think they are likely to be true in the common case -- and if any one of them is false, it seems like this won't work properly. If the real goal here is just to try to randomize the clients, we could just as easily have each client generate a random number at startup and use that. That would probably work reasonably if there are large numbers of clients, but I think it would not be sufficient in cases with a relatively small number of clients.
I don't see how this will lead to connection churn. In fact, regardless of what algorithm is used to compute the client_id, I don't see why it would lead to connection churn. I think we can just have the LB policy compute the client_id when it is created and then stick with that value for its entire lifetime, so there's no need for any churn. Or am I missing something here?
I'm not sure that's true in all cases. We're talking about two different types of overhead here: the overhead of maintaining an extra connection that you may not need is mainly memory usage, whereas the overhead of establishing new connections is mainly CPU and network usage (e.g., SSL handshaking). Depending on how often the connection churn happens and which resources are more expensive in a given environment, this could actually be quite a bit worse than just dealing with the additional connections.
I don't think this is an xDS vs. non-xDS question. I think the fundamental principle here is that we should not mix static config data with dynamic endpoint assignment data. Note that even the resolver API enshrines this separation: the resolver returns the list of addresses and the service config as two separate pieces of data. So even if you were going to do something like this in a custom resolver, I think the right way to do it would be to have the resolver apply the subsetting and return only the chosen addresses, in which case there would be no need for a subsetting LB policy in the first place. So yes, I do feel strongly that we should not have such a field in CDS, at least not as the only way to do this. But I don't think that removing the xDS part of this design actually solves the problem, because I think the same problem exists in a custom resolver that injects the service config.
I think those cases can be made cacheable by using dynamic parameters, as described in xRFC TP2.
At truly massive scale, I'm not sure this is sufficient. I think you wind up eventually needing to introduce read-only caching xDS proxies, and you can't do that without cacheability.
This requires storing info about each individual client. Given a sufficiently large number of clients, I don't think this scales.
gRPC doesn't support RTDS -- that's an Envoy-only concept. (In fact, "runtime" as implemented in Envoy does not exist in gRPC and probably never will.)
This option is feasible, but it would require a lot of machinery. We'd basically need a new extension point with its own registry just for the plugins that could be used in this one LB policy. That's doable but not ideal. Another option would be to define a few built-in mechanisms for setting the client_id. I think I could live with "set this explicit value" (your current proposal) as one of the options, as long as (a) there are other viable options, (b) this option is not the default, and (c) we document all of the shortcomings of this approach. Unfortunately, the only other even semi-viable option I see is the one I mentioned above, where the client just generates a random number, which won't work right with a small enough set of clients. I'd like to hear input from @ejona86 and @dfawley on this as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What is better about this proposal than doing subsetting via EDS? If the control plane can manage the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agree.
I agree with this principle, that's why I was trying to suggest alternative approaches, such as using a separate xDS resource type (like RTDS) for dynamic data, or externalizing dynamic data provisioning to some sort of plugins (like certificate plugins) I don't think that we should transform this principle to "never serve dynamic data through xDS" because IMO it is very limiting and many other use-cases (like, for example, shading) may need such dynamic data.
Most of our clients are envoys, so until dynamic parameters support is added to envoy it is not possible. But I agree that fundamentally it solves the problem.
Not sure I fully agree here. If a control-plane can serve EDS it should have access to a datastore with all endpoints anyway. Every client usually is a server for some other downstream clients, so in practice the requirement to store IP per client is already satisfied by all control-planes that serves EDS.
I can generalize the idea: we can create a new xDS type that is specifically designed to store non-cashable dynamic data. Every other resource that need to reference such data will add a pointer to this new resource type instead of embedding dynamic data directly, which solves the problem of cashability for CDS and other static resources.
Maybe we can generalize "certificate provider plugins" to "dynamic config plugins", and design it in a way that can cover multiple use-cases? The plugin can return opaque data which will be interpreted by consumers. I am mostly doing brainstorming here, so feel free to dismiss my ideas if it is not feasible.
This sounds good to me. I can document and implement the options after we agree on the option list.
I can test this out, but my intuition that this will work no better than simple random subsetting (where every client simply selects N random backends) This option most likely won't work for all our use-cases, because in practice it results in a big imbalance of requests on the backend size (this is explained in more details in the Google's subsetting paper that I references in this gRFC) However, this is a very simple and practical option and I am not at all agains implementing it as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Here are a few reasons:
Mostly my reasoning is that it is always worth trying to push a solution for some common problem upstream before implementing a custom one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Okay, thanks, I understand the intent of this proposal better now. However, I think it still rests on a bunch of assumptions that won't be true in many cases. For example, what if the client has multiple NICs, each with a different IP address? Also, this assumes the use of DNS -- what if the environment uses a name service other than DNS?
I think you're assuming use of a service mesh topology, but that's not the only use-case to consider. There are many environments in which the clients are distributed and are not themselves servers, so the control plane would not otherwise be aware of them.
That would certainly be better than embedding it in the CDS resource. However, this would require us to make the mechanism for determining the client_id pluggable rather than simply supporting a fixed set of built-in options, because I don't think we want this LB policy to depend on xDS.
The certificate provider plugin APIs are all fairly specific to what it's doing, so I don't think there's a good way to generalize the mechanism to serve both use-cases. But we could certainly duplicate that paradigm if it provided some advantage -- although I'm not sure that's the case here. The main difference between the certificate provider paradigm and just having a plugin specified directly in the LB policy config is that in the certificate provider case, the decision of which plugin to use is actually made by the deployment, not by the control plane. This makes sense for certs, since there's a lot of deployment-specific machinery that determines how the cert is provided, and the control plane does not want to have to worry about which mechanism is actually used on each deployment, nor does it want to have to generate different variants of the LDS and CDS resources for each deployment. However, in this case, it's not clear to me that there's any benefit in having the decision of what plugin to use determined by the deployment (although I'm open to hearing counter-arguments if there's a reason why there might be), so it seems like it would be much simpler to just have the plugin be specified directly in the LB policy config.
I agree that avoiding the need for wheel reinvention is a good goal, but I'm not sure this proposal actually accomplishes it. Ultimately, as this discussion illustrates, I think the hard part of this is actually determining the client_id, and it's not at all clear to me that it's easier for a control plane to do that than it is to just compute the subsets itself. The main difference seems to be whether the subsetting algorithm itself is implemented on the client side or in the control plane, and that particular piece of code seems much less complex than the additional machinery needed to move it to the client side.
Won't you have exactly this same problem by needing to encode the client_id in the CDS resource? You'll still need to send a different client_id to each client, right? I'll also point out that this issue of resource fan-out is exactly the issue of xDS resource cacheability, which you were previously arguing that we don't need. It seems like you're actually relying on this property for EDS, but you're arguing to break this property for CDS, which seems very counter-intuitive to me. Taking all of this into account, here's where I'm currently landing:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can't it be shared just as well on the control plane, if not easier? Most control planes are Go, with some Java. Client-side there's Envoy, gRPC C, Java, Go, Node. For you specifically, you'd only need to implement it in a single control plane, and it'd work for all the clients. An important detail here is that all client implementations must match exactly to get the benefits. Having it in the control plane avoids accidental implementation drift degrading performance.
Sure, but as Mark mentioned that applies to CDS as well, so how is using CDS a better fit than EDS? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am not aware about any open-source control-plane that could potentially implement subsetting. Projects like go-control-plane and java-control-plane are mostly just collections of low level primitives that allow implementing xDS protocol more easily. Those projects don't have any notion of endpoint storage and can't serve EDS as well as implement subsetting. Of course we can implement it in our own control-plane, but it then gRPC and/or envoy implement a standard way of doing subsetting we will be left with our own custom solution and unclear migration path to the standard one. This proposal is an attempt to define the standard way of doing subsetting with gRPC. If it fails - we'll implement a custom solution either in our own control-plane or via a custom grpc load-balancer or resolver. Another important detail is that we need a solution that works not only for xDS consumers, as most of our users don't use xDS yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is really a continuation of the discussion about client_id, but I want to start a separate thread to keep it more focused. What I am trying to do is to take a step back and think about options that don't involve client_id. One obvious option is a fully random subsetting, where every client selects N backends at random. The problem with this option is that resulting load on the backends end up being not even, which is described in more details here But what if we can make it better by using WRR as a child balancer? My intuition is that WRR should make clients to send more requests to the least utilized backends within the subset, and if we have good subset diversity and relatively large subsets, WRR can, at least partially, compensate for backend load imbalance. If this is not enough we can extend this idea by creating a "load aware subset balancer" The balancer will do what I described above, but additionally it will periodically remove the most utilized backend from its subset, if the deviation in load for this backend diverged too far from the mean for the whole subset. This will probably require an additional ORCA metric, which reports pure CPU load on the backends. (I think the metric that is used by WRR is more @markdroth what do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
WRR definitely does help balance load within the subset. However, I think that solves the problem only for workloads where every client generates approximately the same load. If the subsets are not evenly distributed and you happen to have a set of clients that generate more load than others and they all happen to be talking to the same set of backends, then you will still wind up with uneven load across all backends.
If WRR is using the right weight for balancing load across the backends, then it's not clear to me that this kind of outlier detection-style backend ejection is actually necessary. If one particular backend's weight is much lower than the others, WRR should give it much less traffic anyway, so ejecting the backend wouldn't really change anything. Also, if the problem is that a particular set of clients is generating higher load than other clients, then ejecting some of the backends in the subset seems like it could result in even worse load imbalance, by forcing those heavy-traffic clients onto an even smaller set of backends. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (To be clear, BTW, I am not necessarily objecting to implementing random subsetting -- there may very well be workloads for which that works fine. I'm just pointing out other factors to consider here when evaluating whether a given algorithm will work for your particular workload.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think this argument can be applied to any subsetting implementation, including the Google's one as well as Twitter's Aperture. All those algorithms don't take into account potential difference between clients and only balance number of connections per backend, not req per backend. I guess the way it works is by grouping clients into groups of identical instances. My intuition is that random subsetting should deal with this particular problem (different types of clients) just fine because the resulting subsets are fully random, which makes the probability of any group of clients talking to the same subset of backends almost neglectable.
I think I described it poorly. When I said eject what I really meant is first eject the backend and then replace it with another random one, so we will always have exactly N backends per client. The problem with just WRR is that it can balance only requests and not connections per backend, and the proposed solution could also be used to balance the connections. For example, we can implement the load metric to take into account the number of incoming connections. instead of CPU load. Then top level subsetting LB will be balancing just the number of connections per backend and child LB (wrr) will be balancing load within the subset. I think that by tuning different types of load metrics we can achieve perfect load distribution across all backends, though I fully agree that it might not be necessary and just subsetting + wrr might be enough What I am going to do is to test all those options on a synthetic benchmark. I'll get back to this PR when I have some results. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I got some results testing random subsetting with WRR Test setup
Here are the results: round robin wrr random subsetting + wrr (subset size = 10) First of all, the resulting number of connections look quite bad. Second, wrr with default load function doesn't actually do anything to make the resulting CPU load better. WRR only takes into account "cost per request", which is different for slow and fast servers, but is identical for the servers that are in the same group but happen to receive different number of connections. To fix this problem I changed the definition of CPULoad parameter and multiplied it by the number of incoming connections. (this is the thing I call "scaled wrr") random subsetting + scaled wrr (subset size = 10) random subsetting + scaled wrr (subset size = 30) Open questions
Proposed plan
@markdroth what do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TL;DR; of our offline subsetting discussion:
To address the last point I want to test how Proportional–integral–derivative controller works for balancing the load. It was suggested to me by some folks at gRPC conf and it is actually used in practice by some companies to correct the imbalance generated by random subsetting. It is also specifically designed to prevent oscillations and provide some mathematical guaranties about that. If PID controller approach doesn't work or end up being not acceptable either I'll five up on this proposal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've got some really interesting results while testing PID controller in combination with random subsetting. After we test it with real apps in production, I am going to close this proposal and instead open a new one, which will define 2 new LB policies:
@markdroth @ejona86 does this sound ok to you? PID controller test results
If the signal is positive we slowly increase the weight, proportionally to the value of the signal. The opposite happens is the signal is negative
Test setup
Test 1: plain wrr with no subsetting Test 1: PID controller + subsetting (subset_size = 20) Test 2: plain wrr with no subsetting Test 2: PID controller + subsetting (subset_size = 20) Test 1: PID controller + subsetting (subset_size = 5) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't fully respond, but I also don't want to leave you hanging without hearing anything back. Those results look really nice. The PID worked well. Subsetting behaved as expected. You threw a lot of ugly cases at it and it held up. I'm excited. There's been some separate conversations that just so happen to have happened in parallel which may impact the random subsetting... In a conversation about very low QPS clients we found ourselves talking about dynamic subset sizing. @markdroth was going to figure out how much overlap there was with this, IIRC. FYI, I'm on vacation Oct 10th through Oct 18th. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I finally can report some result here: we tested PID balancer for a few apps in prod environment and learned a few important lessons about how we should tune it. Here is the result of applying PID balancer in one of our production DCs for an app that uses random subsetting.
@markdroth @ejona86 does this sound good? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Those last two could be in one gRFC, since the PID balancer is to use the new ORCA functionality. |
||
|
||
// subset_size indicates how many backends every client will be connected to. | ||
// Default is 10. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm tempted to not have a default here (or make it MAX_INT). I know for WRR we believe the minimum subset size should be 20 for many workloads. But there's so much that goes into this that is usage-specific. |
||
google.protobuf.UInt32Value subset_size = 2; | ||
|
||
// sort_addresses indicates whether the LB should sort addresses by IP | ||
// before applying the subsetting algorithm. This might be useful | ||
// in cases when the resolver doesn't provide a stable order or | ||
// when it could order addresses differently depending on the client. | ||
// Default is false. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The default for sorting should probably be true. Most resolvers don't provide a stable ordering of results -- in fact, many of them explicitly randomize the order, because otherwise you don't get proper connection-level load balancing when your clients use pick_first. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree to have it enabled by default. I'd even be tempted to always sort, because it is easy to accidentally mess this up and hard to debug. Is performance the only reason not to sort all the time? |
||
google.protobuf.BoolValue sort_addresses = 3; | ||
|
||
// The config for the child policy. | ||
repeated LoadBalancingConfig child_policy = 4; | ||
} | ||
``` | ||
|
||
### Handling Parent/Resolver Updates | ||
|
||
When the resolver updates the list of addresses, or the LB config changes, Deterministic subsetting LB will run the subsetting algorithm, described above, to filter the endpoint list. Then it will create a new resolver state with the filtered list of the addresses and pass it to the child LB. Attributes and service config from the old resolver state will be copied to the new one. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the last sentence of this paragraph is incorrect and should be removed. Every update the LB policy gets will include all necessary attributes and LB policy config; there is no need to copy anything from the previous state. And doing that kind of copying could result in incorrect behavior if any of that information was intentionally changed between the two updates. |
||
|
||
### Handling Subchannel Connectivity State Notifications | ||
|
||
Deterministic subsetting LB will simply redirect all requests to the child LB without doing any additional processing. This also applies to all other callbacks in the LB interface, besides the one that handles resolver and config updates (which is described in the previous section). This is possible because deterministic subsetting LB doesn't store or manage sub-connections - it acts as a simple filter on the resolver state, and that's why it can redirect all actual work to the child LB. | ||
|
||
### xDS Integration | ||
|
||
Deterministic subsetting LB won't depend on xDS in any way. People may choose to initialize it by directly providing service config. If they do this, they will have to figure out how to correctly initialize and keep updating client_index. There are ways to do that, but describing them is out of scope for this gRFC. The main point is that the LB itself will be xDS agnostic. | ||
|
||
However, the main intended usage of this LB is via xDS. The xDS control plane is a natural place to aggregate information about all available clients and backends and assign client_index. | ||
|
||
#### Changes to xDS API | ||
|
||
`deterministic_subsetting` will be added as a new LB policy. | ||
|
||
```textproto | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: This should just be proto. Textproto is for formatting a message instance. You can also add this code styling to the "LB Policy Config and Parameters" section. |
||
package envoy.extensions.load_balancing_policies.deterministic_subsetting.v3; | ||
|
||
message DeterministicSubsetting { | ||
google.protobuf.UInt32Value client_index = 1; | ||
google.protobuf.UInt32Value subset_size = 2; | ||
google.protobuf.BoolValue sort_addresses = 3; | ||
repeated LoadBalancingConfig child_policy = 4; | ||
} | ||
``` | ||
|
||
As you can see, the fields in this policy match exactly the fields in the deterministic subsetting LB service config. | ||
|
||
#### Integration with xDS LB Policy Registry | ||
As described in [gRFC A52][A52], gRPC has an LB policy registry, which maintains a list of converters. Every converter translates xDS LB policy to the corresponding service config. In order to allow using the Deterministic subsetting LB policy via xDS, the only thing that needs to be done is providing a corresponding converter function. The function implementation will be trivial as the fields in the xDS LB policy will match exactly the fields in the service config. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't look like you're actually adding a link to gRFC A52 anywhere, so this isn't being formatted correctly. |
||
|
||
## Rationale | ||
### Alternatives Considered: Deterministic Aperture | ||
Apart from Google's Deterministic Subsetting algorithm, we also considered [Twitter's Deterministic Aperture](https://patentimages.storage.googleapis.com/1a/09/5a/14392e76e5d8c5/US11119827.pdf) algorithm to handle subsetting. A critical difference in Twitter's algorithm is that it "splits" endpoints into multiple subsets. This is well demonstrated in their [ring diagrams](https://twitter.github.io/finagle/guide/ApertureLoadBalancers.html#deterministic-subsetting), where you can see a single service split into multiple subsets proportionately (i.e. 1/3 in one subset and 2/3 in the other). This ensures that even with simple child policies like `p2c`, the proportion of traffic sent to each backend across all subsets is balanced. | ||
|
||
Though this balancing is valuable, we opted for Google's algorithm for several reasons: | ||
* Both algorithms are sufficient in reducing connection count, which is the primary drawback of `round_robin`. | ||
* Both algorithms provide better load distribution guarantees than `pick_first`. | ||
* When the client to backend ratio requires overlapping subsets, deterministic subsetting provides the additional guarantee of unique subsets. Because the aperture algorithm is built around a fixed ring, when clients \* aperture exceeds the number of servers, aperture takes "laps" over the ring, producing similar subsets for several clients. Deterministic subsetting strives to reduce the risk of misbehaving endpoints by distributing them more randomly across client subsets. | ||
* In order to take advantage of the partial weighting of backends, Twitter's algorithm would require passing weight information from parent balancer to child balancer. The appropriate balancer to handle this sort of weighting would be weighted\_round\_robin, but weighted\_round\_robin currently calculates weights based on server load alone, and it is not designed to consider this additional weighting information. We opted for the solution that allowed us to keep this existing balancer as is. Though deterministic subsetting does not guarantee even load distribution across subsets, its diverse subset paired with weighted\_round\_robin as the child policy within subsets should be sufficient for most use cases. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's true that the gRPC WRR policy described in gRFC A58 uses weight information provided by the endpoints. However, there is a separate community-contributed proposal open (although admittedly hasn't been active for a while) to provide an Envoy-style WRR policy that uses weights specified by the control plane in pending gRFC A34. The two policies are not mutually exclusive -- we can support both if someone wants to contribute the latter. gRPC C-core does pass endpoint weights down from the xDS control plane to the LB policy, but it's up to the individual LB policy as to whether it uses that information. The only policy we currently have that does so is the ring_hash policy. I believe gRPC Go passes the weight through as well, but I don't know about Java. And even if they didn't, I don't think it would be hard to add this plumbing for any gRPC implementation that supports xDS. My reason for mentioning all of this is just to say that the need for endpoint weights from the control plane is not necessarily a reason not to support deterministic aperature subsetting in the general case. But it would make sense to say that deterministic aperature would not work with a control plane that does not send weight information or with an LB policy like gRPC's WRR that uses out-of-band weight information. |
||
|
||
## Implementation | ||
DataDog will provide Go and Java implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can list me as the approver.