From 7326b4d36d9bee93fa77c4979eb1fcb92c3bc639 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Tue, 25 Jul 2023 10:37:32 -0600 Subject: [PATCH 01/13] Add Subsetting lb policy --- A66-deterministic-subsetting-lb-policy.md | 53 +++++++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 A66-deterministic-subsetting-lb-policy.md diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md new file mode 100644 index 000000000..e2bb67393 --- /dev/null +++ b/A66-deterministic-subsetting-lb-policy.md @@ -0,0 +1,53 @@ +A58: `deterministic_subsetting` LB policy. +---- +* Author(s): @s-matyukevich +* Approver: +* Status: Draft +* Implemented in: +* Last updated: [Date] +* Discussion at: + +## Abstract + +Add support for the `deterministic_subsetting` load balancing configured via xDS. + +## Background + +Currently, gRPC is lacking a way to select a subset of backend endpoints and load-balance requests between them. Mostly people are using 2 extremes: `pick_first` and `round_robin`. With`pick_first` the resulting load distribution on the backend servers ends up being very uneven, especially during server rollouts. With `round_rogin` gRPC generates a lot of unnecessary connections, that eat up resource on both clients and servers. The proposed LB policy provides a middlegroud between those 2 extremes. + +### Related Proposals: + +## Proposal + +Introduce a new LB policy `deterministic_subsetting`. This policy selects a subset of addresses and passes them to the clild LB policy in such a way that: +* Every client is connected to at most N backends. If there are not enough backends the policy falls back to its child policy. +* The number of resulting connections per backend is as balanced as possible. +* Every client is connected to a different subset of backends. + +### Subsetting algorithm + +The LB policy will implement the algorithm desribed in [Site Reliability Engineering: How Google Runs Production Systems, Chapter 20](https://sre.google/sre-book/load-balancing-datacenter/#a-subset-selection-algorithm-deterministic-subsetting-eKsdcaUm) with the additional modification mentioned in the "Deterministic subsetting" section of the [Reinventing Backend Subsetting at Google](https://queue.acm.org/detail.cfm?id=3570937) paper. Here is the relevant quote: + +``` +This is the algorithm as previously described in Site Reliability Engineering: How Google Runs Production Systems, Chapter 20) but one improvement remains that can be made by balancing the leftover tasks in each group. The simplest way to achieve this is by choosing (before shuffling) which tasks will be leftovers in a round-robin fashion. For example, the first group of frontend tasks would choose {0, 1} to be leftovers and then shuffle the remaining tasks to get subsets {8, 3, 9, 2} and {4, 6, 5, 7}, and then the second group of frontend tasks would choose {2, 3} to be leftovers and shuffle the remaining tasks to get subsets {9, 7, 1, 6} and {0, 5, 4, 8}. This additional balancing ensures that all backend tasks are evenly excluded from consideration, producing a better distribution. +``` + +### LB Policy Config and Parameters + +The `deterministic_subsetting` LB policy config will be as follows. + +``` +``` + +## Rationale + +[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] + + +## Implementation + +[A description of the steps in the implementation, who will do them, and when. If a particular language is going to get the implementation first, this section should list the proposed order.] + +## Open issues (if applicable) + +[A discussion of issues relating to this proposal for which the author does not know the solution. This section may be omitted if there are none.] \ No newline at end of file From 5370169f3907e112e906e90bcd06cb961d5fe4f0 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Thu, 27 Jul 2023 14:02:04 -0600 Subject: [PATCH 02/13] more details on the proposal --- A66-deterministic-subsetting-lb-policy.md | 75 +++++++++++++++++++++-- 1 file changed, 69 insertions(+), 6 deletions(-) diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md index e2bb67393..094374b79 100644 --- a/A66-deterministic-subsetting-lb-policy.md +++ b/A66-deterministic-subsetting-lb-policy.md @@ -29,15 +29,82 @@ Introduce a new LB policy `deterministic_subsetting`. This policy selects a subs The LB policy will implement the algorithm desribed in [Site Reliability Engineering: How Google Runs Production Systems, Chapter 20](https://sre.google/sre-book/load-balancing-datacenter/#a-subset-selection-algorithm-deterministic-subsetting-eKsdcaUm) with the additional modification mentioned in the "Deterministic subsetting" section of the [Reinventing Backend Subsetting at Google](https://queue.acm.org/detail.cfm?id=3570937) paper. Here is the relevant quote: ``` -This is the algorithm as previously described in Site Reliability Engineering: How Google Runs Production Systems, Chapter 20) but one improvement remains that can be made by balancing the leftover tasks in each group. The simplest way to achieve this is by choosing (before shuffling) which tasks will be leftovers in a round-robin fashion. For example, the first group of frontend tasks would choose {0, 1} to be leftovers and then shuffle the remaining tasks to get subsets {8, 3, 9, 2} and {4, 6, 5, 7}, and then the second group of frontend tasks would choose {2, 3} to be leftovers and shuffle the remaining tasks to get subsets {9, 7, 1, 6} and {0, 5, 4, 8}. This additional balancing ensures that all backend tasks are evenly excluded from consideration, producing a better distribution. +This is the algorithm as previously described in Site Reliability Engineering: How Google Runs Production Systems, Chapter 20 but one improvement remains that can be made by balancing the leftover tasks in each group. The simplest way to achieve this is by choosing (before shuffling) which tasks will be leftovers in a round-robin fashion. For example, the first group of frontend tasks would choose {0, 1} to be leftovers and then shuffle the remaining tasks to get subsets {8, 3, 9, 2} and {4, 6, 5, 7}, and then the second group of frontend tasks would choose {2, 3} to be leftovers and shuffle the remaining tasks to get subsets {9, 7, 1, 6} and {0, 5, 4, 8}. This additional balancing ensures that all backend tasks are evenly excluded from consideration, producing a better distribution. ``` +### Characteristics of the selected algorithm + +* Every client connects to exactly `subset_size` backends. +* Clients are evenly distributed between backends. The most connected backend recieves at most 2 more connections than the least connected one. +* Backends and randomly shuffled between clients. If a small subset of backends looses connectivity, the algoitm maximizes chances that the backends from this subset will be evenly distributed between clients. +* The algoritm generates some, potentially significant, connection charn during backends scale or redeployment. This might be a problem for very small subset sizes. + ### LB Policy Config and Parameters The `deterministic_subsetting` LB policy config will be as follows. ``` +message LoadBalancingConfig { + oneof policy { + DeterministicSubsettingLbConfig deterministic_subsetting = 21 [json_name = "deterministic_subsetting"]; + } +} + +message DeterministicSubsettingLbConfig { + // client_index is an index within the + // interval [0..N-1], where N is the total number of clients. + // Every client must have a unque index. + google.protobuf.UInt32Value client_index = 1; + + // subset_size indicates to how many backends every client will be connected to. + // Default is 10. + google.protobuf.UInt32Value subset_size = 2; + + // sort_addresses indicates whether the LB should sort addresses by IP + // before applying subsetting algorithm. This might be usefull + // in cases when the resolver doesn't provide a stable order or + // when it could order addresses differently depending on the client. + // Default is false. + google.protobuf.BoolValue sort_addresses = 3; + + // The config for the child policy. + repeated LoadBalancingConfig child_policy = 4; +} +``` + +### Handling Parent/Resolver Updates + +When resolver updates the list of addresses, or the LB config changes Deterministic subsetting LB will run the subsetting algorithm, described above, to filter the endpoint list. Then it will create new resolver state with the filtered list of the addresses and passes it to the clild LB. Attributes and service config from the old resolver state will be copied to the new one. + +### Handling Subchannel Connectivity State Notifications + +Deterministic subsetting LB will simply redirect all requests to the child LB without doing any additional processing. This also applies to all other calllbacks in the LB interface, besides the one that hadles resolver and config updates (which is described in the previous section). This is possible because deterministic subsetting LB doesn't store or manage sub connections - it acts as a simple filter on the resolver state, and that's why it can redirect all actual work to the child LB. + +### xDS Integration + +Deterministic subsetting LB won't depend on xDS in any way. People may chose to initialize it by directly providing service config. If they do this they will have to figure out how to correctly initialize and keep updating client_index. There are way how to do that, but desribing them is out of scope for this gRPC. The main point is that the LB itself will be xDS agnostic. + +However, the main intended usage of this LB is via xDS. xDS control plane is a natural place to aggregate information about all avilable clients and backends and assign client_index. + +#### Changes to xDS API + +`deterministic_subsetting` will be added as a new LB policy. + +```textproto +package envoy.extensions.load_balancing_policies.deterministic_subsetting.v3; + +message DeterministicSubsetting { + google.protobuf.UInt32Value client_index = 1; + google.protobuf.UInt32Value subset_size = 2; + google.protobuf.BoolValue sort_addresses = 3; + repeated LoadBalancingConfig child_policy = 4; +} ``` +As you can see, the fields in this posicy match exactly the fields in the deterministic subsetting LB service config. + +#### Integration with xDS LB Policy Registry + +As described in [gRFC A52][A52], grpc has an LB policy registry, which maintains alist of convertes. Every converter translates xDS LB policy to the corresponding service config. In order to allow using Deterministic subsetting LB policy via xDS the only thing that needs to be done is providing a corresponding converter function. The function implementation will be trivial as the fields in the xDS LB policy will match exactly the fields in the service config. ## Rationale @@ -46,8 +113,4 @@ The `deterministic_subsetting` LB policy config will be as follows. ## Implementation -[A description of the steps in the implementation, who will do them, and when. If a particular language is going to get the implementation first, this section should list the proposed order.] - -## Open issues (if applicable) - -[A discussion of issues relating to this proposal for which the author does not know the solution. This section may be omitted if there are none.] \ No newline at end of file +DataDog will provide Go and Java implementations. From 3fdfb762f27da7db115f59c60d8d66e670678519 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Thu, 27 Jul 2023 14:16:08 -0600 Subject: [PATCH 03/13] correct grammar mistakes --- A66-deterministic-subsetting-lb-policy.md | 82 +++++++++++------------ 1 file changed, 40 insertions(+), 42 deletions(-) diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md index 094374b79..294120a12 100644 --- a/A66-deterministic-subsetting-lb-policy.md +++ b/A66-deterministic-subsetting-lb-policy.md @@ -9,35 +9,36 @@ A58: `deterministic_subsetting` LB policy. ## Abstract -Add support for the `deterministic_subsetting` load balancing configured via xDS. +Add support for the `deterministic_subsetting` load balancing policy configured via xDS. ## Background -Currently, gRPC is lacking a way to select a subset of backend endpoints and load-balance requests between them. Mostly people are using 2 extremes: `pick_first` and `round_robin`. With`pick_first` the resulting load distribution on the backend servers ends up being very uneven, especially during server rollouts. With `round_rogin` gRPC generates a lot of unnecessary connections, that eat up resource on both clients and servers. The proposed LB policy provides a middlegroud between those 2 extremes. +Currently, gRPC is lacking a way to select a subset of backend endpoints and load-balance requests between them. Mostly people are using two extremes: `pick_first` and `round_robin`. With `pick_first`, the resulting load distribution on the backend servers ends up being very uneven, especially during server rollouts. With `round_robin`, gRPC generates a lot of unnecessary connections, which consume resources on both clients and servers. The proposed LB policy provides a middle ground between these two extremes. ### Related Proposals: ## Proposal -Introduce a new LB policy `deterministic_subsetting`. This policy selects a subset of addresses and passes them to the clild LB policy in such a way that: -* Every client is connected to at most N backends. If there are not enough backends the policy falls back to its child policy. +Introduce a new LB policy, `deterministic_subsetting`. This policy selects a subset of addresses and passes them to the child LB policy in such a way that: +* Every client is connected to at most N backends. If there are not enough backends, the policy falls back to its child policy. * The number of resulting connections per backend is as balanced as possible. * Every client is connected to a different subset of backends. ### Subsetting algorithm -The LB policy will implement the algorithm desribed in [Site Reliability Engineering: How Google Runs Production Systems, Chapter 20](https://sre.google/sre-book/load-balancing-datacenter/#a-subset-selection-algorithm-deterministic-subsetting-eKsdcaUm) with the additional modification mentioned in the "Deterministic subsetting" section of the [Reinventing Backend Subsetting at Google](https://queue.acm.org/detail.cfm?id=3570937) paper. Here is the relevant quote: +The LB policy will implement the algorithm described in [Site Reliability Engineering: How Google Runs Production Systems, Chapter 20](https://sre.google/sre-book/load-balancing-datacenter/#a-subset-selection-algorithm-deterministic-subsetting-eKsdcaUm) with the additional modification mentioned in the "Deterministic subsetting" section of the [Reinventing Backend Subsetting at Google](https://queue.acm.org/detail.cfm?id=3570937) paper. Here is the relevant quote: ``` This is the algorithm as previously described in Site Reliability Engineering: How Google Runs Production Systems, Chapter 20 but one improvement remains that can be made by balancing the leftover tasks in each group. The simplest way to achieve this is by choosing (before shuffling) which tasks will be leftovers in a round-robin fashion. For example, the first group of frontend tasks would choose {0, 1} to be leftovers and then shuffle the remaining tasks to get subsets {8, 3, 9, 2} and {4, 6, 5, 7}, and then the second group of frontend tasks would choose {2, 3} to be leftovers and shuffle the remaining tasks to get subsets {9, 7, 1, 6} and {0, 5, 4, 8}. This additional balancing ensures that all backend tasks are evenly excluded from consideration, producing a better distribution. ``` + ### Characteristics of the selected algorithm * Every client connects to exactly `subset_size` backends. -* Clients are evenly distributed between backends. The most connected backend recieves at most 2 more connections than the least connected one. -* Backends and randomly shuffled between clients. If a small subset of backends looses connectivity, the algoitm maximizes chances that the backends from this subset will be evenly distributed between clients. -* The algoritm generates some, potentially significant, connection charn during backends scale or redeployment. This might be a problem for very small subset sizes. +* Clients are evenly distributed between backends. The most connected backend receives at most 2 more connections than the least connected one. +* Backends are randomly shuffled between clients. If a small subset of backends loses connectivity, the algorithm maximizes the chances that the backends from this subset will be evenly distributed between clients. +* The algorithm generates some, potentially significant, connection churn during backend scaling or redeployment. This might be a problem for very small subset sizes. ### LB Policy Config and Parameters @@ -45,46 +46,46 @@ The `deterministic_subsetting` LB policy config will be as follows. ``` message LoadBalancingConfig { - oneof policy { - DeterministicSubsettingLbConfig deterministic_subsetting = 21 [json_name = "deterministic_subsetting"]; - } + oneof policy { + DeterministicSubsettingLbConfig deterministic_subsetting = 21 [json_name = "deterministic_subsetting"]; + } } message DeterministicSubsettingLbConfig { - // client_index is an index within the - // interval [0..N-1], where N is the total number of clients. - // Every client must have a unque index. - google.protobuf.UInt32Value client_index = 1; - - // subset_size indicates to how many backends every client will be connected to. - // Default is 10. - google.protobuf.UInt32Value subset_size = 2; - - // sort_addresses indicates whether the LB should sort addresses by IP - // before applying subsetting algorithm. This might be usefull - // in cases when the resolver doesn't provide a stable order or - // when it could order addresses differently depending on the client. - // Default is false. - google.protobuf.BoolValue sort_addresses = 3; - - // The config for the child policy. - repeated LoadBalancingConfig child_policy = 4; + // client_index is an index within the + // interval [0..N-1], where N is the total number of clients. + // Every client must have a unique index. + google.protobuf.UInt32Value client_index = 1; + + // subset_size indicates how many backends every client will be connected to. + // Default is 10. + google.protobuf.UInt32Value subset_size = 2; + + // sort_addresses indicates whether the LB should sort addresses by IP + // before applying the subsetting algorithm. This might be useful + // in cases when the resolver doesn't provide a stable order or + // when it could order addresses differently depending on the client. + // Default is false. + google.protobuf.BoolValue sort_addresses = 3; + + // The config for the child policy. + repeated LoadBalancingConfig child_policy = 4; } -``` + ### Handling Parent/Resolver Updates -When resolver updates the list of addresses, or the LB config changes Deterministic subsetting LB will run the subsetting algorithm, described above, to filter the endpoint list. Then it will create new resolver state with the filtered list of the addresses and passes it to the clild LB. Attributes and service config from the old resolver state will be copied to the new one. +When the resolver updates the list of addresses, or the LB config changes, Deterministic subsetting LB will run the subsetting algorithm, described above, to filter the endpoint list. Then it will create a new resolver state with the filtered list of the addresses and pass it to the child LB. Attributes and service config from the old resolver state will be copied to the new one. ### Handling Subchannel Connectivity State Notifications -Deterministic subsetting LB will simply redirect all requests to the child LB without doing any additional processing. This also applies to all other calllbacks in the LB interface, besides the one that hadles resolver and config updates (which is described in the previous section). This is possible because deterministic subsetting LB doesn't store or manage sub connections - it acts as a simple filter on the resolver state, and that's why it can redirect all actual work to the child LB. +Deterministic subsetting LB will simply redirect all requests to the child LB without doing any additional processing. This also applies to all other callbacks in the LB interface, besides the one that handles resolver and config updates (which is described in the previous section). This is possible because deterministic subsetting LB doesn't store or manage sub-connections - it acts as a simple filter on the resolver state, and that's why it can redirect all actual work to the child LB. ### xDS Integration -Deterministic subsetting LB won't depend on xDS in any way. People may chose to initialize it by directly providing service config. If they do this they will have to figure out how to correctly initialize and keep updating client_index. There are way how to do that, but desribing them is out of scope for this gRPC. The main point is that the LB itself will be xDS agnostic. +Deterministic subsetting LB won't depend on xDS in any way. People may choose to initialize it by directly providing service config. If they do this, they will have to figure out how to correctly initialize and keep updating client_index. There are ways to do that, but describing them is out of scope for this gRPC. The main point is that the LB itself will be xDS agnostic. -However, the main intended usage of this LB is via xDS. xDS control plane is a natural place to aggregate information about all avilable clients and backends and assign client_index. +However, the main intended usage of this LB is via xDS. The xDS control plane is a natural place to aggregate information about all available clients and backends and assign client_index. #### Changes to xDS API @@ -100,17 +101,14 @@ message DeterministicSubsetting { repeated LoadBalancingConfig child_policy = 4; } ``` -As you can see, the fields in this posicy match exactly the fields in the deterministic subsetting LB service config. -#### Integration with xDS LB Policy Registry +As you can see, the fields in this policy match exactly the fields in the deterministic subsetting LB service config. -As described in [gRFC A52][A52], grpc has an LB policy registry, which maintains alist of convertes. Every converter translates xDS LB policy to the corresponding service config. In order to allow using Deterministic subsetting LB policy via xDS the only thing that needs to be done is providing a corresponding converter function. The function implementation will be trivial as the fields in the xDS LB policy will match exactly the fields in the service config. +#### Integration with xDS LB Policy Registry +As described in [gRFC A52][A52], gRPC has an LB policy registry, which maintains a list of converters. Every converter translates xDS LB policy to the corresponding service config. In order to allow using the Deterministic subsetting LB policy via xDS, the only thing that needs to be done is providing a corresponding converter function. The function implementation will be trivial as the fields in the xDS LB policy will match exactly the fields in the service config. ## Rationale - -[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] - +[A discussion of alternate approaches and theThe proposal will be revised when implementation experience is obtained.] ## Implementation - -DataDog will provide Go and Java implementations. +DataDog will provide Go and Java implementations. \ No newline at end of file From a2456e592cd436d650e18ecbd0ec11eec611f456 Mon Sep 17 00:00:00 2001 From: Joy Bestourous Date: Thu, 27 Jul 2023 18:18:51 -0400 Subject: [PATCH 04/13] rationale and typos --- A66-deterministic-subsetting-lb-policy.md | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md index 294120a12..9a42be3d4 100644 --- a/A66-deterministic-subsetting-lb-policy.md +++ b/A66-deterministic-subsetting-lb-policy.md @@ -1,9 +1,9 @@ -A58: `deterministic_subsetting` LB policy. +A66: `deterministic_subsetting` LB policy. ---- -* Author(s): @s-matyukevich +* Author(s): @s-matyukevich, @joybestourous * Approver: * Status: Draft -* Implemented in: +* Implemented in: Go, Java * Last updated: [Date] * Discussion at: @@ -71,7 +71,7 @@ message DeterministicSubsettingLbConfig { // The config for the child policy. repeated LoadBalancingConfig child_policy = 4; } - +``` ### Handling Parent/Resolver Updates @@ -83,9 +83,9 @@ Deterministic subsetting LB will simply redirect all requests to the child LB wi ### xDS Integration -Deterministic subsetting LB won't depend on xDS in any way. People may choose to initialize it by directly providing service config. If they do this, they will have to figure out how to correctly initialize and keep updating client_index. There are ways to do that, but describing them is out of scope for this gRPC. The main point is that the LB itself will be xDS agnostic. +Deterministic subsetting LB won't depend on xDS in any way. People may choose to initialize it by directly providing service config. If they do this, they will have to figure out how to correctly initialize and keep updating client\_index. There are ways to do that, but describing them is out of scope for this gRPC. The main point is that the LB itself will be xDS agnostic. -However, the main intended usage of this LB is via xDS. The xDS control plane is a natural place to aggregate information about all available clients and backends and assign client_index. +However, the main intended usage of this LB is via xDS. The xDS control plane is a natural place to aggregate information about all available clients and backends and assign client\_index. #### Changes to xDS API @@ -108,7 +108,14 @@ As you can see, the fields in this policy match exactly the fields in the determ As described in [gRFC A52][A52], gRPC has an LB policy registry, which maintains a list of converters. Every converter translates xDS LB policy to the corresponding service config. In order to allow using the Deterministic subsetting LB policy via xDS, the only thing that needs to be done is providing a corresponding converter function. The function implementation will be trivial as the fields in the xDS LB policy will match exactly the fields in the service config. ## Rationale -[A discussion of alternate approaches and theThe proposal will be revised when implementation experience is obtained.] +### Alternatives Considered: Deterministic Aperture +Apart from Google's Deterministic Subsetting algorithm, we also considered [Twitter's Deterministic Aperture](https://patentimages.storage.googleapis.com/1a/09/5a/14392e76e5d8c5/US11119827.pdf) algorithm to handle subsetting. A critical difference in Twitter's algorithm is that it "splits" endpoints into multiple subsets. This is well demonstrated in their [ring diagrams](https://twitter.github.io/finagle/guide/ApertureLoadBalancers.html#deterministic-subsetting), where you can see a single service split into multiple subsets proportionately (i.e. 1/3 in one subset and 2/3 in the other). This ensures that even with simple child policies like `p2c`, the proportion of traffic sent to each backend across all subsets is balanced. + +Though this balancing is valuable, we opted for Google's algorithm for several reasons: +* Both algorithms are sufficient in reducing connection count, which is the primary drawback of `round_robin`. +* Both algorithms provide better load distribution guarantees than `pick_first`. +* When the client to backend ratio requires overlapping subsets, deterministic subsetting provides the additional guarantee of unique subsets. Because the aperture algorithm is built around a fixed ring, when clients \* aperture exceeds the number of servers, aperture takes "laps" over the ring, producing similar subsets for several clients. Deterministic subsetting strives to reduce the risk of misbehaving endpoints by distributing them more randomly across client subsets. +* In order to take advantage of the partial weighting of backends, Twitter's algorithm would require passing weight information from parent balancer to child balancer. The appropriate balancer to handle this sort of weighting would be weighted\_round\_robin, but weighted\_round\_robin currently calculates weights based on server load alone, and it is not designed to consider this additional weighting information. We opted for the solution that allowed us to keep this existing balancer as is. Though deterministic subsetting does not guarantee even load distribution across subsets, its diverse subset paired with weighted\_round\_robin as the child policy within subsets should be sufficient for most use cases. ## Implementation -DataDog will provide Go and Java implementations. \ No newline at end of file +DataDog will provide Go and Java implementations. From fa0275eb7d2d3539da035ee528bc274c30092d1f Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Fri, 28 Jul 2023 08:51:58 -0600 Subject: [PATCH 05/13] Update A66-deterministic-subsetting-lb-policy.md Co-authored-by: Antoine Tollenaere --- A66-deterministic-subsetting-lb-policy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md index 9a42be3d4..cecdd6fa5 100644 --- a/A66-deterministic-subsetting-lb-policy.md +++ b/A66-deterministic-subsetting-lb-policy.md @@ -13,7 +13,7 @@ Add support for the `deterministic_subsetting` load balancing policy configured ## Background -Currently, gRPC is lacking a way to select a subset of backend endpoints and load-balance requests between them. Mostly people are using two extremes: `pick_first` and `round_robin`. With `pick_first`, the resulting load distribution on the backend servers ends up being very uneven, especially during server rollouts. With `round_robin`, gRPC generates a lot of unnecessary connections, which consume resources on both clients and servers. The proposed LB policy provides a middle ground between these two extremes. +Currently, gRPC is lacking a way to select a subset of endpoints available from the resolver and load-balance requests between them. Out of the box, users have the choice between two extremes: `pick_first` which sends all requests to one random backend, and `round_robin` which sends requests to all available backends. `pick_first` has poor connection balancing when the number of client is not much higher than the number of servers because of the birthday paradox. The problem is exacerbated during rollouts because `pick_first` does not change endpoint on resolver updates if the current subchannel remains `READY`. `round_robin` results in every servers having as many connections open as there are clients, which is unnecessarily costly when there are many clients, and makes local decisions load balancing (such as outlier detection) less precise. ### Related Proposals: From 837fb7ad9eba32bd830b74ec07ccf02810768a33 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Fri, 28 Jul 2023 08:52:47 -0600 Subject: [PATCH 06/13] Update A66-deterministic-subsetting-lb-policy.md Co-authored-by: Antoine Tollenaere --- A66-deterministic-subsetting-lb-policy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md index cecdd6fa5..001333006 100644 --- a/A66-deterministic-subsetting-lb-policy.md +++ b/A66-deterministic-subsetting-lb-policy.md @@ -20,7 +20,7 @@ Currently, gRPC is lacking a way to select a subset of endpoints available from ## Proposal Introduce a new LB policy, `deterministic_subsetting`. This policy selects a subset of addresses and passes them to the child LB policy in such a way that: -* Every client is connected to at most N backends. If there are not enough backends, the policy falls back to its child policy. +* Every client is connected to exactly N backends. If there are not enough backends, the policy falls back to its child policy. * The number of resulting connections per backend is as balanced as possible. * Every client is connected to a different subset of backends. From 59bedf3221be72aec281aa611252a3a9ef703290 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Fri, 28 Jul 2023 11:00:10 -0600 Subject: [PATCH 07/13] add some pseudocode --- A66-deterministic-subsetting-lb-policy.md | 60 ++++++++++++++++++++++- 1 file changed, 58 insertions(+), 2 deletions(-) diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md index 001333006..ccd186764 100644 --- a/A66-deterministic-subsetting-lb-policy.md +++ b/A66-deterministic-subsetting-lb-policy.md @@ -22,7 +22,7 @@ Currently, gRPC is lacking a way to select a subset of endpoints available from Introduce a new LB policy, `deterministic_subsetting`. This policy selects a subset of addresses and passes them to the child LB policy in such a way that: * Every client is connected to exactly N backends. If there are not enough backends, the policy falls back to its child policy. * The number of resulting connections per backend is as balanced as possible. -* Every client is connected to a different subset of backends. +* For any 2 given clients the probability of them connecting on the same or very similar subset of backends is as low as possible. ### Subsetting algorithm @@ -32,11 +32,67 @@ The LB policy will implement the algorithm described in [Site Reliability Engine This is the algorithm as previously described in Site Reliability Engineering: How Google Runs Production Systems, Chapter 20 but one improvement remains that can be made by balancing the leftover tasks in each group. The simplest way to achieve this is by choosing (before shuffling) which tasks will be leftovers in a round-robin fashion. For example, the first group of frontend tasks would choose {0, 1} to be leftovers and then shuffle the remaining tasks to get subsets {8, 3, 9, 2} and {4, 6, 5, 7}, and then the second group of frontend tasks would choose {2, 3} to be leftovers and shuffle the remaining tasks to get subsets {9, 7, 1, 6} and {0, 5, 4, 8}. This additional balancing ensures that all backend tasks are evenly excluded from consideration, producing a better distribution. ``` +Here is the implementation of this algorithm in pseudocode with detailed comments. + +``` +func filter_addresses(addresses, subset_size, client_index, sort_addresses) + backend_count = addresses.length() + if backend_count > subset_size { + // if we don't have enough addresses to cover the desired subset size, just return the whole list + return addresses + } + + if sort_addresses { + // sort address list because the algorithm assumes that the initial + // order of the addresses is the same for every client + addresses.sort() + } + + // subset_count indicates how much clients we can have so that every cleint is connected to exactly + // subset_size distinct backends and no 2 clients connect to the same backend. + subset_count = backend_count / subset_size + + // Given subset_count we not cat divide clients by rounds. Every round have exactly subset_count clients. + // round indicates the index of the round for the current client based on its index. + round = client_index / subset_count + + // There might be some lefover backends withing every round in cases when backend_count % subset_size != 0 + // excluded_count indicates how mumanych leftover backends we have on every round. + excluded_count = backend_count % subset_size + + // We want to choose what backends are excluded in a round robin fashion before shufling the backends. + // This way we make sure that for any 2 backends the difference of how many times they are excluded is at most 1. + // excluded_start indicates the index of the first excluded backend + excluded_start := (round * excluded_count) % backend_count + // excluded_start indicates the index after the last excluded backend. + // It could wrap around the end of the addresses list. + excluded_end := (excluded_start + excluded_count) % backend_count + + if excluded_start < excluded_end { + // excluded_end doesn't wrap, exclude addresses from the interval [excluded_start, excluded_end) + } else { + // excluded_end wraps around the end of the addresses list, exclude intervals [0:excluded_start] and [excluded_end, end_of_the_array) + } + + // randomly shuffle the addresses to increase subset diversity. Use round as seed to make sure + // that clients within the same round shuffle addresses identically. + addresses.shuffle(seed: round) + + // subset_id is the index for the current client withing the round + subset_id := client_index % subset_count + + // calculate start start and end of the resulting subset + start = subsetId * subset_size + end = start + subset_size + return addresses[start: end] +} +``` + ### Characteristics of the selected algorithm * Every client connects to exactly `subset_size` backends. -* Clients are evenly distributed between backends. The most connected backend receives at most 2 more connections than the least connected one. +* Clients are evenly distributed between backends. The most connected backend receives at most 2 more connections than the least connected one. One comes from the fact that a backend can be excluded one more time than some other backend. And another one comes from the fact that there might be not enough client to completely fill the last round. * Backends are randomly shuffled between clients. If a small subset of backends loses connectivity, the algorithm maximizes the chances that the backends from this subset will be evenly distributed between clients. * The algorithm generates some, potentially significant, connection churn during backend scaling or redeployment. This might be a problem for very small subset sizes. From bbf20d9f1744d4836c4521e8a74b855dd60a920c Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Mon, 31 Jul 2023 08:03:20 -0600 Subject: [PATCH 08/13] Update A66-deterministic-subsetting-lb-policy.md Co-authored-by: Anna Grigoriu <125993460+anna-grigoriu@users.noreply.github.com> --- A66-deterministic-subsetting-lb-policy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md index ccd186764..10a3b647f 100644 --- a/A66-deterministic-subsetting-lb-policy.md +++ b/A66-deterministic-subsetting-lb-policy.md @@ -139,7 +139,7 @@ Deterministic subsetting LB will simply redirect all requests to the child LB wi ### xDS Integration -Deterministic subsetting LB won't depend on xDS in any way. People may choose to initialize it by directly providing service config. If they do this, they will have to figure out how to correctly initialize and keep updating client\_index. There are ways to do that, but describing them is out of scope for this gRPC. The main point is that the LB itself will be xDS agnostic. +Deterministic subsetting LB won't depend on xDS in any way. People may choose to initialize it by directly providing service config. If they do this, they will have to figure out how to correctly initialize and keep updating client\_index. There are ways to do that, but describing them is out of scope for this gRFC. The main point is that the LB itself will be xDS agnostic. However, the main intended usage of this LB is via xDS. The xDS control plane is a natural place to aggregate information about all available clients and backends and assign client\_index. From bbbeca2ae12067a36f36a48488cafceb6d4a54cf Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Mon, 31 Jul 2023 08:04:26 -0600 Subject: [PATCH 09/13] rename --- A68-deterministic-subsetting-lb-policy.md | 177 ++++++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 A68-deterministic-subsetting-lb-policy.md diff --git a/A68-deterministic-subsetting-lb-policy.md b/A68-deterministic-subsetting-lb-policy.md new file mode 100644 index 000000000..0ddcc6391 --- /dev/null +++ b/A68-deterministic-subsetting-lb-policy.md @@ -0,0 +1,177 @@ +A68: `deterministic_subsetting` LB policy. +---- +* Author(s): @s-matyukevich, @joybestourous +* Approver: +* Status: Draft +* Implemented in: Go, Java +* Last updated: [Date] +* Discussion at: + +## Abstract + +Add support for the `deterministic_subsetting` load balancing policy configured via xDS. + +## Background + +Currently, gRPC is lacking a way to select a subset of endpoints available from the resolver and load-balance requests between them. Out of the box, users have the choice between two extremes: `pick_first` which sends all requests to one random backend, and `round_robin` which sends requests to all available backends. `pick_first` has poor connection balancing when the number of client is not much higher than the number of servers because of the birthday paradox. The problem is exacerbated during rollouts because `pick_first` does not change endpoint on resolver updates if the current subchannel remains `READY`. `round_robin` results in every servers having as many connections open as there are clients, which is unnecessarily costly when there are many clients, and makes local decisions load balancing (such as outlier detection) less precise. + +### Related Proposals: + +## Proposal + +Introduce a new LB policy, `deterministic_subsetting`. This policy selects a subset of addresses and passes them to the child LB policy in such a way that: +* Every client is connected to exactly N backends. If there are not enough backends, the policy falls back to its child policy. +* The number of resulting connections per backend is as balanced as possible. +* For any 2 given clients the probability of them connecting on the same or very similar subset of backends is as low as possible. + +### Subsetting algorithm + +The LB policy will implement the algorithm described in [Site Reliability Engineering: How Google Runs Production Systems, Chapter 20](https://sre.google/sre-book/load-balancing-datacenter/#a-subset-selection-algorithm-deterministic-subsetting-eKsdcaUm) with the additional modification mentioned in the "Deterministic subsetting" section of the [Reinventing Backend Subsetting at Google](https://queue.acm.org/detail.cfm?id=3570937) paper. Here is the relevant quote: + +``` +This is the algorithm as previously described in Site Reliability Engineering: How Google Runs Production Systems, Chapter 20 but one improvement remains that can be made by balancing the leftover tasks in each group. The simplest way to achieve this is by choosing (before shuffling) which tasks will be leftovers in a round-robin fashion. For example, the first group of frontend tasks would choose {0, 1} to be leftovers and then shuffle the remaining tasks to get subsets {8, 3, 9, 2} and {4, 6, 5, 7}, and then the second group of frontend tasks would choose {2, 3} to be leftovers and shuffle the remaining tasks to get subsets {9, 7, 1, 6} and {0, 5, 4, 8}. This additional balancing ensures that all backend tasks are evenly excluded from consideration, producing a better distribution. +``` + +Here is the implementation of this algorithm in pseudocode with detailed comments. + +``` +func filter_addresses(addresses, subset_size, client_index, sort_addresses) + backend_count = addresses.length() + if backend_count > subset_size { + // if we don't have enough addresses to cover the desired subset size, just return the whole list + return addresses + } + + if sort_addresses { + // sort address list because the algorithm assumes that the initial + // order of the addresses is the same for every client + addresses.sort() + } + + // subset_count indicates how much clients we can have so that every cleint is connected to exactly + // subset_size distinct backends and no 2 clients connect to the same backend. + subset_count = backend_count / subset_size + + // Given subset_count we not cat divide clients by rounds. Every round have exactly subset_count clients. + // round indicates the index of the round for the current client based on its index. + round = client_index / subset_count + + // There might be some lefover backends withing every round in cases when backend_count % subset_size != 0 + // excluded_count indicates how mumanych leftover backends we have on every round. + excluded_count = backend_count % subset_size + + // We want to choose what backends are excluded in a round robin fashion before shufling the backends. + // This way we make sure that for any 2 backends the difference of how many times they are excluded is at most 1. + // excluded_start indicates the index of the first excluded backend + excluded_start := (round * excluded_count) % backend_count + // excluded_start indicates the index after the last excluded backend. + // It could wrap around the end of the addresses list. + excluded_end := (excluded_start + excluded_count) % backend_count + + if excluded_start < excluded_end { + // excluded_end doesn't wrap, exclude addresses from the interval [excluded_start, excluded_end) + } else { + // excluded_end wraps around the end of the addresses list, exclude intervals [0:excluded_start] and [excluded_end, end_of_the_array) + } + + // randomly shuffle the addresses to increase subset diversity. Use round as seed to make sure + // that clients within the same round shuffle addresses identically. + addresses.shuffle(seed: round) + + // subset_id is the index for the current client withing the round + subset_id := client_index % subset_count + + // calculate start start and end of the resulting subset + start = subsetId * subset_size + end = start + subset_size + return addresses[start: end] +} +``` + + +### Characteristics of the selected algorithm + +* Every client connects to exactly `subset_size` backends. +* Clients are evenly distributed between backends. The most connected backend receives at most 2 more connections than the least connected one. One comes from the fact that a backend can be excluded one more time than some other backend. And another one comes from the fact that there might be not enough client to completely fill the last round. +* Backends are randomly shuffled between clients. If a small subset of backends loses connectivity, the algorithm maximizes the chances that the backends from this subset will be evenly distributed between clients. +* The algorithm generates some, potentially significant, connection churn during backend scaling or redeployment. This might be a problem for very small subset sizes. + +### LB Policy Config and Parameters + +The `deterministic_subsetting` LB policy config will be as follows. + +``` +message LoadBalancingConfig { + oneof policy { + DeterministicSubsettingLbConfig deterministic_subsetting = 21 [json_name = "deterministic_subsetting"]; + } +} + +message DeterministicSubsettingLbConfig { + // client_index is an index within the + // interval [0..N-1], where N is the total number of clients. + // Every client must have a unique index. + google.protobuf.UInt32Value client_index = 1; + + // subset_size indicates how many backends every client will be connected to. + // Default is 10. + google.protobuf.UInt32Value subset_size = 2; + + // sort_addresses indicates whether the LB should sort addresses by IP + // before applying the subsetting algorithm. This might be useful + // in cases when the resolver doesn't provide a stable order or + // when it could order addresses differently depending on the client. + // Default is false. + google.protobuf.BoolValue sort_addresses = 3; + + // The config for the child policy. + repeated LoadBalancingConfig child_policy = 4; +} +``` + +### Handling Parent/Resolver Updates + +When the resolver updates the list of addresses, or the LB config changes, Deterministic subsetting LB will run the subsetting algorithm, described above, to filter the endpoint list. Then it will create a new resolver state with the filtered list of the addresses and pass it to the child LB. Attributes and service config from the old resolver state will be copied to the new one. + +### Handling Subchannel Connectivity State Notifications + +Deterministic subsetting LB will simply redirect all requests to the child LB without doing any additional processing. This also applies to all other callbacks in the LB interface, besides the one that handles resolver and config updates (which is described in the previous section). This is possible because deterministic subsetting LB doesn't store or manage sub-connections - it acts as a simple filter on the resolver state, and that's why it can redirect all actual work to the child LB. + +### xDS Integration + +Deterministic subsetting LB won't depend on xDS in any way. People may choose to initialize it by directly providing service config. If they do this, they will have to figure out how to correctly initialize and keep updating client\_index. There are ways to do that, but describing them is out of scope for this gRPC. The main point is that the LB itself will be xDS agnostic. + +However, the main intended usage of this LB is via xDS. The xDS control plane is a natural place to aggregate information about all available clients and backends and assign client\_index. + +#### Changes to xDS API + +`deterministic_subsetting` will be added as a new LB policy. + +```textproto +package envoy.extensions.load_balancing_policies.deterministic_subsetting.v3; + +message DeterministicSubsetting { + google.protobuf.UInt32Value client_index = 1; + google.protobuf.UInt32Value subset_size = 2; + google.protobuf.BoolValue sort_addresses = 3; + repeated LoadBalancingConfig child_policy = 4; +} +``` + +As you can see, the fields in this policy match exactly the fields in the deterministic subsetting LB service config. + +#### Integration with xDS LB Policy Registry +As described in [gRFC A52][A52], gRPC has an LB policy registry, which maintains a list of converters. Every converter translates xDS LB policy to the corresponding service config. In order to allow using the Deterministic subsetting LB policy via xDS, the only thing that needs to be done is providing a corresponding converter function. The function implementation will be trivial as the fields in the xDS LB policy will match exactly the fields in the service config. + +## Rationale +### Alternatives Considered: Deterministic Aperture +Apart from Google's Deterministic Subsetting algorithm, we also considered [Twitter's Deterministic Aperture](https://patentimages.storage.googleapis.com/1a/09/5a/14392e76e5d8c5/US11119827.pdf) algorithm to handle subsetting. A critical difference in Twitter's algorithm is that it "splits" endpoints into multiple subsets. This is well demonstrated in their [ring diagrams](https://twitter.github.io/finagle/guide/ApertureLoadBalancers.html#deterministic-subsetting), where you can see a single service split into multiple subsets proportionately (i.e. 1/3 in one subset and 2/3 in the other). This ensures that even with simple child policies like `p2c`, the proportion of traffic sent to each backend across all subsets is balanced. + +Though this balancing is valuable, we opted for Google's algorithm for several reasons: +* Both algorithms are sufficient in reducing connection count, which is the primary drawback of `round_robin`. +* Both algorithms provide better load distribution guarantees than `pick_first`. +* When the client to backend ratio requires overlapping subsets, deterministic subsetting provides the additional guarantee of unique subsets. Because the aperture algorithm is built around a fixed ring, when clients \* aperture exceeds the number of servers, aperture takes "laps" over the ring, producing similar subsets for several clients. Deterministic subsetting strives to reduce the risk of misbehaving endpoints by distributing them more randomly across client subsets. +* In order to take advantage of the partial weighting of backends, Twitter's algorithm would require passing weight information from parent balancer to child balancer. The appropriate balancer to handle this sort of weighting would be weighted\_round\_robin, but weighted\_round\_robin currently calculates weights based on server load alone, and it is not designed to consider this additional weighting information. We opted for the solution that allowed us to keep this existing balancer as is. Though deterministic subsetting does not guarantee even load distribution across subsets, its diverse subset paired with weighted\_round\_robin as the child policy within subsets should be sufficient for most use cases. + +## Implementation +DataDog will provide Go and Java implementations. From 1428640354b4459e0e0811f9168247e158d0f750 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Mon, 31 Jul 2023 08:15:57 -0600 Subject: [PATCH 10/13] Update A66-deterministic-subsetting-lb-policy.md Co-authored-by: Antoine Tollenaere --- A66-deterministic-subsetting-lb-policy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md index 10a3b647f..bc86a95f6 100644 --- a/A66-deterministic-subsetting-lb-policy.md +++ b/A66-deterministic-subsetting-lb-policy.md @@ -48,7 +48,7 @@ func filter_addresses(addresses, subset_size, client_index, sort_addresses) addresses.sort() } - // subset_count indicates how much clients we can have so that every cleint is connected to exactly + // subset_count indicates how many clients we can have so that every client is connected to exactly // subset_size distinct backends and no 2 clients connect to the same backend. subset_count = backend_count / subset_size From 391c8905b6b3e0c874b51e4f6f29d321694b3aa8 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Mon, 31 Jul 2023 08:16:59 -0600 Subject: [PATCH 11/13] Update A66-deterministic-subsetting-lb-policy.md Co-authored-by: Antoine Tollenaere --- A66-deterministic-subsetting-lb-policy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md index bc86a95f6..54080e02e 100644 --- a/A66-deterministic-subsetting-lb-policy.md +++ b/A66-deterministic-subsetting-lb-policy.md @@ -56,7 +56,7 @@ func filter_addresses(addresses, subset_size, client_index, sort_addresses) // round indicates the index of the round for the current client based on its index. round = client_index / subset_count - // There might be some lefover backends withing every round in cases when backend_count % subset_size != 0 + // There might be some lefover backends withing every round in cases when backend_count % subset_size != 0. // excluded_count indicates how mumanych leftover backends we have on every round. excluded_count = backend_count % subset_size From 0d86f1be039f9ddc0c5d4588b440271a6670bb53 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Mon, 31 Jul 2023 08:17:58 -0600 Subject: [PATCH 12/13] fix typos --- A68-deterministic-subsetting-lb-policy.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A68-deterministic-subsetting-lb-policy.md b/A68-deterministic-subsetting-lb-policy.md index 0ddcc6391..6f165e1cb 100644 --- a/A68-deterministic-subsetting-lb-policy.md +++ b/A68-deterministic-subsetting-lb-policy.md @@ -52,12 +52,12 @@ func filter_addresses(addresses, subset_size, client_index, sort_addresses) // subset_size distinct backends and no 2 clients connect to the same backend. subset_count = backend_count / subset_size - // Given subset_count we not cat divide clients by rounds. Every round have exactly subset_count clients. + // Given subset_count we now can divide clients by rounds. Every round have exactly subset_count clients. // round indicates the index of the round for the current client based on its index. round = client_index / subset_count // There might be some lefover backends withing every round in cases when backend_count % subset_size != 0 - // excluded_count indicates how mumanych leftover backends we have on every round. + // excluded_count indicates how many leftover backends we have on every round. excluded_count = backend_count % subset_size // We want to choose what backends are excluded in a round robin fashion before shufling the backends. From f8c4777ac5862b4b0409d22fee453c3dc7568752 Mon Sep 17 00:00:00 2001 From: Sergey Matyukevich Date: Mon, 31 Jul 2023 11:58:00 -0600 Subject: [PATCH 13/13] remove duplicated file --- A66-deterministic-subsetting-lb-policy.md | 177 ---------------------- A68-deterministic-subsetting-lb-policy.md | 8 +- 2 files changed, 4 insertions(+), 181 deletions(-) delete mode 100644 A66-deterministic-subsetting-lb-policy.md diff --git a/A66-deterministic-subsetting-lb-policy.md b/A66-deterministic-subsetting-lb-policy.md deleted file mode 100644 index 54080e02e..000000000 --- a/A66-deterministic-subsetting-lb-policy.md +++ /dev/null @@ -1,177 +0,0 @@ -A66: `deterministic_subsetting` LB policy. ----- -* Author(s): @s-matyukevich, @joybestourous -* Approver: -* Status: Draft -* Implemented in: Go, Java -* Last updated: [Date] -* Discussion at: - -## Abstract - -Add support for the `deterministic_subsetting` load balancing policy configured via xDS. - -## Background - -Currently, gRPC is lacking a way to select a subset of endpoints available from the resolver and load-balance requests between them. Out of the box, users have the choice between two extremes: `pick_first` which sends all requests to one random backend, and `round_robin` which sends requests to all available backends. `pick_first` has poor connection balancing when the number of client is not much higher than the number of servers because of the birthday paradox. The problem is exacerbated during rollouts because `pick_first` does not change endpoint on resolver updates if the current subchannel remains `READY`. `round_robin` results in every servers having as many connections open as there are clients, which is unnecessarily costly when there are many clients, and makes local decisions load balancing (such as outlier detection) less precise. - -### Related Proposals: - -## Proposal - -Introduce a new LB policy, `deterministic_subsetting`. This policy selects a subset of addresses and passes them to the child LB policy in such a way that: -* Every client is connected to exactly N backends. If there are not enough backends, the policy falls back to its child policy. -* The number of resulting connections per backend is as balanced as possible. -* For any 2 given clients the probability of them connecting on the same or very similar subset of backends is as low as possible. - -### Subsetting algorithm - -The LB policy will implement the algorithm described in [Site Reliability Engineering: How Google Runs Production Systems, Chapter 20](https://sre.google/sre-book/load-balancing-datacenter/#a-subset-selection-algorithm-deterministic-subsetting-eKsdcaUm) with the additional modification mentioned in the "Deterministic subsetting" section of the [Reinventing Backend Subsetting at Google](https://queue.acm.org/detail.cfm?id=3570937) paper. Here is the relevant quote: - -``` -This is the algorithm as previously described in Site Reliability Engineering: How Google Runs Production Systems, Chapter 20 but one improvement remains that can be made by balancing the leftover tasks in each group. The simplest way to achieve this is by choosing (before shuffling) which tasks will be leftovers in a round-robin fashion. For example, the first group of frontend tasks would choose {0, 1} to be leftovers and then shuffle the remaining tasks to get subsets {8, 3, 9, 2} and {4, 6, 5, 7}, and then the second group of frontend tasks would choose {2, 3} to be leftovers and shuffle the remaining tasks to get subsets {9, 7, 1, 6} and {0, 5, 4, 8}. This additional balancing ensures that all backend tasks are evenly excluded from consideration, producing a better distribution. -``` - -Here is the implementation of this algorithm in pseudocode with detailed comments. - -``` -func filter_addresses(addresses, subset_size, client_index, sort_addresses) - backend_count = addresses.length() - if backend_count > subset_size { - // if we don't have enough addresses to cover the desired subset size, just return the whole list - return addresses - } - - if sort_addresses { - // sort address list because the algorithm assumes that the initial - // order of the addresses is the same for every client - addresses.sort() - } - - // subset_count indicates how many clients we can have so that every client is connected to exactly - // subset_size distinct backends and no 2 clients connect to the same backend. - subset_count = backend_count / subset_size - - // Given subset_count we not cat divide clients by rounds. Every round have exactly subset_count clients. - // round indicates the index of the round for the current client based on its index. - round = client_index / subset_count - - // There might be some lefover backends withing every round in cases when backend_count % subset_size != 0. - // excluded_count indicates how mumanych leftover backends we have on every round. - excluded_count = backend_count % subset_size - - // We want to choose what backends are excluded in a round robin fashion before shufling the backends. - // This way we make sure that for any 2 backends the difference of how many times they are excluded is at most 1. - // excluded_start indicates the index of the first excluded backend - excluded_start := (round * excluded_count) % backend_count - // excluded_start indicates the index after the last excluded backend. - // It could wrap around the end of the addresses list. - excluded_end := (excluded_start + excluded_count) % backend_count - - if excluded_start < excluded_end { - // excluded_end doesn't wrap, exclude addresses from the interval [excluded_start, excluded_end) - } else { - // excluded_end wraps around the end of the addresses list, exclude intervals [0:excluded_start] and [excluded_end, end_of_the_array) - } - - // randomly shuffle the addresses to increase subset diversity. Use round as seed to make sure - // that clients within the same round shuffle addresses identically. - addresses.shuffle(seed: round) - - // subset_id is the index for the current client withing the round - subset_id := client_index % subset_count - - // calculate start start and end of the resulting subset - start = subsetId * subset_size - end = start + subset_size - return addresses[start: end] -} -``` - - -### Characteristics of the selected algorithm - -* Every client connects to exactly `subset_size` backends. -* Clients are evenly distributed between backends. The most connected backend receives at most 2 more connections than the least connected one. One comes from the fact that a backend can be excluded one more time than some other backend. And another one comes from the fact that there might be not enough client to completely fill the last round. -* Backends are randomly shuffled between clients. If a small subset of backends loses connectivity, the algorithm maximizes the chances that the backends from this subset will be evenly distributed between clients. -* The algorithm generates some, potentially significant, connection churn during backend scaling or redeployment. This might be a problem for very small subset sizes. - -### LB Policy Config and Parameters - -The `deterministic_subsetting` LB policy config will be as follows. - -``` -message LoadBalancingConfig { - oneof policy { - DeterministicSubsettingLbConfig deterministic_subsetting = 21 [json_name = "deterministic_subsetting"]; - } -} - -message DeterministicSubsettingLbConfig { - // client_index is an index within the - // interval [0..N-1], where N is the total number of clients. - // Every client must have a unique index. - google.protobuf.UInt32Value client_index = 1; - - // subset_size indicates how many backends every client will be connected to. - // Default is 10. - google.protobuf.UInt32Value subset_size = 2; - - // sort_addresses indicates whether the LB should sort addresses by IP - // before applying the subsetting algorithm. This might be useful - // in cases when the resolver doesn't provide a stable order or - // when it could order addresses differently depending on the client. - // Default is false. - google.protobuf.BoolValue sort_addresses = 3; - - // The config for the child policy. - repeated LoadBalancingConfig child_policy = 4; -} -``` - -### Handling Parent/Resolver Updates - -When the resolver updates the list of addresses, or the LB config changes, Deterministic subsetting LB will run the subsetting algorithm, described above, to filter the endpoint list. Then it will create a new resolver state with the filtered list of the addresses and pass it to the child LB. Attributes and service config from the old resolver state will be copied to the new one. - -### Handling Subchannel Connectivity State Notifications - -Deterministic subsetting LB will simply redirect all requests to the child LB without doing any additional processing. This also applies to all other callbacks in the LB interface, besides the one that handles resolver and config updates (which is described in the previous section). This is possible because deterministic subsetting LB doesn't store or manage sub-connections - it acts as a simple filter on the resolver state, and that's why it can redirect all actual work to the child LB. - -### xDS Integration - -Deterministic subsetting LB won't depend on xDS in any way. People may choose to initialize it by directly providing service config. If they do this, they will have to figure out how to correctly initialize and keep updating client\_index. There are ways to do that, but describing them is out of scope for this gRFC. The main point is that the LB itself will be xDS agnostic. - -However, the main intended usage of this LB is via xDS. The xDS control plane is a natural place to aggregate information about all available clients and backends and assign client\_index. - -#### Changes to xDS API - -`deterministic_subsetting` will be added as a new LB policy. - -```textproto -package envoy.extensions.load_balancing_policies.deterministic_subsetting.v3; - -message DeterministicSubsetting { - google.protobuf.UInt32Value client_index = 1; - google.protobuf.UInt32Value subset_size = 2; - google.protobuf.BoolValue sort_addresses = 3; - repeated LoadBalancingConfig child_policy = 4; -} -``` - -As you can see, the fields in this policy match exactly the fields in the deterministic subsetting LB service config. - -#### Integration with xDS LB Policy Registry -As described in [gRFC A52][A52], gRPC has an LB policy registry, which maintains a list of converters. Every converter translates xDS LB policy to the corresponding service config. In order to allow using the Deterministic subsetting LB policy via xDS, the only thing that needs to be done is providing a corresponding converter function. The function implementation will be trivial as the fields in the xDS LB policy will match exactly the fields in the service config. - -## Rationale -### Alternatives Considered: Deterministic Aperture -Apart from Google's Deterministic Subsetting algorithm, we also considered [Twitter's Deterministic Aperture](https://patentimages.storage.googleapis.com/1a/09/5a/14392e76e5d8c5/US11119827.pdf) algorithm to handle subsetting. A critical difference in Twitter's algorithm is that it "splits" endpoints into multiple subsets. This is well demonstrated in their [ring diagrams](https://twitter.github.io/finagle/guide/ApertureLoadBalancers.html#deterministic-subsetting), where you can see a single service split into multiple subsets proportionately (i.e. 1/3 in one subset and 2/3 in the other). This ensures that even with simple child policies like `p2c`, the proportion of traffic sent to each backend across all subsets is balanced. - -Though this balancing is valuable, we opted for Google's algorithm for several reasons: -* Both algorithms are sufficient in reducing connection count, which is the primary drawback of `round_robin`. -* Both algorithms provide better load distribution guarantees than `pick_first`. -* When the client to backend ratio requires overlapping subsets, deterministic subsetting provides the additional guarantee of unique subsets. Because the aperture algorithm is built around a fixed ring, when clients \* aperture exceeds the number of servers, aperture takes "laps" over the ring, producing similar subsets for several clients. Deterministic subsetting strives to reduce the risk of misbehaving endpoints by distributing them more randomly across client subsets. -* In order to take advantage of the partial weighting of backends, Twitter's algorithm would require passing weight information from parent balancer to child balancer. The appropriate balancer to handle this sort of weighting would be weighted\_round\_robin, but weighted\_round\_robin currently calculates weights based on server load alone, and it is not designed to consider this additional weighting information. We opted for the solution that allowed us to keep this existing balancer as is. Though deterministic subsetting does not guarantee even load distribution across subsets, its diverse subset paired with weighted\_round\_robin as the child policy within subsets should be sufficient for most use cases. - -## Implementation -DataDog will provide Go and Java implementations. diff --git a/A68-deterministic-subsetting-lb-policy.md b/A68-deterministic-subsetting-lb-policy.md index 6f165e1cb..73359082d 100644 --- a/A68-deterministic-subsetting-lb-policy.md +++ b/A68-deterministic-subsetting-lb-policy.md @@ -48,7 +48,7 @@ func filter_addresses(addresses, subset_size, client_index, sort_addresses) addresses.sort() } - // subset_count indicates how much clients we can have so that every cleint is connected to exactly + // subset_count indicates how many clients we can have so that every client is connected to exactly // subset_size distinct backends and no 2 clients connect to the same backend. subset_count = backend_count / subset_size @@ -56,7 +56,7 @@ func filter_addresses(addresses, subset_size, client_index, sort_addresses) // round indicates the index of the round for the current client based on its index. round = client_index / subset_count - // There might be some lefover backends withing every round in cases when backend_count % subset_size != 0 + // There might be some lefover backends withing every round in cases when backend_count % subset_size != 0. // excluded_count indicates how many leftover backends we have on every round. excluded_count = backend_count % subset_size @@ -139,9 +139,9 @@ Deterministic subsetting LB will simply redirect all requests to the child LB wi ### xDS Integration -Deterministic subsetting LB won't depend on xDS in any way. People may choose to initialize it by directly providing service config. If they do this, they will have to figure out how to correctly initialize and keep updating client\_index. There are ways to do that, but describing them is out of scope for this gRPC. The main point is that the LB itself will be xDS agnostic. +Deterministic subsetting LB won't depend on xDS in any way. People may choose to initialize it by directly providing service config. If they do this, they will have to figure out how to correctly initialize and keep updating client_index. There are ways to do that, but describing them is out of scope for this gRFC. The main point is that the LB itself will be xDS agnostic. -However, the main intended usage of this LB is via xDS. The xDS control plane is a natural place to aggregate information about all available clients and backends and assign client\_index. +However, the main intended usage of this LB is via xDS. The xDS control plane is a natural place to aggregate information about all available clients and backends and assign client_index. #### Changes to xDS API