Implement PickFirst load balancer#2570
Implement PickFirst load balancer#2570nathanielford wants to merge 11 commits intohyperium:masterfrom
Conversation
a442272 to
73397ff
Compare
| fn start_connection_pass(&mut self, channel_controller: &mut dyn ChannelController) { | ||
| self.current_index = 0; | ||
| self.selected = None; | ||
| if let Some(sc) = self.subchannels.get(0) { |
There was a problem hiding this comment.
Since subchannels are reused when reconciling address lists during a resolver update, the first subchannel may be in backoff (reporting TRANSIENT_FAILURE). Instead of unconditionally calling connect() on the first subchannel, pick_first should ideally find the first IDLE or CONNECTING subchannel.
Here is Go's implementation.
| Ok(()) | ||
| } | ||
|
|
||
| fn subchannel_update( |
There was a problem hiding this comment.
The iteration over the subchannel list needs to be split into two passes, as described in A61. During the initial pass, pick_first will sequentially iterate over the subchannel list, ensuring it sees a TRANSIENT_FAILURE for each subchannel before proceeding to the next. Once all subchannels have failed, the first phase is complete. Now, pick_first will reconnect each subchannel as it reports IDLE. Ignoring the Happy Eyeballs timer, this corresponds to the circled portion of the gRFC:
There was a problem hiding this comment.
To be clear, this is a happy eyeballs requirement, correct? I think we were going to try and put in a pick_first implementation without Happy eyeballs first. If you think it's important, I can include that but it was initially out of scope.
There was a problem hiding this comment.
Technically, Happy Eyeballs involves only the interleaving of IPv4/IPv6 addresses and running a timer to resume iteration if subchannels are slow to fail. The first-pass/second-pass logic is independent of that. Given this, the Happy Eyeballs implementation probably accounts for only around 20% of the dev effort. If we don't add the two-pass logic, we would end up re-implementing a significant portion of the LB policy later. After discussing this with @dfawley, I suggest we increase the scope of this change to implement the two-pass logic without Happy Eyeballs. This would involve increasing the estimates for this work and reducing the estimates for the Happy Eyeballs work, which should be fine.
| if let Some(attempting) = self.subchannels.get(self.current_index) { | ||
| if attempting.address() == subchannel.address() { | ||
| match state.connectivity_state { | ||
| ConnectivityState::Ready => { |
There was a problem hiding this comment.
Since we have subchannel sharing in Rust, we would need to consider the this section of A61 also:
- pick_first needs to be prepared for any subchannel to report READY at any time
- When we choose a subchannel that has become successfully connected, we will unref all of the other subchannels
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod test { |
There was a problem hiding this comment.
There are some tricky edge cases found by Go's e2e-style tests that we discovered when implementing a new dual-stack pick_first LB in Go. I added unit tests for them to ensure the balancer behaved as expected. These tests are mainly in two files:
- pickfirst_ext_test.go: These are e2e-style tests that create a gRPC channel to a fake server.
- pickfirst_test.go: These use a mock channel, similar to the tests in Rust.
It may not be possible to write e2e-style tests in Rust right now since we don't have the same test utilities as Go. However, we can still test the same scenarios using a mock channel.
I would recommend seeing if Gemini can convert these Go tests into tests for Rust, skipping the Happy Eyeballs ones. This would act as a conformance test. I did something similar to generate tests for credentials in the following way:
- Clone gRPC Go.
- Point the Gemini CLI to the test files for gRPC Go and gRPC Rust.
- Ask Gemini to create similar test cases for gRPC Rust.
If it gives decent results without much effort, that's great; otherwise, we can improve test coverage later.
73397ff to
bdf7fc3
Compare
7ea10ad to
e1bcaf4
Compare
…ss, and updated sync testing framework
…A61 endpoint handling This commit implements the PickFirst load balancer policy for Tonic gRPC, focusing on: - Efficient subchannel management with backoff preservation. - "Stickiness" support: continuing to use an existing Ready subchannel if it remains in resolver updates. - Compliance with gRFC A61: endpoints are now shuffled before being flattened into an address list, ensuring multiple addresses for a single endpoint (e.g., IPv4/IPv6) stay together. - Clean state reset: subchannels and selected state are now cleared when receiving an empty address list. - Alignment with the updated synchronous testing framework in master. Includes comprehensive test coverage for basic connection, failover, stickiness, exhaustion, deterministic endpoint shuffling, de-duplication, and empty updates.
…active failover This change enhances the PickFirst load balancing policy to better support gRFC A61 (Happy Eyeballs) and improve connection establishment latency. Key changes: - Implement IPv6/IPv4 address interleaving in `compile_address` to ensure subsequent connection attempts alternate between protocol families. - Introduce a `subchannel_states` cache in `PickFirstPolicy` to track the connectivity status of managed subchannels. - Refactor connection logic to use a `frontier_index` and proactively skip subchannels known to be in `TransientFailure` (e.g., during backoff). - Update `advance_frontier` to safely maintain the index within the bounds of the address list, ensuring the policy remains reactive to recovery. - Add deterministic unit tests for shuffling and interleaving logic.
…x tests This change completes the Happy Eyeballs implementation in the `pick_first` load balancer by: - Implementing the "Steady State" mode for continuous retries after the initial connection pass fails for all addresses. - Adding two new unit tests to verify failover and steady-state behavior under multi-backend scenarios. - Fixing the timer advancement unit test to avoid deadlocks in async test contexts.
e1bcaf4 to
798270f
Compare
fa711db to
566cb6d
Compare
Motivation
Full implementation of the pick first load balancer, including 'Happy eyeballs' features.
Solution
Load balancing implementation to pick the first available endpoint to connect to, maintaining stickiness across endpoint updates if configured. Handles accepting new LB configuration and subchannel reconstruction.
Prototype is at https://github.com/nathanielford/grpc-rust-testbed/tree/main/pick_first_lib
Notes