Skip to content

Implement PickFirst load balancer#2570

Open
nathanielford wants to merge 11 commits intohyperium:masterfrom
nathanielford:implement/PickFirstLB
Open

Implement PickFirst load balancer#2570
nathanielford wants to merge 11 commits intohyperium:masterfrom
nathanielford:implement/PickFirstLB

Conversation

@nathanielford
Copy link
Copy Markdown
Collaborator

@nathanielford nathanielford commented Mar 25, 2026

Motivation

Full implementation of the pick first load balancer, including 'Happy eyeballs' features.

Solution

Load balancing implementation to pick the first available endpoint to connect to, maintaining stickiness across endpoint updates if configured. Handles accepting new LB configuration and subchannel reconstruction.

Prototype is at https://github.com/nathanielford/grpc-rust-testbed/tree/main/pick_first_lib

Notes

  • Ended up including all happy eyeball features because it wasn't clear where best to slice the line. Considering this a full implementation, and it should be reviewed as such.
  • This does use tokio::spawn and tokio::time, which may need to be replaced to make things runtime agnostic. Please comment in the PR whether this is the case.

@nathanielford nathanielford requested a review from dfawley March 25, 2026 15:30
@nathanielford nathanielford self-assigned this Mar 25, 2026
@nathanielford nathanielford requested review from arjan-bal and removed request for dfawley March 25, 2026 16:17
@nathanielford nathanielford force-pushed the implement/PickFirstLB branch 3 times, most recently from a442272 to 73397ff Compare March 25, 2026 17:29
Copy link
Copy Markdown
Collaborator

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving initial comments while I review the remaining changes.

Comment thread grpc/src/client/load_balancing/pick_first.rs
Comment thread grpc/src/client/load_balancing/pick_first.rs Outdated
Comment thread grpc/src/client/load_balancing/pick_first.rs Outdated
Comment thread grpc/src/client/load_balancing/pick_first.rs Outdated
Comment thread grpc/src/client/load_balancing/pick_first.rs Outdated
Comment thread grpc/src/client/load_balancing/pick_first.rs Outdated
Comment thread grpc/src/client/load_balancing/pick_first.rs Outdated
Comment thread grpc/src/client/load_balancing/pick_first.rs Outdated
fn start_connection_pass(&mut self, channel_controller: &mut dyn ChannelController) {
self.current_index = 0;
self.selected = None;
if let Some(sc) = self.subchannels.get(0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since subchannels are reused when reconciling address lists during a resolver update, the first subchannel may be in backoff (reporting TRANSIENT_FAILURE). Instead of unconditionally calling connect() on the first subchannel, pick_first should ideally find the first IDLE or CONNECTING subchannel.

Here is Go's implementation.

Ok(())
}

fn subchannel_update(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iteration over the subchannel list needs to be split into two passes, as described in A61. During the initial pass, pick_first will sequentially iterate over the subchannel list, ensuring it sees a TRANSIENT_FAILURE for each subchannel before proceeding to the next. Once all subchannels have failed, the first phase is complete. Now, pick_first will reconnect each subchannel as it reports IDLE. Ignoring the Happy Eyeballs timer, this corresponds to the circled portion of the gRFC:

Image

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this is a happy eyeballs requirement, correct? I think we were going to try and put in a pick_first implementation without Happy eyeballs first. If you think it's important, I can include that but it was initially out of scope.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, Happy Eyeballs involves only the interleaving of IPv4/IPv6 addresses and running a timer to resume iteration if subchannels are slow to fail. The first-pass/second-pass logic is independent of that. Given this, the Happy Eyeballs implementation probably accounts for only around 20% of the dev effort. If we don't add the two-pass logic, we would end up re-implementing a significant portion of the LB policy later. After discussing this with @dfawley, I suggest we increase the scope of this change to implement the two-pass logic without Happy Eyeballs. This would involve increasing the estimates for this work and reducing the estimates for the Happy Eyeballs work, which should be fine.

Comment on lines +258 to +261
if let Some(attempting) = self.subchannels.get(self.current_index) {
if attempting.address() == subchannel.address() {
match state.connectivity_state {
ConnectivityState::Ready => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have subchannel sharing in Rust, we would need to consider the this section of A61 also:

  • pick_first needs to be prepared for any subchannel to report READY at any time
  • When we choose a subchannel that has become successfully connected, we will unref all of the other subchannels

}

#[cfg(test)]
mod test {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some tricky edge cases found by Go's e2e-style tests that we discovered when implementing a new dual-stack pick_first LB in Go. I added unit tests for them to ensure the balancer behaved as expected. These tests are mainly in two files:

It may not be possible to write e2e-style tests in Rust right now since we don't have the same test utilities as Go. However, we can still test the same scenarios using a mock channel.

I would recommend seeing if Gemini can convert these Go tests into tests for Rust, skipping the Happy Eyeballs ones. This would act as a conformance test. I did something similar to generate tests for credentials in the following way:

  1. Clone gRPC Go.
  2. Point the Gemini CLI to the test files for gRPC Go and gRPC Rust.
  3. Ask Gemini to create similar test cases for gRPC Rust.

If it gives decent results without much effort, that's great; otherwise, we can improve test coverage later.

@arjan-bal arjan-bal removed their assignment Mar 30, 2026
@nathanielford nathanielford force-pushed the implement/PickFirstLB branch from 73397ff to bdf7fc3 Compare April 6, 2026 19:32
@nathanielford nathanielford force-pushed the implement/PickFirstLB branch from 7ea10ad to e1bcaf4 Compare April 27, 2026 20:55
@nathanielford nathanielford requested a review from arjan-bal April 27, 2026 21:38
@nathanielford nathanielford marked this pull request as ready for review April 27, 2026 21:39
@arjan-bal arjan-bal assigned arjan-bal and unassigned nathanielford Apr 28, 2026
…A61 endpoint handling

This commit implements the PickFirst load balancer policy for Tonic gRPC, focusing on:
- Efficient subchannel management with backoff preservation.
- "Stickiness" support: continuing to use an existing Ready subchannel if it remains in resolver updates.
- Compliance with gRFC A61: endpoints are now shuffled before being flattened into an address list, ensuring multiple addresses for a single endpoint (e.g., IPv4/IPv6) stay together.
- Clean state reset: subchannels and selected state are now cleared when receiving an empty address list.
- Alignment with the updated synchronous testing framework in master.

Includes comprehensive test coverage for basic connection, failover, stickiness, exhaustion, deterministic endpoint shuffling, de-duplication, and empty updates.
…active failover

This change enhances the PickFirst load balancing policy to better support
gRFC A61 (Happy Eyeballs) and improve connection establishment latency.

Key changes:
- Implement IPv6/IPv4 address interleaving in `compile_address` to ensure
  subsequent connection attempts alternate between protocol families.
- Introduce a `subchannel_states` cache in `PickFirstPolicy` to track the
  connectivity status of managed subchannels.
- Refactor connection logic to use a `frontier_index` and proactively skip
  subchannels known to be in `TransientFailure` (e.g., during backoff).
- Update `advance_frontier` to safely maintain the index within the bounds
  of the address list, ensuring the policy remains reactive to recovery.
- Add deterministic unit tests for shuffling and interleaving logic.
…x tests

This change completes the Happy Eyeballs implementation in the `pick_first` load balancer by:

- Implementing the "Steady State" mode for continuous retries after the initial connection pass fails for all addresses.

- Adding two new unit tests to verify failover and steady-state behavior under multi-backend scenarios.

- Fixing the timer advancement unit test to avoid deadlocks in async test contexts.
@nathanielford nathanielford force-pushed the implement/PickFirstLB branch from e1bcaf4 to 798270f Compare April 29, 2026 15:17
@nathanielford nathanielford force-pushed the implement/PickFirstLB branch from fa711db to 566cb6d Compare April 29, 2026 15:24
Comment thread grpc/src/client/load_balancing/pick_first.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants