Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

balancer: fix connectivity state aggregation algorithm to follow the spec #5473

merged 2 commits into from Jul 7, 2022


Copy link

@easwars easwars commented Jun 27, 2022

The connectivity state aggregation algorithm as specified in the spec is as follows:

The policy sets the channel's connectivity state by aggregating the states of the subchannels:

- If any one subchannel is in READY state, the channel's state is READY.
- Otherwise, if there is any subchannel in state CONNECTING, the channel's state is CONNECTING.
- Otherwise, if there is any subchannel in state IDLE, the channel's state is IDLE.
- Otherwise, if all subchannels are in state TRANSIENT_FAILURE, the channel's state is TRANSIENT_FAILURE.

Note that when a given subchannel reports TRANSIENT_FAILURE, it is considered to still be in
TRANSIENT_FAILURE until it successfully reconnects and reports READY. In particular, we ignore 
the transition from TRANSIENT_FAILURE to CONNECTING.

In our existing implementation, we were giving more precedence to TRANSIENT_FAILURE than IDLE. This was not causing any issue so far since this algorithm was only being used by the round_robin LB policy where IDLE is a fleeting state (the policy triggers a connection attempt whenever the subConn enters IDLE). We plan to use this connectivity state aggregation algorithm in the weightedtarget LB policy, where IDLE is not a fleeting state and needs to be prioritized over TRANSIENT_FAILURE.

Also, this algorithm does not take care of suppressing connectivity state changes from TRANSIENT_FAILURE to non-READY states. This is currently taken care by the LB policy implementations and will continue to remain that way.

Fixes #5458


  • balancer: give more precedence to IDLE over TRANSIENT_FAILURE when aggregating connectivity state

@easwars easwars requested a review from dfawley June 27, 2022 18:46
@easwars easwars added this to the 1.48 Release milestone Jun 27, 2022
@dfawley dfawley modified the milestones: 1.48 Release, 1.49 Release Jul 1, 2022
@dfawley dfawley assigned easwars and unassigned dfawley Jul 7, 2022
@easwars easwars merged commit 03fee09 into grpc:master Jul 7, 2022
10 checks passed
@easwars easwars deleted the connectivity_state_aggregation branch July 7, 2022 20:48
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
None yet

Successfully merging this pull request may close these issues.

balancer: connectivity state aggregation algorithm needs fixing
2 participants