balancer: connectivity state aggregation algorithm needs fixing #5458

easwars · 2022-06-22T22:20:52Z

The round_robin LB policy's implementation is broken into two pieces:

The base balancer implementation found here, and
Picker implementation specific to round_robin found here

The base balancer implementation uses the connectivity state aggregation logic provided by ConnectivityStateEvaluator:

grpc-go/balancer/balancer.go

Line 379 in 28de486

type ConnectivityStateEvaluator struct {

The algorithm is as follows:

//  - If at least one SubConn in Ready, the aggregated state is Ready;
//  - Else if at least one SubConn in Connecting, the aggregated state is Connecting;
//  - Else if at least one SubConn is TransientFailure, the aggregated state is Transient Failure;
//  - Else if at least one SubConn is Idle, the aggregated state is Idle;
//  - Else there are no subconns and the aggregated state is Transient Failure

The algorithm as defined in the load balancing spec is as follows though:

The policy sets the channel's connectivity state by aggregating the states of the subchannels:

- If any one subchannel is in READY state, the channel's state is READY.
- Otherwise, if there is any subchannel in state CONNECTING, the channel's state is CONNECTING.
- Otherwise, if there is any subchannel in state IDLE, the channel's state is IDLE.
- Otherwise, if all subchannels are in state TRANSIENT_FAILURE, the channel's state is TRANSIENT_FAILURE.

Note that when a given subchannel reports TRANSIENT_FAILURE, it is considered to still be in
TRANSIENT_FAILURE until it successfully reconnects and reports READY. In particular, we ignore 
the transition from TRANSIENT_FAILURE to CONNECTING.

Note that the implemented algorithm gives precedence to IDLE over TRANSIENT_FAILURE. This works fine for round_robin because in round_robin, we push the subConn into CONNECTING as soon as it enters IDLE. But if we want to use this connectivity state aggregation algorithm in other LB policies, IDLE should take precedence over TRANSIENT_FAILURE. For example, this is exactly what we do in weightedtarget:

grpc-go/balancer/weightedtarget/weightedaggregator/aggregator.go

Line 218 in 28de486

var aggregatedState connectivity.State

We even have a TODO in weightedtarget to use balancer.ConnectivityStateEvaluator:

grpc-go/balancer/weightedtarget/weightedaggregator/aggregator.go

Line 203 in 28de486

// TODO: use balancer.ConnectivityStateEvaluator to calculate the aggregated

We cannot use the latter unless we fix the algorithm implementation.

Also, the c-core implementation of round_robin sets the connectivity state of the subConn to CONNECTING when it enters IDLE because the LB policy starts connecting as soon the subConn enters IDLE. We also do the latter, but we don't do the former.

The text was updated successfully, but these errors were encountered:

easwars added the Type: Bug label Jun 22, 2022

easwars self-assigned this Jun 27, 2022

easwars mentioned this issue Jun 27, 2022

balancer: fix connectivity state aggregation algorithm to follow the spec #5473

Merged

easwars closed this as completed in #5473 Jul 7, 2022

github-actions bot locked as resolved and limited conversation to collaborators Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

balancer: connectivity state aggregation algorithm needs fixing #5458

balancer: connectivity state aggregation algorithm needs fixing #5458

easwars commented Jun 22, 2022

balancer: connectivity state aggregation algorithm needs fixing #5458

balancer: connectivity state aggregation algorithm needs fixing #5458

Comments

easwars commented Jun 22, 2022