grpclb: switch to fallback mode if all connections are lost #3744

zhangkun83 · 2017-11-15T01:00:02Z

Previously fallback mode can be entered only if the client has not
received any server list and the fallback timeout has expired.

Now the fallback timer is started when the stream to the balancer is broken
AND there is no ready Subchannels. Fallback mode is activated when the
timer expires. When a new server list is received from the balancer, either
the fallback timer is cancelled, or fallback mode is exited.

Also fixed a bug that the fallback timer should've been cancelled when GrpcState
is shut down.

Previously fallback mode can be entered only if the client has not received any server list and the fallback timeout has expired. Now fallback mode is entered when the following condition is met: ``` (timeoutExpired && readySubchannels.size() == 0 && !(connectedToBalancer && receivedServerListFromCurrentBalancer)) ``` Fallback mode is exited when a new server list is received from the balancer.

zhangkun83 · 2017-11-15T01:01:05Z

@markdroth please focus on the behavior.
@dapengzhang0 please focus on the language.

markdroth · 2017-11-17T19:51:42Z

grpclb/src/test/java/io/grpc/grpclb/GrpclbLoadBalancerTest.java

+    List<Subchannel> subchannels =
+        fallbackTestVerifyUseOfBalancerBackendLists(inOrder, helper, serverList);
+
+    // Let fallback timer expire


I don't know this code well enough to understand exactly what this is covering, so let me say what the behavior should be here and let you tell me if that's what the code is doing.

I think the behavior we want here is for the fallback timer to start again at the moment that we lose contact with the balancers and backends. If we don't reestablish a connection with either a balancer or a backend by the time the timer expires, then we use the fallback addresses until we regain contact with the balancer.

Note that the above is different from the initial fallback timer that we had implemented previously. For example, consider the following scenario:

Channel is created and is moved out of idle state (either by starting an RPC or by requesting that the channel start connecting).

We start trying to connect to the balancer. At the same time, we start the fallback timer.

If the fallback timer expires before we connect to the balancer, we use the fallback addresses while we continue to attempt to connect to the balancers.

We successfully connect to the balancer and receive a serverlist. If the fallback timer had not previously expired, we cancel it; if it had previously expired, we stop using the fallback addresses. We then start using the backends that the balancer told us about.

Some time later, we lose contact with the balancer. However, we are still in contact with the backends from the last serverlist the balancer sent us, so we keep using them while we try to reconnect to the balancer.

Now we lose contact with all backends from the last serverlist the balancer sent us. We stat a new fallback timer.

If we regain contact with the balancer and get a new serverlist before the fallback timer expires, then we cancel the timer and use the new serverlist.

If the timer fires before we regain contact with the balancer, then we switch to using the fallback addresses until we regain contact with the balancer.

In other words, there are two different scenarios in which the fallback timer will be used. The one that we had previously implemented is a timer that starts when we start trying to contact the balancer. The new one that we need in this PR will start at the moment when we lose contact with the last backend if we have also lost contact with the balancer.

In the current code the timeout is used only for the initial fallback. Later fallbacks are not subject to the timeout, thus will activate immediately after all connections are lost (as long as the initial timeout is passed).

We should discuss and decide which approach we want to take.

My understanding is that the initial timeout is necessary because the client needs time to connect to balancer and receive the server list. It doesn't seem to be the case for the later fallbacks. Losing connections to the balancer and all backends is pretty unusual, and I don't see a reason why we still have to wait for 10 seconds instead of using the fallbacks immediately.

@markdroth I have updated the PR to add the timeout to subsequent fallbacks. Now it matches your description.

Sounds great! Then feel free to merge this PR as soon as Penn reviews it. Thanks!

dapengzhang0 · 2017-11-27T21:43:22Z

grpclb/src/main/java/io/grpc/grpclb/GrpclbLoadBalancer.java

                  "NameResolver returned no LB address while asking for GRPCLB"));
        } else {
-          grpclbState.updateAddresses(newLbAddressGroups, newBackendServers);
+          grpclbState.handleAddresses(newLbAddressGroups, newBackendServers);


As the method name is changed into handleAddresses, maybe also handle if(newLbAddressGroups.isEmpty()) inside the handleAddresses.

dapengzhang0 · 2017-11-27T21:54:14Z

grpclb/src/main/java/io/grpc/grpclb/GrpclbState.java

+   * Start the fallback timer if it's not already started and all connections are lost.
+   */
+  private void maybeStartFallbackTimer() {
+    if (fallbackTimer == null) {


if (fallbackTimer != null) { return; }

to follow suit the pattern.

dapengzhang0 · 2017-11-28T18:53:37Z

grpclb/src/main/java/io/grpc/grpclb/GrpclbState.java

+      }
+      logger.log(Level.FINE, "[{0}] Starting fallback timer.", new Object[] {logId});
+      fallbackTimer = new FallbackModeTask();
+      fallbackTimer.scheduledFuture =


Maybe add a schedule(timeout, timeUnit) method to FallbackModeTask instead of setting the field.

dapengzhang0 · 2017-11-28T19:22:57Z

grpclb/src/main/java/io/grpc/grpclb/GrpclbState.java

+      }
+      if (balancerWorking) {
+        return;
+      }


also

if (usingFallbackBackends) { return; }

Nice catch! Fixed and added test for it.

zhangkun83

PTAL.

zhangkun83 · 2017-11-29T18:21:26Z

grpclb/src/main/java/io/grpc/grpclb/GrpclbLoadBalancer.java

                  "NameResolver returned no LB address while asking for GRPCLB"));
        } else {
-          grpclbState.updateAddresses(newLbAddressGroups, newBackendServers);
+          grpclbState.handleAddresses(newLbAddressGroups, newBackendServers);


zhangkun83 · 2017-11-29T18:21:34Z

grpclb/src/main/java/io/grpc/grpclb/GrpclbState.java

+   * Start the fallback timer if it's not already started and all connections are lost.
+   */
+  private void maybeStartFallbackTimer() {
+    if (fallbackTimer == null) {


zhangkun83 · 2017-11-29T18:21:52Z

grpclb/src/main/java/io/grpc/grpclb/GrpclbState.java

+      }
+      if (balancerWorking) {
+        return;
+      }


Nice catch! Fixed and added test for it.

zhangkun83 · 2017-11-29T18:21:59Z

grpclb/src/main/java/io/grpc/grpclb/GrpclbState.java

+      }
+      logger.log(Level.FINE, "[{0}] Starting fallback timer.", new Object[] {logId});
+      fallbackTimer = new FallbackModeTask();
+      fallbackTimer.scheduledFuture =


dapengzhang0

LGTM

zhangkun83 added 3 commits November 13, 2017 17:27

WIP: fallback when lost contact with all

3df8528

Add comments

1f8d838

zhangkun83 requested review from dapengzhang0 and markdroth November 15, 2017 01:00

markdroth reviewed Nov 17, 2017

View reviewed changes

Apply timeout to subsequent fallbacks

8288d0a

dapengzhang0 reviewed Nov 28, 2017

View reviewed changes

zhangkun83 added 3 commits November 29, 2017 09:46

Move handling of empty LB address list into GrpclbState

4033ce8

Never schedule fallback timer when in fallback mode

4519afd

Move timer scheduling into FallbackModeTask

bcf06f8

zhangkun83 commented Nov 29, 2017

View reviewed changes

dapengzhang0 approved these changes Nov 29, 2017

View reviewed changes

zhangkun83 merged commit 9239984 into grpc:master Nov 29, 2017

zhangkun83 deleted the grpclb_fallback branch November 29, 2017 18:50

carl-mastrangelo added this to the 1.9 milestone Nov 30, 2017

carl-mastrangelo added Type: Behavior Change Type: Bug Type: Feature labels Nov 30, 2017

lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019

grpclb: switch to fallback mode if all connections are lost #3744

grpclb: switch to fallback mode if all connections are lost #3744

Uh oh!

Conversation

zhangkun83 commented Nov 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangkun83 commented Nov 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangkun83 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dapengzhang0 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhangkun83 commented Nov 15, 2017 •

edited

Loading