Skip to content

Conversation

@zhangkun83
Copy link
Contributor

@zhangkun83 zhangkun83 commented Nov 15, 2017

Previously fallback mode can be entered only if the client has not
received any server list and the fallback timeout has expired.

Now the fallback timer is started when the stream to the balancer is broken
AND there is no ready Subchannels. Fallback mode is activated when the
timer expires. When a new server list is received from the balancer, either
the fallback timer is cancelled, or fallback mode is exited.

Also fixed a bug that the fallback timer should've been cancelled when GrpcState
is shut down.

Previously fallback mode can be entered only if the client has not
received any server list and the fallback timeout has expired.  Now
fallback mode is entered when the following condition is met:

```
 (timeoutExpired
  && readySubchannels.size() == 0
  && !(connectedToBalancer && receivedServerListFromCurrentBalancer))
```

Fallback mode is exited when a new server list is received from the
balancer.
@zhangkun83
Copy link
Contributor Author

@markdroth please focus on the behavior.
@dapengzhang0 please focus on the language.

List<Subchannel> subchannels =
fallbackTestVerifyUseOfBalancerBackendLists(inOrder, helper, serverList);

// Let fallback timer expire
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this code well enough to understand exactly what this is covering, so let me say what the behavior should be here and let you tell me if that's what the code is doing.

I think the behavior we want here is for the fallback timer to start again at the moment that we lose contact with the balancers and backends. If we don't reestablish a connection with either a balancer or a backend by the time the timer expires, then we use the fallback addresses until we regain contact with the balancer.

Note that the above is different from the initial fallback timer that we had implemented previously. For example, consider the following scenario:

  1. Channel is created and is moved out of idle state (either by starting an RPC or by requesting that the channel start connecting).
  2. We start trying to connect to the balancer. At the same time, we start the fallback timer.
  3. If the fallback timer expires before we connect to the balancer, we use the fallback addresses while we continue to attempt to connect to the balancers.
  4. We successfully connect to the balancer and receive a serverlist. If the fallback timer had not previously expired, we cancel it; if it had previously expired, we stop using the fallback addresses. We then start using the backends that the balancer told us about.
  5. Some time later, we lose contact with the balancer. However, we are still in contact with the backends from the last serverlist the balancer sent us, so we keep using them while we try to reconnect to the balancer.
  6. Now we lose contact with all backends from the last serverlist the balancer sent us. We stat a new fallback timer.
  7. If we regain contact with the balancer and get a new serverlist before the fallback timer expires, then we cancel the timer and use the new serverlist.
  8. If the timer fires before we regain contact with the balancer, then we switch to using the fallback addresses until we regain contact with the balancer.

In other words, there are two different scenarios in which the fallback timer will be used. The one that we had previously implemented is a timer that starts when we start trying to contact the balancer. The new one that we need in this PR will start at the moment when we lose contact with the last backend if we have also lost contact with the balancer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current code the timeout is used only for the initial fallback. Later fallbacks are not subject to the timeout, thus will activate immediately after all connections are lost (as long as the initial timeout is passed).

We should discuss and decide which approach we want to take.

My understanding is that the initial timeout is necessary because the client needs time to connect to balancer and receive the server list. It doesn't seem to be the case for the later fallbacks. Losing connections to the balancer and all backends is pretty unusual, and I don't see a reason why we still have to wait for 10 seconds instead of using the fallbacks immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth I have updated the PR to add the timeout to subsequent fallbacks. Now it matches your description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great! Then feel free to merge this PR as soon as Penn reviews it. Thanks!

"NameResolver returned no LB address while asking for GRPCLB"));
} else {
grpclbState.updateAddresses(newLbAddressGroups, newBackendServers);
grpclbState.handleAddresses(newLbAddressGroups, newBackendServers);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the method name is changed into handleAddresses, maybe also handle if(newLbAddressGroups.isEmpty()) inside the handleAddresses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Start the fallback timer if it's not already started and all connections are lost.
*/
private void maybeStartFallbackTimer() {
if (fallbackTimer == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (fallbackTimer != null) {
  return;
}

to follow suit the pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
logger.log(Level.FINE, "[{0}] Starting fallback timer.", new Object[] {logId});
fallbackTimer = new FallbackModeTask();
fallbackTimer.scheduledFuture =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a schedule(timeout, timeUnit) method to FallbackModeTask instead of setting the field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
if (balancerWorking) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also

if (usingFallbackBackends) {
  return;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Fixed and added test for it.

Copy link
Contributor Author

@zhangkun83 zhangkun83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL.

"NameResolver returned no LB address while asking for GRPCLB"));
} else {
grpclbState.updateAddresses(newLbAddressGroups, newBackendServers);
grpclbState.handleAddresses(newLbAddressGroups, newBackendServers);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Start the fallback timer if it's not already started and all connections are lost.
*/
private void maybeStartFallbackTimer() {
if (fallbackTimer == null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
if (balancerWorking) {
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Fixed and added test for it.

}
logger.log(Level.FINE, "[{0}] Starting fallback timer.", new Object[] {logId});
fallbackTimer = new FallbackModeTask();
fallbackTimer.scheduledFuture =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@dapengzhang0 dapengzhang0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangkun83 zhangkun83 merged commit 9239984 into grpc:master Nov 29, 2017
@zhangkun83 zhangkun83 deleted the grpclb_fallback branch November 29, 2017 18:50
@carl-mastrangelo carl-mastrangelo added this to the 1.9 milestone Nov 30, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants