-
Notifications
You must be signed in to change notification settings - Fork 3.9k
grpclb: switch to fallback mode if all connections are lost #3744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously fallback mode can be entered only if the client has not received any server list and the fallback timeout has expired. Now fallback mode is entered when the following condition is met: ``` (timeoutExpired && readySubchannels.size() == 0 && !(connectedToBalancer && receivedServerListFromCurrentBalancer)) ``` Fallback mode is exited when a new server list is received from the balancer.
|
@markdroth please focus on the behavior. |
| List<Subchannel> subchannels = | ||
| fallbackTestVerifyUseOfBalancerBackendLists(inOrder, helper, serverList); | ||
|
|
||
| // Let fallback timer expire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know this code well enough to understand exactly what this is covering, so let me say what the behavior should be here and let you tell me if that's what the code is doing.
I think the behavior we want here is for the fallback timer to start again at the moment that we lose contact with the balancers and backends. If we don't reestablish a connection with either a balancer or a backend by the time the timer expires, then we use the fallback addresses until we regain contact with the balancer.
Note that the above is different from the initial fallback timer that we had implemented previously. For example, consider the following scenario:
- Channel is created and is moved out of idle state (either by starting an RPC or by requesting that the channel start connecting).
- We start trying to connect to the balancer. At the same time, we start the fallback timer.
- If the fallback timer expires before we connect to the balancer, we use the fallback addresses while we continue to attempt to connect to the balancers.
- We successfully connect to the balancer and receive a serverlist. If the fallback timer had not previously expired, we cancel it; if it had previously expired, we stop using the fallback addresses. We then start using the backends that the balancer told us about.
- Some time later, we lose contact with the balancer. However, we are still in contact with the backends from the last serverlist the balancer sent us, so we keep using them while we try to reconnect to the balancer.
- Now we lose contact with all backends from the last serverlist the balancer sent us. We stat a new fallback timer.
- If we regain contact with the balancer and get a new serverlist before the fallback timer expires, then we cancel the timer and use the new serverlist.
- If the timer fires before we regain contact with the balancer, then we switch to using the fallback addresses until we regain contact with the balancer.
In other words, there are two different scenarios in which the fallback timer will be used. The one that we had previously implemented is a timer that starts when we start trying to contact the balancer. The new one that we need in this PR will start at the moment when we lose contact with the last backend if we have also lost contact with the balancer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current code the timeout is used only for the initial fallback. Later fallbacks are not subject to the timeout, thus will activate immediately after all connections are lost (as long as the initial timeout is passed).
We should discuss and decide which approach we want to take.
My understanding is that the initial timeout is necessary because the client needs time to connect to balancer and receive the server list. It doesn't seem to be the case for the later fallbacks. Losing connections to the balancer and all backends is pretty unusual, and I don't see a reason why we still have to wait for 10 seconds instead of using the fallbacks immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markdroth I have updated the PR to add the timeout to subsequent fallbacks. Now it matches your description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great! Then feel free to merge this PR as soon as Penn reviews it. Thanks!
| "NameResolver returned no LB address while asking for GRPCLB")); | ||
| } else { | ||
| grpclbState.updateAddresses(newLbAddressGroups, newBackendServers); | ||
| grpclbState.handleAddresses(newLbAddressGroups, newBackendServers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the method name is changed into handleAddresses, maybe also handle if(newLbAddressGroups.isEmpty()) inside the handleAddresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * Start the fallback timer if it's not already started and all connections are lost. | ||
| */ | ||
| private void maybeStartFallbackTimer() { | ||
| if (fallbackTimer == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (fallbackTimer != null) {
return;
}
to follow suit the pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| } | ||
| logger.log(Level.FINE, "[{0}] Starting fallback timer.", new Object[] {logId}); | ||
| fallbackTimer = new FallbackModeTask(); | ||
| fallbackTimer.scheduledFuture = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a schedule(timeout, timeUnit) method to FallbackModeTask instead of setting the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| } | ||
| if (balancerWorking) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also
if (usingFallbackBackends) {
return;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Fixed and added test for it.
zhangkun83
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL.
| "NameResolver returned no LB address while asking for GRPCLB")); | ||
| } else { | ||
| grpclbState.updateAddresses(newLbAddressGroups, newBackendServers); | ||
| grpclbState.handleAddresses(newLbAddressGroups, newBackendServers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * Start the fallback timer if it's not already started and all connections are lost. | ||
| */ | ||
| private void maybeStartFallbackTimer() { | ||
| if (fallbackTimer == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| } | ||
| if (balancerWorking) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Fixed and added test for it.
| } | ||
| logger.log(Level.FINE, "[{0}] Starting fallback timer.", new Object[] {logId}); | ||
| fallbackTimer = new FallbackModeTask(); | ||
| fallbackTimer.scheduledFuture = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dapengzhang0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Previously fallback mode can be entered only if the client has not
received any server list and the fallback timeout has expired.
Now the fallback timer is started when the stream to the balancer is broken
AND there is no ready Subchannels. Fallback mode is activated when the
timer expires. When a new server list is received from the balancer, either
the fallback timer is cancelled, or fallback mode is exited.
Also fixed a bug that the fallback timer should've been cancelled when GrpcState
is shut down.