grpclb fallback after startup #18344

markdroth · 2019-03-12T19:52:48Z

Based on #18340.

This change is

AspirinSJL

Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @apolcyn and @markdroth)

src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc, line 1059 at r1 (raw file):

      // addresses instead of the addresses from the new serverlist.
      // However, if we can't reach any of the servers in the new
      // serverlist, then the child policy will never switch away from

If we can't reach any of the servers in the new serverlist, then the child policy will never switch back to the fallback mode, right? Why would the child policy want to switch away from the fallback addresses?

markdroth

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @apolcyn and @AspirinSJL)

src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc, line 1059 at r1 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

If we can't reach any of the servers in the new serverlist, then the child policy will never switch back to the fallback mode, right? Why would the child policy want to switch away from the fallback addresses?

From the perspective of the child policy, if we're currently in fallback mode, we have a single subchannel list containing the fallback addresses. Then we get a new serverlist from the balancer, and we give the child policy an update containing the addresses from the serverlist. The child policy will create a new subchannel list with the addresses from the serverlist and attempt to connect to them, but that new subchannel list will remain pending until one of the subchannels goes into state READY; until then, the child policy is continuing to use the original subchannel list containing the fallback addresses.

In this situation, the child policy is continuing to use the fallback addresses, but the grpclb policy thinks we've exited fallback mode. As a result, if the resolver delivers an update to the grpclb policy that updates the list of fallback addresses, the grpclb policy will not pass those addresses to the child policy; instead, it will pass in the addresses from the serverlist, which we are still not able to connect to.

Note that if we get to the point that all fallback addresses are out of date and we lose contact with all of them, then the child policy will report TRANSIENT_FAILURE, and the grpclb policy will then go back into fallback mode, at which point we will pass the updated fallback addresses to the child policy. But between the point where the grpclb policy thinks we've exited fallback mode and the point where the last of the original fallback addresses stops responding, we may slowly lose backends.

This is definitely sub-optimal, but in order for it to be a problem, three separate things need to go wrong:

We need to initially go into fallback mode for some reason (either the fallback-at-startup case or the fallback-after-startup case).
We get back in touch with the balancer, which gives us a serverlist containing only addresses we can't reach.
The fallback backends then go down.

I think this should be sufficiently rare that I'm not too worried about it for grpclb. And we will not have this problem in xds, because it will have a separate fallback policy.

AspirinSJL

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @apolcyn)

src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc, line 1059 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

From the perspective of the child policy, if we're currently in fallback mode, we have a single subchannel list containing the fallback addresses. Then we get a new serverlist from the balancer, and we give the child policy an update containing the addresses from the serverlist. The child policy will create a new subchannel list with the addresses from the serverlist and attempt to connect to them, but that new subchannel list will remain pending until one of the subchannels goes into state READY; until then, the child policy is continuing to use the original subchannel list containing the fallback addresses.

In this situation, the child policy is continuing to use the fallback addresses, but the grpclb policy thinks we've exited fallback mode. As a result, if the resolver delivers an update to the grpclb policy that updates the list of fallback addresses, the grpclb policy will not pass those addresses to the child policy; instead, it will pass in the addresses from the serverlist, which we are still not able to connect to.

Note that if we get to the point that all fallback addresses are out of date and we lose contact with all of them, then the child policy will report TRANSIENT_FAILURE, and the grpclb policy will then go back into fallback mode, at which point we will pass the updated fallback addresses to the child policy. But between the point where the grpclb policy thinks we've exited fallback mode and the point where the last of the original fallback addresses stops responding, we may slowly lose backends.

This is definitely sub-optimal, but in order for it to be a problem, three separate things need to go wrong:

We need to initially go into fallback mode for some reason (either the fallback-at-startup case or the fallback-after-startup case).

We get back in touch with the balancer, which gives us a serverlist containing only addresses we can't reach.

The fallback backends then go down.

I think this should be sufficiently rare that I'm not too worried about it for grpclb. And we will not have this problem in xds, because it will have a separate fallback policy.

I see. Sounds like a very rare case.

markdroth · 2019-03-18T16:16:47Z

Known failures: #18416 #16392 #18207 #18404

markdroth added the release notes: no Indicates if PR should not be in release notes label Mar 12, 2019

markdroth requested a review from AspirinSJL March 12, 2019 19:52

markdroth requested a review from apolcyn as a code owner March 12, 2019 19:52

markdroth mentioned this pull request Mar 13, 2019

grpclb fallback-at-startup improvements #18353

Merged

AspirinSJL suggested changes Mar 13, 2019

View reviewed changes

markdroth commented Mar 14, 2019

View reviewed changes

AspirinSJL approved these changes Mar 15, 2019

View reviewed changes

Go into fallback mode when losing contact with balancer and backends.

adc2163

markdroth force-pushed the grpclb_fallback_after_startup branch from f6e91df to adc2163 Compare March 18, 2019 16:17

markdroth merged commit 6a4ddc9 into grpc:master Mar 18, 2019

markdroth deleted the grpclb_fallback_after_startup branch March 18, 2019 18:42

lock bot locked as resolved and limited conversation to collaborators Jun 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpclb fallback after startup #18344

grpclb fallback after startup #18344

markdroth commented Mar 12, 2019 •

edited

Loading

AspirinSJL left a comment

markdroth left a comment

AspirinSJL left a comment

markdroth commented Mar 18, 2019

grpclb fallback after startup #18344

grpclb fallback after startup #18344

Conversation

markdroth commented Mar 12, 2019 • edited Loading

AspirinSJL left a comment

Choose a reason for hiding this comment

markdroth left a comment

Choose a reason for hiding this comment

AspirinSJL left a comment

Choose a reason for hiding this comment

markdroth commented Mar 18, 2019

markdroth commented Mar 12, 2019 •

edited

Loading