Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpclb fallback after startup #18344

Merged
merged 1 commit into from
Mar 18, 2019

Conversation

markdroth
Copy link
Member

@markdroth markdroth commented Mar 12, 2019

Based on #18340.


This change is Reviewable

@markdroth markdroth added the release notes: no Indicates if PR should not be in release notes label Mar 12, 2019
@markdroth markdroth requested a review from apolcyn as a code owner March 12, 2019 19:52
Copy link
Member

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @apolcyn and @markdroth)


src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc, line 1059 at r1 (raw file):

      // addresses instead of the addresses from the new serverlist.
      // However, if we can't reach any of the servers in the new
      // serverlist, then the child policy will never switch away from

If we can't reach any of the servers in the new serverlist, then the child policy will never switch back to the fallback mode, right? Why would the child policy want to switch away from the fallback addresses?

Copy link
Member Author

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc, line 1059 at r1 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

If we can't reach any of the servers in the new serverlist, then the child policy will never switch back to the fallback mode, right? Why would the child policy want to switch away from the fallback addresses?

From the perspective of the child policy, if we're currently in fallback mode, we have a single subchannel list containing the fallback addresses. Then we get a new serverlist from the balancer, and we give the child policy an update containing the addresses from the serverlist. The child policy will create a new subchannel list with the addresses from the serverlist and attempt to connect to them, but that new subchannel list will remain pending until one of the subchannels goes into state READY; until then, the child policy is continuing to use the original subchannel list containing the fallback addresses.

In this situation, the child policy is continuing to use the fallback addresses, but the grpclb policy thinks we've exited fallback mode. As a result, if the resolver delivers an update to the grpclb policy that updates the list of fallback addresses, the grpclb policy will not pass those addresses to the child policy; instead, it will pass in the addresses from the serverlist, which we are still not able to connect to.

Note that if we get to the point that all fallback addresses are out of date and we lose contact with all of them, then the child policy will report TRANSIENT_FAILURE, and the grpclb policy will then go back into fallback mode, at which point we will pass the updated fallback addresses to the child policy. But between the point where the grpclb policy thinks we've exited fallback mode and the point where the last of the original fallback addresses stops responding, we may slowly lose backends.

This is definitely sub-optimal, but in order for it to be a problem, three separate things need to go wrong:

  1. We need to initially go into fallback mode for some reason (either the fallback-at-startup case or the fallback-after-startup case).
  2. We get back in touch with the balancer, which gives us a serverlist containing only addresses we can't reach.
  3. The fallback backends then go down.

I think this should be sufficiently rare that I'm not too worried about it for grpclb. And we will not have this problem in xds, because it will have a separate fallback policy.

Copy link
Member

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @apolcyn)


src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc, line 1059 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

From the perspective of the child policy, if we're currently in fallback mode, we have a single subchannel list containing the fallback addresses. Then we get a new serverlist from the balancer, and we give the child policy an update containing the addresses from the serverlist. The child policy will create a new subchannel list with the addresses from the serverlist and attempt to connect to them, but that new subchannel list will remain pending until one of the subchannels goes into state READY; until then, the child policy is continuing to use the original subchannel list containing the fallback addresses.

In this situation, the child policy is continuing to use the fallback addresses, but the grpclb policy thinks we've exited fallback mode. As a result, if the resolver delivers an update to the grpclb policy that updates the list of fallback addresses, the grpclb policy will not pass those addresses to the child policy; instead, it will pass in the addresses from the serverlist, which we are still not able to connect to.

Note that if we get to the point that all fallback addresses are out of date and we lose contact with all of them, then the child policy will report TRANSIENT_FAILURE, and the grpclb policy will then go back into fallback mode, at which point we will pass the updated fallback addresses to the child policy. But between the point where the grpclb policy thinks we've exited fallback mode and the point where the last of the original fallback addresses stops responding, we may slowly lose backends.

This is definitely sub-optimal, but in order for it to be a problem, three separate things need to go wrong:

  1. We need to initially go into fallback mode for some reason (either the fallback-at-startup case or the fallback-after-startup case).
  2. We get back in touch with the balancer, which gives us a serverlist containing only addresses we can't reach.
  3. The fallback backends then go down.

I think this should be sufficiently rare that I'm not too worried about it for grpclb. And we will not have this problem in xds, because it will have a separate fallback policy.

I see. Sounds like a very rare case.

@markdroth
Copy link
Member Author

Known failures: #18416 #16392 #18207 #18404

@markdroth markdroth force-pushed the grpclb_fallback_after_startup branch from f6e91df to adc2163 Compare March 18, 2019 16:17
@markdroth markdroth merged commit 6a4ddc9 into grpc:master Mar 18, 2019
@markdroth markdroth deleted the grpclb_fallback_after_startup branch March 18, 2019 18:42
@lock lock bot locked as resolved and limited conversation to collaborators Jun 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release notes: no Indicates if PR should not be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants