-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make RR re-resolve when any of its subchannels fail. #14170
Make RR re-resolve when any of its subchannels fail. #14170
Conversation
|
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, re-resolution will call grpc_resolver_channel_saw_error_locked()
to get the next resolution. In #13671, fake_resolver_channel_saw_error_locked()
will return the results_upon_error
as the next resolution. To serve results_upon_error
only when we pull for re-resolution, we don't trigger a resolution when we set results_upon_error
.
I think it's better to base this PR on #13671 to reuse grpc_fake_resolver_response_generator_set_response_upon_error()
.
I have not yet done a complete review, but at a high level, I think we need to rethink this a bit. Please let me know if you have any questions. Thanks! Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file):
I think this change has some implications. Prior to #13932, the subchannel reported TRANSIENT_FAILURE when a connection attempt failed and SHUTDOWN when it received a GOAWAY, and we only re-resolved in the latter case. Now that the subchannel reports TRANSIENT_FAILURE in both of those cases, this change means that we will re-resolve in both cases. I think that's okay in principle, but I'm a bit concerned that this could cause us to hammer the DNS server with too many requests. In particular, consider the case where all of the backends are down: in this case, as soon as we initially attempt to connect to all of the backends, all subchannels will go into TRANSIENT_FAILURE, and we will immediately re-resolve. If the backends all stay down for an extended period of time, the subchannels will all re-attempt to connect using the same backoff algorithm, so they will all go into CONNECTING and then back into TRANSIENT_FAILURE at about the same time for each attempt, meaning that we will wind up re-resolving each time the backoff algorithm tells us to try again. But even worse, if one of the backends comes up briefly (long enough for us to establish a connection) and then dies again, now one subchannel will be off-cycle with the others with respect to the backoff algorithm, which means that we'll re-resolve twice as often. After a long talk with @ejona86 yesterday, I think we need to re-think our overall strategy for re-resolution. Currently, we try to be conservative in when we request re-resolution to avoid hammering the DNS server. But I think a better approach would be to change the DNS resolver code to impose a minimum time between re-resolution requests, as discussed in https://reviewable.io/reviews/grpc/grpc/13671#-L0p5_pdzqIYgVgEu-9d:-L2LWGHF4c9_cItXiqbb:btdcjm. Once we do that, we can then be much more aggressive in the LB policy code in deciding when to re-resolve, because no matter how often we request re-resolution, there will still be a minimum time between DNS requests. So, I suggest the following changes:
It's fine to split the DNS resolver change into a separate PR if you want, but that would need to go in before this one can. We should probably try to get this done quickly, because we don't want to be in a state where we're not re-resolving at all. (In fact, we should check that the code in the currently broken state didn't make its way into a release.) Comments from Reviewable |
Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Do all stacks need to be in agreement on this DNS resolver behavior or can we proceed? Comments from Reviewable |
Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file): Previously, dgquintas (David G. Quintas) wrote…
We don't have uniform behavior between stacks now, so changing our behavior won't make that problem any worse. As it happens, Java already does basically what I described above, except that their DNS resolver doesn't enforce a minimum time between queries. However, since their usual DNS resolver relies on the JVM's caching behavior, it's kind of a moot point for them. We can worry about consistency more later when we get around to writing a client channel spec. Comments from Reviewable |
Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Ok. I'm aiming to have those changes ready by mid-next week. Comments from Reviewable |
Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file): Previously, dgquintas (David G. Quintas) wrote…
#14228 has been merged. Comments from Reviewable |
Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file): Previously, dgquintas (David G. Quintas) wrote…
Super. Please merge those changes into this PR and change it to re-resolve whenever any one subchannel goes into TRANSIENT_FAILURE. Comments from Reviewable |
@sreecha Please have a look at https://paste.googleplex.com/6424444587212800?raw , the result of running |
|
|
|
It seems only happen with epoll1. The current guess is that the client side keep queuing callbacks to the combiner, and the callbacks are picked up in the server side threads polling cq (because of epoll1) and thus blocks the server from returning from a wait. The evidence is at a hanging test the server side is waiting for a ThreadManager thread to finish. In that thread, the back trace is showing client side work: #8 0x000055555567038e in rr_update_locked (policy=0x5555559bb090, args=0x7fffdeffc2e0) at src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc:538 |
|
|
42def10
to
6b7ac64
Compare
6b7ac64
to
2033170
Compare
|
|
1 similar comment
|
|
Review status: 0 of 3 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Done. PTAL. Comments from Reviewable |
|
|
omg green |
Only one significant issue here; the other comments are minor. I am a little concerned that our tests are all green when there's still a significant bug here. Is there a reasonable way to add a test to catch this kind of problem? Reviewed 3 of 5 files at r3. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 333 at r3 (raw file):
No need to mention requesting re-resolution here, because we're not actually triggering that here. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 354 at r3 (raw file):
Since we're no longer using the src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 568 at r3 (raw file):
I don't think it's safe to reset
In the long term, I'd like to see a more general solution here, but I think that will require some careful thought about how to improve the subchannel_list API. I'll do that later as part of C++-ifying that API. For now, please add the following comment: TODO(roth): As part of C++-ifying the subchannel_list API, design a better API for notifying the LB policy of subchannel states, which can be used both for the subchannel's initial state and for subsequent state changes. This will allow us to handle this more generally instead of special-casing TRANSIENT_FAILURE (e.g., we can also distribute any pending picks across all READY subchannels rather than sending them all to the first one). test/cpp/end2end/client_lb_end2end_test.cc, line 141 at r3 (raw file):
Please split this into separate methods for setting the next resolution and the re-resolution response, as @AspirinSJL did in grpclb_end2end_test in #14281. test/cpp/end2end/client_lb_end2end_test.cc, line 581 at r3 (raw file):
I assume these log messages are leftovers from debugging. :) Comments from Reviewable |
Review status: all files reviewed at latest revision, 6 unresolved discussions. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 402 at r3 (raw file):
Not directly related to the rest of this PR, but I think we should only do this if Comments from Reviewable |
I couldn't come up with any effective way to make the previous state change :/ Review status: all files reviewed at latest revision, 6 unresolved discussions. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 333 at r3 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Done. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 354 at r3 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Done. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 402 at r3 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Good catch. Done. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 568 at r3 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Done. test/cpp/end2end/client_lb_end2end_test.cc, line 141 at r3 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Done. test/cpp/end2end/client_lb_end2end_test.cc, line 581 at r3 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Woops. Done. At least I don't use swear words in these any more... Learnt that the embarrassing way. Comments from Reviewable |
|
|
|
This looks great! All remaining comments are minor. Reviewed 3 of 3 files at r4. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r4 (raw file):
I think this should still say test/cpp/end2end/client_lb_end2end_test.cc, line 168 at r4 (raw file):
Could refactor out the code for generating the addresses into its own function, since that's the same for both this function and the previous one. test/cpp/end2end/client_lb_end2end_test.cc, line 765 at r4 (raw file):
Nit: No need for the semicolon here, since you've added the braces. test/cpp/end2end/client_lb_end2end_test.cc, line 767 at r4 (raw file):
I assume this is another log message that was added for debugging. Comments from Reviewable |
Woops, missed the last set of comments. Putting together another tiny PR to address them. |
I've addressed the missed comments in #14374 Review status: all files reviewed at latest revision, 4 unresolved discussions. src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r4 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Done. test/cpp/end2end/client_lb_end2end_test.cc, line 168 at r4 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Done. test/cpp/end2end/client_lb_end2end_test.cc, line 765 at r4 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Done. test/cpp/end2end/client_lb_end2end_test.cc, line 767 at r4 (raw file): Previously, markdroth (Mark D. Roth) wrote…
Done. Comments from Reviewable |
Addressing #14170 leftover comments
There were leftovers from #13932 (for example, it doesn't make sense anymore to look at
num_shutdown
). This was preventing RR from re-resolving the moment all its subchannels were inTRANSIENT_FAILURE
.The fake resolver has been modify to allow the setup of the next resolution without triggering a call to the registered notification closure, as this would trigger an LB update that wouldn't exercise the new codepath.
Fixes #14097
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)