Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make RR re-resolve when any of its subchannels fail. #14170

Merged

Conversation

dgquintas
Copy link
Contributor

@dgquintas dgquintas commented Jan 24, 2018

There were leftovers from #13932 (for example, it doesn't make sense anymore to look at num_shutdown). This was preventing RR from re-resolving the moment all its subchannels were in TRANSIENT_FAILURE.

The fake resolver has been modify to allow the setup of the next resolution without triggering a call to the registered notification closure, as this would trigger an LB update that wouldn't exercise the new codepath.

Fixes #14097


This change is Reviewable

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                          FILE SIZE
 ++++++++++++++ GROWING                                                            ++++++++++++++
  +0.0%    +240 [None]                                                             +1.35Ki  +0.0%
  +4.0%     +48 src/core/ext/filters/client_channel/resolver/fake/fake_resolver.cc     +48  +4.0%
      [NEW]    +157 set_response_and_maybe_push                                           +157  [NEW]
      [NEW]      +7 grpc_fake_resolver_response_generator_set_response_no_notify            +7  [NEW]
      +2.8%      +6 [Unmapped]                                                              +6  +2.8%
      +6.7%      +6 set_response_closure_fn                                                 +6  +6.7%

  +0.0%    +288 TOTAL                                                              +1.40Ki  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

Copy link
Member

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, re-resolution will call grpc_resolver_channel_saw_error_locked() to get the next resolution. In #13671, fake_resolver_channel_saw_error_locked() will return the results_upon_error as the next resolution. To serve results_upon_error only when we pull for re-resolution, we don't trigger a resolution when we set results_upon_error.

I think it's better to base this PR on #13671 to reuse grpc_fake_resolver_response_generator_set_response_upon_error().

@markdroth
Copy link
Member

I have not yet done a complete review, but at a high level, I think we need to rethink this a bit. Please let me know if you have any questions. Thanks!


Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file):

    /* 3) TRANSIENT_FAILURE and re-resolve */
    grpc_connectivity_state_set(
        &p->state_tracker, GRPC_CHANNEL_TRANSIENT_FAILURE,

I think this change has some implications.

Prior to #13932, the subchannel reported TRANSIENT_FAILURE when a connection attempt failed and SHUTDOWN when it received a GOAWAY, and we only re-resolved in the latter case. Now that the subchannel reports TRANSIENT_FAILURE in both of those cases, this change means that we will re-resolve in both cases.

I think that's okay in principle, but I'm a bit concerned that this could cause us to hammer the DNS server with too many requests. In particular, consider the case where all of the backends are down: in this case, as soon as we initially attempt to connect to all of the backends, all subchannels will go into TRANSIENT_FAILURE, and we will immediately re-resolve. If the backends all stay down for an extended period of time, the subchannels will all re-attempt to connect using the same backoff algorithm, so they will all go into CONNECTING and then back into TRANSIENT_FAILURE at about the same time for each attempt, meaning that we will wind up re-resolving each time the backoff algorithm tells us to try again. But even worse, if one of the backends comes up briefly (long enough for us to establish a connection) and then dies again, now one subchannel will be off-cycle with the others with respect to the backoff algorithm, which means that we'll re-resolve twice as often.

After a long talk with @ejona86 yesterday, I think we need to re-think our overall strategy for re-resolution. Currently, we try to be conservative in when we request re-resolution to avoid hammering the DNS server. But I think a better approach would be to change the DNS resolver code to impose a minimum time between re-resolution requests, as discussed in https://reviewable.io/reviews/grpc/grpc/13671#-L0p5_pdzqIYgVgEu-9d:-L2LWGHF4c9_cItXiqbb:btdcjm. Once we do that, we can then be much more aggressive in the LB policy code in deciding when to re-resolve, because no matter how often we request re-resolution, there will still be a minimum time between DNS requests.

So, I suggest the following changes:

  1. Add a minimum time between re-resolutions to the DNS resolver (configurable via a channel arg, with some reasonable default). This will need to be added to both the native and c-ares DNS resolver implementations.
    • The resolver should record the time of the last DNS request.
    • When channel_saw_error() is called, it will check how long it's been since the last DNS request. If it's been longer than the configured minimum interval, a new DNS request can be sent immediately. Otherwise, a timer can be started to perform a DNS request at the timestamp of the last request plus the configured minimum interval.
    • Note: Now that the LB policies are not relying on the updated DNS response to try to reconnect to the subchannels that got GOAWAYs, there is no requirement for channel_saw_error() to immediately cause new data to be returned. Instead, we can just wait for the timer to fire.
  2. Change LB policy code to be more aggressive about requesting re-resolution. In particular, I think we can now request re-resolution whenever any one individual subchannel goes into TRANSIENT_FAILURE -- we no longer need to wait for all subchannels to be in this state.

It's fine to split the DNS resolver change into a separate PR if you want, but that would need to go in before this one can.

We should probably try to get this done quickly, because we don't want to be in a state where we're not re-resolving at all. (In fact, we should check that the code in the currently broken state didn't make its way into a release.)


Comments from Reviewable

@ejona86
Copy link
Member

ejona86 commented Jan 25, 2018

CC @dfawley @menghanl

@markdroth markdroth mentioned this pull request Jan 25, 2018
@dgquintas
Copy link
Contributor Author

Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this change has some implications.

Prior to #13932, the subchannel reported TRANSIENT_FAILURE when a connection attempt failed and SHUTDOWN when it received a GOAWAY, and we only re-resolved in the latter case. Now that the subchannel reports TRANSIENT_FAILURE in both of those cases, this change means that we will re-resolve in both cases.

I think that's okay in principle, but I'm a bit concerned that this could cause us to hammer the DNS server with too many requests. In particular, consider the case where all of the backends are down: in this case, as soon as we initially attempt to connect to all of the backends, all subchannels will go into TRANSIENT_FAILURE, and we will immediately re-resolve. If the backends all stay down for an extended period of time, the subchannels will all re-attempt to connect using the same backoff algorithm, so they will all go into CONNECTING and then back into TRANSIENT_FAILURE at about the same time for each attempt, meaning that we will wind up re-resolving each time the backoff algorithm tells us to try again. But even worse, if one of the backends comes up briefly (long enough for us to establish a connection) and then dies again, now one subchannel will be off-cycle with the others with respect to the backoff algorithm, which means that we'll re-resolve twice as often.

After a long talk with @ejona86 yesterday, I think we need to re-think our overall strategy for re-resolution. Currently, we try to be conservative in when we request re-resolution to avoid hammering the DNS server. But I think a better approach would be to change the DNS resolver code to impose a minimum time between re-resolution requests, as discussed in https://reviewable.io/reviews/grpc/grpc/13671#-L0p5_pdzqIYgVgEu-9d:-L2LWGHF4c9_cItXiqbb:btdcjm. Once we do that, we can then be much more aggressive in the LB policy code in deciding when to re-resolve, because no matter how often we request re-resolution, there will still be a minimum time between DNS requests.

So, I suggest the following changes:

  1. Add a minimum time between re-resolutions to the DNS resolver (configurable via a channel arg, with some reasonable default). This will need to be added to both the native and c-ares DNS resolver implementations.
    • The resolver should record the time of the last DNS request.
    • When channel_saw_error() is called, it will check how long it's been since the last DNS request. If it's been longer than the configured minimum interval, a new DNS request can be sent immediately. Otherwise, a timer can be started to perform a DNS request at the timestamp of the last request plus the configured minimum interval.
    • Note: Now that the LB policies are not relying on the updated DNS response to try to reconnect to the subchannels that got GOAWAYs, there is no requirement for channel_saw_error() to immediately cause new data to be returned. Instead, we can just wait for the timer to fire.
  2. Change LB policy code to be more aggressive about requesting re-resolution. In particular, I think we can now request re-resolution whenever any one individual subchannel goes into TRANSIENT_FAILURE -- we no longer need to wait for all subchannels to be in this state.

It's fine to split the DNS resolver change into a separate PR if you want, but that would need to go in before this one can.

We should probably try to get this done quickly, because we don't want to be in a state where we're not re-resolving at all. (In fact, we should check that the code in the currently broken state didn't make its way into a release.)

Do all stacks need to be in agreement on this DNS resolver behavior or can we proceed?


Comments from Reviewable

@markdroth
Copy link
Member

Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file):

Previously, dgquintas (David G. Quintas) wrote…

Do all stacks need to be in agreement on this DNS resolver behavior or can we proceed?

We don't have uniform behavior between stacks now, so changing our behavior won't make that problem any worse.

As it happens, Java already does basically what I described above, except that their DNS resolver doesn't enforce a minimum time between queries. However, since their usual DNS resolver relies on the JVM's caching behavior, it's kind of a moot point for them.

We can worry about consistency more later when we get around to writing a client channel spec.


Comments from Reviewable

@dgquintas
Copy link
Contributor Author

Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We don't have uniform behavior between stacks now, so changing our behavior won't make that problem any worse.

As it happens, Java already does basically what I described above, except that their DNS resolver doesn't enforce a minimum time between queries. However, since their usual DNS resolver relies on the JVM's caching behavior, it's kind of a moot point for them.

We can worry about consistency more later when we get around to writing a client channel spec.

Ok. I'm aiming to have those changes ready by mid-next week.


Comments from Reviewable

@dgquintas
Copy link
Contributor Author

Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file):

Previously, dgquintas (David G. Quintas) wrote…

Ok. I'm aiming to have those changes ready by mid-next week.

#14228 has been merged.


Comments from Reviewable

@markdroth
Copy link
Member

Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file):

Previously, dgquintas (David G. Quintas) wrote…

#14228 has been merged.

Super. Please merge those changes into this PR and change it to re-resolve whenever any one subchannel goes into TRANSIENT_FAILURE.


Comments from Reviewable

@dgquintas
Copy link
Contributor Author

@sreecha Please have a look at https://paste.googleplex.com/6424444587212800?raw , the result of running GRPC_VERBOSITY=debug GRPC_TRACE=round_robin /usr/local/google/home/dgq/grpc/forks/grpc/bins/dbg/client_lb_end2end_test --gtest_filter=ClientLbEnd2endTest.RoundRobin with this PR patched in (the output will be arbitrarily long, so I'd recommend piping all the output to gvim and control-c'ing the tests after 3 secons, ... |& gvim -.

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                FILE SIZE
 ++++++++++++++ GROWING                                                                  ++++++++++++++
  +0.0%    +288 [None]                                                                   +1.35Ki  +0.0%
  +4.0%     +48 src/core/ext/filters/client_channel/resolver/fake/fake_resolver.cc           +48  +4.0%
      [NEW]    +157 set_response_and_maybe_push                                                 +157  [NEW]
      [NEW]      +7 grpc_fake_resolver_response_generator_set_response_no_notify                  +7  [NEW]
      +2.8%      +6 [Unmapped]                                                                    +6  +2.8%
      +6.7%      +6 set_response_closure_fn                                                       +6  +6.7%
  +0.5%     +32 src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc     +32  +0.5%
      +1.6%     +32 rr_connectivity_changed_locked                                               +32  +1.6%

  +0.0%    +368 TOTAL                                                                    +1.43Ki  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@yang-g
Copy link
Member

yang-g commented Feb 3, 2018

It seems only happen with epoll1. The current guess is that the client side keep queuing callbacks to the combiner, and the callbacks are picked up in the server side threads polling cq (because of epoll1) and thus blocks the server from returning from a wait.

The evidence is at a hanging test the server side is waiting for a ThreadManager thread to finish. In that thread, the back trace is showing client side work:

#8 0x000055555567038e in rr_update_locked (policy=0x5555559bb090, args=0x7fffdeffc2e0) at src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc:538
#9 0x000055555562e200 in grpc_lb_policy_update_locked (policy=0x5555559bb090, lb_policy_args=0x7fffdeffc2e0) at src/core/ext/filters/client_channel/lb_policy.cc:116
#10 0x000055555562a1b8 in on_resolver_result_changed_locked (arg=0x5555559bbbd0, error=0x0) at src/core/ext/filters/client_channel/client_channel.cc:453
#11 0x00005555555ba207 in grpc_combiner_continue_exec_ctx () at src/core/lib/iomgr/combiner.cc:257
#12 0x00005555555be110 in grpc_core::ExecCtx::Flush (this=0x7fffdeffca30) at src/core/lib/iomgr/exec_ctx.cc:128
#13 0x00005555555d5373 in end_worker (pollset=0x7fffe8004690, worker=0x7fffdeffc950, worker_hdl=0x0) at src/core/lib/iomgr/ev_epoll1_linux.cc:878
#14 0x00005555555d586f in pollset_work (ps=0x7fffe8004690, worker_hdl=0x0, deadline=144) at src/core/lib/iomgr/ev_epoll1_linux.cc:978
#15 0x00005555555bd690 in grpc_pollset_work (pollset=0x7fffe8004690, worker=0x0, deadline=144) at src/core/lib/iomgr/ev_posix.cc:247
#16 0x000055555561db5b in cq_next (cq=0x7fffe8004580, deadline=..., reserved=0x0) at src/core/lib/surface/completion_queue.cc:926
#17 0x000055555561dfa0 in grpc_completion_queue_next (cq=0x7fffe8004580, deadline=..., reserved=0x0) at src/core/lib/surface/completion_queue.cc:1001
#18 0x00005555555f24d6 in grpc::CompletionQueue::AsyncNextInternal (this=0x7fffe8005060, tag=0x7fffdeffcc90, ok=0x7fffdeffcc8f, deadline=...) at src/cpp/common/completion_queue_cc.cc:56
#19 0x0000555555600514 in grpc::CompletionQueue::AsyncNext<gpr_timespec> (this=0x7fffe8005060, tag=0x7fffdeffcc90, ok=0x7fffdeffcc8f, deadline=...) at include/grpc++/impl/codegen/completion_queue.h:157
#20 0x00005555555ff759 in grpc::Server::SyncRequestThreadManager::PollForWork (this=0x7fffe8004830, tag=0x7fffdeffcc90, ok=0x7fffdeffcc8f) at src/cpp/server/server_cc.cc:276

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                       FILE SIZE
 ++++++++++++++ GROWING                                                                         ++++++++++++++
  +0.5%     +32 src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc            +32  +0.5%
      +1.6%     +32 rr_connectivity_changed_locked                                                      +32  +1.6%
  +0.2%     +24 src/core/lib/iomgr/ev_epollex_linux.cc                                              +24  +0.2%
      +6.9%     +32 pollable_process_events                                                             +32  +6.9%

 -------------- SHRINKING                                                                       --------------
  -0.0%    -128 [None]                                                                          -2.18Ki  -0.0%
      -0.0%    -128 [Unmapped]                                                                      -2.18Ki  -0.0%
      [DEL]     -32 vtable for grpc_core::RefCountedWithTracing<grpc_core::ConnectedSubchannel>         -32  [DEL]
 -17.9%    -272 src/core/ext/filters/client_channel/resolver/fake/fake_resolver.cc                 -272 -17.9%
     -94.8%    -184 grpc_fake_resolver_response_generator_set_response                                 -184 -94.8%
      [DEL]    -150 grpc_fake_resolver_response_generator_set_response_upon_error                      -150  [DEL]
      [DEL]    -117 set_response_closure_locked                                                        -117  [DEL]
     -47.4%     -46 fake_resolver_channel_saw_error_locked                                              -46 -47.4%
     -10.4%     -26 [Unmapped]                                                                          -26 -10.4%
     -18.4%      -9 fake_resolver_destroy                                                                -9 -18.4%
  -0.7%     -80 src/core/lib/iomgr/ev_poll_posix.cc                                                 -80  -0.7%
     -27.9%     -46 fd_shutdown_error                                                                   -46 -27.9%
      -8.3%     -32 notify_on_locked                                                                    -32  -8.3%
      -0.6%      -2 [Unmapped]                                                                           -2  -0.6%

  -0.0%    -424 TOTAL                                                                           -2.47Ki  -0.0%


****************************************************************

libgrpc++.so

     VM SIZE              FILE SIZE
 ++++++++++++++ GROWIN ++++++++++++++

 -------------- SHRINK --------------
  [ = ]       0 [None]     -64  -0.0%

  [ = ]       0 TOTAL      -64  -0.0%



@grpc-testing
Copy link

[trickle] No significant performance differences

@dgquintas dgquintas force-pushed the fake_resolver_dont_push_rr_reresolve2 branch from 42def10 to 6b7ac64 Compare February 6, 2018 22:46
@dgquintas dgquintas force-pushed the fake_resolver_dont_push_rr_reresolve2 branch from 6b7ac64 to 2033170 Compare February 6, 2018 22:48
@grpc-testing
Copy link

[microbenchmarks] Performance differences noted:
Benchmark                                                                                        allocs_per_iteration    atm_add_per_iteration    atm_cas_per_iteration    cpu_time    locks_per_iteration    nows_per_iteration    real_time
-----------------------------------------------------------------------------------------------  ----------------------  -----------------------  -----------------------  ----------  ---------------------  --------------------  -----------
BM_CreateDestroyCore                                                                                                     -25%                                                          +100%
BM_CreateDestroyCpp                                                                                                      -25%                                                          +33%
BM_CreateDestroyCpp2                                                                                                     -25%                                                          +100%
BM_CreateDestroyPollset                                                                                                  -99%                                              +25%        +100%                                        +25%
BM_EmptyCore                                                                                                             -14%                     +50%                     +188%       +49%                                         +187%
BM_PollAddFd                                                                                                             -99%                                              -95%        -99%                                         -95%
BM_PollEmptyPollset                                                                                                      -33%                     +9999%                   +374%       +66%                                         +374%
BM_PumpStreamClientToServer<InProcessCHTTP2>/2097152                                                                                                                                   +5%
BM_PumpStreamClientToServer<SockPair>/16777216                                                                                                                                         +67%                   +4%
BM_PumpStreamClientToServer<SockPair>/2097152                                                                                                                                          +18%
BM_PumpStreamClientToServer<SockPair>/262144                                                     -5%                     -6%                                                                                  -6%
BM_PumpStreamClientToServer<TCP>/262144                                                                                                                                                +9%
BM_PumpStreamClientToServer<UDS>/2097152                                                                                                                                               +16%
BM_PumpStreamClientToServer<UDS>/262144                                                          -5%                     -7%                                                                                  -7%
BM_PumpStreamServerToClient<InProcessCHTTP2>/2097152                                                                                                                                   +6%
BM_PumpStreamServerToClient<SockPair>/16777216                                                                                                                                         +7%
BM_PumpStreamServerToClient<SockPair>/2097152                                                                                                     +4%                                  +35%                   +22%
BM_PumpStreamServerToClient<SockPair>/262144                                                                             -7%                                                                                  -6%
BM_PumpStreamServerToClient<TCP>/262144                                                                                                                                                +5%
BM_PumpStreamServerToClient<UDS>/16777216                                                                                                         +7%                                  +64%                   +26%
BM_PumpStreamServerToClient<UDS>/2097152                                                                                                                                               +19%
BM_PumpStreamServerToClient<UDS>/262144                                                          -4%                     -5%
BM_SingleThreadPollOneFd                                                                                                 -33%                     +49%                     +104%       +66%                                         +104%
BM_StreamingPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/2                                                                                                           +4%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/0                                                                                                                             +17%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/1                                                                                        +4%                                  +20%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/2                                                                                        +4%                                  +22%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/1/1                                                                                                                             +20%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/1/2                                                                                        +4%                                  +22%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/262144/1                                                                                                                        +19%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/262144/2                                                                                                                        +20%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/32768/1                                                                                                                         +20%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/32768/2                                                                                                                         +21%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/4096/1                                                                                                                          +20%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/4096/2                                                                                     +4%                                  +21%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/512/1                                                                                                                           +20%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/512/2                                                                                      +4%                                  +21%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/64/1                                                                                       +4%                                  +20%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/64/2                                                                                       +4%                                  +21%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/8/1                                                                                        +4%                                  +20%
BM_StreamingPingPong<MinTCP, NoOpMutator, NoOpMutator>/8/2                                                                                        +4%                                  +22%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/0/0                                                                                                                                +17%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/0/1                                                                                           +4%                                  +20%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/0/2                                                                                           +4%                                  +22%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/1/1                                                                                           +4%                                  +20%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/1/2                                                                                           +4%                                  +22%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/262144/1                                                                                                                           +19%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/262144/2                                                                                                                           +20%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/32768/1                                                                                                                            +20%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/32768/2                                                                                                                            +21%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/4096/1                                                                                                                             +20%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/4096/2                                                                                        +4%                                  +21%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/512/1                                                                                                                              +20%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/512/2                                                                                         +4%                                  +21%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/64/1                                                                                          +4%                                  +20%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/64/2                                                                                          +4%                                  +21%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/8/1                                                                                           +4%                                  +20%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/8/2                                                                                           +4%                                  +22%
BM_StreamingPingPongMsgs<InProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152                                                                                                            +5%
BM_StreamingPingPongMsgs<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152                                                                                                         +5%
BM_StreamingPingPongMsgs<MinTCP, NoOpMutator, NoOpMutator>/0                                                                                      +4%                                  +27%
BM_StreamingPingPongMsgs<MinTCP, NoOpMutator, NoOpMutator>/1                                                                                      +4%                                  +27%
BM_StreamingPingPongMsgs<MinTCP, NoOpMutator, NoOpMutator>/262144                                                                                                                      +22%
BM_StreamingPingPongMsgs<MinTCP, NoOpMutator, NoOpMutator>/32768                                                                                  +4%                                  +24%
BM_StreamingPingPongMsgs<MinTCP, NoOpMutator, NoOpMutator>/4096                                                                                   +4%                                  +24%
BM_StreamingPingPongMsgs<MinTCP, NoOpMutator, NoOpMutator>/512                                                                                    +4%                                  +24%
BM_StreamingPingPongMsgs<MinTCP, NoOpMutator, NoOpMutator>/64                                                                                     +4%                                  +25%
BM_StreamingPingPongMsgs<MinTCP, NoOpMutator, NoOpMutator>/8                                                                                      +4%                                  +27%
BM_StreamingPingPongMsgs<TCP, NoOpMutator, NoOpMutator>/0                                                                                         +4%                                  +27%
BM_StreamingPingPongMsgs<TCP, NoOpMutator, NoOpMutator>/1                                                                                         +4%                                  +27%
BM_StreamingPingPongMsgs<TCP, NoOpMutator, NoOpMutator>/262144                                                                                                                         +22%
BM_StreamingPingPongMsgs<TCP, NoOpMutator, NoOpMutator>/32768                                                                                     +4%                                  +24%
BM_StreamingPingPongMsgs<TCP, NoOpMutator, NoOpMutator>/4096                                                                                      +4%                                  +24%
BM_StreamingPingPongMsgs<TCP, NoOpMutator, NoOpMutator>/512                                                                                       +4%                                  +24%
BM_StreamingPingPongMsgs<TCP, NoOpMutator, NoOpMutator>/64                                                                                        +4%                                  +25%
BM_StreamingPingPongMsgs<TCP, NoOpMutator, NoOpMutator>/8                                                                                         +4%                                  +27%
BM_StreamingPingPongWithCoalescingApi<InProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/1/0                                                                                           +4%
BM_StreamingPingPongWithCoalescingApi<InProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/1/1                                                                                           +4%
BM_StreamingPingPongWithCoalescingApi<InProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/2/0                                                                                           +4%
BM_StreamingPingPongWithCoalescingApi<InProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/2/1                                                                                           +5%
BM_StreamingPingPongWithCoalescingApi<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/1/0                                                                                        +4%
BM_StreamingPingPongWithCoalescingApi<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/1/1                                                                                        +4%
BM_StreamingPingPongWithCoalescingApi<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/2/0                                                                                        +4%
BM_StreamingPingPongWithCoalescingApi<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/2/1                                                                                        +5%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/0/2097152                                                                                                                  +4%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/0                                                                                                                  +4%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/2097152                                                                                                            +5%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/0/2097152                                                                                                               +4%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/0                                                                                                               +4%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2097152/2097152                                                                                                         +5%
BM_UnaryPingPong<MinSockPair, NoOpMutator, NoOpMutator>/0/0                                                                                                                            +11%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/0                                                                                                                                 +11%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/1                                                                                                                                 +11%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/262144                                                                                                                            +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/32768                                                                                                                             +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/4096                                                                                                                              +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/512                                                                                                                               +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/64                                                                                                                                +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/0/8                                                                                                                                 +11%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/1/0                                                                                                                                 +11%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/1/1                                                                                                                                 +11%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/134217728/0                                   -37%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/134217728/134217728                           -26%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/262144/0                                                                                                                            +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/262144/262144                                                                                                                       +9%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/32768/0                                                                                                                             +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/32768/32768                                                                                                                         +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/4096/0                                                                                                                              +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/4096/4096                                                                                                                           +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/512/0                                                                                                                               +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/512/512                                                                                                                             +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/64/0                                                                                                                                +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/64/64                                                                                                                               +10%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/8/0                                                                                                                                 +11%
BM_UnaryPingPong<MinTCP, NoOpMutator, NoOpMutator>/8/8                                                                                                                                 +11%
BM_UnaryPingPong<MinUDS, NoOpMutator, NoOpMutator>/0/0                                                                                                                                 +11%
BM_UnaryPingPong<SockPair, NoOpMutator, NoOpMutator>/0/0                                                                                                                               +11%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/0/0                                                                                                                                    +11%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/0/1                                                                                                                                    +11%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/0/262144                                                                                                                               +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/0/32768                                                                                                                                +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/0/4096                                                                                                                                 +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/0/512                                                                                                                                  +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/0/64                                                                                                                                   +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/0/8                                                                                                                                    +11%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/1/0                                                                                                                                    +11%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/1/1                                                                                                                                    +11%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/262144/0                                                                                                                               +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/262144/262144                                                                                                                          +9%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/32768/0                                                                                                                                +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/32768/32768                                                                                                                            +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/4096/0                                                                                                                                 +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/4096/4096                                                                                                                              +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/512/0                                                                                                                                  +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/512/512                                                                                                                                +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/64/0                                                                                                                                   +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/64/64                                                                                                                                  +10%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/8/0                                                                                                                                    +11%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/8/8                                                                                                                                    +11%
BM_UnaryPingPong<UDS, NoOpMutator, NoOpMutator>/0/0                                                                                                                                    +11%

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                FILE SIZE
 ++++++++++++++ GROWING                                                                  ++++++++++++++
  +1.5%     +96 src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc     +96  +1.5%
      +6.9%     +49 rr_update_locked                                                             +49  +6.9%
      +2.0%     +40 rr_connectivity_changed_locked                                               +40  +2.0%
      +5.6%      +7 [Unmapped]                                                                    +7  +5.6%
  +0.7%     +16 src/core/ext/filters/client_channel/lb_policy/subchannel_list.cc             +16  +0.7%
       +17%     +20 grpc_lb_subchannel_data_start_connectivity_watch                             +20   +17%

 -+-+-+-+-+-+-+ MIXED                                                                    +-+-+-+-+-+-+-
  +0.0%    +152 [None]                                                                       -16  -0.0%

  +0.0%    +264 TOTAL                                                                        +96  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



1 similar comment
@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                FILE SIZE
 ++++++++++++++ GROWING                                                                  ++++++++++++++
  +1.5%     +96 src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc     +96  +1.5%
      +6.9%     +49 rr_update_locked                                                             +49  +6.9%
      +2.0%     +40 rr_connectivity_changed_locked                                               +40  +2.0%
      +5.6%      +7 [Unmapped]                                                                    +7  +5.6%
  +0.7%     +16 src/core/ext/filters/client_channel/lb_policy/subchannel_list.cc             +16  +0.7%
       +17%     +20 grpc_lb_subchannel_data_start_connectivity_watch                             +20   +17%

 -+-+-+-+-+-+-+ MIXED                                                                    +-+-+-+-+-+-+-
  +0.0%    +152 [None]                                                                       -16  -0.0%

  +0.0%    +264 TOTAL                                                                        +96  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@dgquintas
Copy link
Contributor Author

Review status: 0 of 3 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Super. Please merge those changes into this PR and change it to re-resolve whenever any one subchannel goes into TRANSIENT_FAILURE.

Done. PTAL.


Comments from Reviewable

@grpc-testing
Copy link

Corrupt JSON data (indicates timeout or crash): 
    bm_fullstack_streaming_pump.BM_PumpStreamServerToClient_SockPair__8.counters.old: 1


[microbenchmarks] Performance differences noted:
Benchmark                                                                                 cpu_time    real_time
----------------------------------------------------------------------------------------  ----------  -----------
BM_PumpStreamClientToServer<InProcess>/262144                                             -8%         -8%
BM_PumpStreamClientToServer<InProcess>/32768                                              -6%         -6%
BM_PumpStreamServerToClient<InProcess>/32768                                              -8%         -8%
BM_StreamingPingPong<InProcess, NoOpMutator, NoOpMutator>/262144/1                        -5%         -5%
BM_StreamingPingPong<InProcess, NoOpMutator, NoOpMutator>/262144/2                        -6%         -6%
BM_StreamingPingPong<MinInProcess, NoOpMutator, NoOpMutator>/262144/1                     -6%         -6%
BM_StreamingPingPong<MinInProcess, NoOpMutator, NoOpMutator>/262144/2                     -8%         -8%
BM_StreamingPingPong<MinInProcess, NoOpMutator, NoOpMutator>/32768/2                      -7%         -7%
BM_StreamingPingPongMsgs<InProcess, NoOpMutator, NoOpMutator>/32768                       -8%         -8%
BM_StreamingPingPongMsgs<MinInProcess, NoOpMutator, NoOpMutator>/262144                   -7%         -7%
BM_StreamingPingPongMsgs<MinInProcess, NoOpMutator, NoOpMutator>/32768                    -9%         -9%
BM_StreamingPingPongWithCoalescingApi<InProcess, NoOpMutator, NoOpMutator>/262144/2/0     -8%         -8%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/262144/2/0  -5%         -5%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/32768/1/0   -4%         -4%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/32768/2/0   -5%         -5%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/32768/2/1   -5%         -5%
BM_UnaryPingPong<InProcess, NoOpMutator, NoOpMutator>/0/2097152                           -6%         -6%
BM_UnaryPingPong<InProcess, NoOpMutator, NoOpMutator>/0/262144                            -8%         -8%
BM_UnaryPingPong<MinInProcess, NoOpMutator, NoOpMutator>/32768/0                          -4%         -4%

@grpc-testing
Copy link

Corrupt JSON data (indicates timeout or crash): 
    bm_fullstack_streaming_pump.BM_PumpStreamServerToClient_SockPair__64.opt.old: 1


[microbenchmarks] Performance differences noted:
Benchmark                                                               cpu_time    real_time
----------------------------------------------------------------------  ----------  -----------
BM_StreamingPingPongMsgs<MinInProcess, NoOpMutator, NoOpMutator>/32768  -6%         -6%

@dgquintas
Copy link
Contributor Author

omg green

@markdroth
Copy link
Member

Only one significant issue here; the other comments are minor.

I am a little concerned that our tests are all green when there's still a significant bug here. Is there a reasonable way to add a test to catch this kind of problem?


Reviewed 3 of 5 files at r3.
Review status: all files reviewed at latest revision, 5 unresolved discussions.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 333 at r3 (raw file):

   * 3) RULE: ALL subchannels are TRANSIENT_FAILURE => policy is
   *                                                   TRANSIENT_FAILURE (and
   *                                                   requests re-resolution).

No need to mention requesting re-resolution here, because we're not actually triggering that here.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 354 at r3 (raw file):

    grpc_connectivity_state_set(&p->state_tracker, GRPC_CHANNEL_CONNECTING,
                                GRPC_ERROR_NONE, "rr_connecting");
  } else if (subchannel_list->num_shutdown ==

Since we're no longer using the num_shutdown field anywhere, let's remove it from the grpc_lb_subchannel_list struct in subchannel_list.h.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 568 at r3 (raw file):

      // discrepancy, attempt to re-resolve and end up here again.
      if (subchannel_state == GRPC_CHANNEL_TRANSIENT_FAILURE) {
        subchannel_list->subchannels[i].pending_connectivity_state_unsafe =

I don't think it's safe to reset pending_connectivity_state_unsafe without also (a) resetting curr_connectivity_state and prev_connectivity_state and (b) updating the state counters. Otherwise, the internal state will be inconsistent and the behavior incorrect (e.g., if all subchannels are initially in state TRANSIENT_FAILURE, num_transient_failure will still be 0, so we will not report that state back from the LB policy). I think something like this should do what we need:

subchannel_list->subchannels[i].prev_connectivity_state = subchannel_state;
subchannel_list->subchannels[i].curr_connectivity_state = subchannel_state;
--subchannel_list->num_idle;
++subchannel_list->num_transient_failure;

In the long term, I'd like to see a more general solution here, but I think that will require some careful thought about how to improve the subchannel_list API. I'll do that later as part of C++-ifying that API. For now, please add the following comment:

TODO(roth): As part of C++-ifying the subchannel_list API, design a better API for notifying the LB policy of subchannel states, which can be used both for the subchannel's initial state and for subsequent state changes. This will allow us to handle this more generally instead of special-casing TRANSIENT_FAILURE (e.g., we can also distribute any pending picks across all READY subchannels rather than sending them all to the first one).


test/cpp/end2end/client_lb_end2end_test.cc, line 141 at r3 (raw file):

  }

  void SetNextResolution(const std::vector<int>& ports, bool notify = true) {

Please split this into separate methods for setting the next resolution and the re-resolution response, as @AspirinSJL did in grpclb_end2end_test in #14281.


test/cpp/end2end/client_lb_end2end_test.cc, line 581 at r3 (raw file):

  ports.emplace_back(servers_[1]->port_);
  ports.emplace_back(servers_[2]->port_);
  gpr_log(GPR_INFO, "ABOUT TO SEND ALLLLL");

I assume these log messages are leftovers from debugging. :)


Comments from Reviewable

@markdroth
Copy link
Member

Review status: all files reviewed at latest revision, 6 unresolved discussions.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 402 at r3 (raw file):

  // Update state counters and new overall state.
  update_state_counters_locked(sd);
  update_lb_connectivity_status_locked(sd, GRPC_ERROR_REF(error));

Not directly related to the rest of this PR, but I think we should only do this if sd is from p->subchannel_list, because we don't want to report the state if it's from p->latest_pending_subchannel_list. And we should probably move this down to the end of this function, because if sd is from p->latest_pending_subchannel_list and reports READY, we might promote p->latest_pending_subchannel_list to p->subchannel_list below.


Comments from Reviewable

@dgquintas
Copy link
Contributor Author

I couldn't come up with any effective way to make the previous state change :/


Review status: all files reviewed at latest revision, 6 unresolved discussions.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 333 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need to mention requesting re-resolution here, because we're not actually triggering that here.

Done.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 354 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Since we're no longer using the num_shutdown field anywhere, let's remove it from the grpc_lb_subchannel_list struct in subchannel_list.h.

Done.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 402 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Not directly related to the rest of this PR, but I think we should only do this if sd is from p->subchannel_list, because we don't want to report the state if it's from p->latest_pending_subchannel_list. And we should probably move this down to the end of this function, because if sd is from p->latest_pending_subchannel_list and reports READY, we might promote p->latest_pending_subchannel_list to p->subchannel_list below.

Good catch. Done.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 568 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think it's safe to reset pending_connectivity_state_unsafe without also (a) resetting curr_connectivity_state and prev_connectivity_state and (b) updating the state counters. Otherwise, the internal state will be inconsistent and the behavior incorrect (e.g., if all subchannels are initially in state TRANSIENT_FAILURE, num_transient_failure will still be 0, so we will not report that state back from the LB policy). I think something like this should do what we need:

subchannel_list->subchannels[i].prev_connectivity_state = subchannel_state;
subchannel_list->subchannels[i].curr_connectivity_state = subchannel_state;
--subchannel_list->num_idle;
++subchannel_list->num_transient_failure;

In the long term, I'd like to see a more general solution here, but I think that will require some careful thought about how to improve the subchannel_list API. I'll do that later as part of C++-ifying that API. For now, please add the following comment:

TODO(roth): As part of C++-ifying the subchannel_list API, design a better API for notifying the LB policy of subchannel states, which can be used both for the subchannel's initial state and for subsequent state changes. This will allow us to handle this more generally instead of special-casing TRANSIENT_FAILURE (e.g., we can also distribute any pending picks across all READY subchannels rather than sending them all to the first one).

Done. prev also needed an update to fully replicate what update_state_counters_locked does (in addition, not doing so would make tests fail).


test/cpp/end2end/client_lb_end2end_test.cc, line 141 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please split this into separate methods for setting the next resolution and the re-resolution response, as @AspirinSJL did in grpclb_end2end_test in #14281.

Done.


test/cpp/end2end/client_lb_end2end_test.cc, line 581 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I assume these log messages are leftovers from debugging. :)

Woops. Done. At least I don't use swear words in these any more... Learnt that the embarrassing way.


Comments from Reviewable

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                FILE SIZE
 ++++++++++++++ GROWING                                                                  ++++++++++++++
  +2.5%    +160 src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc    +160  +2.5%
      +4.4%     +86 rr_connectivity_changed_locked                                               +86  +4.4%
       +11%     +81 rr_update_locked                                                             +81   +11%
  +0.7%     +16 src/core/ext/filters/client_channel/lb_policy/subchannel_list.cc             +16  +0.7%
       +17%     +20 grpc_lb_subchannel_data_start_connectivity_watch                             +20   +17%

 -+-+-+-+-+-+-+ MIXED                                                                    +-+-+-+-+-+-+-
  +0.0%    +160 [None]                                                                      -336  -0.0%

  +0.0%    +336 TOTAL                                                                       -160  -0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@markdroth
Copy link
Member

This looks great! All remaining comments are minor.


Reviewed 3 of 3 files at r4.
Review status: all files reviewed at latest revision, 4 unresolved discussions.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r4 (raw file):

    grpc_connectivity_state_set(
        &p->state_tracker, GRPC_CHANNEL_TRANSIENT_FAILURE,
        GRPC_ERROR_REF(error), "rr_exhausted_subchannels");

I think this should still say "rr_transient_failure".


test/cpp/end2end/client_lb_end2end_test.cc, line 168 at r4 (raw file):

  void SetNextResolutionUponError(const std::vector<int>& ports) {
    grpc_core::ExecCtx exec_ctx;
    grpc_lb_addresses* addresses =

Could refactor out the code for generating the addresses into its own function, since that's the same for both this function and the previous one.


test/cpp/end2end/client_lb_end2end_test.cc, line 765 at r4 (raw file):

  // before noticing the change in the server's connectivity.
  while (!SendRpc(stub)) {
    ;  // Retry until success.

Nit: No need for the semicolon here, since you've added the braces.


test/cpp/end2end/client_lb_end2end_test.cc, line 767 at r4 (raw file):

    ;  // Retry until success.
  }
  gpr_log(GPR_INFO, "------------------------------------------------------");

I assume this is another log message that was added for debugging.


Comments from Reviewable

@markdroth markdroth changed the title Make RR re-resolve when all its subchannels fail. Make RR re-resolve when any of its subchannels fail. Feb 8, 2018
@dgquintas dgquintas merged commit c8f572b into grpc:master Feb 8, 2018
@dgquintas
Copy link
Contributor Author

Woops, missed the last set of comments. Putting together another tiny PR to address them.

@dgquintas
Copy link
Contributor Author

I've addressed the missed comments in #14374


Review status: all files reviewed at latest revision, 4 unresolved discussions.


src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 352 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this should still say "rr_transient_failure".

Done.


test/cpp/end2end/client_lb_end2end_test.cc, line 168 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Could refactor out the code for generating the addresses into its own function, since that's the same for both this function and the previous one.

Done.


test/cpp/end2end/client_lb_end2end_test.cc, line 765 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Nit: No need for the semicolon here, since you've added the braces.

Done.


test/cpp/end2end/client_lb_end2end_test.cc, line 767 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I assume this is another log message that was added for debugging.

Done.


Comments from Reviewable

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Channel stuck in TRANSIENT_FAILURE and no DNS Refresh
6 participants