Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PF: Check connectivity state before watching #16306

Merged
merged 1 commit into from
Aug 17, 2018

Conversation

AspirinSJL
Copy link
Member

@AspirinSJL AspirinSJL commented Aug 10, 2018

Fix #15514. This is extracted from the initial attempt #16176. This PR is much simpler but does fix the problem.

Instead of always starting a subchannel from IDLE, we will check the current connectivity state of a subchannel and start watching the connectivity change from that state. With this change, if the current state of a subchannel is TRANSIENT_FAILURE, we will try connecting to it first and determine its real state by the connection result.


This change is Reviewable

@AspirinSJL AspirinSJL added the release notes: yes Indicates if PR needs to be in release notes label Aug 10, 2018
@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                            FILE SIZE
 ++++++++++++++ GROWING                                                                              ++++++++++++++
  +0.0%    +368 [None]                                                                               +1.44Ki  +0.0%
      +0.0%    +336 [Unmapped]                                                                           +1.41Ki  +0.0%
       +17%      +8 vtable for grpc_core::(anonymous namespace)::RoundRobin::RoundRobinSubchannelData         +8   +17%
       +17%      +8 vtable for grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData           +8   +17%
       +17%      +8 vtable for grpc_core::SubchannelData<grpc_core::(anonymous namespace)::RoundRobin::R      +8   +17%
       +17%      +8 vtable for grpc_core::SubchannelData<grpc_core::(anonymous namespace)::PickFirst::Pi      +8   +17%
  +5.0%    +640 src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc                  +640  +5.0%
      [NEW]    +897 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::ProcessUnselec    +897  [NEW]
      [NEW]    +200 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::StartConnectiv    +200  [NEW]
       +13%     +37 [Unmapped]                                                                               +37   +13%
      +2.5%      +8 grpc_core::SubchannelData<grpc_core::(anonymous namespace)::PickFirst::PickFirstSubc      +8  +2.5%

  +0.1%   +1008 TOTAL                                                                                +2.06Ki  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

Objective-C binary sizes
*****************STATIC******************
  New size                      Old size
 1,951,592      Total (>)      1,950,932

 No significant differences in binary sizes

***************FRAMEWORKS****************
  New size                      Old size
 3,617,434       Core (>)      3,615,418

10,671,344      Total (>)     10,669,329


@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                            FILE SIZE
 ++++++++++++++ GROWING                                                                              ++++++++++++++
  +0.0%    +352 [None]                                                                               +1.41Ki  +0.0%
      +0.0%    +320 [Unmapped]                                                                           +1.38Ki  +0.0%
       +17%      +8 vtable for grpc_core::(anonymous namespace)::RoundRobin::RoundRobinSubchannelData         +8   +17%
       +17%      +8 vtable for grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData           +8   +17%
       +17%      +8 vtable for grpc_core::SubchannelData<grpc_core::(anonymous namespace)::RoundRobin::R      +8   +17%
       +17%      +8 vtable for grpc_core::SubchannelData<grpc_core::(anonymous namespace)::PickFirst::Pi      +8   +17%
  +4.3%    +544 src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc                  +544  +4.3%
      [NEW]    +812 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::ProcessUnselec    +812  [NEW]
      [NEW]    +200 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::StartConnectiv    +200  [NEW]
      +9.0%     +26 [Unmapped]                                                                               +26  +9.0%
      +2.5%      +8 grpc_core::SubchannelData<grpc_core::(anonymous namespace)::PickFirst::PickFirstSubc      +8  +2.5%

  +0.1%    +896 TOTAL                                                                                +1.95Ki  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

Objective-C binary sizes
*****************STATIC******************
  New size                      Old size
 1,951,584      Total (>)      1,950,932

 No significant differences in binary sizes

***************FRAMEWORKS****************
  New size                      Old size
 3,617,434       Core (>)      3,615,418

10,671,347      Total (>)     10,669,332


@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! All of my comments are relatively minor.

Please let me know if you have any questions.

Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @AspirinSJL and @dgquintas)


src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 118 at r1 (raw file):

  // ProcessConnectivityChangeLocked() will be called when the
  // connectivity state changes.
  virtual void StartConnectivityWatchLocked();

Instead of making this virtual, I suggest just having PF provide a separate method called something like CheckConnectivityStateAndStartWatchingLocked() that checks the current state and then calls this method.


src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 159 at r1 (raw file):

  // Returns the connectivity state. Must be called only while there is no
  // connectivity notification pending.
  grpc_connectivity_state connectivity_state() const;

I really don't want to expose pending_connectivity_state_unsafe_. As the name implies, it's really not safe to use that value anywhere but where we're using it internally, so I deliberately structured this API to avoid exposing it.

Also, it should not be necessary to expose this in the first place. It looks like the only place it's being used is in an assertion, and I don't think the assertion is actually necessary (see below).


src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 365 at r1 (raw file):

  }
  GPR_ASSERT(!connectivity_notification_pending_);
  connectivity_notification_pending_ = true;

It looks like you're changing the semantics of connectivity_notification_pending_, but it's not clear why.

The current semantics are that we set this to true in StartConnectivityWatchLocked(). Whenever the watch returns, the caller needs to call either RenewConnectivityWatchLocked(), in which case we don't reset the value, because it's already true, or StopConnectivityWatchLocked(), in which case we set it to false. Note that we also ref the subchannel list in StartConnectivityWatchLocked() and unref it in StopConnectivityWatchLocked(), so the value of connectivity_notification_pending_ basically tells us whether we're holding that ref. To say this another way, because of that ref, the caller is required to call either RenewConnectivityWatchLocked() or StopConnectivityWatchLocked() when it receives the callback anyway, so it seems reasonable to update connectivity_notification_pending_ at those same points.

It looks like you've changed this such that we unset connectivity_notification_pending_ whenever we get a callback and then reset it when renewing. It's not clear to me why that semantic is better than what we already have.


src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 571 at r1 (raw file):

void PickFirst::PickFirstSubchannelData::ProcessUnselectedReadyLocked() {
  PickFirst* p = static_cast<PickFirst*>(subchannel_list()->policy());
  GPR_ASSERT(p->selected_ != this);

What's the benefit of this assertion? What would break if it was false?


src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 572 at r1 (raw file):

  PickFirst* p = static_cast<PickFirst*>(subchannel_list()->policy());
  GPR_ASSERT(p->selected_ != this);
  GPR_ASSERT(connectivity_state() == GRPC_CHANNEL_READY);

This assertion shouldn't be necessary. We're only calling this from two places, both of which explicitly check that the state is READY.


test/cpp/end2end/client_lb_end2end_test.cc, line 593 at r1 (raw file):

  auto channel_2 = BuildChannel("pick_first");
  auto stub_2 = BuildStub(channel_2);
  SetNextResolution(ports);

Please add a TODO here explaining that this resolution data will only be visible to channel 2, not channel 1, due to the way that we're sharing the fake resolver response generator between the two channels. We should ideally fix this by changing the response generator to be able to deliver updates to multiple channels at once, but we don't need to block this PR on this.


test/cpp/end2end/client_lb_end2end_test.cc, line 604 at r1 (raw file):

  // Wait for a while so that the disconnection has triggered the connectivity
  // notification. Otherwise, the subchannel may be picked but will fail soon.
  sleep(1);

Is there something we can do here more clever than just sleeping? For example, can we wait for the channel's connectivity state to go to something other than READY?

Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing!

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @markdroth, @AspirinSJL, and @dgquintas)


src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 118 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of making this virtual, I suggest just having PF provide a separate method called something like CheckConnectivityStateAndStartWatchingLocked() that checks the current state and then calls this method.

Done.


src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 159 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I really don't want to expose pending_connectivity_state_unsafe_. As the name implies, it's really not safe to use that value anywhere but where we're using it internally, so I deliberately structured this API to avoid exposing it.

Also, it should not be necessary to expose this in the first place. It looks like the only place it's being used is in an assertion, and I don't think the assertion is actually necessary (see below).

Removed.


src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 365 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It looks like you're changing the semantics of connectivity_notification_pending_, but it's not clear why.

The current semantics are that we set this to true in StartConnectivityWatchLocked(). Whenever the watch returns, the caller needs to call either RenewConnectivityWatchLocked(), in which case we don't reset the value, because it's already true, or StopConnectivityWatchLocked(), in which case we set it to false. Note that we also ref the subchannel list in StartConnectivityWatchLocked() and unref it in StopConnectivityWatchLocked(), so the value of connectivity_notification_pending_ basically tells us whether we're holding that ref. To say this another way, because of that ref, the caller is required to call either RenewConnectivityWatchLocked() or StopConnectivityWatchLocked() when it receives the callback anyway, so it seems reasonable to update connectivity_notification_pending_ at those same points.

It looks like you've changed this such that we unset connectivity_notification_pending_ whenever we get a callback and then reset it when renewing. It's not clear to me why that semantic is better than what we already have.

This was required because of the assertion I added. I wanted to check the current status in ProcessUnselectedReadyLocked(). I can't do it by calling CheckConnectivityStateLocked() because it's unsafe. But it's safe to return pending_connectivity_state_unsafe_ at that point because we haven't subscribed to the connectivity change. That's why I added a connectivity_state() API that can only be called with connectivity_notification_pending_ being false. But then I found that connectivity_notification_pending_ is true in ProcessConnectivityChangeLocked(). Actually, the notification is not pending when we are in ProcessConnectivityChangeLocked(); it just happened. So I chose to reset connectivity_notification_pending_ in OnConnectivityChangedLocked().

I've removed the assertion, so this has been reverted.

But I think the variable should better be named like holding_watch_ref_ to be precise.


src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 571 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

What's the benefit of this assertion? What would break if it was false?

The method will be a no-op if the assertion fails. But "Pick First %p selected subchannel %p" will be printed again, which might be misleading.

I removed the assertion and added a check instead.


src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 572 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This assertion shouldn't be necessary. We're only calling this from two places, both of which explicitly check that the state is READY.

Done.

I added this assertion to be defensive. But it might have introduced too many other changes.


test/cpp/end2end/client_lb_end2end_test.cc, line 593 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add a TODO here explaining that this resolution data will only be visible to channel 2, not channel 1, due to the way that we're sharing the fake resolver response generator between the two channels. We should ideally fix this by changing the response generator to be able to deliver updates to multiple channels at once, but we don't need to block this PR on this.

Done.


test/cpp/end2end/client_lb_end2end_test.cc, line 604 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Is there something we can do here more clever than just sleeping? For example, can we wait for the channel's connectivity state to go to something other than READY?

Done.

I should have considered this.

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                            FILE SIZE
 ++++++++++++++ GROWING                                                                              ++++++++++++++
  +0.0%    +224 [None]                                                                                  +816  +0.0%
  +3.5%    +448 src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc                  +448  +3.5%
      [NEW]    +729 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::ProcessUnselec    +729  [NEW]
      [NEW]    +200 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::CheckConnectiv    +200  [NEW]
      +7.3%     +21 [Unmapped]                                                                               +21  +7.3%

  +0.0%    +672 TOTAL                                                                                +1.23Ki  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

Objective-C binary sizes
*****************STATIC******************
  New size                      Old size
 1,951,347      Total (>)      1,950,932

 No significant differences in binary sizes

***************FRAMEWORKS****************
  New size                      Old size
 3,616,466       Core (>)      3,615,418

10,663,195      Total (>)     10,662,142


@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

Reviewed 3 of 3 files at r2.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @markdroth and @dgquintas)


src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 571 at r1 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

The method will be a no-op if the assertion fails. But "Pick First %p selected subchannel %p" will be printed again, which might be misleading.

I removed the assertion and added a check instead.

It seems unlikely to me that this would actually happen. We're only calling this from two places, and in neither one is this the selected subchannel. And the name of this method is a fairly strong hint to our future selves not to do this accidentally.

I would just remove this check. I don't think it's worth the complexity for something that is very unlikely to happen and wouldn't actually break anything anyway.


src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 365 at r1 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

This was required because of the assertion I added. I wanted to check the current status in ProcessUnselectedReadyLocked(). I can't do it by calling CheckConnectivityStateLocked() because it's unsafe. But it's safe to return pending_connectivity_state_unsafe_ at that point because we haven't subscribed to the connectivity change. That's why I added a connectivity_state() API that can only be called with connectivity_notification_pending_ being false. But then I found that connectivity_notification_pending_ is true in ProcessConnectivityChangeLocked(). Actually, the notification is not pending when we are in ProcessConnectivityChangeLocked(); it just happened. So I chose to reset connectivity_notification_pending_ in OnConnectivityChangedLocked().

I've removed the assertion, so this has been reverted.

But I think the variable should better be named like holding_watch_ref_ to be precise.

I think the current name is appropriate. The variable is not really just about holding the ref; it's really about whether there is a watch pending from the perspective of the caller of the SubchannelList API. The fact that we also hold a ref at the same time is an implementation detail.

Anyway, this looks good.

Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @markdroth and @dgquintas)


src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 571 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It seems unlikely to me that this would actually happen. We're only calling this from two places, and in neither one is this the selected subchannel. And the name of this method is a fairly strong hint to our future selves not to do this accidentally.

I would just remove this check. I don't think it's worth the complexity for something that is very unlikely to happen and wouldn't actually break anything anyway.

Done.

I may have been paranoid...

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                            FILE SIZE
 ++++++++++++++ GROWING                                                                              ++++++++++++++
  +0.0%    +120 [None]                                                                               +4.88Ki  +0.1%
  +2.8%    +352 src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc                  +352  +2.8%
      [NEW]    +636 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::ProcessUnselec    +636  [NEW]
      [NEW]    +200 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::CheckConnectiv    +200  [NEW]
      +5.6%     +16 [Unmapped]                                                                               +16  +5.6%
      +0.7%      +2 grpc_core::(anonymous namespace)::PickFirst::PickLocked                                   +2  +0.7%

  +0.0%    +472 TOTAL                                                                                +5.23Ki  +0.1%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

Objective-C binary sizes
*****************STATIC******************
  New size                      Old size
 1,952,986      Total (>)      1,952,734

 No significant differences in binary sizes

***************FRAMEWORKS****************
  New size                      Old size
 3,617,538       Core (>)      3,616,490

10,664,631      Total (>)     10,663,581


@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @markdroth, @AspirinSJL, and @dgquintas)


src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 288 at r3 (raw file):

    return true;
  }
  pick->next = pending_picks_;

This is the same fix that @dgquintas has in progress in #16054. We should probably get that in before this PR. All that's missing there is a test, which should be similar to the one added in #15947.

@dgquintas
Copy link
Contributor

Ah, thanks for bumping that up. With all the things going on it went down my list. I'll try to get that PR merged, with a test, ASAP.

@dgquintas
Copy link
Contributor

I've just updated #16054

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                            FILE SIZE
 ++++++++++++++ GROWING                                                                              ++++++++++++++
  +0.0%    +120 [None]                                                                               +4.88Ki  +0.1%
  +2.8%    +352 src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc                  +352  +2.8%
      [NEW]    +636 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::ProcessUnselec    +636  [NEW]
      [NEW]    +200 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::CheckConnectiv    +200  [NEW]
      +6.3%     +18 [Unmapped]                                                                               +18  +6.3%

  +0.0%    +472 TOTAL                                                                                +5.23Ki  +0.1%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

Objective-C binary sizes
*****************STATIC******************
  New size                      Old size
 1,952,998      Total (>)      1,952,730

 No significant differences in binary sizes

***************FRAMEWORKS****************
  New size                      Old size
 3,617,710       Core (>)      3,616,662

10,664,797      Total (>)     10,663,751


@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                            FILE SIZE
 ++++++++++++++ GROWING                                                                              ++++++++++++++
  +0.0%    +120 [None]                                                                               +4.88Ki  +0.1%
  +2.8%    +352 src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc                  +352  +2.8%
      [NEW]    +636 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::ProcessUnselec    +636  [NEW]
      [NEW]    +200 grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData::CheckConnectiv    +200  [NEW]
      +6.3%     +18 [Unmapped]                                                                               +18  +6.3%

  +0.0%    +472 TOTAL                                                                                +5.23Ki  +0.1%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

Objective-C binary sizes
*****************STATIC******************
  New size                      Old size
 1,952,998      Total (>)      1,952,730

 No significant differences in binary sizes

***************FRAMEWORKS****************
  New size                      Old size
 3,617,710       Core (>)      3,616,662

10,664,796      Total (>)     10,663,741


@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@AspirinSJL
Copy link
Member Author

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/client channel kind/bug lang/core release notes: yes Indicates if PR needs to be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected GOAWAY in Ruby client
4 participants