-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PF: Check connectivity state before watching #16306
PF: Check connectivity state before watching #16306
Conversation
|
|
f1c1704
to
9b72650
Compare
|
|
|
|
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good! All of my comments are relatively minor.
Please let me know if you have any questions.
Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @AspirinSJL and @dgquintas)
src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 118 at r1 (raw file):
// ProcessConnectivityChangeLocked() will be called when the // connectivity state changes. virtual void StartConnectivityWatchLocked();
Instead of making this virtual, I suggest just having PF provide a separate method called something like CheckConnectivityStateAndStartWatchingLocked()
that checks the current state and then calls this method.
src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 159 at r1 (raw file):
// Returns the connectivity state. Must be called only while there is no // connectivity notification pending. grpc_connectivity_state connectivity_state() const;
I really don't want to expose pending_connectivity_state_unsafe_
. As the name implies, it's really not safe to use that value anywhere but where we're using it internally, so I deliberately structured this API to avoid exposing it.
Also, it should not be necessary to expose this in the first place. It looks like the only place it's being used is in an assertion, and I don't think the assertion is actually necessary (see below).
src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 365 at r1 (raw file):
} GPR_ASSERT(!connectivity_notification_pending_); connectivity_notification_pending_ = true;
It looks like you're changing the semantics of connectivity_notification_pending_
, but it's not clear why.
The current semantics are that we set this to true in StartConnectivityWatchLocked()
. Whenever the watch returns, the caller needs to call either RenewConnectivityWatchLocked()
, in which case we don't reset the value, because it's already true, or StopConnectivityWatchLocked()
, in which case we set it to false. Note that we also ref the subchannel list in StartConnectivityWatchLocked()
and unref it in StopConnectivityWatchLocked()
, so the value of connectivity_notification_pending_
basically tells us whether we're holding that ref. To say this another way, because of that ref, the caller is required to call either RenewConnectivityWatchLocked()
or StopConnectivityWatchLocked()
when it receives the callback anyway, so it seems reasonable to update connectivity_notification_pending_
at those same points.
It looks like you've changed this such that we unset connectivity_notification_pending_
whenever we get a callback and then reset it when renewing. It's not clear to me why that semantic is better than what we already have.
src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 571 at r1 (raw file):
void PickFirst::PickFirstSubchannelData::ProcessUnselectedReadyLocked() { PickFirst* p = static_cast<PickFirst*>(subchannel_list()->policy()); GPR_ASSERT(p->selected_ != this);
What's the benefit of this assertion? What would break if it was false?
src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 572 at r1 (raw file):
PickFirst* p = static_cast<PickFirst*>(subchannel_list()->policy()); GPR_ASSERT(p->selected_ != this); GPR_ASSERT(connectivity_state() == GRPC_CHANNEL_READY);
This assertion shouldn't be necessary. We're only calling this from two places, both of which explicitly check that the state is READY.
test/cpp/end2end/client_lb_end2end_test.cc, line 593 at r1 (raw file):
auto channel_2 = BuildChannel("pick_first"); auto stub_2 = BuildStub(channel_2); SetNextResolution(ports);
Please add a TODO here explaining that this resolution data will only be visible to channel 2, not channel 1, due to the way that we're sharing the fake resolver response generator between the two channels. We should ideally fix this by changing the response generator to be able to deliver updates to multiple channels at once, but we don't need to block this PR on this.
test/cpp/end2end/client_lb_end2end_test.cc, line 604 at r1 (raw file):
// Wait for a while so that the disconnection has triggered the connectivity // notification. Otherwise, the subchannel may be picked but will fail soon. sleep(1);
Is there something we can do here more clever than just sleeping? For example, can we wait for the channel's connectivity state to go to something other than READY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing!
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @markdroth, @AspirinSJL, and @dgquintas)
src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 118 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Instead of making this virtual, I suggest just having PF provide a separate method called something like
CheckConnectivityStateAndStartWatchingLocked()
that checks the current state and then calls this method.
Done.
src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 159 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
I really don't want to expose
pending_connectivity_state_unsafe_
. As the name implies, it's really not safe to use that value anywhere but where we're using it internally, so I deliberately structured this API to avoid exposing it.Also, it should not be necessary to expose this in the first place. It looks like the only place it's being used is in an assertion, and I don't think the assertion is actually necessary (see below).
Removed.
src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 365 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
It looks like you're changing the semantics of
connectivity_notification_pending_
, but it's not clear why.The current semantics are that we set this to true in
StartConnectivityWatchLocked()
. Whenever the watch returns, the caller needs to call eitherRenewConnectivityWatchLocked()
, in which case we don't reset the value, because it's already true, orStopConnectivityWatchLocked()
, in which case we set it to false. Note that we also ref the subchannel list inStartConnectivityWatchLocked()
and unref it inStopConnectivityWatchLocked()
, so the value ofconnectivity_notification_pending_
basically tells us whether we're holding that ref. To say this another way, because of that ref, the caller is required to call eitherRenewConnectivityWatchLocked()
orStopConnectivityWatchLocked()
when it receives the callback anyway, so it seems reasonable to updateconnectivity_notification_pending_
at those same points.It looks like you've changed this such that we unset
connectivity_notification_pending_
whenever we get a callback and then reset it when renewing. It's not clear to me why that semantic is better than what we already have.
This was required because of the assertion I added. I wanted to check the current status in ProcessUnselectedReadyLocked()
. I can't do it by calling CheckConnectivityStateLocked()
because it's unsafe. But it's safe to return pending_connectivity_state_unsafe_
at that point because we haven't subscribed to the connectivity change. That's why I added a connectivity_state()
API that can only be called with connectivity_notification_pending_
being false. But then I found that connectivity_notification_pending_
is true in ProcessConnectivityChangeLocked()
. Actually, the notification is not pending when we are in ProcessConnectivityChangeLocked()
; it just happened. So I chose to reset connectivity_notification_pending_
in OnConnectivityChangedLocked()
.
I've removed the assertion, so this has been reverted.
But I think the variable should better be named like holding_watch_ref_
to be precise.
src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 571 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
What's the benefit of this assertion? What would break if it was false?
The method will be a no-op if the assertion fails. But "Pick First %p selected subchannel %p"
will be printed again, which might be misleading.
I removed the assertion and added a check instead.
src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 572 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This assertion shouldn't be necessary. We're only calling this from two places, both of which explicitly check that the state is READY.
Done.
I added this assertion to be defensive. But it might have introduced too many other changes.
test/cpp/end2end/client_lb_end2end_test.cc, line 593 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Please add a TODO here explaining that this resolution data will only be visible to channel 2, not channel 1, due to the way that we're sharing the fake resolver response generator between the two channels. We should ideally fix this by changing the response generator to be able to deliver updates to multiple channels at once, but we don't need to block this PR on this.
Done.
test/cpp/end2end/client_lb_end2end_test.cc, line 604 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Is there something we can do here more clever than just sleeping? For example, can we wait for the channel's connectivity state to go to something other than READY?
Done.
I should have considered this.
|
|
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
Reviewed 3 of 3 files at r2.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @markdroth and @dgquintas)
src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 571 at r1 (raw file):
Previously, AspirinSJL (Juanli Shen) wrote…
The method will be a no-op if the assertion fails. But
"Pick First %p selected subchannel %p"
will be printed again, which might be misleading.I removed the assertion and added a check instead.
It seems unlikely to me that this would actually happen. We're only calling this from two places, and in neither one is this the selected subchannel. And the name of this method is a fairly strong hint to our future selves not to do this accidentally.
I would just remove this check. I don't think it's worth the complexity for something that is very unlikely to happen and wouldn't actually break anything anyway.
src/core/ext/filters/client_channel/lb_policy/subchannel_list.h, line 365 at r1 (raw file):
Previously, AspirinSJL (Juanli Shen) wrote…
This was required because of the assertion I added. I wanted to check the current status in
ProcessUnselectedReadyLocked()
. I can't do it by callingCheckConnectivityStateLocked()
because it's unsafe. But it's safe to returnpending_connectivity_state_unsafe_
at that point because we haven't subscribed to the connectivity change. That's why I added aconnectivity_state()
API that can only be called withconnectivity_notification_pending_
being false. But then I found thatconnectivity_notification_pending_
is true inProcessConnectivityChangeLocked()
. Actually, the notification is not pending when we are inProcessConnectivityChangeLocked()
; it just happened. So I chose to resetconnectivity_notification_pending_
inOnConnectivityChangedLocked()
.I've removed the assertion, so this has been reverted.
But I think the variable should better be named like
holding_watch_ref_
to be precise.
I think the current name is appropriate. The variable is not really just about holding the ref; it's really about whether there is a watch pending from the perspective of the caller of the SubchannelList
API. The fact that we also hold a ref at the same time is an implementation detail.
Anyway, this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @markdroth and @dgquintas)
src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 571 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
It seems unlikely to me that this would actually happen. We're only calling this from two places, and in neither one is this the selected subchannel. And the name of this method is a fairly strong hint to our future selves not to do this accidentally.
I would just remove this check. I don't think it's worth the complexity for something that is very unlikely to happen and wouldn't actually break anything anyway.
Done.
I may have been paranoid...
|
|
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @markdroth, @AspirinSJL, and @dgquintas)
src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc, line 288 at r3 (raw file):
return true; } pick->next = pending_picks_;
This is the same fix that @dgquintas has in progress in #16054. We should probably get that in before this PR. All that's missing there is a test, which should be similar to the one added in #15947.
Ah, thanks for bumping that up. With all the things going on it went down my list. I'll try to get that PR merged, with a test, ASAP. |
I've just updated #16054 |
|
|
|
|
cc27885
to
d19fd1c
Compare
|
|
|
|
|
Fix #15514. This is extracted from the initial attempt #16176. This PR is much simpler but does fix the problem.
Instead of always starting a subchannel from
IDLE
, we will check the current connectivity state of a subchannel and start watching the connectivity change from that state. With this change, if the current state of a subchannel isTRANSIENT_FAILURE
, we will try connecting to it first and determine its real state by the connection result.This change is