Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race between pick and transport shutdown #2562

Open
zhangkun83 opened this issue Jan 3, 2017 · 6 comments
Open

Race between pick and transport shutdown #2562

zhangkun83 opened this issue Jan 3, 2017 · 6 comments
Assignees
Milestone

Comments

@zhangkun83
Copy link
Contributor

Right now they are done in two steps:

  1. A transport that is in READY state is selected
  2. newStream() is called on the selected transport.

If transport is shutdown (by LoadBalancer or channel idle mode) between the two steps, Step 2 will fail spuriously. Currently we work around this by adding a delay between stopping selecting a subchannel (which owns the transport) and shutting it down. As long as the delay is longer than the time between Step 1 and Step 2, the race won't happen.

This is not ideal because it relies on timing to work correctly, and will still fail in extreme cases where the time between the two steps are longer than the pre-set delay.

It would be a better solution to differentiate the racy shutdown and the intended shutdown (Channel is shutdown for good). In response to racy shutdown, transport selection will be retried. The clientTransportProvider in ManagedChannelImpl is in the best position to do this, because it knows whether the Channel has shutdown. clientTransportProvider would have to call newStream() and start the stream, and return the started stream to ClientCallImpl instead of a transport.

@ejona86
Copy link
Member

ejona86 commented Mar 28, 2017

We should investigate whether we can use this to fix the race seen in #2857

@biran0079
Copy link

biran0079 commented Nov 2, 2017

What happens to on-going rpc when channel shuts down?

pick READY transport
channel goes idle, scheduled to shut down
newStream, rpc starts
channel shuts down

In scenario above, if rpc did not finish before channel shuts down, would it fail?

@zhangkun83
Copy link
Contributor Author

@biran0079 I think you meant "transport shuts down" for the last item. If a transport shuts down with active RPCs, these RPCs will continue normally.

@biran0079
Copy link

@zhangkun83 I see. Thanks for explaining!

@zhangkun83
Copy link
Contributor Author

The transport could also be shutdown by server-sending a GOAWAY. Unlike the case described in my first post, this case cannot be mitigated by adding a delay, thus is more problematic.

@zhangkun83 zhangkun83 changed the title Make transport selection and stream start atomic Race between pick and transport shutdown Feb 7, 2019
@zhangkun83
Copy link
Contributor Author

zhangkun83 commented Feb 7, 2019

It seems transparent retry is the answer for both the local shutdown and the GOAWAY cases. Go and C++'s balancer APIs have the same issue and they are also relying on transparent retry.

One issue that prevents us from enabling transparent retry by default is that it may not be compatible with stats keeping.

zhangkun83 added a commit to zhangkun83/grpc-java that referenced this issue Feb 7, 2019
…hannels.

This should lower the chance of the race between the pick and the
shutdown (grpc#2562).
zhangkun83 added a commit that referenced this issue Feb 7, 2019
…hannels. (#5338)

This should lower the chance of the race between the pick and the
shutdown (#2562).
@zhangkun83 zhangkun83 added the bug label Jul 24, 2019
ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 9, 2020
A user has been seeing "InternalSubchannel closed transport due to address
change" errors (b/153064566). It is unclear if they are predomenent, but they
are at least adding noise. Since grpc#2562 is still far from being generally
solved, we delay the shutdown a while to side-step the race.
ejona86 added a commit that referenced this issue Apr 9, 2020
A user has been seeing "InternalSubchannel closed transport due to address
change" errors (b/153064566). It is unclear if they are predomenent, but they
are at least adding noise. Since #2562 is still far from being generally
solved, we delay the shutdown a while to side-step the race.
ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 10, 2020
The race between new streams and transport shutdown is grpc#2562, but it is still
far from being generally solved. This reduces the race window of new streams
from (transport selection → stream created on network thread) to (transport
selection → stream enqueued on network thread). Since only a single thread now
needs to do work in the stream creation race window, the window should be
dramatically smaller.

This only reduces GOAWAY races when the server performs a graceful shutdown
(using two GOAWAYs), as that is the only non-racy way on-the-wire to shutdown a
connection in HTTP/2.
ejona86 added a commit that referenced this issue Apr 10, 2020
The race between new streams and transport shutdown is #2562, but it is still
far from being generally solved. This reduces the race window of new streams
from (transport selection → stream created on network thread) to (transport
selection → stream enqueued on network thread). Since only a single thread now
needs to do work in the stream creation race window, the window should be
dramatically smaller.

This only reduces GOAWAY races when the server performs a graceful shutdown
(using two GOAWAYs), as that is the only non-racy way on-the-wire to shutdown a
connection in HTTP/2.
@ejona86 ejona86 removed the bug label Jun 23, 2020
@creamsoup creamsoup removed their assignment Jul 8, 2020
dfawley pushed a commit to dfawley/grpc-java that referenced this issue Jan 15, 2021
A user has been seeing "InternalSubchannel closed transport due to address
change" errors (b/153064566). It is unclear if they are predomenent, but they
are at least adding noise. Since grpc#2562 is still far from being generally
solved, we delay the shutdown a while to side-step the race.
dfawley pushed a commit to dfawley/grpc-java that referenced this issue Jan 15, 2021
The race between new streams and transport shutdown is grpc#2562, but it is still
far from being generally solved. This reduces the race window of new streams
from (transport selection → stream created on network thread) to (transport
selection → stream enqueued on network thread). Since only a single thread now
needs to do work in the stream creation race window, the window should be
dramatically smaller.

This only reduces GOAWAY races when the server performs a graceful shutdown
(using two GOAWAYs), as that is the only non-racy way on-the-wire to shutdown a
connection in HTTP/2.
fixmebot bot referenced this issue in aomsw13/develop_test Apr 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants