-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race between pick and transport shutdown #2562
Comments
We should investigate whether we can use this to fix the race seen in #2857 |
What happens to on-going rpc when channel shuts down? pick READY transport In scenario above, if rpc did not finish before channel shuts down, would it fail? |
@biran0079 I think you meant "transport shuts down" for the last item. If a transport shuts down with active RPCs, these RPCs will continue normally. |
@zhangkun83 I see. Thanks for explaining! |
The transport could also be shutdown by server-sending a GOAWAY. Unlike the case described in my first post, this case cannot be mitigated by adding a delay, thus is more problematic. |
It seems transparent retry is the answer for both the local shutdown and the GOAWAY cases. Go and C++'s balancer APIs have the same issue and they are also relying on transparent retry. One issue that prevents us from enabling transparent retry by default is that it may not be compatible with stats keeping. |
…hannels. This should lower the chance of the race between the pick and the shutdown (grpc#2562).
A user has been seeing "InternalSubchannel closed transport due to address change" errors (b/153064566). It is unclear if they are predomenent, but they are at least adding noise. Since grpc#2562 is still far from being generally solved, we delay the shutdown a while to side-step the race.
A user has been seeing "InternalSubchannel closed transport due to address change" errors (b/153064566). It is unclear if they are predomenent, but they are at least adding noise. Since #2562 is still far from being generally solved, we delay the shutdown a while to side-step the race.
The race between new streams and transport shutdown is grpc#2562, but it is still far from being generally solved. This reduces the race window of new streams from (transport selection → stream created on network thread) to (transport selection → stream enqueued on network thread). Since only a single thread now needs to do work in the stream creation race window, the window should be dramatically smaller. This only reduces GOAWAY races when the server performs a graceful shutdown (using two GOAWAYs), as that is the only non-racy way on-the-wire to shutdown a connection in HTTP/2.
The race between new streams and transport shutdown is #2562, but it is still far from being generally solved. This reduces the race window of new streams from (transport selection → stream created on network thread) to (transport selection → stream enqueued on network thread). Since only a single thread now needs to do work in the stream creation race window, the window should be dramatically smaller. This only reduces GOAWAY races when the server performs a graceful shutdown (using two GOAWAYs), as that is the only non-racy way on-the-wire to shutdown a connection in HTTP/2.
A user has been seeing "InternalSubchannel closed transport due to address change" errors (b/153064566). It is unclear if they are predomenent, but they are at least adding noise. Since grpc#2562 is still far from being generally solved, we delay the shutdown a while to side-step the race.
The race between new streams and transport shutdown is grpc#2562, but it is still far from being generally solved. This reduces the race window of new streams from (transport selection → stream created on network thread) to (transport selection → stream enqueued on network thread). Since only a single thread now needs to do work in the stream creation race window, the window should be dramatically smaller. This only reduces GOAWAY races when the server performs a graceful shutdown (using two GOAWAYs), as that is the only non-racy way on-the-wire to shutdown a connection in HTTP/2.
Right now they are done in two steps:
newStream()
is called on the selected transport.If transport is shutdown (by LoadBalancer or channel idle mode) between the two steps, Step 2 will fail spuriously. Currently we work around this by adding a delay between stopping selecting a subchannel (which owns the transport) and shutting it down. As long as the delay is longer than the time between Step 1 and Step 2, the race won't happen.
This is not ideal because it relies on timing to work correctly, and will still fail in extreme cases where the time between the two steps are longer than the pre-set delay.
It would be a better solution to differentiate the racy shutdown and the intended shutdown (Channel is shutdown for good). In response to racy shutdown, transport selection will be retried. The
clientTransportProvider
inManagedChannelImpl
is in the best position to do this, because it knows whether the Channel has shutdown.clientTransportProvider
would have to callnewStream()
and start the stream, and return the started stream toClientCallImpl
instead of a transport.The text was updated successfully, but these errors were encountered: