Race between pick and transport shutdown #2562

zhangkun83 · 2017-01-03T19:34:07Z

Right now they are done in two steps:

A transport that is in READY state is selected
newStream() is called on the selected transport.

If transport is shutdown (by LoadBalancer or channel idle mode) between the two steps, Step 2 will fail spuriously. Currently we work around this by adding a delay between stopping selecting a subchannel (which owns the transport) and shutting it down. As long as the delay is longer than the time between Step 1 and Step 2, the race won't happen.

This is not ideal because it relies on timing to work correctly, and will still fail in extreme cases where the time between the two steps are longer than the pre-set delay.

It would be a better solution to differentiate the racy shutdown and the intended shutdown (Channel is shutdown for good). In response to racy shutdown, transport selection will be retried. The clientTransportProvider in ManagedChannelImpl is in the best position to do this, because it knows whether the Channel has shutdown. clientTransportProvider would have to call newStream() and start the stream, and return the started stream to ClientCallImpl instead of a transport.

The text was updated successfully, but these errors were encountered:

ejona86 · 2017-03-28T17:50:43Z

We should investigate whether we can use this to fix the race seen in #2857

biran0079 · 2017-11-02T18:34:23Z

What happens to on-going rpc when channel shuts down?

pick READY transport
channel goes idle, scheduled to shut down
newStream, rpc starts
channel shuts down

In scenario above, if rpc did not finish before channel shuts down, would it fail?

zhangkun83 · 2017-11-03T16:59:59Z

@biran0079 I think you meant "transport shuts down" for the last item. If a transport shuts down with active RPCs, these RPCs will continue normally.

biran0079 · 2017-11-03T23:59:08Z

@zhangkun83 I see. Thanks for explaining!

zhangkun83 · 2019-02-07T18:45:56Z

The transport could also be shutdown by server-sending a GOAWAY. Unlike the case described in my first post, this case cannot be mitigated by adding a delay, thus is more problematic.

zhangkun83 · 2019-02-07T23:14:33Z

It seems transparent retry is the answer for both the local shutdown and the GOAWAY cases. Go and C++'s balancer APIs have the same issue and they are also relying on transparent retry.

One issue that prevents us from enabling transparent retry by default is that it may not be compatible with stats keeping.

…hannels. This should lower the chance of the race between the pick and the shutdown (grpc#2562).

…hannels. (#5338) This should lower the chance of the race between the pick and the shutdown (#2562).

A user has been seeing "InternalSubchannel closed transport due to address change" errors (b/153064566). It is unclear if they are predomenent, but they are at least adding noise. Since grpc#2562 is still far from being generally solved, we delay the shutdown a while to side-step the race.

A user has been seeing "InternalSubchannel closed transport due to address change" errors (b/153064566). It is unclear if they are predomenent, but they are at least adding noise. Since #2562 is still far from being generally solved, we delay the shutdown a while to side-step the race.

The race between new streams and transport shutdown is grpc#2562, but it is still far from being generally solved. This reduces the race window of new streams from (transport selection → stream created on network thread) to (transport selection → stream enqueued on network thread). Since only a single thread now needs to do work in the stream creation race window, the window should be dramatically smaller. This only reduces GOAWAY races when the server performs a graceful shutdown (using two GOAWAYs), as that is the only non-racy way on-the-wire to shutdown a connection in HTTP/2.

The race between new streams and transport shutdown is #2562, but it is still far from being generally solved. This reduces the race window of new streams from (transport selection → stream created on network thread) to (transport selection → stream enqueued on network thread). Since only a single thread now needs to do work in the stream creation race window, the window should be dramatically smaller. This only reduces GOAWAY races when the server performs a graceful shutdown (using two GOAWAYs), as that is the only non-racy way on-the-wire to shutdown a connection in HTTP/2.

A user has been seeing "InternalSubchannel closed transport due to address change" errors (b/153064566). It is unclear if they are predomenent, but they are at least adding noise. Since grpc#2562 is still far from being generally solved, we delay the shutdown a while to side-step the race.

The race between new streams and transport shutdown is grpc#2562, but it is still far from being generally solved. This reduces the race window of new streams from (transport selection → stream created on network thread) to (transport selection → stream enqueued on network thread). Since only a single thread now needs to do work in the stream creation race window, the window should be dramatically smaller. This only reduces GOAWAY races when the server performs a graceful shutdown (using two GOAWAYs), as that is the only non-racy way on-the-wire to shutdown a connection in HTTP/2.

zhangkun83 assigned zhangkun83 and ejona86 Jan 3, 2017

zhangkun83 added enhancement code health labels Jan 3, 2017

zhangkun83 mentioned this issue Jan 3, 2017

core: ManagedChannelImpl2. #2530

Merged

ejona86 mentioned this issue Mar 28, 2017

core: Do not call startDeadlineTimer when is deadlineCancellationExecutor is null #2857

Merged

ejona86 added this to the Next milestone Jul 27, 2017

dapengzhang0 mentioned this issue Sep 2, 2017

core: refactor transportProvider creating fewer transport instances #3431

Closed

ejona86 mentioned this issue Nov 8, 2017

Support smooth network changes #3688

Closed

zhangkun83 changed the title ~~Make transport selection and stream start atomic~~ Race between pick and transport shutdown Feb 7, 2019

zhangkun83 added a commit to zhangkun83/grpc-java that referenced this issue Feb 7, 2019

core: RoundRobinLoadBalancer updates picker before shutting down subc…

e4dd281

…hannels. This should lower the chance of the race between the pick and the shutdown (grpc#2562).

zhangkun83 mentioned this issue Feb 7, 2019

core: RoundRobinLoadBalancer updates picker before shutting down subchannels #5338

Merged

zhangkun83 added a commit that referenced this issue Feb 7, 2019

core: RoundRobinLoadBalancer updates picker before shutting down subc…

139e544

…hannels. (#5338) This should lower the chance of the race between the pick and the shutdown (#2562).

zhangkun83 mentioned this issue Jun 18, 2019

NettyServerHandler closes connection when first go_away is ackd regardless of starting stream #5806

Closed

ejona86 mentioned this issue Jun 25, 2019

UNAVAILABLE: HTTP/2 error code: NO_ERROR Received Goawaysession_timed_out #5855

Closed

zhangkun83 assigned creamsoup Jul 24, 2019

zhangkun83 added the bug label Jul 24, 2019

ejona86 mentioned this issue Apr 9, 2020

core: Delay transport shutdown during updateAddresses() #6916

Merged

ejona86 mentioned this issue Apr 10, 2020

netty: Reduce race window size between GOAWAY and new streams #6918

Merged

ejona86 removed the bug label Jun 23, 2020

creamsoup removed their assignment Jul 8, 2020

fixmebot bot referenced this issue in aomsw13/develop_test Apr 12, 2021

Update readme.md

70b8890

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race between pick and transport shutdown #2562

Race between pick and transport shutdown #2562

zhangkun83 commented Jan 3, 2017

ejona86 commented Mar 28, 2017

biran0079 commented Nov 2, 2017 •

edited

Loading

zhangkun83 commented Nov 3, 2017

biran0079 commented Nov 3, 2017

zhangkun83 commented Feb 7, 2019

zhangkun83 commented Feb 7, 2019 •

edited

Loading

Race between pick and transport shutdown #2562

Race between pick and transport shutdown #2562

Comments

zhangkun83 commented Jan 3, 2017

ejona86 commented Mar 28, 2017

biran0079 commented Nov 2, 2017 • edited Loading

zhangkun83 commented Nov 3, 2017

biran0079 commented Nov 3, 2017

zhangkun83 commented Feb 7, 2019

zhangkun83 commented Feb 7, 2019 • edited Loading

biran0079 commented Nov 2, 2017 •

edited

Loading

zhangkun83 commented Feb 7, 2019 •

edited

Loading