Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: fix bug RetriableStream cancel() racing with start() #8386

Merged
merged 2 commits into from
Aug 6, 2021

Conversation

dapengzhang0
Copy link
Member

@dapengzhang0 dapengzhang0 commented Aug 5, 2021

There is a bug in the scenario of the following sequence of events:

  • call.start()
  • received retryable status and about to retry
  • The retry attempt Substream is created but not yet drain()
  • call.cancel()

But stream.cancel() cannot be called prior to stream.start(), otherwise retry will cause a failure with IllegalStateException. The current RetriableStream code must be fixed to not cancel a stream until start() is called in drain().

if (savedState.winningSubstream != null && savedState.winningSubstream != substream) {
// committed but not me
if (savedState.winningSubstream != null && savedState.winningSubstream != substream
&& streamStarted) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If streamStarted is false here, we can also short-circuit by returning directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. At the moment, every created stream has to be used. Things like DelayedClientTransport.newStream() add the stream to their tracking list, so you can't just drop the stream on the floor.

Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to clean this up later by separating the before-start and after-start methods, so that we call the before-start methods+start inline when creating a new stream. (And even better, get rid of the before-start methods entirely and make a new object that holds the setting.) But this is appropriate at the moment.

if (savedState.winningSubstream != null && savedState.winningSubstream != substream) {
// committed but not me
if (savedState.winningSubstream != null && savedState.winningSubstream != substream
&& streamStarted) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. At the moment, every created stream has to be used. Things like DelayedClientTransport.newStream() add the stream to their tracking list, so you can't just drop the stream on the floor.

@@ -440,6 +448,7 @@ public void run() {

@Override
public final void cancel(Status reason) {
cancellationStatus = reason;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to just before the synchronized block (or as an else within the synchronized block)? Right now if commit() fails the status is still set, which may be confusing.

@dapengzhang0 dapengzhang0 added the TODO:backport PR needs to be backported. Removed after backport complete label Aug 6, 2021
@dapengzhang0 dapengzhang0 enabled auto-merge (squash) August 6, 2021 00:37
@dapengzhang0 dapengzhang0 merged commit 3668f2e into grpc:master Aug 6, 2021
@dapengzhang0 dapengzhang0 deleted the fix-retriable-stream-bug branch August 6, 2021 02:08
dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this pull request Aug 6, 2021
There is a bug in the scenario of the following sequence of events:

- `call.start()` 
- received retryable status and about to retry
- The retry attempt Substream is created but not yet `drain()` 
- `call.cancel()`

But `stream.cancel()` cannot be called prior to `stream.start()`, otherwise retry will cause a failure with IllegalStateException. The current RetriableStream code must be fixed to not cancel a stream until `start()` is called in `drain()`.
@dapengzhang0 dapengzhang0 removed the TODO:backport PR needs to be backported. Removed after backport complete label Aug 6, 2021
dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this pull request Aug 6, 2021
dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this pull request Aug 6, 2021
dapengzhang0 added a commit that referenced this pull request Aug 7, 2021
While adding regression tests to #8386, I found a bug in an edge case: while retry attempt is draining the last buffered entry, if it is in the mean time committed and then we cancel the call, the stream will never be cancelled. See the regression test case `commitAndCancelWhileDraining()`.
dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this pull request Aug 7, 2021
…#8393)

While adding regression tests to grpc#8386, I found a bug in an edge case: while retry attempt is draining the last buffered entry, if it is in the mean time committed and then we cancel the call, the stream will never be cancelled. See the regression test case `commitAndCancelWhileDraining()`.
dapengzhang0 added a commit that referenced this pull request Aug 9, 2021
There is a bug in the scenario of the following sequence of events:

- `call.start()` 
- received retryable status and about to retry
- The retry attempt Substream is created but not yet `drain()` 
- `call.cancel()`

But `stream.cancel()` cannot be called prior to `stream.start()`, otherwise retry will cause a failure with IllegalStateException. The current RetriableStream code must be fixed to not cancel a stream until `start()` is called in `drain()`.
dapengzhang0 added a commit that referenced this pull request Aug 9, 2021
While adding regression tests to #8386, I found a bug in an edge case: while retry attempt is draining the last buffered entry, if it is in the mean time committed and then we cancel the call, the stream will never be cancelled. See the regression test case `commitAndCancelWhileDraining()`.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants