Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: fix retriablestream deadlock #10386

Merged
merged 2 commits into from Jul 21, 2023

Conversation

YifeiZhuang
Copy link
Contributor

@YifeiZhuang YifeiZhuang commented Jul 18, 2023

fix #10314

The deadlock can be reproduced with the UT in this change, and diff:

@@ -863,6 +870,12 @@ abstract class RetriableStream<ReqT> implements ClientStream {
         headers.discardAll(GRPC_PREVIOUS_RPC_ATTEMPTS);
         headers.put(GRPC_PREVIOUS_RPC_ATTEMPTS, String.valueOf(substream.previousAttemptCount));
       }
+      try  {
+        Thread.sleep(500);
+      } catch (InterruptedException ex) {
+        log.log(Level.INFO, "nothing");
+      }

The issue was that two streams from two transports held their transport lock while waiting for the other, thus deadlock. In this particular case, what happens is that in one subListener.close() it creates a new substream on other transports that requires that transport thread. Meanwhile another subListener.close() it receives headersRead() and tries to cancel all other streams that requires the corresponding transport lock. It is believe that it won't happen in netty, only in okhttp.

Solutions

We should break the deadlock in both places.
This fix is in headersRead() to have the cancel other stream from the call executor thread.
The other part of the fix is to have createSubstream to run from the call executor. This is be fixed in a follow up PR.

Tests

global TAP running results currently does not show bad signals. (failing due to already failing issues)

@YifeiZhuang YifeiZhuang requested a review from ejona86 July 20, 2023 22:10
@YifeiZhuang
Copy link
Contributor Author

@ejona86 did you want to take a look at the new changes? just comments.

@YifeiZhuang YifeiZhuang merged commit e179212 into grpc:master Jul 21, 2023
14 checks passed
@YifeiZhuang YifeiZhuang deleted the fix-retry-deadlock branch July 21, 2023 20:48
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hedging retry seems to cause a deadlock in rare cases with OkHttp
2 participants