Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs related to channel dispose while there are active calls #2120

Merged
merged 8 commits into from May 15, 2023

Conversation

JamesNK
Copy link
Member

@JamesNK JamesNK commented May 10, 2023

Fixes #2119

@amanda-tarafa I tried to recreate this in a unit test without success. I went with your suggestion. Also, I modified the dispose check when starting gRPC calls to remove the chance of a new active call being added after dispose.

Update: Deadlock reproduced in a unit test. Fix in PR confirmed to work.


This PR fixes a number of bugs related to disposing a channel while there are ongoing active calls:

  • Fix potential deadlock when a call and channel are disposed at the same time.
  • Fix race that could allow call to be added to the channel's active call collection after dispose.
  • Fix hedge/retry calls not immediately throwing ObjectDisposedException after channel dispose.
  • Fix hedge/retry calls not immediately canceling after channel dispose.

@amanda-tarafa
Copy link

The changes look good. And the test as well. I also haven't been able to reproduce, I pulled your test and played around with it even, but nothing.
I'm pretty certain the race condition is there, but it'd definetely be nice to see it happen.

@JamesNK JamesNK changed the title Cancel calls outside of lock on dispose Fix bugs related to channel dispose while there are active calls May 11, 2023
@amanda-tarafa
Copy link

amanda-tarafa commented May 12, 2023

I've continued trying to reproduce this in Grpc.Net.Client directly (no Google library involved) and I haven't been able to. The code I've been using can be see in amanda-tarafa#1. My test is very similar to the test in this PR, only a few tweaks here and there.

There are changes to production code, mostly logging, for the sake of keeping track of things, but I did try to mimick loosing CPU where I think mattered, without success in reproducing the deadlock.

Basically, I can't find any sensible way to trigger the Dispose thread blocking waiting on the cancellation token registration callbacks to be executed elsewhere, which is what we've seen in googleapis/google-cloud-dotnet#10318. I'm a little out of my depth here, but I'll try to pick this up again at some point next week.

@amanda-tarafa
Copy link

Sracth my last comment, I can now reproduce this 100% reliably (of the 20 or so times I've executed my test, we might still get the ocassional success).
See amanda-tarafa#1 for the unit test (no production code changes there whatsoever).

@JamesNK the main issue with your test is that the HTTP handler does not return so GrpcCall.RunCall is awaiting for the response instead of for the GrpcCall.CallTask whose continuations are the ones that block the dispose thread. Hopefully you can run this with your changes and confirm that releasing the lock earlier is indeed a fix.

@JamesNK
Copy link
Member Author

JamesNK commented May 13, 2023

Good find. I've updated the test in the PR.

I confirmed the test deadlocked on my machine before the changes and succeeded after the changes.

@JamesNK JamesNK merged commit 7786079 into grpc:master May 15, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition when disposing of channels while cancelling calls.
3 participants