Fix bugs related to channel dispose while there are active calls #2120

JamesNK · 2023-05-10T14:27:35Z

@amanda-tarafa I tried to recreate this in a unit test without success. I went with your suggestion. Also, I modified the dispose check when starting gRPC calls to remove the chance of a new active call being added after dispose.

Update: Deadlock reproduced in a unit test. Fix in PR confirmed to work.

This PR fixes a number of bugs related to disposing a channel while there are ongoing active calls:

Fix potential deadlock when a call and channel are disposed at the same time.
Fix race that could allow call to be added to the channel's active call collection after dispose.
Fix hedge/retry calls not immediately throwing ObjectDisposedException after channel dispose.
Fix hedge/retry calls not immediately canceling after channel dispose.

amanda-tarafa · 2023-05-10T23:12:02Z

The changes look good. And the test as well. I also haven't been able to reproduce, I pulled your test and played around with it even, but nothing.
I'm pretty certain the race condition is there, but it'd definetely be nice to see it happen.

amanda-tarafa · 2023-05-12T11:43:29Z

I've continued trying to reproduce this in Grpc.Net.Client directly (no Google library involved) and I haven't been able to. The code I've been using can be see in amanda-tarafa#1. My test is very similar to the test in this PR, only a few tweaks here and there.

~~There are changes to production code, mostly logging, for the sake of keeping track of things, but I did try to mimick loosing CPU where I think mattered, without success in reproducing the deadlock.~~

Basically, I can't find any sensible way to trigger the Dispose thread blocking waiting on the cancellation token registration callbacks to be executed elsewhere, which is what we've seen in googleapis/google-cloud-dotnet#10318. I'm a little out of my depth here, but I'll try to pick this up again at some point next week.

amanda-tarafa · 2023-05-12T14:22:17Z

Sracth my last comment, I can now reproduce this 100% reliably (of the 20 or so times I've executed my test, we might still get the ocassional success).
See amanda-tarafa#1 for the unit test (no production code changes there whatsoever).

@JamesNK the main issue with your test is that the HTTP handler does not return so GrpcCall.RunCall is awaiting for the response instead of for the GrpcCall.CallTask whose continuations are the ones that block the dispose thread. Hopefully you can run this with your changes and confirm that releasing the lock earlier is indeed a fix.

JamesNK · 2023-05-13T03:34:15Z

Good find. I've updated the test in the PR.

I confirmed the test deadlocked on my machine before the changes and succeeded after the changes.

Cancel calls outside of lock on dispose

8303308

JamesNK requested review from mgravell, ReubenBond, BrennanConroy and amcasey May 10, 2023 14:27

JamesNK added 2 commits May 10, 2023 22:40

Fix build

1f8d253

Clean up

9035ca6

JamesNK added 2 commits May 11, 2023 11:58

More tests

286280a

Fix retry and hedge calls not ending on dispose

7f18aef

JamesNK changed the title ~~Cancel calls outside of lock on dispose~~ Fix bugs related to channel dispose while there are active calls May 11, 2023

JamesNK added 2 commits May 11, 2023 14:22

Clean up

bf02c53

Assert active calls

8a201d1

Update test

d8ffb29

BrennanConroy approved these changes May 15, 2023

View reviewed changes

JamesNK merged commit 7786079 into grpc:master May 15, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs related to channel dispose while there are active calls #2120

Fix bugs related to channel dispose while there are active calls #2120

JamesNK commented May 10, 2023 •

edited

amanda-tarafa commented May 10, 2023

amanda-tarafa commented May 12, 2023 •

edited

amanda-tarafa commented May 12, 2023

JamesNK commented May 13, 2023 •

edited

Fix bugs related to channel dispose while there are active calls #2120

Fix bugs related to channel dispose while there are active calls #2120

Conversation

JamesNK commented May 10, 2023 • edited

amanda-tarafa commented May 10, 2023

amanda-tarafa commented May 12, 2023 • edited

amanda-tarafa commented May 12, 2023

JamesNK commented May 13, 2023 • edited

JamesNK commented May 10, 2023 •

edited

amanda-tarafa commented May 12, 2023 •

edited

JamesNK commented May 13, 2023 •

edited