Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C# crashes with AccessViolation exception if a unary call is started after the all the channels have been shutdown #19090

Closed
olivieruzan opened this issue May 21, 2019 · 16 comments · Fixed by #23003

Comments

@olivieruzan
Copy link

olivieruzan commented May 21, 2019

Using grpc 1.19.0 for C# as Nuget package for VisualStudio 2017 on Win2012 r2 server

The C# program is crashing with AccessViolation error when the remote Python server returns an error:

   at Grpc.Core.Internal.CompletionQueueSafeHandle.ReleaseHandle()
   at System.Runtime.InteropServices.SafeHandle.InternalDispose()
   at Grpc.Core.Internal.AsyncCall`2.UnaryCall(TRequest msg)
   at Grpc.Core.DefaultCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
@jtattermusch
Copy link
Contributor

seeing the native stack would be useful. The native symbols are in Grpc.Core.NativeDebug packages.

@jtattermusch
Copy link
Contributor

Also, what kind of error does the python server returns? Do you have a reproduction of the problem?

Please see https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md and try to include some extra logs.

@olivieruzan
Copy link
Author

olivieruzan commented May 22, 2019 via email

@olivieruzan
Copy link
Author

olivieruzan commented May 29, 2019 via email

@cosborne83
Copy link

cosborne83 commented Jan 20, 2020

I'm seeing the same exception/stack trace on 2.26.0, and (in my case) it seems to be related to calls being made around the time of a channel being shut down.

I have been able to capture the following native stack:

00000000`7772d261 : 00000000`00000000 000007fe`c3c9c288 00000000`02a3c660 000007fe`c3db04f8 : ntdll!RtlpWaitOnCriticalSection+0xb4
000007fe`c3b70f0f : 00000000`1c419869 00000000`00000120 00000000`1bbaa5b0 00000000`00000000 : ntdll!RtlEnterCriticalSection+0xd1
000007fe`c3a18117 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : grpc_csharp_ext_x64!gpr_mu_lock+0xf [t:\src\github\grpc\workspace_csharp_ext_windows_x64\src\core\lib\gpr\sync_windows.cc @ 38]
000007fe`c3a1b4b2 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : grpc_csharp_ext_x64!cq_shutdown_pluck+0x17 [t:\src\github\grpc\workspace_csharp_ext_windows_x64\src\core\lib\surface\completion_queue.cc @ 1334]
000007fe`c3a1b201 : 00000000`00aa44b0 00000000`0274ddf8 00000000`00aa44b0 00000000`00000000 : grpc_csharp_ext_x64!grpc_completion_queue_shutdown+0x102 [t:\src\github\grpc\workspace_csharp_ext_windows_x64\src\core\lib\surface\completion_queue.cc @ 1399]
000007fe`928a7db9 : 00000000`1bb1ff60 00000000`0274ddf8 00000000`00aa44b0 00000000`00000000 : grpc_csharp_ext_x64!grpc_completion_queue_destroy+0x51 [t:\src\github\grpc\workspace_csharp_ext_windows_x64\src\core\lib\surface\completion_queue.cc @ 1407]
[Managed stack trace above follows]

I have also been able to put together a fairly compact repro, though it seems timing-dependent, so may take some time for the access violation to occur. You should expect to see "Caught" being written to the console every few seconds, followed eventually by the exception in question. I have tested x64 only in both Debug and Release modes:

using System;
using System.Threading;
using System.Threading.Tasks;
using Grpc.Core;

namespace GrpcBug
{
    class Program
    {
        static void Main(string[] args)
        {
            var client = new BugTest();
            client.Test();
        }
    }

    class BugTest
    {
        private readonly object _lock = new object();
        private Channel _channel;
        private TestBug.TestBugClient _client;

        private TestBug.TestBugClient Client { get { lock (_lock) return _client; } }

        public void Test()
        {
            Task.Run(() => Connect());
            Console.ReadLine();
        }

        private void Call()
        {
            try
            {
                Client.Request(new TestRequest());
            }
            catch (ObjectDisposedException)
            {
                Console.WriteLine("Caught");
            }
        }

        private void Connect()
        {
            while (true)
            {
                lock (_lock)
                {
                    // There should *not* be a gRPC service listening on this port
                    _client = new TestBug.TestBugClient(_channel = new Channel("127.0.0.1", 12345, ChannelCredentials.Insecure));
                    Task.Run(() => Call());
                    try
                    {
                        var tokenSource = new CancellationTokenSource(2000);
                        _channel.ConnectAsync().Wait(tokenSource.Token);
                    }
                    catch
                    {
                        _channel.ShutdownAsync().Wait();
                    }
                }

                Thread.Sleep(500);
            }
        }
    }
}

The above code uses following .proto definitions for the service:

syntax = "proto3";

package GrpcBug;

service TestBug
{
    rpc Request(TestRequest) returns (TestResponse);
}

message TestRequest
{
}

message TestResponse
{
}

@stale
Copy link

stale bot commented May 6, 2020

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 30 days. It will be closed automatically if no further update occurs in 7 day. Thank you for your contributions!

@jtattermusch
Copy link
Contributor

jtattermusch commented May 11, 2020

@cosborne83 on what platform do I have to run to reproduce?
It would be the best if you provided a reproduction as a ready-to-run project (e.g. github repository).

@jtattermusch
Copy link
Contributor

@olivieruzan what exactly do you mean by "try to Cancel the thread?"

@jtattermusch
Copy link
Contributor

@cosborne83 it looks like you only reproduced the exception that is thrown why you try to start a call on an already-disposed channel (which is working as expected, you shouldn't try to invoke calls on a disposed channel), but you haven't reproduced the crash that @olivieruzan has mentioned.

@jtattermusch
Copy link
Contributor

For now I'll close the issue as "cannot reproduce" and the instructions to reproduce aren't clear. I can reopen if more evidence is provided.

@cosborne83
Copy link

@jtattermusch - I've created https://github.com/cosborne83/GrpcBug which contains a self-contained Visual Studio 2019 solution that reproduces the original AccessViolation issue per the code in my comment above and using the latest gRPC NuGet package (2.28.1).

I have confirmed it reproduces the issue on Windows (10, v1909, build 18363.778 in my case, though I don't believe OS is a factor), and in both Debug/Release configuration for both x86 and x64 targets.

My expectation is that you would simply see Caught displayed repeatedly in the console window as it catches the ObjectDisposedException that results from the channel being disposed on another thread. However instead, an AccessViolationException occurs almost immediately with the following managed stack trace:

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Grpc.Core.Internal.CompletionQueueSafeHandle.ReleaseHandle()
   at System.Runtime.InteropServices.SafeHandle.InternalDispose()
   at Grpc.Core.Internal.AsyncCall`2.UnaryCall(TRequest msg)
   at Grpc.Core.DefaultCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at GrpcBug.TestBug.TestBugClient.Request(TestRequest request, CallOptions options) in ...\GrpcBug\GrpcBug\obj\Debug\net472\GrpcBugGrpc.cs:line 71
   at GrpcBug.TestBug.TestBugClient.Request(TestRequest request, Metadata headers, Nullable`1 deadline, CancellationToken cancellationToken) in ...\GrpcBug\GrpcBug\obj\Debug\net472\GrpcBugGrpc.cs:line 67
   at GrpcBug.BugTest.Call() in ...\GrpcBug\GrpcBug\Program.cs:line 35
   at GrpcBug.BugTest.<Connect>b__7_0() in ...\GrpcBug\GrpcBug\Program.cs:line 51
   at System.Threading.Tasks.Task.Execute()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot)
   at System.Threading.Tasks.Task.ExecuteEntry(Boolean bPreventDoubleExecution)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()

@jtattermusch jtattermusch reopened this May 12, 2020
@jtattermusch
Copy link
Contributor

@cosborne83 I can reproduce using the code you provided. I will investigate.

@jtattermusch
Copy link
Contributor

@cosborne83 some results on analyzing your reproduction:

  • due to the way locking is setup in your example, the call (client.Request(new TestRequest());) won't actually be started until _channel.ShutdownAsync().Wait(); is finished (because you're reading the value of "_client" under a lock and the lock won't be released before the end of the loop's body (which is under a lock).

so basically, your example can be simplified to this:

var channel = new Channel("127.0.0.1", 12345, ChannelCredentials.Insecure);
var client = new TestBug.TestBugClient(channel);
            
channel.ShutdownAsync().Wait();
client.Request(new TestRequest());

so basically you're trying to start a new call on a channel after the shutdown of that channel has been requested and it has also fully finished (the Wait() does that) - which makes little sense in a real application.

That said, while your example doesn't really do anything that a real application should ever do (and is thus mostly theoretical), it does indeed crash and that's wrong.
Actually I think this is the root cause for the crash:

  • your example only has one channel, and when the last channel/server in your application is shutdown, grpc internally also does some cleanup (internally grpc_shutdown() native function is called). I suspect that in our case deinitializes some internal stuff which results in crashing (trying to start a call without the native internals being initilialized is bad).

Possible workarounds:

  • don't start any calls after you've shutdown the last channel in your application. That either means you can make your call invocation logic behave well (not starting calls on something you've disposed previously is a fair requirement) or you can create a dummy channel that will increment the internal channel count and will prevent the grpc internals from shutting down when you shutdown the first channel

  • use an async unary call (not the blocking version) - from my experiments it seems that only the blocking version of the unary call is affected and it is because it tries to create a new completion queue before checking whether the channel has been disposed. Trying to initialize a completion queue after the grpc environment has been shutdown leads to the crash you're experiencing.

@jtattermusch
Copy link
Contributor

I think the unary calls crashes here:

using (CompletionQueueSafeHandle cq = CompletionQueueSafeHandle.CreateSync())

(when one attempts to start a unary call after all channel have been shutdown and thus the internal grpc environment has been shutdown too).

@jtattermusch jtattermusch changed the title C# AccessViolation error when Python server returns an error C# crashes with AccessViolation exception if a unary call is started after the all the channels have been shutdown May 19, 2020
@cosborne83
Copy link

@jtattermusch - I wouldn't say it's entirely theoretical, the basis of the repro I posted was a slimmed-down version of our production client code.

In our case we use short-lived SSL certificates for client authentication, and found that if the certificate was rejected by the server (e.g. due to it having expired), that the ConnectAsync() call would hang indefinitely, thus requiring the timeout, shutdown and recreation of a new channel (with a renewed authentication certificate). This all occurs in the background relative to the rest of the application that's actually performing calls on the channel. There may of course be a better way to achieve this.

We have however managed to work around the issue using the "dummy channel" approach you mention.

jtattermusch added a commit to jtattermusch/grpc that referenced this issue May 20, 2020
jtattermusch added a commit to jtattermusch/grpc that referenced this issue May 20, 2020
jtattermusch added a commit to jtattermusch/grpc that referenced this issue May 20, 2020
@jtattermusch
Copy link
Contributor

@cosborne83 a fix that prevents the crash on the grpc side is here: #23003.

What I meant by "theoretical" is that the issue is completely preventable in the user code (maybe my wording wasn't the best but you get the idea).

jtattermusch added a commit to jtattermusch/grpc that referenced this issue Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants