Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream removed, when server stream takes more then 60 seconds in very specific scenario #31640

Closed
Pascal1986 opened this issue Nov 14, 2022 · 16 comments

Comments

@Pascal1986
Copy link

Pascal1986 commented Nov 14, 2022

What version of gRPC and what language are you using?

gRPC.Core 2.0.0.0 net standard 2.0

What operating system (Linux, Windows,...) and version?

Windows 10

What runtime / compiler are you using (e.g. python version or version of gcc)

Client within local network
Server in Azure with private IP connected to our local network using an internal Network hub.

What did you do?

A stream removed error occurs in the following specific scenario:

  • GRPC server stream where the server takes more than 60 seconds before returning to the client
  • Using Grpc.Core.Channel(Grpc.Net.Client.GrpcChannel does not cause this)
  • Running and calling the server in Azure over a public IP does not cause this issue.
  • Running and calling the server in Azure using a private IP over an internal network hub does cause the issue.

What did you expect to see?

A succesfull response

What did you see instead?

The following exception message:
Status(StatusCode="Unknown", Detail="Stream removed", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1668425723.993000000","description":"Error received from peer ipv4::5000","file":"......\src\core\lib\surface\call.cc","file_line":953,"grpc_message":"Stream removed","grpc_status":2}")

Anything else we should know about your project / environment?

Wireshark capture:
image

Proto file:

syntax = "proto3";

option csharp_namespace = "GrpcDemoServer";

package greet;

service Greeter {
  rpc SayHelloStream (HelloRequest) returns (stream StreamHelloReply);
}

message HelloRequest {  
  string name = 1;
  int32 sleepTime =2;
}

message StreamHelloReply {
    string message = 1;
}

Client code:

var channel = new Channel("<IP ADDRESS>", 1234, ChannelCredentials.Insecure, new[]
{
   new ChannelOption( "grpc.keepalive_time_ms", 1000 ),
   new ChannelOption( "grpc.http2.max_pings_without_data", 0 ),
   new ChannelOption( "grpc.keepalive_permit_without_calls", 1 )
});

var client = new Greeter.GreeterClient(channel);
using var transfer = client.SayHelloStream(GetRequest("Test", 70));
while (await transfer.ResponseStream.MoveNext(CancellationToken.None).ConfigureAwait(false))
{
   var response = transfer.ResponseStream.Current;
   client.SayHello(GetRequest(response.Message, 1));
   Console.WriteLine(response.Message);
}

Server code:

public override async Task SayHelloStream(HelloRequest request, IServerStreamWriter<StreamHelloReply> responseStream, ServerCallContext context)
  {
      for (int i = 1; i <= 10; i++)
      {
          Thread.Sleep(request.SleepTime);//70 seconds
          await responseStream.WriteAsync(new StreamHelloReply
          {
              Message = $"Hello {i}"
          });
      }
  }
public class Program
{
    public static void Main(string[] args)
    {
        CreateHostBuilder(args).Build().Run();
    }

    public static IHostBuilder CreateHostBuilder(string[] args) =>
        Host.CreateDefaultBuilder(args)
            .UseWindowsService()
            .ConfigureWebHostDefaults(webBuilder =>
            {
                webBuilder.ConfigureKestrel(options =>
                {
                    // Setup a HTTP/2 endpoint without TLS.
                    options.ListenAnyIP(5000, o => o.Protocols =
                        HttpProtocols.Http2);

                });
                webBuilder.UseStartup<Startup>();
            });
}
public class Startup
    {
        public Startup(IConfiguration configuration)
        {
            Configuration = configuration;
        }
        public IConfiguration Configuration { get; }
       
        public void ConfigureServices(IServiceCollection services)
        {
            services.AddGrpc();
            services.AddApplicationInsightsTelemetry(Configuration["APPINSIGHTS_CONNECTIONSTRING"]);
        }
       
        public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
        {
            app.UseRouting();
                        
            app.UseEndpoints(endpoints =>
            { 
                endpoints.MapGrpcService<GreeterService>();
                endpoints.MapGet("/", async context =>
                {
                    await context.Response.WriteAsync("Communication with gRPC endpoints must be made through a gRPC client. To learn how to create a client, visit: https://go.microsoft.com/fwlink/?linkid=2086909");
                });
            });
        }
    }

I tried using the following ChannelOptions to prevent this, but without success:

grpc.keepalive_time_ms => 1000
grpc.http2.max_pings_without_data => 0
grpc.keepalive_permit_without_calls => 1

@yashykt
Copy link
Member

yashykt commented Nov 15, 2022

@jtattermusch can you do the initial triage and if it is a Core issue, can you re-assign?

@yashykt yashykt assigned jtattermusch and unassigned veblush Nov 15, 2022
@tonydnewell
Copy link
Contributor

As far as I can tell the server (or something between the server and the client) is doing the disconnect (the RST) - since it is going in the direction 5000 -> 52687 (and server is on port 5000).

Could you provide a gRPC debug trace on the client side. See https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md

@jtattermusch
Copy link
Contributor

Since the server is likely running behind a proxy (has private IP and there's an "internal network hub"), this looks like a classic case of the proxy terminating the connection (which then shows up as a "reset / stream removed" on the client).
What's puzzling me is that the user says that the same scenario works when grpc-dotnet is used).

It would be good to understand the scenario more:

  • can we get more details about the proxy that's being used (not sure what "internal network hub" is). If the client is connecting directly to the server with no middle man, then the difference between local network vs internet shouldn't really matter.
  • what is the traffic pattern? do you get some responses initially and then you wait for too long, after which the connection gets terminated? or do you get stream removed as the first thing after trying to call the RPC?
  • does the connection always get terminated after exactly 60seconds? (which sounds like the proxy kills a connections with no traffic for 60s)
  • Btw, using Thread.Sleep(request.SleepTime); is an antipattern since it is blocking. You should use await Task.Delay(sleeptime) instead.

@Pascal1986
Copy link
Author

Pascal1986 commented Jan 31, 2023

Thanks for you reactions! sorry for my late reaction, got caught up in other work :).

I have changed the test scenario a bit, to get some more information.
Instead of waiting 60 seconds on the server side I now wait 5 seconds( using Task.Delay;) ) and I repeat the streaming call 10 times. The Stream removed error now appears after 35 seconds, which leads me to believe that the timeout is at 30 seconds instead of the before mentioned 60 seconds. Also sending request 30 seconds after the stream was created will still result in a stream removed error despite "refreshing" the stream every 5 seconds.

To answer your questions:

  • I am not sure exactly how the traffic goes from my client PC within our business network towards Azure, but there are multiple hoops the traffic has to go through before it reaches Azure.
  • The error seems to appear on the client, the stream removed as the first thing after trying to call the RPC as long as it has been at least 30 seconds since the stream was started.
  • Yes, it always gets terminated after the exact same amount of time.

Attached I have the client side logging when using Grpc.Core.Channel(ChannelOutput.log) and when using Grpc.Net.Client.GrpcChannel(GrpcChannelOutput.log) to do the same calls.
ChannelOutput.log
GrpcChannelOutput.log

@tonydnewell
Copy link
Contributor

tonydnewell commented Feb 3, 2023

@Pascal1986 Thank you for the additional information. A very brief look at the log files does confirm that the connection is being dropped on the non-client side (server or something in between). I will spend some time later seeing whether I can reproduce this. but I expect it is specific to you environment.

I see you have provided extracts for the code used above, but are you have to provide complete example code (zip or github repository) for a test client and server so I can make sure I'm testing exactly the same code as you are using?

If you have any more information about firewalls or other network configuration then that might be helpful.

Have you tried to reproduce this connecting to a server in your local network rather than Azure?

@Pascal1986
Copy link
Author

The source code can be found at: https://github.com/Pascal1986/GrpcDemoRepository
This contains 3 projects:
GrpcDemoServer => The grpc server
GrpcClient => The client that uses Grpc.Core.Channel
NewGrpcClient => The client that uses Grpc.Net.Client.GrpcChannel to make the same call.

I will try to gather more information on the network configuration in the meantime.

I also tested this on a server on the local network and everything worked as expected.

Can you think of anything that differs between the two types of Channel that explains why everything works as expected for GrpcChannel and not for Channel?

@tonydnewell
Copy link
Contributor

tonydnewell commented Feb 8, 2023

@Pascal1986 that you for the code. Working fine in my environment but as I don't currently have access to Azure I know I'm not replicating your setup.

There are a few things that you can do to try and diagnose the problem and the difference between the gRPC-core and gRPC-dotnet clients.

  1. Enable this logging on the gRPC core client:
    These commands set the environment variables in a CMD prompt - but you can also set them in the Visual Studio debug properties for the client.
set GRPC_TRACE=http,http1,http_keepalive,http2_stream_state
set GRPC_VERBOSITY=DEBUG

this will give us the HTTP chatter and the keepalive pings.

  1. Look for any logging on firewalls between you and the Azure network

  2. Provide wireshark traces of both clients so we can try and compare the differences in the traffic behaviour.

@Pascal1986
Copy link
Author

Pascal1986 commented Feb 10, 2023

Attached are some screenshots of the wire shark logging of both the client and the server in a Channel and a GrpcChannel scnenario.

Client with Channel(RST after 30 seconds)
ClientChannel
Client with GRPC channel(no RST)
ClientGrpcChannel
Server on VM in Azure with Channel(RST after 30 seconds)
ServerChannel
Server on VM in Azure with GrpcChannel(no RST)
ServerGrpcChannel

@Pascal1986
Copy link
Author

And the log file of the client using Channel
grpclient.log

@tonydnewell
Copy link
Contributor

@Pascal1986 thank you for the information.

The wireshark logs do show that the connection is being terminated by something in between the client and the server since both the client and the server receive a RST from the "remote" port.

I'm not sure if there are any known functional differences in the HTTP2 implementations (Grpc.Core and Grpc-dotnet) - I'll have to get input from elsewhere. I know they are built on completely different network stacks.

I also note that the HTTP2 implementation in the v1.46.x branch (which Grpc.Core uses) is quite behind the master branch. I'll need to check with someone whether there have been any significant bug fixes that need to be merged across.

In the meantime if you do find any information about what may be in the middle that might be terminating the connections then that might help.

@jtattermusch @JamesNK - any ideas why Grpc.Core and Grpc-dotnet might behave differently? Or if there have been fixes to HTTP2 that need to be merged to the branch?

@Pascal1986
Copy link
Author

We looked at what might cause the termination of the connection in the middle, but unfortunately could not find anything. Is there any more information on the difference between Grpc.Core and Grpc-dotnet?

@Pascal1986
Copy link
Author

Pascal1986 commented Apr 17, 2023

We did another session to try to pinpoint the issue somewhere in the connection between the client and the server.

  1. Test from on premise to an Azure VM with only a public IP => Stream removed error occurs
  2. Test from a public client to an Azure VM with only a public IP => Stream removed error does not occur.
    We now know the issue is within our network our with our ISP. But when looking at firewalls/logging/packets being send we see no specific errors/resets.

Also we played around with the time between the messages being send over the stream

  1. 5 seconds between messages: First 5 messages successfull
  2. 1 second between messages: First 28 messages successfull
  3. 40 seconds between messages: First 2 messages succesfull
  4. 61 seconds between messages: First 2 messages succesfull
    This didn't quite makes sense as we first expected there to be a timeout after approximately 30 seconds. But then increasing the timeout to 30+ seconds should never return a successfull message, let alone two.
    Can you maybe explain this behaviour?

Also Is it possible for us to try using the Grpc.Core channel with a newer implementation of Http2 so we can maybe try to find a solution this way?

@tonydnewell
Copy link
Contributor

tonydnewell commented Apr 19, 2023

@Pascal1986 Without access to your network or the full network traces it is hard to say what is happening. I agree the behaviour doesn't seem to make sense with the scenarios 3 and 4 above.

It is unlikely that all the changes to the core code for HTTP/2 will be back ported to the C# 1.46.x branch without a good reason to do so (as in, we can identify the exact issue that needs fixing).

Two suggestions I have:

  1. Continue to experiment with the different ChannelOptions settings. See:

  2. Eliminate the C# implementation by using a Python or C++ gRPC client (see https://github.com/grpc/grpc/tree/master/examples/python/hellostreamingworld for an example). This might tell us if it is a core issue (and the version it occurs in).

@Pascal1986
Copy link
Author

Pascal1986 commented Apr 21, 2023

I got the ChannelOptions working on the server side by using Grpc.Core.Server. And when playing around with some traces and settings I found that sometimes the stream does continue succesfully after 30 seconds. But this only works Sometimes under the following conditions:

  • No extra configuration is needed( no keep alive etc.)
  • The server is running as a GRPC core Server in a console window with GRPC_TRACE set to TCP and GRPC_VERBOSITY set to DEBUG
  • The client is running in a console window with GRPC_TRACE set to TCP and GRPC_VERBOSITY set to DEBUG

I gathered the trace logs for the client and server for when it fails and when it succeeds:
StreamSuccessServer.txt
StreamClientFailed.txt
StreamClientSuccess.txt
StreamFailedServer.txt
StreamClientFailed2.txt

There are 2 different errors being logged at the client when it fails, included them both. Can you maybe find a reason why it sometimes works under these circumstances?

@tonydnewell
Copy link
Contributor

@Pascal1986 The errors in the logs are consistent with something in the middle terminating the connections. So no wiser I'm afraid.

@jtattermusch
Copy link
Contributor

@Pascal1986 The errors in the logs are consistent with something in the middle terminating the connections. So no wiser I'm afraid.

Since we already put quite a bit of effort into investigating this and we don't seem to be getting anywhere, I'll close this issue as "not actionable". We can reopen if there are some more findings/inputs that would help us determine if this is an actual grpc bug or if something else in user's environment is at fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants