Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC stream "Response ended prematurely (ResponseEnded)" #2361

Open
jeffijoe opened this issue Jan 16, 2024 · 17 comments
Open

gRPC stream "Response ended prematurely (ResponseEnded)" #2361

jeffijoe opened this issue Jan 16, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@jeffijoe
Copy link

What version of gRPC and what language are you using?

C#, Grpc.Net.Client version 2.60.0

What operating system (Linux, Windows,...) and version?

macOS 14.2.1, Linux (Ubuntu on Google Cloud Kubernetes Engine)

What runtime / compiler are you using (e.g. .NET Core SDK version dotnet --info)

.NET SDK:
 Version:           8.0.100

What did you do?

This may potentially be related to #2358

We have a long-lived gRPC streaming call that gets canceled after receiving a few messages with the following:

Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error reading next message. HttpIOException: The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)", DebugException="System.Net.Http.HttpIOException: The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)")
 ---> System.Net.Http.HttpIOException: The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)
   at System.Net.Http.Http2Connection.ThrowRequestAborted(Exception innerException)
   at System.Net.Http.Http2Connection.Http2Stream.CheckResponseBodyState()
   at System.Net.Http.Http2Connection.Http2Stream.TryReadFromBuffer(Span`1 buffer, Boolean partOfSyncRead)
   at System.Net.Http.Http2Connection.Http2Stream.ReadDataAsync(Memory`1 buffer, HttpResponseMessage responseMessage, CancellationToken cancellationToken)
   at Grpc.Net.Client.Internal.StreamExtensions.ReadMessageAsync[TResponse](Stream responseStream, GrpcCall call, Func`2 deserializer, String grpcEncoding, Boolean singleMessage, CancellationToken cancellationToken)
   at Grpc.Net.Client.Internal.GrpcCall`2.ReadMessageAsync(Stream responseStream, String grpcEncoding, Boolean singleMessage, CancellationToken cancellationToken)
   at Grpc.Net.Client.Internal.HttpContentClientStreamReader`2.MoveNextCore(CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at Grpc.Net.Client.Internal.HttpContentClientStreamReader`2.MoveNextCore(CancellationToken cancellationToken)
   at Grpc.Core.AsyncStreamReaderExtensions.ReadAllAsyncCore[T](IAsyncStreamReader`1 streamReader, CancellationToken cancellationToken)+MoveNext()
   at Grpc.Core.AsyncStreamReaderExtensions.ReadAllAsyncCore[T](IAsyncStreamReader`1 streamReader, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()

So far, we can reproduce this when using a L7 external load balancer in Google Cloud (Classic Application Load Balancer). When using an L4 external load balancer, it works as expected.

We have built an extensive reproduction, including multiple servers and clients in different languages to isolate the root cause as being the gRPC dotnet client library. The following clients have been tested and work without issue:

  • grpcurl
  • custom Golang client (see repo)
  • custom Rust client (see repo)

The linked repository also includes information about the infrastructure setup.

The following proto is used:

syntax = "proto3";

option csharp_namespace = "LbIssue";
option go_package = "github.com/taxfyle/lb-issue-repro/src/go/pb";

package lb_issue;

service Demo {
  rpc StreamMessages (StreamMessagesRequest) returns (stream StreamMessagesResponse);
}

message StreamMessagesRequest {
  string name = 1;
}

message StreamMessagesResponse {
  string message = 1;
}

At first, we thought the root cause was the L7 load balancer, however, given other gRPC client libraries work without issue, we figured it might be something specific in the dotnet one.

The issue occurs after having received between 5-8 messages on the stream when sending a message roughly every second. 6 messages is the most common one we've observed.

If we send messages faster (using a while loop that calls a HTTP endpoint on the server to trigger messages), we get between 20-25 messages before the connection is broken.

We also observed that the connection breaks after receiving a message, never while waiting for one.

@jeffijoe jeffijoe added the bug Something isn't working label Jan 16, 2024
@jeffijoe jeffijoe changed the title gRPC stream "Response ended prematurely(ResponseEnded)" gRPC stream "Response ended prematurely (ResponseEnded)" Jan 16, 2024
@JamesNK
Copy link
Member

JamesNK commented Jan 17, 2024

I looked at your .NET client and server sample apps and they both look fine. I also tested them on my local computer (no load balancer involved), and they worked without issue. That must mean there is an interaction between .NET gRPC client + the L7 load balancer that causes the failure.

Is setting up the solution in GCP required? I don't have experience with using it so that would be a bit of a barrier to me quickly testing it failing. I think the issue is in the gRPC client (since the server works with all the other clients) so that's the part I'd want to test.

If you have a GCP endpoint I can point a client on my local machine at, that will speed things up. I can set up gathering of low-level diagnostics and then run the client until there is a failure. If you don't want to share the address publicly then contact me over email at username@microsoft.com (jamesnk)

If that isn't possible then can I get you to get some diagnostics. I can create a PR to your repository with changes to enable detailed logging in the client. You can send attached the results here.

@jeffijoe
Copy link
Author

Thank you, James. I've sent you an email with the endpoint.

@JamesNK
Copy link
Member

JamesNK commented Jan 17, 2024

I reproduced your error and looked at detailed internal logs. The problem is the .NET client is sending the server pings to calculate the RTT, but the server closes the connection because it believes the client is malicious.

Logs: ping-goaway.txt

Important parts:

ProcessIncomingFramesAsync,Frame 15: StreamId=0; Type=GoAway; Flags=None; PayloadLength=22.
ReadGoAwayFrame,lastStreamId=1, errorCode=EnhanceYourCalm

What is happening is the server is sending a GOAWAY frame with an EnhanceYourCalm error code to the client right after the client sends a ping. The server is telling the client that it thinks the client is malicious.

Adding this code to the start of the client app disables RTT pings fixes the problem:

AppContext.SetSwitch("System.Net.SocketsHttpHandler.Http2FlowControl.DisableDynamicWindowSizing", true);

A bug fix needs to happen inside .NET to improve how pings are sent, but you can use that setting to work around the problem until a fix is publicly available.

@jeffijoe
Copy link
Author

jeffijoe commented Jan 17, 2024

It sounds like this is related to #2358 then.

So the server sends a GOAWAY, but the client does not surface this information in the exception that is thrown; is that intentional, or is that one of the things that need to be fixed? And regarding the ping, is it because the .NET client sends excessive pings?

EDIT: Also found this, seems related: dotnet/runtime#62216

@JamesNK
Copy link
Member

JamesNK commented Jan 17, 2024

but the client does not surface this information in the exception that is thrown; is that intentional, or is that one of the things that need to be fixed?

Providing the real reason why a HTTP request failed is a bug and should be fixed. I passed feedback to the team. The ENHANCE_YOUR_CALM status code should appear in the exception message and shouldn't require low-level debugging.

And regarding the ping, is it because the .NET client sends excessive pings?

Yes. It fails after the 5th ping. Some servers are very conservative and see five pings as excessive. The algorithm used to send pings needs to be updated.

@jeffijoe
Copy link
Author

The AppContext switch works perfectly, thank you!

Will this issue remain open until the underlying issues have been fixed? Or how will we be able to track when we can remove the switch?

@JamesNK
Copy link
Member

JamesNK commented Jan 17, 2024

It sounds like this is related to #2358 then.

Yes it is. Exactly the same.

Will this issue remain open until the underlying issues have been fixed? Or how will we be able to track when we can remove the switch?

The fix will happen in dotnet/runtime repository. I'll put a comment here to a new issue there when I close it.

@antonfirsov
Copy link

Regardless of the fact if this will be fixed in .NET or the GCP load balancer, we need to involve the owners of the load balancer in order to make the right decision.

I warmed up an old email thread about a similar issue, but the primary contact in the discussion seems to have left google, so I don't have strong hopes I can reach the people I need over there.

@jeffijoe could you please also raise a ticket against Google? IMO that would be the most efficient way to have this routed to the right people so we can chat.

@jeffijoe
Copy link
Author

@antonfirsov we don't have a direct way to contact Google but we've opened a ticket with DoIT who's our support provider and linked them to this issue, they may be able to escalate this to Google.

@jeffijoe
Copy link
Author

jeffijoe commented Feb 2, 2024

@antonfirsov our support rep forwarded the ticket to Google, this is what we got back:

I have reached out regarding this request and it is currently not possible to involve the product team as the issue lies on the .NET client side. The .NET team provided a workaround that the customer can use in their code in order to disable the pings while they (the .NET team) work on fixing the issue with the .NET client. Please note that changing the LB to accept this behavior will potentially introduce a vulnerability that can be used by any gRPC client software to generate attacks.
In conclusion, kindly proceed with the workaround proposed by the .NET team in order for traffic to not be considered as malicious by the LB.

@antonfirsov
Copy link

@jeffijoe thanks for doing the round with google!

The fix is on the way on the .NET side: dotnet/runtime#97881. With change the problem doesn't happen anymore using your repro, however it would be nice to get a confirmation if it fixes your real problem as well. Having a customer validation would raise the chance of the fix being accepted for 6.0-8.0 backport.

Would it be possible for you to try the attached custom Linux build of System.Net.Http.dll?

You would need to replace it in <DOTNETDIR>/shared/Microsoft.NETCore.App/8.0.1/.

@jeffijoe
Copy link
Author

jeffijoe commented Feb 5, 2024

@antonfirsov if you tested it and it works against our reproduction server, and considering how disabling the feature worked for both our repro and the real problem, then I'm confident the issue has been resolved with your fix. 👍

@jochenkirstaetter
Copy link

@antonfirsov I tested your custom Linux build of System.Net.Http.dll successfully in my scenario.
Against a Google Cloud API.

I copied the DLL into the runtime 8.0.3 lacking a runtime 8.0.1. ;-)
Run the test three times to be sure that it won't throw the exception again. It failed each time with the original DLL.

I implemented an unofficial Gemini SDK for .NET and I came across the reported HttpIOException using long-running stream requests. The method involved GenerateContentStream is available on Mscc.GenerativeAI repository. And I'm using it to create a transcription of an audio file. The xUnit test case is located here: TranscribeStream_Audio_From_FileAPI

As this is a .NET library, I'm not sure whether the suggested workaround with AppContext.SetSwitch() is applicable. I'd assume that this needs to be done in any application using my NuGet package instead.

The streaming response comes from the server and using your custom DLL simply stops instead of throwing the HttpIOException while using the original DLL.

I hope this helps, and that this change will be merged into .NET 8 and hopefully backported to .NET 6.

Cheers, JoKi

System environment and .NET version

$ dotnet --info
.NET SDK:
Version: 8.0.203
Commit: 5e1ceea679
Workload version: 8.0.200-manifests.4e94be9c

Runtime Environment:
OS Name: ubuntu
OS Version: 22.04
OS Platform: Linux
RID: linux-x64
Base Path: /home/jochen/.dotnet8/sdk/8.0.203/

.NET workloads installed:
There are no installed workloads to display.

Host:
Version: 8.0.3
Architecture: x64
Commit: 9f4b1f5d66

.NET SDKs installed:
8.0.100-preview.5.23303.2 [/home/jochen/.dotnet8/sdk]
8.0.100-rc.1.23463.5 [/home/jochen/.dotnet8/sdk]
8.0.100-rc.2.23502.2 [/home/jochen/.dotnet8/sdk]
8.0.201 [/home/jochen/.dotnet8/sdk]
8.0.202 [/home/jochen/.dotnet8/sdk]
8.0.203 [/home/jochen/.dotnet8/sdk]

.NET runtimes installed:
Microsoft.AspNetCore.App 8.0.0-preview.5.23302.2 [/home/jochen/.dotnet8/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.0-rc.1.23421.29 [/home/jochen/.dotnet8/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.0-rc.2.23480.2 [/home/jochen/.dotnet8/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.2 [/home/jochen/.dotnet8/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.3 [/home/jochen/.dotnet8/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 8.0.0-preview.5.23280.8 [/home/jochen/.dotnet8/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.0-rc.1.23419.4 [/home/jochen/.dotnet8/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.0-rc.2.23479.6 [/home/jochen/.dotnet8/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.2 [/home/jochen/.dotnet8/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.3 [/home/jochen/.dotnet8/shared/Microsoft.NETCore.App]

@jochenkirstaetter
Copy link

@antonfirsov
After upgrading to latest SDK and runtimes 8.0.4 the problems seems to be resolved.

Thanks!

@antonfirsov
Copy link

@jochenkirstaetter that is very strange since we changed our mind regarding the backport of dotnet/runtime#97881 because we find the change risky. It might be backported to 8.0 after a short battle test by 9.0 adapters (early 2025).

@jochenkirstaetter
Copy link

@antonfirsov that's indeed really strange.

Okay, let's see what happened here.
I observed the exception using SDK and Runtime 8.0.3, reproducible.

Given the exception I searched the Internet and found the issues here reading about the problem.

Then, I downloaded the custom System.Net.Http 8.0.1 and placed it (as suggested) in the shared folder of 8.0.3

Any run was without exception although "incomplete" like stopped in the middle of the response content (text)

Then I replaced the custom DLL with the original one, and I got the exception again, as expected.

Next, I upgraded to v8.0.4 and haven't observed any exception since. The incomplete response is still there.

Guess, I'll give my Linux system a complete reboot and check again.

FYI, I'm using JetBrains Rider for development and testing.

Cheers, JoKi

@careless6666
Copy link

I have the same error

StatusCode=\"Unavailable\", Detail=\"Error reading next message. HttpIOException: The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)\"

In my case i have different applications with client and server. When i want to reboot server with gracefull shutdown i can't send
await responseStream.CompleteAsync();
and no way to correct finish communication between server and client

Why we don't have method CompleteAsync for IServerStreamWriter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants