Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "server closed the stream without sending trailers" error #52

Merged
merged 5 commits into from
May 25, 2021

Conversation

guggero
Copy link
Member

@guggero guggero commented May 7, 2021

This PR adds a test to reproduce the elusive server closed the stream without sending trailers error that we first discovered through user feedback in lightninglabs/pool#225.

Apparently this is a bug in the net/http/httputil package itself, so we can not really fix it but add a workaround.
The tl;dr is that sometimes, the HTTP/2 trailer fields aren't sent back to the client correctly when an error occurs on the proxied connection. We add our own HTTP transport implementation that fixes the problem.

@guggero guggero requested review from Roasbeef and wpaulino May 7, 2021 14:43
Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome find! My guess is that there's a race condition somewhere within the ReverseProxy/Transport code making this reproducible only sometimes. We should probably report this upstream.

@guggero
Copy link
Member Author

guggero commented May 10, 2021

This could be somewhat related to golang/go#34413. Maybe the backend closes the stream after sending an error. I'm not sure if I should invest more time into a repro scenario. Would be cool to fix something in golang though :)

@Roasbeef
Copy link
Member

Lint fails:

proxy/proxy.go:363:22: resp.Header.Values undefined (type http.Header has no field or method Values) (typecheck)
		if len(resp.Header.Values(hdrGrpcStatus)) > 0 {
		                   ^
		                 ```

@@ -329,11 +331,42 @@ func sendDirectResponse(w http.ResponseWriter, r *http.Request,
// so we can use that.
switch {
case strings.HasPrefix(r.Header.Get(hdrContentType), hdrTypeGrpc):
w.Header().Set("Grpc-Status", strconv.Itoa(int(codes.Internal)))
w.Header().Set("Grpc-Message", errInfo)
w.Header().Set(hdrGrpcStatus, strconv.Itoa(int(codes.Internal)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any link handy to allow me to better understand whcih headers we're setting here and why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a very good doc, but that's the best I could find: https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md

The Grpc-Status is the field that holds the numerical value for the gRPC specific states (e.g. OK, INTERNAL, UNKNOWN, UNIMPLEMENTED, UNAVAILABLE and so on). The Grpc-Message field contains the actual error message in case the gRPC status isn't OK (0). Those two fields should be sent as a HTTP/2 trailer field and not header field. That's what we tell the golang http2 library by setting the Trailer field with the name of those two fields.

The Content-Length and :status are probably not necessary as they would be set anyway by the proxy. Maybe I should remove them as they are from trying out stuff.


resp, err := l.next.RoundTrip(req)
if resp != nil && len(resp.Trailer) == 0 {
if len(resp.Header.Values(hdrGrpcStatus)) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a comment stating that the OK code is zero, so anything beyond that is actually considered an error: https://pkg.go.dev/google.golang.org/grpc/codes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll rewrite this to make it a bit more clear. We don't look at the actual value. This just checks "if there is no Trailer (as in HTTP/2 trailer data field) set but we have a Grpc-Status Header field, then, per gRPC convention, we need to set the same value in the Trailer part of the message". This is the actual fix as the client complains that there are no trailers set when the proxy closes the connection.

@@ -195,6 +213,17 @@ func TestProxyGRPC(t *testing.T) {
runGRPCTest(t, tc)
})
}

for i := 0; i < 20; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the goal w/ this added iteration of the test case? Given that the iteration index (or no other new information) is actually propagated into the test context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is only hit every 3 or 4 tries. That's why suspect a race condition in the underlying code. Without the fix we'd never get 20 successful runs, so this just "brute forces" the test to make sure it doesn't flake anymore.

if !haveNPN {
tlsConf.NextProtos = append(tlsConf.NextProtos, "h2")
}
tlsConf.NextProtos = append(tlsConf.NextProtos, "h2-14")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this serve to resolve those issues we observed when we attempted to expose Aperture directly over the load balancer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe, yes. I think we should look into that again.

rpc SayHello (HelloRequest) returns (HelloReply) {}
rpc SayHelloNoAuth (HelloRequest) returns (HelloReply) {}
rpc SayHello (HelloRequest) returns (HelloReply);
rpc SayHelloNoAuth (HelloRequest) returns (HelloReply);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah, new proto syntax change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. You can use the {} notation if you add additional metadata (for example the REST options) or use a semicolon directly if there is no "body".

@guggero
Copy link
Member Author

guggero commented May 11, 2021

Fixed the linter by bumping the golang versions.

@guggero
Copy link
Member Author

guggero commented May 11, 2021

Argh, now there's a data race in the built in library...

EDIT: All fixed now.

@guggero guggero requested a review from Roasbeef May 21, 2021 08:30
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🌛

@guggero guggero merged commit 0194e0d into lightninglabs:master May 25, 2021
@guggero guggero deleted the stream-no-trailers-error branch May 25, 2021 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants