Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: server sends RST_STREAM w/ PROTOCOL_ERROR to clients it incorrectly believes have violated max advertised num streams #42777

Open
jared2501 opened this issue Nov 23, 2020 · 4 comments
Milestone

Comments

@jared2501
Copy link

@jared2501 jared2501 commented Nov 23, 2020

What version of Go are you using (go version)?

$ go version
go version go1.15.0 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What's the issue

I'm seeing an issue with a grpc-go client talking to a http2.Server{}. I've documented the issue grpc/grpc-go#4057. To summarize the issue:

  • I have a grpc-go client that is making requests to an http2.Servr{} on the same machine. The client is generating a lot of load on the server. The server is configured to allow a maximum of 320 requests on a single connection.
  • I'm seeing sporadic, and relatively regular, RST_STREAM w/ PROTOCOL_ERROR frames from the server.
  • The client usually has upwards of 500 simultaneous requests open at a time, which means there are 320 concurrent streams and ~200 streams waiting for these streams to complete. The machine that hosts the server is under heavy load.

I put some println statements in the client and server code and isolated the error that the server is returning to be from this line. This code checks the sc.curClientStreams and rejects the new stream if it believes the client is violating the advertised max num of concurrent streams. I added further logging to the server and the client to record the number of active streams and found that this number never creeps above 320 on the client or the server - the grpc-go client code believes it is being well behaved, which indicated to me something is fishy in the http2.Server code.

The only place in the http2.Server code that decremtnts sc.curClientStreams is here in sc.closeStream which is called from sc.wroteFrame, and sc.wroteFrame is only ever called in the sc.serve loop. Writes to the client can be triggered asynchronously by sc.writeFrameAsync (i.e. see here), and then once the write is complete the result is sent to the sc.wroteFrameCh which is read by the sc.serve loop (i.e. see here). The sc.serve loop contains a big select statement that reads from the sc.wroteFrameCh, but also the sc.readFrameCh which receives new frames from clients. Since these channels are read from the same select statement, if both contain a message then there is no guarantee which one will be read from first.

I believe what is happening to cause the spurious PROTOCOL_ERRORs is:

  • The server is maxed out with 320 concurrent streams.
  • One of the streams on the server finishes. This triggers a call to sc.writeFrameAsync to be spawned in a separate goroutine that writes a frame f1 that ends the stream.
  • sc.writeFrameAsync writes the frame f1 to the TCP connection and adds a message to the sc.wroteFrameCh.
  • The client reads frame f1 immediately, sees that the stream has completed, and therefore starts a new stream to the server by sending a frame f2.
  • The sc.readFrames loop reads frame f2 and pushes it onto the sc.readFrameCh channel.
  • Now, the sc.serve loop goes through another iteration, and since both sc.readFrameCh and sc.writeFrameAsync contain messages, it's undefined which one it processes first. In this case, it processes the message on sc.readFrameCh first, which contains frame f2.
  • Since the message in the sc.wroteFrameCh has not been processed yet, the sc.curClientStreams variable has not yet been decremented and thus the server blieves that sc.curClientStreams is still set to 320. When sc.processFrameFromReader is called it will return a PROTOCOL_ERROR.

To attempt to fix this, I patched the http2.Server code:

diff --git a/vendor/golang.org/x/net/http2/server.go b/vendor/golang.org/x/net/http2/server.go
index 686501b4..a79fed98 100644
--- a/vendor/golang.org/x/net/http2/server.go
+++ b/vendor/golang.org/x/net/http2/server.go
@@ -868,6 +868,13 @@ func (sc *serverConn) serve() {
                case res := <-sc.wroteFrameCh:
                        sc.wroteFrame(res)
                case res := <-sc.readFrameCh:
+                       // Process any wrote frames before reading new frames from the client since a
+                       // wrote frame could have triggered a new stream to be started.
+                       select {
+                       case wroteRes := <-sc.wroteFrameCh:
+                               sc.wroteFrame(wroteRes)
+                       default:
+                       }
                        if !sc.processFrameFromReader(res) {
                                return
                        }

When I ran this on the server it significantly reduced the occurrences of the PRTOCOL_ERRORs I saw. I was seeing 20 every 3 minutes before the patch, and after I observed about 10 over the course of 2 hours.

I believe that this patch solves some of the problem, but not all of it, as the error still persists. Also, I may be totally wrong in my diagnosis above. Does the above reasoning make sense as a possible failure?

I'm going to try and get a minimal repro working, but wanted to create this issue first to get folks thoughts.

@networkimprov
Copy link

@networkimprov networkimprov commented Nov 24, 2020

@fraenkel
Copy link
Contributor

@fraenkel fraenkel commented Nov 25, 2020

I agree with your explanation however it is not guaranteed that the above will fix most of the racing that occurs. The problem is that between writeFrame and sending the frameWriteResult, one can receive the readFrame.
We can mitigate the select by only checking when there is an async write occuring, otherwise checking for the wroteFrameCh is just a waste.

BTW, this the reverse of what the client does between the bodyWriter and the readLoop.

A testcase would go a long way.

@jared2501
Copy link
Author

@jared2501 jared2501 commented Nov 26, 2020

@fraenkel yeah that makes sense - I was thinking along those lines too. Weird thing was that when I made a hacky fix that used a mutex to lock, I still saw some PROTOCOL_ERRORs.. But there's also a chance that my hax were wrong since I'm not as familiar with this part of the code.

Did you still need a repro from me, or do you think you have a sufficient understanding of the failure mode here? Happy to try and fixings on our PROD system to see if they fix things too (and can do a bit of debugging to see what else could be up).

@fraenkel
Copy link
Contributor

@fraenkel fraenkel commented Dec 2, 2020

I understand the failure. The trick is coming up with a test case which may not be so simple.

@cagedmantis cagedmantis added this to the Backlog milestone Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants