Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: flow control desync when receiving data for canceled stream #56558

Open
mrdg opened this issue Nov 3, 2022 · 5 comments
Open

x/net/http2: flow control desync when receiving data for canceled stream #56558

mrdg opened this issue Nov 3, 2022 · 5 comments
Assignees
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mrdg
Copy link

mrdg commented Nov 3, 2022

What version of Go are you using (go version)?

$ go version
go version go1.19 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env

What did you do?

We have a proxy server in Go that proxies over HTTP/2 to a third-party server. We're seeing that the third-party server sometimes runs out of flow control tokens and can't send anymore data back to the client. There's no clear reproduction path, but it seems to happen when the proxy is under heavy load and seeing an increase in canceled requests, due to higher latency.

Looking for potential causes in the http2 package, I noticed that in the case where the transport reads a data frame for a stream that was already canceled, it returns the flow control tokens, but it also increments the flow control counter for the connection:

https://github.com/golang/net/blob/a1278a7f7ee0c218caeda793b867e0568bbe1e77/http2/transport.go#L2557-L2559

As far as I can tell, the tokens haven't been subtracted at this point, so we shouldn't be adding to the window? I ran the test case below, with some logging added to the lines linked above, e.g.:

cc.inflow.add(int32(f.Length))
cc.logf("http2: Transport received DATA frame for canceled stream (window: %d)", cc.inflow.available())
func TestFlowControlDesync(t *testing.T) {
	ct := newClientTester(t)
	sync := make(chan struct{})

	ct.client = func() error {
		ctx, cancel := context.WithCancel(context.Background())
		req, _ := http.NewRequest("GET", "https://dummy.tld/", nil)
		req = req.WithContext(ctx)
		_, err := ct.tr.RoundTrip(req)
		if err != nil {
			t.Fatal(err)
		}
		<-sync
		cancel()
		return nil
	}
	ct.server = func() error {
		ct.greet()
		window := initialWindowSize

		f, err := ct.fr.ReadFrame()
		if err != nil {
			t.Fatal(err)
		}
		wuf := f.(*WindowUpdateFrame)
		window += int(wuf.Increment)

		f, err = ct.fr.ReadFrame()
		if err != nil {
			t.Fatal(err)
		}
		hf := f.(*HeadersFrame)

		var buf bytes.Buffer
		enc := hpack.NewEncoder(&buf)
		enc.WriteField(hpack.HeaderField{Name: ":status", Value: "200"})
		ct.fr.WriteHeaders(HeadersFrameParam{
			StreamID:      hf.StreamID,
			EndHeaders:    true,
			EndStream:     false,
			BlockFragment: buf.Bytes(),
		})
		close(sync)
		for {
			data := make([]byte, maxFrameSize)
			window -= len(data)
			ct.fr.WriteData(hf.StreamID, false, data)
			f, err := ct.fr.ReadFrame()
			if err != nil {
				t.Fatal(err)
			}
			switch fr := f.(type) {
			case *WindowUpdateFrame:
				window += int(fr.Increment)
				t.Logf("server window: %v", window)
			}
			time.Sleep(time.Second)
		}
	}
	ct.run()
}

What did you expect to see?

I'd expect flow control to be returned to the server, but the transport's counter shouldn't increase in this case.

What did you see instead?

The flow control counter keeps increasing. I think will lead to flow control desync between client and server.

@mdempsky mdempsky added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 3, 2022
@mdempsky
Copy link
Member

mdempsky commented Nov 3, 2022

/cc @neild

@mdempsky mdempsky added this to the Go1.20 milestone Nov 3, 2022
@neild neild self-assigned this Nov 3, 2022
@neild
Copy link
Contributor

neild commented Nov 3, 2022

cc.inflow.add(int32(f.Length))

cc.inflow.add is increasing the flow control window, and the subsequent lines send the increase to the server in a WINDOW_UPDATE frame. There might be a flow control bug here somewhere, but I don't think these lines are it unless I'm misreading something.

@mrdg
Copy link
Author

mrdg commented Nov 4, 2022

The reason those lines seemed wrong to me is that in the general case, when data is received for an open stream, the window is reduced first (both stream and connection):

if cs.inflow.available() >= int32(f.Length) {
  cs.inflow.take(int32(f.Length))
}

It then refunds for padding right away, and later increases the window again when the caller reads the body:

if v := cc.inflow.available(); v < transportDefaultConnFlow/2 {
  connAdd = transportDefaultConnFlow - v
  cc.inflow.add(connAdd)
}

In the case of a canceled stream, it looks like the window is only ever increased.

@neild
Copy link
Contributor

neild commented Nov 4, 2022

Ah, you're right. That does look wrong; we're returning flow control tokens to cs.inflow that were never taken. (I thought we consumed the connection-level flow earlier, but we don't; connection and stream flow are consumed at the same time.) Over time, that'll lead to the transport thinking the server has more tokens than it does, and it'll stop sending window updates.

Thanks for the analysis.

@gopherbot
Copy link

gopherbot commented Nov 4, 2022

Change https://go.dev/cl/448155 mentions this issue: http2: refactor inbound flow control tracking

mrdg pushed a commit to netlify/net that referenced this issue Nov 7, 2022
Add a new inflow type for tracking inbound flow control.
An inflow tracks both the window sent to the peer, and the
window we are willing to send. Updates are accumulated and
sent in a batch when the unsent window update is large
enough.

Large enough is currently defined as at least doubling
the peer's current window.

This change makes both the client and server use the same
algorithm to decide when to send window updates. This should
slightly reduce the rate of updates sent by the client, and
significantly reduce the rate sent by the server.

Fix a client flow control tracking bug: When processing data
for a canceled stream, the record of flow control consumed
by the peer was not updated to account for the discard
stream.

Fixes golang/go#28732
Fixes golang/go#56558

Change-Id: Id119d17b84b46f3dc2719f28a86758d9a10085d9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants