Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time #23559

Open
gwik opened this Issue Jan 25, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@gwik
Copy link
Contributor

commented Jan 25, 2018

What version of Go are you using (go version)?

go version go1.9.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/<redacted>/Dev/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cd/_7rcv5812531s2lswhn6kp680000gp/T/go-build436968147=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

Also happened on linux/amd64, see below.

What did you do?

A bug in production showed an http2 client hanging more than 250 seconds without respect to the request context which was set to timeout after 5 sec.
All possible timeouts are set on connection and TLS handshake and I didn't see many dials (they are monitored).
Latency graph in ms:
screen shot 2018-01-26 at 12 14 34 am

Clients are using net/http2.Transport directly.

What did you expect to see?

The requests should have timed out after 5s.

What did you see instead?

No or very long timeouts (I believe the server has reset the connection or TCP timed out).

The synchronous write cc.writeHeaders to the connection in ClientConn.roundTrip does not set any deadline on the connection which can block forever (or being timed out by TCP) if the server or network hangs:
https://github.com/golang/net/blob/0ed95abb35c445290478a5348a7b38bb154135fd/http2/transport.go#L833

I wrote a test that demonstrates this:
gwik/net@e4c191a

// TestTransportTimeoutServerHangs demonstrates that client can hang forever
// when the server hangs and the headers exceed the conn buffer size (forcing a
// flush). Without respect to the context's deadline.
func TestTransportTimeoutServerHangs(t *testing.T) {
	clientDone := make(chan struct{})
	ct := newClientTester(t)
	ct.client = func() error {
		defer ct.cc.(*net.TCPConn).CloseWrite()
		defer close(clientDone)

		buf := make([]byte, 1<<19)
		_, err := rand.Read(buf)
		if err != nil {
			t.Fatal("fail to gen random data")
		}
		headerVal := hex.EncodeToString(buf)

		req, err := http.NewRequest("PUT", "https://dummy.tld/", nil)
		if err != nil {
			return err
		}

		ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)
		defer cancel()
		req = req.WithContext(ctx)
		req.Header.Add("Authorization", headerVal)
		_, err = ct.tr.RoundTrip(req)
		if err == nil {
			return errors.New("error should not be nil")
		}
		if ne, ok := err.(net.Error); !ok || !ne.Timeout() {
			return fmt.Errorf("error should be a net error timeout was: +%v", err)
		}
		return nil
	}
	ct.server = func() error {
		ct.greet()
		select {
		case <-time.After(5 * time.Second):
		case <-clientDone:
		}
		return nil
	}
	ct.run()
}

Setting the write deadline fixes the test:
gwik/net@052de95

It seems to fail when the header value exceeds 1MB. I might miss something here, the default buffer size of bufio.Writer is 4096 bytes, I had expected to see it fail around that value, maybe compression and/or TCP buffers...
Also I don't think it sent 1MB headers when it failed in production something else must have fill the buffer.

The buffer is on the connection and shared among streams the buffer can be filled by other requests on the same connection.

Besides this particular call which is synchronous, no write nor read to the connection has a deadline set. Can't this lead to goroutine leaks and http2 streams being stuck in the background ?

@odeke-em odeke-em changed the title x/net/http2: client can hang forever x/net/http2: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time Jan 26, 2018

@odeke-em

This comment has been minimized.

Copy link
Member

commented Jan 26, 2018

@gwik

This comment has been minimized.

Copy link
Contributor Author

commented Jan 26, 2018

Working on a patch to fix this, I realized that using the context deadline to timeout the i/o calls may not be appropriate.
Timing out the I/O might leave the framing protocol in an inconsistent state and the only safe option would be to close the whole connection. Probably not was the caller intends when the context deadline times out. It would work if the deadline was set on context only for this purpose just before doing the request but that's usually not the case.
I think there should be an I/O timeout option to the transport that would serve the purpose of timing out the I/O calls and every I/O calls would set the deadline to now + timeout and reset it after the call to Read or Write.
I'm going to implement this for now. Let me know what you think about it.

gwik added a commit to gwik/net that referenced this issue Feb 3, 2018

net/http2: add I/O timeouts
Addresses hanging transport when on blocking I/O. There are many scenario where
the roundtrip hangs on write or read and won't be unlocked by current
cancelation systems (context, Request.Cancel, ...).

This adds read and write deadlines support.
The writer disables the read deadline and enables the write deadline, then after
the write is successful, it disables the write deadline and re-enables the read
deadline.
The read loop also sets its read deadline after a successful read since the next
frame is not predictable.
It guarantees that an I/O will not timeout before IOTimeout and will timeout
after a complete block before at least IOTimeout.

See issue: golang/go#23559

gwik added a commit to gwik/net that referenced this issue Feb 3, 2018

http2: add I/O timeouts
Addresses hanging transport when on blocking I/O. There are many scenario where
the roundtrip hangs on write or read and won't be unlocked by current
cancelation systems (context, Request.Cancel, ...).

This adds read and write deadlines support.
The writer disables the read deadline and enables the write deadline, then after
the write is successful, it disables the write deadline and re-enables the read
deadline.
The read loop also sets its read deadline after a successful read since the next
frame is not predictable.
It guarantees that an I/O will not timeout before IOTimeout and will timeout
after a complete block before at least IOTimeout.

See issue: golang/go#23559

@bradfitz bradfitz added this to the Go1.11 milestone Feb 3, 2018

gwik added a commit to gwik/net that referenced this issue Mar 9, 2018

http2: add I/O timeouts
Addresses hanging transport when on blocking I/O. There are many scenario where
the roundtrip hangs on write or read and won't be unlocked by current
cancelation systems (context, Request.Cancel, ...).

This adds read and write deadlines support.
The writer disables the read deadline and enables the write deadline, then after
the write is successful, it disables the write deadline and re-enables the read
deadline.
The read loop also sets its read deadline after a successful read since the next
frame is not predictable.
It guarantees that an I/O will not timeout before IOTimeout and will timeout
after a complete block before at least IOTimeout.

See issue: golang/go#23559

Change-Id: If618a63857cc32d8c3175c0d9bef1f8bf83c89df

gwik added a commit to znly/golang-x-net that referenced this issue Mar 26, 2018

http2: add I/O timeouts
Addresses hanging transport when on blocking I/O. There are many scenario where
the roundtrip hangs on write or read and won't be unlocked by current
cancelation systems (context, Request.Cancel, ...).

This adds read and write deadlines support.
The writer disables the read deadline and enables the write deadline, then after
the write is successful, it disables the write deadline and re-enables the read
deadline.
The read loop also sets its read deadline after a successful read since the next
frame is not predictable.
It guarantees that an I/O will not timeout before IOTimeout and will timeout
after a complete block before at least IOTimeout.

See issue: golang/go#23559

Change-Id: If618a63857cc32d8c3175c0d9bef1f8bf83c89df

ikenchina referenced this issue in gwik/net May 14, 2018

net/http2: add test to demonstrate transport can hang forever
The transport can hang forever if the server hangs after accepting the
connection and the request headers (or control frames) exceeds the connection
write buffer. Without respect to the context's deadline.

The ClientConn writes to the connection without setting a write deadline thus
blocks forever if the buffer is flushed to the socket.

go test -v -run TestTransportTimeoutServerHangs ./http2
@bradfitz

This comment has been minimized.

Copy link
Member

commented Jul 12, 2018

Any update here?

Keep in mind that there are 3 goroutines per Transport connection to a server:

  • the writer (which might block forever)
  • the reader (blocked by default, waiting for frames from server)
  • the control one, which never blocks, and owns all state

I think the behavior we want is:

  • if the user said 5 seconds timeout on a request, the control channel should notice and return an error to the user's RoundTrip. I'd be surprised if this part isn't already working?
  • don't accept new streams (new RoundTrips) on that ClientConn if the conn has been stuck in a write for "too long", for some definition TBD
  • if the write appears totally hung, nuke the whole connection and all streams still attached to it

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned Jul 12, 2018

@gwik

This comment has been minimized.

Copy link
Contributor Author

commented Jul 16, 2018

if the user said 5 seconds timeout on a request, the control channel should notice and return an error to the user's RoundTrip. I'd be surprised if this part isn't already working?

There are cases where this isn't working. I believe the graph and tests demonstrates it.

don't accept new streams (new RoundTrips) on that ClientConn if the conn has been stuck in a write for "too long", for some definition TBD

Unfortunately, the writes almost never time out I don't know if this is specific to my use particular use case or if buffering makes it very long to happen.

I run this CL in production for a very specific service that only makes requests to APNS and Firebase push gateways.

The patch helps to recover from stuck connections and kills it but it's slow to converge, is incomplete and may not work properly for the general use case.

What happens is that there are situations where the APNS gateway is blocking the connection, it behaves like a firewall that drops packet never reset the connection. Most of the time the requests are timing out correctly, the context allow to unlock the client stream goroutine and return to the caller however the connection read loop never times out nor receive any other error causing all requests to time out. (Also not that the APNS gateway doesn't allow pings).

What I see most often is the writes are not timing out. I believe that the only way to guarantee to time out the connection is have a way to unblock the read loop. The reads (ClientConn) are asynchronous to the writes (ClientStream) so we have to find a way to add a deadline on the reads that will follow the writes (requests) or that follow a read (e.g reading the rest of the body) from the client stream side.

The CL takes a simple approach and avoids synchronization between the client stream and the connection read loop. It resets the read deadline and sets a write deadline before every write from the client stream. After the write is done, it sets the read deadline.
This kinda work, I see the that connections are no longer stuck forever but it is slow to time out the connection. Resetting the read deadline before every write continuously pushes back the deadline.
Also there are no read timeout when waiting for subsequent body reads for example.

I wanted to experiment the following algorithm:

Upon waiting for something from the network on the ClientStream / caller side, push a deadline in a heap data structure (probably under ClientConn.wmu lock) keeping track of the stream that pushes it.
If there was no deadline, unlock the read loop by setting one in the past an set the deadline before next read.
Before every read look at the root of the deadline heap and set it.
After every read remove the deadline of the stream that has successfully completed the read. We need some sort of reverse index to find the entry in the heap and remove it (maybe every client stream can track it's unique read deadline entry in the heap).

What do you think ?

gwik added a commit to znly/golang-x-net that referenced this issue Nov 14, 2018

http2: add I/O timeouts
Addresses hanging transport when on blocking I/O. There are many scenario where
the roundtrip hangs on write or read and won't be unlocked by current
cancelation systems (context, Request.Cancel, ...).

This adds read and write deadlines support.
The writer disables the read deadline and enables the write deadline, then after
the write is successful, it disables the write deadline and re-enables the read
deadline.
The read loop also sets its read deadline after a successful read since the next
frame is not predictable.
It guarantees that an I/O will not timeout before IOTimeout and will timeout
after a complete block before at least IOTimeout.

See issue: golang/go#23559

Change-Id: If618a63857cc32d8c3175c0d9bef1f8bf83c89df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.