Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: http/2 connection management magnifies the effect of network timeouts #63422

rhysh opened this issue Oct 6, 2023 · 3 comments
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.


Copy link

rhysh commented Oct 6, 2023

What version of Go are you using (go version)?

$ go1.21 version
go version go1.21.2 darwin/arm64

Does this issue reproduce with the latest release?

Yes, this is present in the Go 1.21 series and in the development branch.

What operating system and processor architecture are you using (go env)?

go env Output
$ go1.21 env                     
GOENV='/Users/rhys/Library/Application Support/go/env'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/lf/n4ry0qv5639_3z0nhyhy3z7c0000gs/T/go-build3568422790=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

I made HTTP requests over a network where routing changes sometimes result in TCP flows transitioning from 0% packet loss to 100% packet loss.

What did you expect to see?

I expected that a TCP flow with 100% packet loss would not be used for new requests.

What did you see instead?

With HTTP/1, each connection is used for one request at a time. Complete packet loss on a connection leads to failure by timeout of the request that was assigned to it (if any), or the failure of the next request to be assigned to that connection. The http.Transport pool responds to that situation by closing the connection, preventing its use for any other requests. This is good.

With HTTP/2, each connection may be used for multiple concurrent requests. Complete packet loss on a connection leads to failure by timeout of every request assigned to it (if any). This is understandable, but without a user-defined limit on the number of streams the client is willing to open it seems unavoidable.

But continuing with HTTP/2, if the connection has not yet reached its limit for concurrent streams, the http.Transport pool will continue to assign new requests to that connection, even if the other requests it's assigned to that connection have failed to result in any HTTP/2 frames received at the client. If I understand correctly, a Go httptest.Server will allow up to 250 concurrent streams per connection. Failing hundreds of requests when one connection experiences packet loss is unfortunate, and seems like something we could improve.

Finally on HTTP/2, when a connection is experiencing 100% packet loss and so results in timeouts as I described above, any request that has timed out stops counting against the connection's concurrent streams limit. This results in the dead connection appearing to have capacity for new requests. This can result in thousands upon thousands of requests failing, especially if the concurrency never grows beyond what the http.Transport expects a single connection can handle. This can go on until the TCP connection's buffer within the OS fills, or until the OS times out the connection (which can take several minutes). Failing thousands of requests, beyond the connection's max streams limit, when one connection experiences packet loss is something we should be able to fix.

These stark differences between how HTTP/1 and Go's HTTP/2 respond to network perturbations makes it hard rely on Go's HTTP/2 support in production.

CC @neild

The reproducer below (collapsed) is structured as a test that fails unconditionally in order to print logs. It makes one request at a time, with a timeout. Every so often, it "unplugs" one of the TCP connections in the pool to simulate (from Go's perspective, if not the OS's) 100% packet loss. Usually there's only one connection in the pool.

Here's how it looks when making 30,000 requests over HTTP/1.1 with a 10ms timeout, unplugging a connection after 5000, 15,000, and 25,000 requests. It observes failure in 3 requests:

$ go1.21 test ./h2_unplug_test.go -request-timeout=10ms -request-count=30000 -unplug-interval=10000
--- FAIL: TestNetsplit (2.17s)
    --- FAIL: TestNetsplit/http1 (2.17s)
        h2_unplug_test.go:155: total=30000 unplugs=3 failures=3
FAIL    command-line-arguments  2.358s

Here's how it looks when making 30,000 requests over HTTP/2 with a 10ms timeout, unplugging a connection after 5000 requests and attempting to unplug after 15,000 and 25,000 requests as well (though there's no new connection to unplug). It observes failure in 25,000 requests:

$ go1.21 test ./h2_unplug_test.go -request-timeout=10ms -request-count=30000 -unplug-interval=10000
--- FAIL: TestNetsplit (275.02s)
    --- FAIL: TestNetsplit/http2 (275.02s)
        h2_unplug_test.go:155: total=30000 unplugs=1 failures=25000
FAIL    command-line-arguments  275.201s
package main

import (

var (
	requestTimeout = flag.Duration("request-timeout", 10*time.Millisecond, "Timeout for each HTTP request")
	requestCount   = flag.Int("request-count", 1000, "Number of HTTP requests to send")
	unplugInterval = flag.Int("unplug-interval", 10, "Number of HTTP requests to issue between each unplug event")

func TestNetsplit(t *testing.T) {
	testcase := func(
		srvConfig func(srv *httptest.Server),
		rtConfig func(rt *http.Transport),
		test func(t *testing.T, url string, client *http.Client,
			withConns func(fn func(conns []*proxyConn) []*proxyConn),
	) func(t *testing.T) {
		return func(t *testing.T) {
			srv := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			defer srv.Close()

			dialer := &net.Dialer{}

			var (
				pipeMu sync.Mutex
				pipes  []*proxyConn

			var proxyMu sync.Mutex
			proxy, err := net.Listen("tcp", "")
			if err != nil {
				t.Fatalf("proxy listen: %v", err)
			defer proxy.Close()

			connPair := func() (net.Conn, net.Conn, error) {
				defer proxyMu.Unlock()

				c1, err := net.Dial(proxy.Addr().Network(), proxy.Addr().String())
				if err != nil {
					return nil, nil, err

				c2, err := proxy.Accept()
				if err != nil {
					return nil, nil, err

				return c1, c2, nil

			transport := srv.Client().Transport.(*http.Transport).Clone()
			transport.ForceAttemptHTTP2 = true
			transport.DialContext = func(ctx context.Context, network, addr string) (net.Conn, error) {
				c3, err := dialer.DialContext(ctx, network, addr)
				if err != nil {
					return nil, err

				c1, c2, err := connPair()
				if err != nil {
					return nil, err

				pipe := newProxyConn(c2, c3)

				pipes = append(pipes, pipe)

				return c1, nil

			client := &http.Client{Transport: transport}

			withConns := func(fn func(conns []*proxyConn) []*proxyConn) {
				defer pipeMu.Unlock()
				pipes = fn(pipes)

			test(t, srv.URL, client, withConns)

	makeRequest := func(ctx context.Context, url string, client *http.Client) error {
		req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
		if err != nil {
			return err

		resp, err := client.Do(req)
		if err != nil {
			return err
		defer resp.Body.Close()

		_, err = io.Copy(io.Discard, resp.Body)
		if err != nil {
			return err

		return nil

	run := func(t *testing.T, url string, client *http.Client, withConns func(func(conns []*proxyConn) []*proxyConn)) {
		var failures int64
		var unplugs int64
		total := *requestCount
		for i := 0; i < total; i++ {
			if interval := *unplugInterval; i%interval == interval/2 { // midway or a bit less
				withConns(func(conns []*proxyConn) []*proxyConn {
					if l := len(conns); l > 0 {
						idx := rand.Intn(l)
						pipe := conns[idx]

						conns[idx] = conns[l-1]
						conns = conns[:l-1]
					return conns

			err := func() error {
				ctx := context.Background()
				ctx, cancel := context.WithTimeout(ctx, *requestTimeout)
				defer cancel()
				return makeRequest(ctx, url, client)

			if err != nil {

		t.Errorf("total=%d unplugs=%d failures=%d", total, unplugs, failures)

	t.Run("http1", testcase(
		func(srv *httptest.Server) { srv.EnableHTTP2 = false },
		func(rt *http.Transport) {},

	t.Run("http2", testcase(
		func(srv *httptest.Server) { srv.EnableHTTP2 = true },
		func(rt *http.Transport) {},

// A proxyConn connects two net.Conns together, copying the data in each
// direction, until it is closed or unplugged. Unplugging a proxyConn causes it
// to silently stop copying the data.
type proxyConn struct {
	c1 net.Conn
	c2 net.Conn

	once sync.Once

func newProxyConn(c1, c2 net.Conn) *proxyConn {
	pc := &proxyConn{c1: c1, c2: c2}

	go io.Copy(c1, c2)
	go io.Copy(c2, c1)

	return pc

func (pc *proxyConn) unplug() {
	dl := time.Now().Add(-1 * time.Second)

func (pc *proxyConn) close() {
	pc.once.Do(func() {
Copy link

prattmic commented Oct 6, 2023

cc @neild

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 6, 2023
Copy link

When you say "100% packet loss" are you detecting this due to timeouts via DeadlineExceeded (and/or 499)?

This is an interesting problem and I wonder how other languages cope with this.

Copy link
Contributor Author

rhysh commented Nov 2, 2023

This shows up in the app as timeouts / DeadlineExceeded, yes. (Or, if the tightest timeout is from an earlier layer of caller, as Canceled.)

"100% packet loss" is a type of problem with the TCP connection that results in the kernel not having / not giving us the information we expect to be available. Other causes of that observable behavior (such as the application on the other end deciding to sleep instead of writing our data) would have the same effect on users of Go's HTTP/2 support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
None yet

No branches or pull requests

3 participants