Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: http/2 connection management magnifies the effect of network timeouts #63422

Open
rhysh opened this issue Oct 6, 2023 · 3 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Oct 6, 2023

What version of Go are you using (go version)?

$ go1.21 version
go version go1.21.2 darwin/arm64

Does this issue reproduce with the latest release?

Yes, this is present in the Go 1.21 series and in the development branch.

What operating system and processor architecture are you using (go env)?

go env Output
$ go1.21 env                     
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/rhys/Library/Caches/go-build'
GOENV='/Users/rhys/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/rhys/go/pkg/mod'
GONOPROXY='*'
GONOSUMDB='*'
GOOS='darwin'
GOPATH='/Users/rhys/go'
GOPRIVATE='*'
GOPROXY='direct'
GOROOT='/Users/rhys/go/version/go1.21'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/Users/rhys/go/version/go1.21/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.21.2'
GCCGO='gccgo'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/lf/n4ry0qv5639_3z0nhyhy3z7c0000gs/T/go-build3568422790=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

I made HTTP requests over a network where routing changes sometimes result in TCP flows transitioning from 0% packet loss to 100% packet loss.

What did you expect to see?

I expected that a TCP flow with 100% packet loss would not be used for new requests.

What did you see instead?

With HTTP/1, each connection is used for one request at a time. Complete packet loss on a connection leads to failure by timeout of the request that was assigned to it (if any), or the failure of the next request to be assigned to that connection. The http.Transport pool responds to that situation by closing the connection, preventing its use for any other requests. This is good.

With HTTP/2, each connection may be used for multiple concurrent requests. Complete packet loss on a connection leads to failure by timeout of every request assigned to it (if any). This is understandable, but without a user-defined limit on the number of streams the client is willing to open it seems unavoidable.

But continuing with HTTP/2, if the connection has not yet reached its limit for concurrent streams, the http.Transport pool will continue to assign new requests to that connection, even if the other requests it's assigned to that connection have failed to result in any HTTP/2 frames received at the client. If I understand correctly, a Go httptest.Server will allow up to 250 concurrent streams per connection. Failing hundreds of requests when one connection experiences packet loss is unfortunate, and seems like something we could improve.

Finally on HTTP/2, when a connection is experiencing 100% packet loss and so results in timeouts as I described above, any request that has timed out stops counting against the connection's concurrent streams limit. This results in the dead connection appearing to have capacity for new requests. This can result in thousands upon thousands of requests failing, especially if the concurrency never grows beyond what the http.Transport expects a single connection can handle. This can go on until the TCP connection's buffer within the OS fills, or until the OS times out the connection (which can take several minutes). Failing thousands of requests, beyond the connection's max streams limit, when one connection experiences packet loss is something we should be able to fix.

These stark differences between how HTTP/1 and Go's HTTP/2 respond to network perturbations makes it hard rely on Go's HTTP/2 support in production.

CC @neild


The reproducer below (collapsed) is structured as a test that fails unconditionally in order to print logs. It makes one request at a time, with a timeout. Every so often, it "unplugs" one of the TCP connections in the pool to simulate (from Go's perspective, if not the OS's) 100% packet loss. Usually there's only one connection in the pool.

Here's how it looks when making 30,000 requests over HTTP/1.1 with a 10ms timeout, unplugging a connection after 5000, 15,000, and 25,000 requests. It observes failure in 3 requests:

$ go1.21 test ./h2_unplug_test.go -test.run=/http1 -request-timeout=10ms -request-count=30000 -unplug-interval=10000
--- FAIL: TestNetsplit (2.17s)
    --- FAIL: TestNetsplit/http1 (2.17s)
        h2_unplug_test.go:155: total=30000 unplugs=3 failures=3
FAIL
FAIL    command-line-arguments  2.358s
FAIL

Here's how it looks when making 30,000 requests over HTTP/2 with a 10ms timeout, unplugging a connection after 5000 requests and attempting to unplug after 15,000 and 25,000 requests as well (though there's no new connection to unplug). It observes failure in 25,000 requests:

$ go1.21 test ./h2_unplug_test.go -test.run=/http2 -request-timeout=10ms -request-count=30000 -unplug-interval=10000
--- FAIL: TestNetsplit (275.02s)
    --- FAIL: TestNetsplit/http2 (275.02s)
        h2_unplug_test.go:155: total=30000 unplugs=1 failures=25000
FAIL
FAIL    command-line-arguments  275.201s
FAIL
./h2_unplug_test.go
package main

import (
	"context"
	"flag"
	"io"
	"math/rand"
	"net"
	"net/http"
	"net/http/httptest"
	"sync"
	"testing"
	"time"
)

var (
	requestTimeout = flag.Duration("request-timeout", 10*time.Millisecond, "Timeout for each HTTP request")
	requestCount   = flag.Int("request-count", 1000, "Number of HTTP requests to send")
	unplugInterval = flag.Int("unplug-interval", 10, "Number of HTTP requests to issue between each unplug event")
)

func TestNetsplit(t *testing.T) {
	testcase := func(
		srvConfig func(srv *httptest.Server),
		rtConfig func(rt *http.Transport),
		test func(t *testing.T, url string, client *http.Client,
			withConns func(fn func(conns []*proxyConn) []*proxyConn),
		),
	) func(t *testing.T) {
		return func(t *testing.T) {
			srv := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
				return
			}))
			srvConfig(srv)
			srv.StartTLS()
			defer srv.Close()

			dialer := &net.Dialer{}

			var (
				pipeMu sync.Mutex
				pipes  []*proxyConn
			)

			var proxyMu sync.Mutex
			proxy, err := net.Listen("tcp", "127.0.0.1:0")
			if err != nil {
				t.Fatalf("proxy listen: %v", err)
			}
			defer proxy.Close()

			connPair := func() (net.Conn, net.Conn, error) {
				proxyMu.Lock()
				defer proxyMu.Unlock()

				c1, err := net.Dial(proxy.Addr().Network(), proxy.Addr().String())
				if err != nil {
					return nil, nil, err
				}

				c2, err := proxy.Accept()
				if err != nil {
					return nil, nil, err
				}

				return c1, c2, nil
			}

			transport := srv.Client().Transport.(*http.Transport).Clone()
			transport.ForceAttemptHTTP2 = true
			transport.DialContext = func(ctx context.Context, network, addr string) (net.Conn, error) {
				c3, err := dialer.DialContext(ctx, network, addr)
				if err != nil {
					return nil, err
				}

				c1, c2, err := connPair()
				if err != nil {
					return nil, err
				}

				pipe := newProxyConn(c2, c3)

				pipeMu.Lock()
				pipes = append(pipes, pipe)
				pipeMu.Unlock()

				return c1, nil
			}

			client := &http.Client{Transport: transport}

			withConns := func(fn func(conns []*proxyConn) []*proxyConn) {
				pipeMu.Lock()
				defer pipeMu.Unlock()
				pipes = fn(pipes)
			}

			test(t, srv.URL, client, withConns)
		}
	}

	makeRequest := func(ctx context.Context, url string, client *http.Client) error {
		req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
		if err != nil {
			return err
		}

		resp, err := client.Do(req)
		if err != nil {
			return err
		}
		defer resp.Body.Close()

		_, err = io.Copy(io.Discard, resp.Body)
		if err != nil {
			return err
		}

		return nil
	}

	run := func(t *testing.T, url string, client *http.Client, withConns func(func(conns []*proxyConn) []*proxyConn)) {
		var failures int64
		var unplugs int64
		total := *requestCount
		for i := 0; i < total; i++ {
			if interval := *unplugInterval; i%interval == interval/2 { // midway or a bit less
				withConns(func(conns []*proxyConn) []*proxyConn {
					if l := len(conns); l > 0 {
						idx := rand.Intn(l)
						pipe := conns[idx]
						pipe.unplug()
						unplugs++

						conns[idx] = conns[l-1]
						conns = conns[:l-1]
					}
					return conns
				})
			}

			err := func() error {
				ctx := context.Background()
				ctx, cancel := context.WithTimeout(ctx, *requestTimeout)
				defer cancel()
				return makeRequest(ctx, url, client)
			}()

			if err != nil {
				failures++
			}
		}

		t.Errorf("total=%d unplugs=%d failures=%d", total, unplugs, failures)
	}

	t.Run("http1", testcase(
		func(srv *httptest.Server) { srv.EnableHTTP2 = false },
		func(rt *http.Transport) {},
		run,
	))

	t.Run("http2", testcase(
		func(srv *httptest.Server) { srv.EnableHTTP2 = true },
		func(rt *http.Transport) {},
		run,
	))
}

// A proxyConn connects two net.Conns together, copying the data in each
// direction, until it is closed or unplugged. Unplugging a proxyConn causes it
// to silently stop copying the data.
type proxyConn struct {
	c1 net.Conn
	c2 net.Conn

	once sync.Once
}

func newProxyConn(c1, c2 net.Conn) *proxyConn {
	pc := &proxyConn{c1: c1, c2: c2}

	go io.Copy(c1, c2)
	go io.Copy(c2, c1)

	return pc
}

func (pc *proxyConn) unplug() {
	dl := time.Now().Add(-1 * time.Second)
	pc.c1.SetDeadline(dl)
	pc.c2.SetDeadline(dl)
}

func (pc *proxyConn) close() {
	pc.once.Do(func() {
		pc.c1.Close()
		pc.c2.Close()
	})
}
@prattmic
Copy link
Member

prattmic commented Oct 6, 2023

cc @neild

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 6, 2023
@cgthayer
Copy link

When you say "100% packet loss" are you detecting this due to timeouts via DeadlineExceeded (and/or 499)?

This is an interesting problem and I wonder how other languages cope with this.

@rhysh
Copy link
Contributor Author

rhysh commented Nov 2, 2023

This shows up in the app as timeouts / DeadlineExceeded, yes. (Or, if the tightest timeout is from an earlier layer of caller, as Canceled.)

"100% packet loss" is a type of problem with the TCP connection that results in the kernel not having / not giving us the information we expect to be available. Other causes of that observable behavior (such as the application on the other end deciding to sleep instead of writing our data) would have the same effect on users of Go's HTTP/2 support.

@seankhliao seankhliao added this to the Unplanned milestone Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants