Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame #36026

Open
jim-minter opened this issue Dec 6, 2019 · 5 comments
Milestone

Comments

@jim-minter
Copy link

@jim-minter jim-minter commented Dec 6, 2019

What version of Go are you using (go version)?

$ go version
go version go1.13.4 linux/amd64

Does this issue reproduce with the latest release?

I believe so.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/data/home/jminter/.cache/go-build"
GOENV="/data/home/jminter/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/data/home/jminter/go"
GOPRIVATE=""
GOPROXY="direct"
GOROOT="/usr/lib/golang"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build951550205=/tmp/go-build -gno-record-gcc-switches"

What did you do?

With a simple client like this, and using it to make a dumb GET request to some server:

c := &http.Client{
Timeout: 30 * time.Second,
}

If the server doesn't respond within 30 seconds, the client gets a timeout error as expected. But, Go does not close the underlying TCP connection to the server. If the client keeps retrying, unfortunately its subsequent requests get queued on the same TCP connection.

If the server doesn't close the connection, or perhaps if there's an underlying TCP connection (e.g. connectivity) break which means the client and server aren't actually communicating, the client doesn't redial to establish a new connection, at least perhaps until the local kernel socket send buffer is full, or there's a TCP keepalive timeout, or something (conjecture).

You can see this when it happens because netstat continues to show the connection, and you can see its local Recv-Q figure incrementing on each retry.

Setting DisableKeepAlives works around this problem, but it's a pretty big hammer.

IMO it would make more sense for Go to abandon and close the underlying TCP connection if an HTTP timeout has occured on it.

What did you expect to see?

I'd like to see Go net/http close underlying TCP connections if the HTTP request they are running times out.

What did you see instead?

It keeps the underlying TCP connection open.

@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Dec 6, 2019

Got a repro?

@jim-minter

This comment has been minimized.

Copy link
Author

@jim-minter jim-minter commented Dec 7, 2019

@bradfitz I understand my problem better now, and I can reproduce this easily, but there's not a straightforward free-standing reproducer.

I think this is actually primarily a problem with Azure load balancers being badly behaved, but there's also a difference in Go client behaviour between HTTP/1.1 and HTTP/2. I do wonder whether there is scope for Go to be more defensive here.

The principal issue is that Azure load balancers by default don't send a TCP RST for incoming packets on a connection they don't recognise.

With my issue, I've got a long-standing connection to Azure CosmosDB (which sits behind an Azure load balancer), and I log onto a VPN which changes the default route on my laptop. The same thing can also happen if there's a network break that lasts longer than the Azure LB idle timeout, or, similarly, if I suspend my laptop for longer than the idle timeout.

When any of these things happen, the open TCP connection gets into a nasty state. The connection can't make progress because Azure doesn't recognise it or has timed it out, but it also doesn't die quickly because Azure blackholes it instead of sending a RST back to the client. Linux keeps sending packets to Azure hoping for a response until it hits the tcp_retries2 timeout and finally gives up.

In the case of a Go HTTP client with a Timeout set, using HTTP/2, when the underlying TCP connection goes bad, individual requests time out, but Go does not collect the connection. All future requests that are multiplexed to that connection also timeout because the underlying connection is basically dead - it cannot make progress.

For me, adding the following after c.Do() helps, although I suspect it's racy in the face of multiple concurrent requests on a single connection:

if err, ok := err.(net.Error); ok && err.Timeout() {
	     c.CloseIdleConnections()
}

Setting DisableKeepAlives: true works around the issue safely in the HTTP/2 case, but as mentioned, is a big hammer.

If I force the Go client to HTTP/1.1 (with keep-alives enabled), then when the Timeout hits the connection is automatically collected, which seems to be much better behaviour.

I think in my case, for safety, it makes sense for me to disable the HTTP/2 client for now.

I wonder whether, as a safety feature, it would make sense for the Go HTTP/2 client to abandon a connection for future requests if it has had a timeout - or whether this would be too intrusive? Conjecturing, it would be also a useful safety mechanism for any client sitting behind a stupid firewall that drops RSTs, if that brokenness still happens these days.

@toothrot toothrot added this to the Backlog milestone Dec 9, 2019
@odeke-em

This comment has been minimized.

Copy link
Member

@odeke-em odeke-em commented Dec 24, 2019

Thank you for the report @jim-minter!

To help get us to the last mile and make this issue actionable for Brad and myself, here is some seed code to help with the repro

package main

import (
	"fmt"
	"log"
	"net"
	"net/http"
	"net/http/httptest"
	"os"
	"os/exec"
	"time"

	"golang.org/x/net/http2"
)

func main() {
	fmt.Printf("Our PID: %d", os.Getpid())
	stallResp := make(chan bool)
	cst := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		fmt.Printf("Request proto: %s\n", r.Proto)
		<-stallResp
		w.Write([]byte("Done here"))
	}))
	if err := http2.ConfigureServer(cst.Config, new(http2.Server)); err != nil {
		log.Fatalf("Failed to configure HTTP/2 server: %v", err)
	}
	cst.TLS = cst.Config.TLSConfig
	cst.StartTLS()
	defer cst.Close()

	tr := &http.Transport{TLSClientConfig: cst.Config.TLSConfig}
	if err := http2.ConfigureTransport(tr); err != nil {
		log.Fatalf("Failed to configure HTTP2 transport: %v", err)
	}
	tr.TLSClientConfig.InsecureSkipVerify = true
	client := &http.Client{
		Timeout:   400 * time.Millisecond,
		Transport: tr,
	}
	req, _ := http.NewRequest("GET", cst.URL, nil)
	res, err := client.Do(req)
	if err == nil {
		log.Fatal("Unexpectedly didn't get an error")
		res.Body.Close()
	}
	ne := err.(net.Error)
	if !ne.Timeout() {
		log.Fatal("Failed to get a timeout")
	}

	close(stallResp)

	// Now finally print the summary from lsof every 3 seconds.
	for {
		output, _ := lsofUs()
		fmt.Printf("%s\n\n", output)
		<-time.After(3 * time.Second)
	}
}

func lsofUs() ([]byte, error) {
	return exec.Command("lsof", "-p", fmt.Sprintf("%d", os.Getpid())).CombinedOutput()
}

It contains a server and a client both in HTTP/2 mode, please tweak it to make it representative of the failures you reported and every 3 seconds it'll report lsof -p PID output which will also print out the states of the sockets e.g.

main    34481 emmanuelodeke    8u    IPv4 0xc0636412fd530a8f       0t0      TCP localhost:49402->localhost:49401 (ESTABLISHED)
main    34481 emmanuelodeke    9u    IPv4 0xc06364130796210f       0t0      TCP localhost:49401->localhost:49402 (ESTABLISHED)

which you can then use to verify your report and then we can also use to diagnose the issue. Thank you!

@jim-minter

This comment has been minimized.

Copy link
Author

@jim-minter jim-minter commented Dec 24, 2019

@odeke-em, thanks for spurring me into action. Here's a reproducer that works on Linux (note: uses sudo and iptables):

package main

import (
	"crypto/tls"
	"log"
	"net"
	"net/http"
	"net/http/httptest"
	"os/exec"
	"time"
)

// configure returns a TLS test server and corresponding client.  If http2 is
// set, the server is configured to enable HTTP/2; otherwise HTTP/1.x will be
// negotiated.
func configure(dialer func(network, addr string) (net.Conn, error), http2 bool) (s *httptest.Server, c *http.Client) {
	s = httptest.NewUnstartedServer(http.HandlerFunc(func(http.ResponseWriter, *http.Request) {}))
	if http2 {
		s.TLS = &tls.Config{
			NextProtos: []string{"h2"},
		}
	}
	s.StartTLS()

	c = s.Client()
	c.Timeout = 500 * time.Millisecond
	c.Transport.(*http.Transport).Dial = dialer
	c.Transport.(*http.Transport).ForceAttemptHTTP2 = true

	return
}

// doRequest makes a GET request from the client to the server and logs the
// first line of the response.
func doRequest(s *httptest.Server, c *http.Client) {
	resp, err := c.Get(s.URL)
	if err != nil {
		log.Printf("doRequest: %v", err)
		return
	}

	log.Printf("doRequest: %s %s", resp.Proto, resp.Status)

	err = resp.Body.Close()
	if err != nil {
		log.Printf("doRequest: %v", err)
	}
}

// iptables manages a local iptables rule blackholing localhost traffic on a
// given TCP port.  It simulates bad behaviour by Azure load balancers, which by
// default don't send a TCP RST for incoming packets on a connection they don't
// recognise.  If command is "-I", the rule is inserted; if it is "-D", it is
// deleted.  For more details, see:
// https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-reset
func iptables(command, port string) error {
	log.Printf("iptables: %s %s", command, port)
	return exec.Command("sudo", "iptables", command, "INPUT", "-s", "127.0.0.1", "-p", "tcp", "--sport", port, "-j", "DROP").Run()
}

func test(http2 bool) error {
	var localPort string

	// dialer wraps net.Dial and leaks the local port of the connection it has
	// just dialed to localPort
	dialer := func(network, addr string) (net.Conn, error) {
		log.Printf("dialing %s", addr)
		conn, err := net.Dial(network, addr)
		if err != nil {
			return nil, err
		}

		_, localPort, err = net.SplitHostPort(conn.(*net.TCPConn).LocalAddr().String())
		if err != nil {
			return nil, err
		}

		return conn, nil
	}

	s, c := configure(dialer, http2)

	// first request is expected to succeed (no reason for it to fail)
	doRequest(s, c)

	// now simulate the underlying connection being broken
	defer iptables("-D", localPort)
	err := iptables("-I", localPort)
	if err != nil {
		return err
	}

	// second request is expected to time out because the underlying connection
	// is busted
	doRequest(s, c)

	// with HTTP/1.x, the third and subsequent requests succeed because the
	// broken underlying connection is discarded and a new one is dialed.

	// with HTTP/2, the third and subsequent requests fail because the
	// underlying connection is *not* discarded.  Very approximately, I think
	// that HTTP/2 should not make new requests on an underlying connection if
	// it has seen a request time out on that connection.
	doRequest(s, c)
	doRequest(s, c)

	return nil
}

func main() {
	log.Print("testing HTTP/1.x")
	err := test(false)
	if err != nil {
		panic(err)
	}

	log.Print("testing HTTP/2")
	err = test(true)
	if err != nil {
		panic(err)
	}
}

I get the following output:

2019/12/24 09:03:57 testing HTTP/1.x
2019/12/24 09:03:57 dialing 127.0.0.1:43105
2019/12/24 09:03:57 doRequest: HTTP/1.1 200 OK
2019/12/24 09:03:57 iptables: -I 47758
2019/12/24 09:03:58 doRequest: Get https://127.0.0.1:43105: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:58 dialing 127.0.0.1:43105
2019/12/24 09:03:58 doRequest: HTTP/1.1 200 OK
2019/12/24 09:03:58 doRequest: HTTP/1.1 200 OK
2019/12/24 09:03:58 iptables: -D 47758
2019/12/24 09:03:58 testing HTTP/2
2019/12/24 09:03:58 dialing 127.0.0.1:41429
2019/12/24 09:03:58 doRequest: HTTP/2.0 200 OK
2019/12/24 09:03:58 iptables: -I 42876
2019/12/24 09:03:58 doRequest: Get https://127.0.0.1:41429: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:59 doRequest: Get https://127.0.0.1:41429: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:59 doRequest: Get https://127.0.0.1:41429: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:59 iptables: -D 42876

Thanks!

@odeke-em

This comment has been minimized.

Copy link
Member

@odeke-em odeke-em commented Dec 26, 2019

Awesome, thank you for the complete repro @jim-minter!

I'll update this issue to match the problem and also kindly ask our HTTP/2 experts
@bradfitz @rs @tombergan:
How do we deal with a missing GOAWAY frame from the server? Sounds like
an HTTP/2 spec violation to me.
Should we be checking x/net/http2's RoundTripOpts for errors
that are non-retryable such as context.Canceled while waiting for headers? How do we differentiate the case of timeouts from no response, from just a server that's still processing?

As reported by this issue the HTTP/1 stack already closes connections on such errors but assumes that the HTTP/2 stack will do that too as per

go/src/net/http/transport.go

Lines 2576 to 2582 in c170b14

// Close HTTP/1 (pc.alt == nil) connection.
// HTTP/2 closes its connection itself.
if pc.alt == nil {
if err != errCallerOwnsConn {
pc.conn.Close()
}
close(pc.closech)

@odeke-em odeke-em changed the title net/http: TCP connection is not closed after http client request times out net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame Dec 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.