net/http: investigate Transport's use of cached connections upon resume from sleep #29308

bradfitz · 2018-12-17T19:52:33Z

When the http.Transport has cached persistent connections to a server and the machine is suspended (e.g. laptop lid closing, serverless environments), the monotonic clock might not advance but the wall clock will.

This is a tracking bug to verify how Transport behaves in such cases.

It would be bad if during the machine's time asleep we got a TCP packet to close a connection but we never saw it (due to being asleep), and then upon resume we try to re-use that dead TCP connection, get a write error, and then are unable to retry for whatever reason (non-idempotent or non-replayable request).

Before using a connection we should look at the current wall time and compare it to the wall time of when it was last idle. We might already be doing that (by accident?) but we might also be accidentally using monotonic time, in which case we wouldn't notice the missing chunks of time.

Investigate.

/cc @jadekler

ianlancetaylor · 2018-12-17T20:11:09Z

Way forward

Perhaps a //golinkname mono time.mono link and then some mechanisms for detecting drift e.g.
in this patch meant to handle the case of clock drifts/frozen time:

diff --git a/src/net/http/transport.go b/src/net/http/transport.go
index 26f642aa7a..138df57765 100644
--- a/src/net/http/transport.go
+++ b/src/net/http/transport.go
@@ -29,6 +29,7 @@ import (
 	"sync"
 	"sync/atomic"
 	"time"
+	_ "unsafe"
 
 	"golang.org/x/net/http/httpguts"
 	"golang.org/x/net/http/httpproxy"
@@ -603,7 +604,10 @@ func (pc *persistConn) shouldRetryRequest(req *Request, err error) bool {
 		// creating new connections and retrying if the server
 		// is just hanging up on us because it doesn't like
 		// our request (as opposed to sending an error).
-		return false
+
+		// As per golang.org/issue/29380, let's only return true if time
+		// drifted.
+		return pc.timeDrifted()
 	}
 	if _, ok := err.(nothingWrittenError); ok {
 		// We never wrote anything, so it's safe to retry, if there's no body or we
@@ -628,6 +632,44 @@ func (pc *persistConn) shouldRetryRequest(req *Request, err error) bool {
 	return false // conservatively
 }
 
+// timeDrifted returns true if we can detect that our wall and monotonicNstonic clocks
+// have drifted apart. There are 2 scenarios in which we'll return true:
+//  1. If pconn.t.IdleConnTimeout is non-zero and
+//      pconn.idleAt+pconn.t.IdleConnTimeout  are before time.Now()
+//  2. If the monotonicNstonic clock difference and the wall clock time differences
+//      exceed a heuristic percentage.
+func (pc *persistConn) timeDrifted() bool {
+	pc.mu.Lock()
+	idleAt := pc.idleAt
+	var idleConnTimeout time.Duration
+	if pc.t != nil {
+		idleConnTimeout = pc.t.IdleConnTimeout
+	}
+	pc.mu.Unlock()
+
+	if idleAt.IsZero() {
+		// We need to have a non-zero idleAt in order
+		// to try to guess if we've drifted.
+		return false
+	}
+
+	now := time.Now()
+
+	if idleConnTimeout > 0 {
+		// Since we had a Transport.CloseConnTimeout, if we were put in limbo
+		// and our idle conn cleanup route didn't fire, manually check here.
+		return now.After(idleAt.Add(idleConnTimeout))
+	}
+
+	diffWallNanos := float64(now.UnixNano()-idleAt.UnixNano()) / 1e9
+	diffmonotonicNsNanos := float64(monotonicNs(now)-monotonicNs(idleAt)) / 1e9
+	driftPercentage := 100 * float64(diffmonotonicNsNanos-diffWallNanos)
+
+	// As a heuristic, if the monotonicNstonic nanoseconds and the walltime nanoseconds
+	// have drifted by say over 10%, then report this as a drifted time.
+	return driftPercentage >= 10
+}
+
 // ErrSkipAltProtocol is a sentinel error value defined by Transport.RegisterProtocol.
 var ErrSkipAltProtocol = errors.New("net/http: skip alternate protocol")
 
@@ -878,6 +920,9 @@ func (t *Transport) tryPutIdleConn(pconn *persistConn) error {
 	return nil
 }
 
+//go:linkname monotonicNs time.mono
+func monotonicNs(t time.Time) int64
+
 // getIdleConnCh returns a channel to receive and return idle
 // persistent connection for the given connectMethod.
 // It may return nil, if persistent connections are not being used.
@@ -1429,6 +1474,10 @@ func (t *Transport) dialConn(ctx context.Context, cm connectMethod) (*persistCon
 	pconn.br = bufio.NewReaderSize(pconn, t.readBufferSize())
 	pconn.bw = bufio.NewWriterSize(persistConnWriter{pconn}, t.writeBufferSize())
 
+	// Perhaps we are abusing idleAt but really we want to
+	// capture the time  that the connection was created.
+	pconn.idleAt = time.Now()
+
 	go pconn.readLoop()
 	go pconn.writeLoop()
 	return pconn, nil
diff --git a/src/time/time.go b/src/time/time.go
index c8116a74f4..06da8009f3 100644
--- a/src/time/time.go
+++ b/src/time/time.go
@@ -1091,6 +1091,10 @@ func daysIn(m Month, year int) int {
 	return int(daysBefore[m] - daysBefore[m-1])
 }
 
+func mono(t Time) int64 {
+	return t.ext
+}
+
 // Provided by package runtime.
 func now() (sec int64, nsec int32, mono int64)

Results

Before patch

server: 2019/06/21 19:12:54 Listening at "[::]:49373"
client: 2019/06/21 19:12:55 URL: http://[::]:49373/hello
server: 2019/06/21 19:12:55 Request from: [::1]:49376

client: 2019/06/21 19:12:55 Blob: Server responding ASAP
client: 2019/06/21 19:12:55 Pausing for 1.5s

client: 2019/06/21 19:12:57 URL: http://[::]:49373/hello
server: 2019/06/21 19:12:57 Request from: [::1]:49376

server: 2019/06/21 19:12:57 Now going to lag for 3s
server: 2019/06/21 19:12:57 Pausing for 2s before reviving client with pid: 21447
server: 2019/06/21 19:13:27 Request from: [::1]:49377

server: 2019/06/21 19:13:27 Now going to lag for 3s
client: 2019/06/21 19:13:33 Failed to make request("http://[::]:49373/hello"): Get http://[::]:49373/hello: read tcp [::1]:49377->[::1]:49373: read: connection reset by peer

After proposed patch

server: 2019/06/21 19:11:00 Listening at "[::]:49346"
client: 2019/06/21 19:11:01 URL: http://[::]:49346/hello
server: 2019/06/21 19:11:01 Request from: [::1]:49349

client: 2019/06/21 19:11:01 Blob: Server responding ASAP
client: 2019/06/21 19:11:01 Pausing for 1.5s

client: 2019/06/21 19:11:02 URL: http://[::]:49346/hello
server: 2019/06/21 19:11:02 Request from: [::1]:49349

server: 2019/06/21 19:11:02 Now going to lag for 3s
server: 2019/06/21 19:11:02 Pausing for 2s before reviving client with pid: 21387
server: 2019/06/21 19:11:32 Request from: [::1]:49360

server: 2019/06/21 19:11:32 Now going to lag for 3s
server: 2019/06/21 19:11:39 Request from: [::1]:49362

Please let me know what y'all think.

gopherbot · 2019-06-24T05:30:38Z

Change https://golang.org/cl/183557 mentions this issue: net/http: detect and make persistConn handle time drifts

odeke-em · 2019-09-13T03:36:29Z

One other thing I discovered today while thinking out loud about this issue is that perhaps the runtime on getting a SIGCONT can refetch the current time and go update all the previous timers that may have drifted. This might even be the simpler and more correct solution instead of the addition to get time.mono.

bradfitz · 2019-11-01T18:25:56Z

@odeke-em, what about https://go-review.googlesource.com/c/go/+/204797? Seems a bit simpler.

Could you try it out?

odeke-em · 2019-11-01T18:39:01Z

Roger that and nice work @bradfitz! Let me try it out right now.

gopherbot · 2019-11-01T18:41:44Z

Change https://golang.org/cl/204797 mentions this issue: net/http: only use wall time in Transport idle conn timeouts

odeke-em · 2019-11-01T19:07:45Z

@bradfitz this gist might help in automation of the code for easier feedback loops when making the change https://gist.github.com/odeke-em/639f947edded2f86ae34d286fb12f875#file-main-go

…) time Both laptops closing their lids and cloud container runtimes suspending VMs both faced the problem where an idle HTTP connection used by the Transport could be cached for later reuse before the machine is frozen, only to wake up many minutes later to think that their HTTP connection was still good (because only a second or two of monotonic time passed), only to find out that the peer hung up on them when they went to write. HTTP/1 connection reuse is inherently racy like this, but no need for us to step into a trap if we can avoid it. Also, not everybody sets Request.GetBody to enable re-tryable POSTs. And we can only safely retry requests in some cases. So with this CL, before reusing an old connection, double check the walltime. Testing was done both with a laptop (closing the lid for a bit) and with QEMU, running "stop" and "cont" commands in the monitor and sending QMP guest agent commands to update its wall clock after the "cont": echo '{"execute":"guest-set-time"}' | socat STDIN UNIX-CONNECT:/var/run/qemu-server/108.qga In both cases, I was running https://gist.github.com/bradfitz/260851776f08e4bc4dacedd82afa7aea and watching that the RemoteAddr changed after resume. It's kinda difficult to write an automated test for. I gave a lightning talk on using pure emulation user mode qemu for such tests: https://www.youtube.com/watch?v=69Zy77O-BUM https://docs.google.com/presentation/d/1rAAyOTCsB8GLbMgI0CAbn69r6EVWL8j3DPl4qc0sSlc/edit?usp=sharing https://github.com/google/embiggen-disk/blob/master/integration_test.go ... that would probably be a good direction if we want an automated test here. But I don't have time to do that now. Updates #29308 (HTTP/2 remains) Change-Id: I03997e00491f861629d67a0292da000bd94ed5ca Reviewed-on: https://go-review.googlesource.com/c/go/+/204797 Reviewed-by: Bryan C. Mills <bcmills@google.com>

mpx · 2019-11-25T10:48:04Z

This seems like another example where using BOOTTIME would help? (#24595)

gopherbot · 2019-11-25T22:02:36Z

Change https://golang.org/cl/208798 mentions this issue: http2: make Transport.IdleConnTimeout consider wall (not monotonic) time

This is the http2 version of CL 204797. Updates golang/go#29308 (fixes once bundled into std) Change-Id: I7cd97d38c941e9a8a62808e23b6533c72760f003 Reviewed-on: https://go-review.googlesource.com/c/net/+/208798 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Bryan C. Mills <bcmills@google.com>

gopherbot · 2019-11-27T00:02:55Z

Change https://golang.org/cl/209077 mentions this issue: net/http: update bundled x/net/http2

bradfitz added help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 17, 2018

bradfitz added this to the Go1.13 milestone Dec 17, 2018

jeanbza mentioned this issue Dec 17, 2018

storage: connection reset by peer on Cloud Functions googleapis/google-cloud-go#1253

Closed

sethvargo mentioned this issue May 23, 2019

GCS backend errors from Google Storage API, barrier errors & no TLS config found hashicorp/vault#6641

Closed

odeke-em modified the milestones: Go1.13, Go1.14 Jun 12, 2019

odeke-em self-assigned this Jun 12, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

smasher164 modified the milestones: Backlog, Go1.14 Oct 11, 2019

odeke-em mentioned this issue Oct 20, 2019

time: NewTimer firing later if computer sleeps, how to use wall clock? #35012

Open

bradfitz self-assigned this Nov 22, 2019

gopherbot closed this as completed in 22688f7 Nov 27, 2019

mpx mentioned this issue Nov 27, 2019

runtime: use CLOCK_BOOTIME, not CLOCK_MONOTONIC, when possible #24595

Closed

golang locked and limited conversation to collaborators Nov 26, 2020

gopherbot added the FrozenDueToAge label Nov 26, 2020

rsc unassigned bradfitz and odeke-em Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net/http: investigate Transport's use of cached connections upon resume from sleep #29308

net/http: investigate Transport's use of cached connections upon resume from sleep #29308

bradfitz commented Dec 17, 2018 •

edited

ianlancetaylor commented Dec 17, 2018

chris-vest commented Jun 12, 2019

odeke-em commented Jun 12, 2019

odeke-em commented Jun 22, 2019 •

edited

gopherbot commented Jun 24, 2019

odeke-em commented Sep 13, 2019

bradfitz commented Nov 1, 2019

odeke-em commented Nov 1, 2019

gopherbot commented Nov 1, 2019

odeke-em commented Nov 1, 2019

mpx commented Nov 25, 2019

gopherbot commented Nov 25, 2019

gopherbot commented Nov 27, 2019

net/http: investigate Transport's use of cached connections upon resume from sleep #29308

net/http: investigate Transport's use of cached connections upon resume from sleep #29308

Comments

bradfitz commented Dec 17, 2018 • edited

ianlancetaylor commented Dec 17, 2018

chris-vest commented Jun 12, 2019

odeke-em commented Jun 12, 2019

odeke-em commented Jun 22, 2019 • edited

Way forward

Results

Before patch

After proposed patch

gopherbot commented Jun 24, 2019

odeke-em commented Sep 13, 2019

bradfitz commented Nov 1, 2019

odeke-em commented Nov 1, 2019

gopherbot commented Nov 1, 2019

odeke-em commented Nov 1, 2019

mpx commented Nov 25, 2019

gopherbot commented Nov 25, 2019

gopherbot commented Nov 27, 2019

bradfitz commented Dec 17, 2018 •

edited

odeke-em commented Jun 22, 2019 •

edited