Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: racy use of timers in http2Transport #37449

Closed
rsc opened this issue Feb 25, 2020 · 3 comments
Closed

net/http: racy use of timers in http2Transport #37449

rsc opened this issue Feb 25, 2020 · 3 comments

Comments

@rsc
Copy link
Contributor

@rsc rsc commented Feb 25, 2020

With these strategically-placed sleeps:

$ git diff 
diff --git a/src/runtime/time.go b/src/runtime/time.go
index d64bea814f..b01a664378 100644
--- a/src/runtime/time.go
+++ b/src/runtime/time.go
@@ -302,6 +302,7 @@ func deltimer(t *timer) bool {
 		case timerModifiedEarlier:
 			tpp := t.pp.ptr()
 			if atomic.Cas(&t.status, s, timerModifying) {
+				usleep(10000)
 				atomic.Xadd(&tpp.adjustTimers, -1)
 				if !atomic.Cas(&t.status, timerModifying, timerDeleted) {
 					badTimer()
@@ -487,6 +488,7 @@ func resettimer(t *timer, when int64) {
 			}
 		case timerDeleted:
 			if atomic.Cas(&t.status, s, timerModifying) {
+				usleep(10000)
 				t.nextwhen = when
 				newStatus := uint32(timerModifiedLater)
 				if when < t.when {
$ 

I get reliable crashes out of GOTRACEBACK=system go test -race in net/http.

The crash is a detected race between s.timer.Reset during the request writing goroutine (specifically a delayed body write) and s.timer.Stop during response processing of a 100 Continue header. Full stacks below.

The code is correct except for provoking the race panic. The timer setup is:

s.fnonce = new(sync.Once)

// Arm the timer with a very large duration, which we'll
// intentionally lower later. It has to be large now because
// we need a handle to it before writing the headers, but the
// s.delay value is defined to not start until after the
// request headers were written.
const hugeDuration = 365 * 24 * time.Hour
s.timer = time.AfterFunc(hugeDuration, func() {
	s.fnonce.Do(s.fn)
})

The request writing does:

// scheduleBodyWrite starts writing the body, either immediately (in
// the common case) or after the delay timeout. It should not be
// called until after the headers have been written.
func (s http2bodyWriterState) scheduleBodyWrite() {
	if s.timer == nil {
		// We're not doing a delayed write (see
		// getBodyWriterState), so just start the writing
		// goroutine immediately.
		go s.fn()
		return
	}
	http2traceWait100Continue(s.cs.trace)
	if s.timer.Stop() {
		s.timer.Reset(s.delay)
	}
}

If we get a 100 Continue then we stop the timer and run the function:

...
if statusCode == 100 {
	http2traceGot100Continue(cs.trace)
	if cs.on100 != nil {
		cs.on100() // forces any write delay timer to fire
	}
}
...

where on100 is:

func (s http2bodyWriterState) on100() {
	if s.timer == nil {
		// If we didn't do a delayed write, ignore the server's
		// bogus 100 continue response.
		return
	}
	s.timer.Stop()
	go func() { s.fnonce.Do(s.fn) }()
}

The Stop is cleanup, and fnonce is protecting the actual work.
If this on100 Stop happens first, then s.timer.Stop above returns false, and s.timer.Reset does not run.
If the scheduleBodyWrite Stop happens first, then it's true that there is a race where maybe the
on100 Stop happens before the scheduleBodyWrite Reset, so that the timer still does fire after s.delay.
But the code is aware of that potential problem and using fnonce to address it.
This code is basically unchanged since it was written in 2016.
What changed is that the timers got pickier in Go 1.11.

If we don't remove the panic in #37400 then we need to rewrite this code.

goroutine 16462 [runnable]:
runtime.wakeNetPoller(0xe1bdc93b1876)
	/Users/rsc/go/src/runtime/proc.go:2424 +0x52 fp=0xc00085c890 sp=0xc00085c888 pc=0x107deb2
runtime.resettimer(0xc000a94828, 0xe1bdc93b1876)
	/Users/rsc/go/src/runtime/time.go:502 +0x12f fp=0xc00085c8b0 sp=0xc00085c890 pc=0x1099fcf
time.resetTimer(0xc000a94828, 0xe1bdc93b1876)
	/Users/rsc/go/src/runtime/time.go:228 +0x51 fp=0xc00085c8d0 sp=0xc00085c8b0 pc=0x1099331
time.(*Timer).Reset(0xc000a94820, 0x3b9aca00, 0x10dfbef)
	/Users/rsc/go/src/time/sleep.go:126 +0x9f fp=0xc00085c908 sp=0xc00085c8d0 pc=0x115dc7f
net/http.http2bodyWriterState.scheduleBodyWrite(0xc0001e0420, 0xc000a94820, 0xc00086ac40, 0xc000252ff0, 0xc0009dc360, 0x3b9aca00)
	/Users/rsc/go/src/net/http/h2_bundle.go:9074 +0xd5 fp=0xc00085c948 sp=0xc00085c908 pc=0x1567235
net/http.(*http2ClientConn).roundTrip(0xc0006a8780, 0xc0008d4400, 0x0, 0x0, 0x0, 0x0)
	/Users/rsc/go/src/net/http/h2_bundle.go:7557 +0x912 fp=0xc00085cd40 sp=0xc00085c948 pc=0x15591f2
net/http.(*http2Transport).RoundTripOpt(0xc000970360, 0xc0008d4400, 0x0, 0x0, 0xc0003a6170, 0xc00085cd80)
	/Users/rsc/go/src/net/http/h2_bundle.go:6948 +0x273 fp=0xc00085cec0 sp=0xc00085cd40 pc=0x15526e3
net/http.(*http2Transport).RoundTrip(0xc000970360, 0xc0008d4400, 0x0, 0xc0001d8690, 0x5)
	/Users/rsc/go/src/net/http/h2_bundle.go:6909 +0x4c fp=0xc00085cf18 sp=0xc00085cec0 pc=0x1551fbc
net/http.(*Transport).roundTrip(0xc0003a6140, 0xc0008d4400, 0x0, 0x0, 0xc00016a9a0)
	/Users/rsc/go/src/net/http/transport.go:555 +0xb00 fp=0xc00085d268 sp=0xc00085cf18 pc=0x15aca20
net/http.(*Transport).RoundTrip(0xc0003a6140, 0xc0008d4400, 0xc0003a6140, 0x0, 0x0)
	/Users/rsc/go/src/net/http/roundtrip.go:17 +0x43 fp=0xc00085d2b8 sp=0xc00085d268 pc=0x157f423
net/http.send(0xc0008d4400, 0x19805e0, 0xc0003a6140, 0x0, 0x0, 0x0, 0x108fc30, 0x0, 0xc00085d580, 0x108fe1d)
	/Users/rsc/go/src/net/http/client.go:252 +0x6bb fp=0xc00085d4d8 sp=0xc00085d2b8 pc=0x1511eeb
net/http.(*Client).send(0xc000a38390, 0xc0008d4400, 0x0, 0x0, 0x0, 0xc0006accc8, 0x0, 0x1, 0x1b)
	/Users/rsc/go/src/net/http/client.go:176 +0x1ce fp=0xc00085d590 sp=0xc00085d4d8 pc=0x15114fe
net/http.(*Client).do(0xc000a38390, 0xc0008d4400, 0x0, 0x0, 0x0)
	/Users/rsc/go/src/net/http/client.go:692 +0x2cd fp=0xc00085d8d8 sp=0xc00085d590 pc=0x1513f0d
net/http.(*Client).Do(...)
	/Users/rsc/go/src/net/http/client.go:560
net/http_test.testTransportEventTrace(0xc00089e6c0, 0xc0007d0001)
	/Users/rsc/go/src/net/http/transport_test.go:4489 +0x1570 fp=0xc00085deb0 sp=0xc00085d8d8 pc=0x16ffb10
net/http_test.TestTransportEventTrace_h2(0xc00089e6c0)
	/Users/rsc/go/src/net/http/transport_test.go:4391 +0x40 fp=0xc00085ded0 sp=0xc00085deb0 pc=0x16fe4c0
testing.tRunner(0xc00089e6c0, 0x18b48f8)
	/Users/rsc/go/src/testing/testing.go:954 +0x1ef fp=0xc00085dfd0 sp=0xc00085ded0 pc=0x11c7e5f
runtime.goexit()
	/Users/rsc/go/src/runtime/asm_amd64.s:1376 +0x1 fp=0xc00085dfd8 sp=0xc00085dfd0 pc=0x10abb61
created by testing.(*T).Run
	/Users/rsc/go/src/testing/testing.go:1005 +0x6d9

goroutine 16487 [running]:
panic(0x17cd040, 0x19738c0)
	/Users/rsc/go/src/runtime/panic.go:1060 +0x420 fp=0xc0003fd470 sp=0xc0003fd3c8 pc=0x10757c0
net/http.(*http2pipe).closeWithError(0xc0001e0448, 0xc0001e04a0, 0x0, 0x0, 0x0)
	/Users/rsc/go/src/net/http/h2_bundle.go:3558 +0x3cc fp=0xc0003fd558 sp=0xc0003fd470 pc=0x153859c
net/http.(*http2pipe).CloseWithError(...)
	/Users/rsc/go/src/net/http/h2_bundle.go:3545
net/http.(*http2clientConnReadLoop).cleanup(0xc0003fdfb8)
	/Users/rsc/go/src/net/http/h2_bundle.go:8228 +0x499 fp=0xc0003fd798 sp=0xc0003fd558 pc=0x155f459
runtime.call32(0x0, 0x18b30f8, 0xc0003fdf98, 0x800000008)
	/Users/rsc/go/src/runtime/asm_amd64.s:540 +0x41 fp=0xc0003fd7c8 sp=0xc0003fd798 pc=0x10a9dc1
panic(0x17f6a40, 0x1975cc0)
	/Users/rsc/go/src/runtime/panic.go:973 +0x396 fp=0xc0003fd870 sp=0xc0003fd7c8 pc=0x1075736
runtime.badTimer(...)
	/Users/rsc/go/src/runtime/time.go:1048
runtime.deltimer(0xc000a94828, 0xc0003fd968)
	/Users/rsc/go/src/runtime/time.go:326 +0x188 fp=0xc0003fd898 sp=0xc0003fd870 pc=0x1099848
time.stopTimer(0xc000a94828, 0x0)
	/Users/rsc/go/src/runtime/time.go:219 +0x2b fp=0xc0003fd8b8 sp=0xc0003fd898 pc=0x10992bb
time.(*Timer).Stop(...)
	/Users/rsc/go/src/time/sleep.go:78
net/http.http2bodyWriterState.on100(0xc0001e0420, 0xc000a94820, 0xc00086ac40, 0xc000252ff0, 0xc0009dc360, 0x3b9aca00)
	/Users/rsc/go/src/net/http/h2_bundle.go:9057 +0x73 fp=0xc0003fd940 sp=0xc0003fd8b8 pc=0x15670b3
net/http.http2bodyWriterState.on100-fm()
	/Users/rsc/go/src/net/http/h2_bundle.go:9051 +0x72 fp=0xc0003fd9b8 sp=0xc0003fd940 pc=0x1618a22
net/http.(*http2clientConnReadLoop).handleResponse(0xc0003fdfb8, 0xc0001e0420, 0xc0000cec90, 0x1000000015650d8, 0x4006a87d0, 0x19837c0)
	/Users/rsc/go/src/net/http/h2_bundle.go:8433 +0x126a fp=0xc0003fdc60 sp=0xc0003fd9b8 pc=0x1561ffa
net/http.(*http2clientConnReadLoop).processHeaders(0xc0003fdfb8, 0xc0000cec90, 0x0, 0x0)
	/Users/rsc/go/src/net/http/h2_bundle.go:8355 +0x1d0 fp=0xc0003fdd78 sp=0xc0003fdc60 pc=0x15607f0
net/http.(*http2clientConnReadLoop).run(0xc0003fdfb8, 0x0, 0x0)
	/Users/rsc/go/src/net/http/h2_bundle.go:8277 +0x81b fp=0xc0003fdf00 sp=0xc0003fdd78 pc=0x15600bb
net/http.(*http2ClientConn).readLoop(0xc0006a8780)
	/Users/rsc/go/src/net/http/h2_bundle.go:8174 +0xb7 fp=0xc0003fdfd8 sp=0xc0003fdf00 pc=0x155ec97
runtime.goexit()
	/Users/rsc/go/src/runtime/asm_amd64.s:1376 +0x1 fp=0xc0003fdfe0 sp=0xc0003fdfd8 pc=0x10abb61
created by net/http.(*http2Transport).newClientConn
	/Users/rsc/go/src/net/http/h2_bundle.go:7174 +0xbc5
@gopherbot
Copy link

@gopherbot gopherbot commented Feb 27, 2020

Change https://golang.org/cl/221298 mentions this issue: [release-branch.go1.14] runtime: don't panic on racy use of timers

gopherbot pushed a commit that referenced this issue Feb 27, 2020
If we see a racy use of timers, as in concurrent calls to Timer.Reset,
do the operations in an unpredictable order, rather than crashing.

Updates #37400
Updates #37449
Fixes #37494

Change-Id: Idbac295df2dfd551b6d762909d5040fc532c1b34
Reviewed-on: https://go-review.googlesource.com/c/go/+/221077
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit 98858c4)
Reviewed-on: https://go-review.googlesource.com/c/go/+/221298
@ianlancetaylor ianlancetaylor modified the milestones: Go1.15, Go1.14.1 Feb 28, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Feb 28, 2020

Closed by merging 3293174 to release-branch.go1.14.

@gopherbot gopherbot closed this Feb 28, 2020
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 28, 2020

This is now fixed on tip and 1.14 branch.

peterebden added a commit to thought-machine/please-servers that referenced this issue Apr 5, 2020
It appears we are affected by golang/go#37449; I have seen one panic from it so far.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.