Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net,runtime: apparent deadlock in (*net.conn).Close and runtime.netpollblock on arm64 platforms #48925

Closed
bcmills opened this issue Oct 12, 2021 · 21 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Oct 12, 2021

greplogs --dashboard -md -l -e '(?m).*\[semacquire, \d+ minutes\]:\n(.+\n\t.+\n)*net\.\(\*conn\)\.Close'

2021-10-08T16:26:20-99c1b24/windows-arm64-10
2021-10-05T15:58:05-f1f626d/linux-arm64-aws
2021-09-30T18:10:18-e81a3d9-205640e/linux-arm64-packet
2021-09-29T18:36:18-e48cf0d/android-arm64-corellium
2021-09-28T18:58:50-7d9f5e0-9c43872/windows-arm64-10
2021-09-28T17:19:08-1c6e50a/android-arm64-corellium
2021-09-27T22:22:04-52b23a5/android-arm64-corellium
2021-09-27T19:42:34-4e4d966-315cec2/linux-arm64-packet
2021-09-24T18:21:14-f6b5ffb/android-arm64-corellium
2021-09-22T20:53:48-41bb744/linux-arm64-packet
2021-09-18T06:04:41-c084706-c894b44/windows-arm64-10
2021-09-17T18:33:15-70493b3/ios-arm64-corellium
2021-09-16T23:13:58-e09dcc2/android-arm64-corellium
2021-09-09T21:28:56-c163c31-c981874/linux-arm64-aws
2021-09-07T23:27:08-3fff213/windows-arm64-10
2021-09-03T01:38:54-9f69a44/ios-arm64-corellium
2021-09-02T19:36:22-a8aa6cf/windows-arm64-10
2021-09-02T19:16:19-9633195/ios-arm64-corellium
2021-09-02T19:16:19-9633195/linux-arm64-packet
2021-09-02T16:59:57-2a463a2/android-arm64-corellium
2021-09-02T14:02:42-b8420ba/ios-arm64-corellium
2021-09-01T22:28:21-0bfd6fc/android-arm64-corellium
2021-08-28T04:46:13-0108177/windows-arm64-10
2021-08-24T15:55:38-60bc85c-d70c69d/windows-arm64-10
2021-08-23T12:26:30-c7e354d/linux-arm64-packet
2021-08-21T00:55:22-e17439e/windows-arm64-10
2021-08-17T15:00:04-3001b0a/windows-arm64-10
2021-08-05T20:22:31-fd45e26/android-arm64-corellium
2021-07-31T23:59:40-c6fcb2d-b8ca6e5/windows-arm64-10
2021-07-26T22:15:24-bfbb288/ios-arm64-corellium
2021-07-20T19:54:36-d568e6e/ios-arm64-corellium
2021-07-16T02:43:48-aa4e0f5/linux-arm64-aws
2021-06-30T01:29:49-c45e800/windows-arm64-10
2021-05-04T14:38:36-5e4f9b0/linux-arm64-aws
2021-05-03T16:23:09-7b768d4/linux-arm64-aws
2021-04-29T17:08:05-e03cca6/linux-arm64-aws
2021-04-29T11:37:05-c4c68fb/linux-arm64-aws
2021-04-22T17:47:59-537cde0/linux-arm64-aws
2021-04-16T16:36:57-2f0e5bf/linux-arm64-packet
2021-04-14T03:15:00-2d4ba26/linux-arm64-packet
2021-04-13T21:13:09-e512bc2/ios-arm64-corellium
2021-04-09T18:49:05-c3faff7/linux-arm64-aws
2021-04-09T18:49:05-c3faff7/linux-arm64-packet
2021-04-09T14:33:44-a690a5d/android-arm64-corellium
2021-04-08T18:47:17-96a6745/linux-arm64-packet
2021-03-29T21:50:16-4e1bf8e/linux-arm64-packet
2021-03-15T23:47:56-661f3f1/linux-arm64-aws
2021-03-12T21:52:25-5ea612d-f39c4de/linux-arm64-aws
2021-03-05T22:14:48-a829114/linux-arm64-aws
2021-01-25T22:54:28-e6b6d10/android-arm64-corellium
2020-12-16T16:40:57-5abda26/android-arm64-corellium
2020-12-10T23:11:25-1fe891a/ios-arm64-corellium
2020-10-15T18:07:26-3c9488e/linux-arm64-aws
2020-09-10T14:44:25-8098dbb/linux-arm64-aws

@bcmills
Copy link
Member Author

@bcmills bcmills commented Oct 12, 2021

Given that this appears to affect all arm64 platforms, it smells like a runtime or compiler bug to me.

Notably, all of these stack traces seem to feature goroutines blocked in runtime.netpollblock and at least one of runtime.semacquire1 or internal/poll.runtime_Semacquire.

@jeremyfaller, could you suggest someone to route it to?

@bcmills bcmills changed the title net: apparent deadlock in (*conn).Close on arm64 platforms net,runtime: apparent deadlock in (*net.conn).Close and runtime.netpollblock on arm64 platforms Oct 12, 2021
@jeremyfaller
Copy link
Contributor

@jeremyfaller jeremyfaller commented Oct 13, 2021

Hmm. I feel like @cherrymui, @prattmic, or @dr2chase might be able to offer insights.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Oct 13, 2021

If I recall, arm64 needs a few more memory fences than amd64.
(And it will be really hard to figure out which ones aren't there, if this is the case.)
Do we know the particular flavors of arm involved?

@bcmills
Copy link
Member Author

@bcmills bcmills commented Oct 13, 2021

The darwin-arm64-11_0-toothrot, freebsd-arm64-dmgk, openbsd-arm64-jsing, and netbsd-arm64-bsiegert builders are notably absent from the failures observed on the builders.

That leaves:

@prattmic
Copy link
Member

@prattmic prattmic commented Oct 13, 2021

Some notes:

  • internal/poll/FD.Close acquires FD.csema to wait for other users of the FD (system calls, primarily) to finish.
  • Each use of the FD reference counts their use. The last use decref releases FD.csema, which unblocks the Close.
  • Close also (before acquiring csema) "evicts" the FD, which should wake any blocked waiters so they can decref.

In https://build.golang.org/log/93f6f9bde154a60a2dcad7c9a3e7b66f4bf2f41a, we have 2 goroutines using the same stuck FD:

goroutine 9859 [semacquire, 9 minutes]:
internal/poll.runtime_Semacquire(0xffff54148070)
	/workdir/go/src/runtime/sema.go:61 +0x2c
internal/poll.(*FD).Close(0x400048b080)
	/workdir/go/src/internal/poll/fd_unix.go:116 +0x68
net.(*netFD).Close(0x400048b080)
	/workdir/go/src/net/fd_posix.go:38 +0x44
net.(*conn).Close(0x40000100a0)
	/workdir/go/src/net/net.go:207 +0x34
golang.org/x/net/http2.(*ClientConn).closeIfIdle(0x40004de600)
	/workdir/gopath/src/golang.org/x/net/http2/transport.go:888 +0x224
golang.org/x/net/http2.(*clientConnPool).closeIdleConnections(0x4001406000)
	/workdir/gopath/src/golang.org/x/net/http2/client_conn_pool.go:275 +0x148
golang.org/x/net/http2.(*Transport).CloseIdleConnections(0x10)
	/workdir/gopath/src/golang.org/x/net/http2/transport.go:544 +0x7c
golang.org/x/net/http2.(*clientTester).cleanup(0x40004b2140)
	/workdir/gopath/src/golang.org/x/net/http2/transport_test.go:760 +0x2c
sync.(*Once).doSlow(0x4001406000, 0x400070c200)
	/workdir/go/src/sync/once.go:68 +0x10c
sync.(*Once).Do(...)
	/workdir/go/src/sync/once.go:59
golang.org/x/net/http2.(*clientTester).run(0x40004b2140)
	/workdir/gopath/src/golang.org/x/net/http2/transport_test.go:786 +0x1bc
golang.org/x/net/http2.testTransportReqBodyAfterResponse(0x4000602758, 0x193)
	/workdir/gopath/src/golang.org/x/net/http2/transport_test.go:931 +0x15c
golang.org/x/net/http2.TestTransportReqBodyAfterResponse_403(0x40010c8000)
	/workdir/gopath/src/golang.org/x/net/http2/transport_test.go:824 +0x24
testing.tRunner(0x4000308000, 0x3f86b8)
	/workdir/go/src/testing/testing.go:1389 +0x110
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1436 +0x328
goroutine 9861 [IO wait, 9 minutes]:
internal/poll.runtime_pollWait(0xffff54148070, 0x72)
	/workdir/go/src/runtime/netpoll.go:229 +0x9c
internal/poll.(*pollDesc).wait(0x400048b080, 0x4000343000, 0x0)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x2c
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x400048b080, {0x4000343000, 0x1000, 0x1000})
	/workdir/go/src/internal/poll/fd_unix.go:167 +0x1ec
net.(*netFD).Read(0x400048b080, {0x4000343000, 0x4000065c28, 0x0})
	/workdir/go/src/net/fd_posix.go:56 +0x2c
net.(*conn).Read(0x40000100a0, {0x4000343000, 0x400000801, 0x1})
	/workdir/go/src/net/net.go:183 +0x38
bufio.(*Reader).Read(0x40003ae0c0, {0x40004ca120, 0x9, 0x40007ff640})
	/workdir/go/src/bufio/bufio.go:232 +0x1ec
io.ReadAtLeast({0x45b640, 0x40003ae0c0}, {0x40004ca120, 0x9, 0x9}, 0x9)
	/workdir/go/src/io/io.go:331 +0xac
io.ReadFull(...)
	/workdir/go/src/io/io.go:350
golang.org/x/net/http2.readFrameHeader({0x40004ca120, 0x9, 0x40007ff650}, {0x45b640, 0x40003ae0c0})
	/workdir/gopath/src/golang.org/x/net/http2/frame.go:237 +0x5c
golang.org/x/net/http2.(*Framer).ReadFrame(0x40004ca0e0)
	/workdir/gopath/src/golang.org/x/net/http2/frame.go:498 +0x84
golang.org/x/net/http2.(*clientConnReadLoop).run(0x4000065f90)
	/workdir/gopath/src/golang.org/x/net/http2/transport.go:1973 +0x134
golang.org/x/net/http2.(*ClientConn).readLoop(0x40004de600)
	/workdir/gopath/src/golang.org/x/net/http2/transport.go:1864 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
	/workdir/gopath/src/golang.org/x/net/http2/transport.go:739 +0x8fc

The fact that the second goroutine is still blocked in Read seems to indicate that the eviction failed to unblock that goroutine, or that the second goroutine was able to acquire the fd mu after close.

@prattmic
Copy link
Member

@prattmic prattmic commented Oct 13, 2021

I reproduced this on linux-arm64-aws quite quickly, so I'll dig in some more:

$ GOROOT=$(pwd) gomote run $(gomote-instance) ./go/bin/go test -count=1000 crypto/tls
panic: test timed out after 10m0s

goroutine 117595 [running]:
testing.(*M).startAlarm.func1()
        /workdir/go/src/testing/testing.go:1961 +0x8c
created by time.goFunc
        /workdir/go/src/time/sleep.go:176 +0x3c

goroutine 1 [chan receive, 6 minutes]:
testing.(*T).Run(0x40003e4340, {0x2cce47, 0x75bdd8b719}, 0x2eef18)
        /workdir/go/src/testing/testing.go:1425 +0x33c
testing.runTests.func1(0x0)
        /workdir/go/src/testing/testing.go:1771 +0x74
testing.tRunner(0x40003e4340, 0x4000133b68)
        /workdir/go/src/testing/testing.go:1377 +0x110
testing.runTests(0x4000106b40, {0x4e2da0, 0x95, 0x95}, {0x46c, 0x4000014280, 0x4e7040})
        /workdir/go/src/testing/testing.go:1769 +0x3e0
testing.(*M).Run(0x4000106b40)
        /workdir/go/src/testing/testing.go:1657 +0x510
crypto/tls.runMain(0x400005c180)
        /workdir/go/src/crypto/tls/handshake_test.go:381 +0x5b4
crypto/tls.TestMain(0x4000045ec8)
        /workdir/go/src/crypto/tls/handshake_test.go:333 +0x80
main.main()
        _testmain.go:353 +0x1ec

goroutine 6 [IO wait, 6 minutes]:
internal/poll.runtime_pollWait(0xffff65e8e818, 0x72)
        /workdir/go/src/runtime/netpoll.go:229 +0x9c
internal/poll.(*pollDesc).wait(0x400013a080, 0x87ac4, 0x0)
        /workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x2c
internal/poll.(*pollDesc).waitRead(...)
        /workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x400013a080)
        /workdir/go/src/internal/poll/fd_unix.go:508 +0x1d8
net.(*netFD).accept(0x400013a080)
        /workdir/go/src/net/fd_unix.go:173 +0x28
net.(*TCPListener).accept(0x400000e198)
        /workdir/go/src/net/tcpsock_posix.go:140 +0x2c
net.(*TCPListener).Accept(0x400000e198)
        /workdir/go/src/net/tcpsock.go:262 +0x30
crypto/tls.localServer({0x340ac8, 0x400000e198})
        /workdir/go/src/crypto/tls/handshake_test.go:247 +0x5c
created by crypto/tls.runMain
        /workdir/go/src/crypto/tls/handshake_test.go:353 +0x1e0

goroutine 117621 [IO wait, 6 minutes]:
internal/poll.runtime_pollWait(0xffff65e8db68, 0x72)
        /workdir/go/src/runtime/netpoll.go:229 +0x9c
internal/poll.(*pollDesc).wait(0x400013a500, 0x40004e3440, 0x0)
        /workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x2c
internal/poll.(*pollDesc).waitRead(...)
        /workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x400013a500, {0x40004e3440, 0x205, 0x205})
        /workdir/go/src/internal/poll/fd_unix.go:167 +0x1ec
net.(*netFD).Read(0x400013a500, {0x40004e3440, 0x0, 0x1c58933d99f3})
        /workdir/go/src/net/fd_posix.go:56 +0x2c
net.(*conn).Read(0x4000558058, {0x40004e3440, 0x40003f2698, 0x160ca8})
        /workdir/go/src/net/net.go:183 +0x38
crypto/tls.(*atLeastReader).Read(0x40004e05a0, {0x40004e3440, 0x0, 0x1ac18})
        /workdir/go/src/crypto/tls/conn.go:777 +0x44
bytes.(*Buffer).ReadFrom(0x4000077af8, {0x33f328, 0x40004e05a0})
        /workdir/go/src/bytes/buffer.go:204 +0x94
crypto/tls.(*Conn).readFromUntil(0x4000077880, {0x33f588, 0x4000558058}, 0x160ca8)
        /workdir/go/src/crypto/tls/conn.go:799 +0xd8
crypto/tls.(*Conn).readRecordOrCCS(0x4000077880, 0x0)
        /workdir/go/src/crypto/tls/conn.go:606 +0xdc
crypto/tls.(*Conn).readRecord(...)
        /workdir/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).readHandshake(0x4000077880)
        /workdir/go/src/crypto/tls/conn.go:1009 +0x4c
crypto/tls.(*Conn).clientHandshake(0x4000077880, {0x340dd0, 0x40005bc480})
        /workdir/go/src/crypto/tls/handshake_client.go:179 +0x1c0
crypto/tls.(*Conn).handshakeContext(0x4000077880, {0x340e08, 0x400001a040})
        /workdir/go/src/crypto/tls/conn.go:1445 +0x3c8
crypto/tls.(*Conn).HandshakeContext(...)
        /workdir/go/src/crypto/tls/conn.go:1395
crypto/tls.(*Conn).Handshake(...)
        /workdir/go/src/crypto/tls/conn.go:1379
crypto/tls.TestHostnameInSNI.func1({0x2cacbe, 0xb})
        /workdir/go/src/crypto/tls/handshake_client_test.go:1439 +0x154
created by crypto/tls.TestHostnameInSNI
        /workdir/go/src/crypto/tls/handshake_client_test.go:1438 +0x184

goroutine 117585 [semacquire, 6 minutes]:
internal/poll.runtime_Semacquire(0xffff65e8db68)
        /workdir/go/src/runtime/sema.go:61 +0x2c
internal/poll.(*FD).Close(0x400013a500)
        /workdir/go/src/internal/poll/fd_unix.go:116 +0x68
net.(*netFD).Close(0x400013a500)
        /workdir/go/src/net/fd_posix.go:38 +0x44
net.(*conn).Close(0x4000558058)
        /workdir/go/src/net/net.go:207 +0x34
crypto/tls.TestHostnameInSNI(0x40003e4820)
        /workdir/go/src/crypto/tls/handshake_client_test.go:1453 +0x278
testing.tRunner(0x40003e4820, 0x2eef18)
        /workdir/go/src/testing/testing.go:1377 +0x110
created by testing.(*T).Run
        /workdir/go/src/testing/testing.go:1424 +0x328
FAIL    crypto/tls      600.010s
FAIL
Error running run: exit status 1

@prattmic
Copy link
Member

@prattmic prattmic commented Oct 13, 2021

More logging:

[0.719056680 P 0] fd 0x4000362200 Read: enter
[0.719057213 P 0] fd 0x4000362200 Read: lock done
[0.719057533 P 0] fd 0x4000362200 Read: prepare done
[0.719057828 P 0] fd 0x4000362200 Read: syscall.Read enter
[0.719058977 P 0] fd 0x4000362200 Read: syscall.Read error
[0.719059264 P 0] fd 0x4000362200 Read: waitRead enter
[0.719059420 P 1] fd 0x4000362200 Close: enter
[0.719059519 P 0] pd 0xffff65c5a678 pollWait: enter
[0.719059781 P 0] pd 0xffff65c5a678 pollWait: netpollcheckerr done
[0.719059798 P 1] fd 0x4000362200 Close: increfAndClose success
[0.719060077 P 0] pd 0xffff65c5a678 pollWait: netpollblock enter
[0.719060150 P 1] pd 0xffff65c5a678 pollUnblock: enter
[0.719060511 P 1] pd 0xffff65c5a678 pollUnblock: netpollunblock done
[0.719060881 P 1] fd 0x4000362200 Close: evict done
[0.719061168 P 1] fd 0x4000362200 Close: decref done
[0.719061455 P 1] fd 0x4000362200 Close: block enter

My read of this is that runtime.poll_runtime_pollUnblock is failing to properly wake the reader, which is in poll_runtime_pollWait, but may or may not have yet fully blocked in netpollblock.

The synchronization between netpollblock and netpollunblock looks a bit suspicious to me, particularly on a weak system like ARM, but I will need to give it more thought tomorrow.

@prattmic
Copy link
Member

@prattmic prattmic commented Oct 13, 2021

One theory:

Inside netpollblock and netpollunblock, gpp is read using a non-atomic read. In the normal case, writes use atomics, creating a barrier.

However, when consuming a ready event, netpollblock resets gpp using a non-atomic store, thus skipping a barrier.

In netpollunblock, we can return early if gpp == pdReady, not actually performing an unblock.

Thus I posit that an earlier netpollblock sees pdReady and resets to 0. Later, on another core we read gpp and see the stale pdReady and incorrectly return early without triggering a new unblock.

I haven't validated this theory, but my log does show that this pollDesc has earlier calls to netpollblock, so it is possible. If this is the case, I believe the fix is simply to use a atomic store to reset gpp to 0.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 14, 2021

@harshavardhana
Copy link
Contributor

@harshavardhana harshavardhana commented Oct 14, 2021

One theory:

Inside netpollblock and netpollunblock, gpp is read using a non-atomic read. In the normal case, writes use atomics, creating a barrier.

However, when consuming a ready event, netpollblock resets gpp using a non-atomic store, thus skipping a barrier.

In netpollunblock, we can return early if gpp == pdReady, not actually performing an unblock.

Thus I posit that an earlier netpollblock sees pdReady and resets to 0. Later, on another core we read gpp and see the stale pdReady and incorrectly return early without triggering a new unblock.

I haven't validated this theory, but my log does show that this pollDesc has earlier calls to netpollblock, so it is possible. If this is the case, I believe the fix is simply to use a atomic store to reset gpp to 0.

Does this only affect arm64?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 14, 2021

Does this only affect arm64?

It's a fair question why this would affect arm64 and not arm32.

@prattmic
Copy link
Member

@prattmic prattmic commented Oct 14, 2021

Using all atomic loads and stores does make this issue disappear, so I think that's it. Though I also can't explain why we haven't seen this on other architectures, especially arm32.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Oct 14, 2021

Are aligned 32-bit stores always atomic (FSVO “atomic”) on arm32? Maybe the failure mode happens from a torn 64-bit uintptr store.

@LaPingvino
Copy link

@LaPingvino LaPingvino commented Oct 14, 2021

fwiw, I think bash all.bash on my arm64 Chromebook fails on the same tests... I could probably test a possible fix...

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 14, 2021

Change https://golang.org/cl/355952 mentions this issue: runtime: consistently access pollDesc r/w Gs with atomics

@LaPingvino
Copy link

@LaPingvino LaPingvino commented Oct 15, 2021

trying just this code change on top of a tip checkout doesn't fix tests failing on my Chromebook arm64. ofc might be unrelated issue(s)... I didn't give it enough time yet to fully isolate if this is fully the bug that makes the tests fail, but seeing the kind of failing it seems likely to me (mostly timeouts of tests, which seem like possible deadlocks to me, and the place where it failed now suspiciously looks like code touched by that change: FAIL runtime 541.176s, the stacktrace touching poll)

Also I have a weak suspicion failing tests are different at times? I will try to keep a closer eye on it from now on xD if necessary I can probably paste some stack traces etc...

@prattmic
Copy link
Member

@prattmic prattmic commented Oct 15, 2021

@LaPingvino give the new version a try, though it sounds like you may be having a different issue. Logs would definitely help.

@gopherbot please backport to 1.17 and 1.16. This is a pre-existing issue with no workaround, and causing flakes on trybots.

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 15, 2021

Backport issue(s) opened: #49009 (for 1.16), #49010 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@LaPingvino
Copy link

@LaPingvino LaPingvino commented Oct 15, 2021

@prattmic it does seem to have fixed the tests failing on my debian arm64 in crostini (chromebook). runtime took a long time to test (431.701s) but it didn't fail this time. Another test fails atm but I don't think that one is related? now it's a cmd/go test that fails.

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 15, 2021

Change https://golang.org/cl/356369 mentions this issue: [release-branch.go1.17] runtime: consistently access pollDesc r/w Gs with atomics

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 15, 2021

Change https://golang.org/cl/356370 mentions this issue: [release-branch.go1.16] runtime: consistently access pollDesc r/w Gs with atomics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants