Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime,net/http: apparent deadlock in (*connReader).abortPendingRead via (*sync.Cond).Wait on ARM and ARM64 #50469

Closed
bcmills opened this issue Jan 6, 2022 · 9 comments
Labels
arch-arm Issues solely affecting the 32-bit arm architecture. arch-arm64 NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Jan 6, 2022

greplogs --dashboard -md -l -e net/http\\..\*\\n\\t.\*\\nnet/http_test\\.TestTLSServerClosesConnection

2022-01-05T21:22:03-2b39d86/windows-arm64-10

goroutine 3455 [sync.Cond.Wait, 2 minutes]:
runtime.gopark(0x4000759858?, 0x7ff7d54bd570?, 0x0?, 0x40?, 0xfd52ed98c?)
	C:/workdir/go/src/runtime/proc.go:366 +0xe8 fp=0x40006c5800 sp=0x40006c57e0 pc=0x7ff7d53194a8
runtime.goparkunlock(...)
	C:/workdir/go/src/runtime/proc.go:372
sync.runtime_notifyListWait(0x4000522090, 0x0)
	C:/workdir/go/src/runtime/sema.go:513 +0x178 fp=0x40006c5850 sp=0x40006c5800 pc=0x7ff7d5342958
sync.(*Cond).Wait(0x4000522080)
	C:/workdir/go/src/sync/cond.go:56 +0xd8 fp=0x40006c5890 sp=0x40006c5850 pc=0x7ff7d534f388
net/http.(*connReader).abortPendingRead(0x400019c420)
	C:/workdir/go/src/net/http/server.go:720 +0xc8 fp=0x40006c58e0 sp=0x40006c5890 pc=0x7ff7d55548e8
net/http.(*conn).hijackLocked(0x40005ba000)
	C:/workdir/go/src/net/http/server.go:315 +0x34 fp=0x40006c5960 sp=0x40006c58e0 pc=0x7ff7d5553284
net/http.(*response).Hijack(0x40005e2000)
	C:/workdir/go/src/net/http/server.go:2034 +0x104 fp=0x40006c59e0 sp=0x40006c5960 pc=0x7ff7d555b114
net/http_test.TestTLSServerClosesConnection.func1({0x7ff7d57f6750?, 0x40005e2000}, 0x4000759ab8?)
	C:/workdir/go/src/net/http/transport_test.go:3245 +0x74 fp=0x40006c5a50 sp=0x40006c59e0 pc=0x7ff7d564acf4
net/http.HandlerFunc.ServeHTTP(0x0?, {0x7ff7d57f6750?, 0x40005e2000?}, 0x7ff7d5321cec?)
	C:/workdir/go/src/net/http/server.go:2084 +0x3c fp=0x40006c5a80 sp=0x40006c5a50 pc=0x7ff7d555b55c
net/http.serverHandler.ServeHTTP({0x400019c420?}, {0x7ff7d57f6750, 0x40005e2000}, 0x400019b700)
	C:/workdir/go/src/net/http/server.go:2916 +0x408 fp=0x40006c5b40 sp=0x40006c5a80 pc=0x7ff7d555ebc8
net/http.(*conn).serve(0x40005ba000, {0x7ff7d57f6ca8, 0x40000297d0})
	C:/workdir/go/src/net/http/server.go:1966 +0x550 fp=0x40006c5fa0 sp=0x40006c5b40 pc=0x7ff7d555a290
net/http.(*Server).Serve.func3()
	C:/workdir/go/src/net/http/server.go:3071 +0x34 fp=0x40006c5fd0 sp=0x40006c5fa0 pc=0x7ff7d555f664
runtime.goexit()
	C:/workdir/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x40006c5fd0 sp=0x40006c5fd0 pc=0x7ff7d5346dc4
created by net/http.(*Server).Serve
	C:/workdir/go/src/net/http/server.go:3071 +0x448

…

goroutine 2865 [select, 2 minutes]:
runtime.gopark(0x40006eb668?, 0x6?, 0x48?, 0xb3?, 0x40006eb474?)
	C:/workdir/go/src/runtime/proc.go:366 +0xe8 fp=0x40006eb2f0 sp=0x40006eb2d0 pc=0x7ff7d53194a8
runtime.selectgo(0x40006eb668, 0x40006eb468, 0x7ff7d5726e7a?, 0x0, 0x40006eb528?, 0x1)
	C:/workdir/go/src/runtime/select.go:328 +0x6d0 fp=0x40006eb410 sp=0x40006eb2f0 pc=0x7ff7d5329b50
net/http.(*persistConn).roundTrip(0x40000d6000, 0x40000a2480)
	C:/workdir/go/src/net/http/transport.go:2614 +0x804 fp=0x40006eb6d0 sp=0x40006eb410 pc=0x7ff7d55767c4
net/http.(*Transport).roundTrip(0x4000624140, 0x400010c200)
	C:/workdir/go/src/net/http/transport.go:594 +0x6c0 fp=0x40006eb910 sp=0x40006eb6d0 pc=0x7ff7d556b4d0
net/http.(*Transport).RoundTrip(0x40002c0210?, 0x7ff7d57f3a28?)
	C:/workdir/go/src/net/http/roundtrip.go:17 +0x20 fp=0x40006eb930 sp=0x40006eb910 pc=0x7ff7d55530a0
net/http.send(0x400010c200, {0x7ff7d57f3a28, 0x4000624140}, {0x7ff7d5516f18?, 0x8?, 0x0?})
	C:/workdir/go/src/net/http/client.go:252 +0x4ec fp=0x40006ebb10 sp=0x40006eb930 pc=0x7ff7d551561c
net/http.(*Client).send(0x40002185a0, 0x400010c200, {0x80000?, 0x10000?, 0x0?})
	C:/workdir/go/src/net/http/client.go:176 +0xa4 fp=0x40006ebb90 sp=0x40006ebb10 pc=0x7ff7d5514fb4
net/http.(*Client).do(0x40002185a0, 0x400010c200)
	C:/workdir/go/src/net/http/client.go:725 +0x714 fp=0x40006ebda0 sp=0x40006ebb90 pc=0x7ff7d5516f74
net/http.(*Client).Do(...)
	C:/workdir/go/src/net/http/client.go:593
net/http.(*Client).Get(0x4000624140?, {0x400026c030?, 0x0?})
	C:/workdir/go/src/net/http/client.go:480 +0x7c fp=0x40006ebe00 sp=0x40006ebda0 pc=0x7ff7d55165dc
net/http_test.TestTLSServerClosesConnection(0x40006101a0)
	C:/workdir/go/src/net/http/transport_test.go:3263 +0x1bc fp=0x40006ebf60 sp=0x40006ebe00 pc=0x7ff7d564a7fc
testing.tRunner(0x40006101a0, 0x7ff7d576f808)
	C:/workdir/go/src/testing/testing.go:1440 +0x110 fp=0x40006ebfb0 sp=0x40006ebf60 pc=0x7ff7d53d7550
testing.(*T).Run.func1()
	C:/workdir/go/src/testing/testing.go:1487 +0x30 fp=0x40006ebfd0 sp=0x40006ebfb0 pc=0x7ff7d53d82c0
runtime.goexit()
	C:/workdir/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x40006ebfd0 sp=0x40006ebfd0 pc=0x7ff7d5346dc4
created by testing.(*T).Run
	C:/workdir/go/src/testing/testing.go:1487 +0x328

CC @neild

See previously #7634 (CC @bradfitz).

@bcmills bcmills added this to the Backlog milestone Jan 6, 2022
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 6, 2022
@bcmills
Copy link
Member Author

bcmills commented Jan 6, 2022

This looks to me like a probable bug in (*connReader).abortPendingRead on ARM CPUs.

greplogs --dashboard -md -l -e '^goroutine \d+ \[sync\.Cond\.Wait, \d+ minutes\]:\n(?:.+\n\t.+\n)*net/http\.\(\*connReader\)\.abortPendingRead'

2022-01-05T21:22:03-2b39d86/windows-arm64-10
2022-01-04T22:05:15-e39ab9b/windows-arm64-10
2021-12-02T22:05:01-f9b2733/windows-arm64-10
2021-12-02T22:03:11-d3bf627/windows-arm64-10
2021-11-15T18:43:07-0e65410/plan9-arm
2021-11-15T17:10:25-49ce184-f986191/plan9-arm
2021-11-12T18:10:48-0c6a6cd/windows-arm64-10
2021-09-27T05:10:56-54079df/darwin-arm64-11_0-toothrot
2021-09-22T22:55:16-1319b14/windows-arm64-10
2021-09-21T20:28:50-2fc7df9/windows-arm64-10
2021-09-21T16:34:40-901ac46/plan9-arm
2021-09-09T22:20:16-1a708bc/windows-arm64-10
2021-08-23T22:55:34-be1a693/windows-arm64-10
2021-08-18T23:01:08-a32a1a4-0d53084/darwin-arm64-11_0-toothrot
2021-03-10T09:01:05-cf59850/darwin-arm64-11_0-toothrot
2021-03-05T23:13:20-fb03be9/linux-arm64-packet
2021-03-04T06:36:17-4b8b2c5/linux-arm64-packet
2020-06-01T23:25:14-0eb9767/plan9-arm
2020-05-31T02:07:28-06a226f-f1f8f9a/plan9-arm
2020-05-20T03:40:22-4ec4a79/plan9-arm
2020-05-19T19:55:14-1361738/plan9-arm
2020-05-14T16:53:34-1f325af/plan9-arm
2020-05-14T15:18:29-364a05e/plan9-arm
2020-05-07T20:18:02-3321303/plan9-arm
2020-05-05T03:40:39-b40c658/plan9-arm
2020-04-29T22:40:20-a7e5396/plan9-arm
2020-04-29T18:50:32-202c43b/plan9-arm
2020-04-29T04:16:08-67d4087/plan9-arm
2020-04-28T21:46:16-b1b6784/plan9-arm
2020-04-27T18:40:56-729f1e8-08bf64a/plan9-arm
2020-04-27T16:56:52-729f1e8-f1a2a0e/plan9-arm
2020-04-27T16:18:30-f1a2a0e/plan9-arm
2020-04-24T01:48:48-f5fcc9b/plan9-arm
2020-04-21T00:09:00-eec981e/plan9-arm
2020-04-20T02:39:16-62ccee4/plan9-arm
2020-04-18T18:07:52-2a20f5c/plan9-arm
2020-04-13T20:38:15-300ed43/plan9-arm
2020-04-11T16:20:57-b78109e/plan9-arm
2020-04-11T15:42:09-fd18f3b/plan9-arm
2020-04-09T15:56:51-a3ef804/plan9-arm
2020-03-25T22:44:57-2975b27/plan9-arm
2020-03-24T20:19:40-191118a/plan9-arm
2020-03-23T17:07:22-67c2dcb/plan9-arm
2020-03-18T19:44:13-0205790/plan9-arm
2020-03-18T17:29:01-1b76d66-6197104/plan9-arm
2020-03-15T08:14:24-dc32553/plan9-arm
2020-03-15T07:51:41-85a8526/plan9-arm
2020-03-12T20:40:54-29b36a8/plan9-arm
2020-03-12T17:37:13-040855e/plan9-arm
2020-03-09T20:45:44-78000ba-d49fecc/plan9-arm
2020-03-05T19:59:18-bda42a7/plan9-arm
2020-02-29T10:21:33-5756808/plan9-arm
2020-02-27T21:33:33-62ff72d/plan9-arm
2020-02-24T15:55:01-90c71ce/plan9-arm
2020-02-24T08:31:48-31acdcc/plan9-arm
2020-02-22T04:31:20-f9c51e3/plan9-arm
2020-02-22T03:34:22-af686da/plan9-arm
2020-02-20T14:59:27-73269b8/plan9-arm
2020-02-19T23:28:35-8fa2b6d/plan9-arm
2020-02-19T22:04:41-3457359/plan9-arm
2020-02-13T16:25:03-e237df5/plan9-arm
2020-02-07T18:07:47-cb16d26/plan9-arm
2020-01-31T18:50:59-d91c3bc/plan9-arm
2020-01-31T06:02:00-69ecbb4-b7689f5/plan9-arm
2020-01-30T22:33:02-b8dd33b/plan9-arm
2020-01-27T14:32:54-f652015/plan9-arm
2020-01-25T05:48:48-69660ed/plan9-arm
2020-01-23T07:33:33-fa610c7/plan9-arm
2020-01-22T18:10:42-895b7c8/plan9-arm
2020-01-22T16:57:26-fb2e3b8/plan9-arm
2020-01-19T14:04:09-8e0be05/plan9-arm
2020-01-10T10:04:49-5d076d5/plan9-arm
2020-01-02T22:26:56-73b657e/plan9-arm
2019-12-26T21:16:03-cd1b9c1/plan9-arm
2019-12-20T19:01:41-21713f4/plan9-arm
2019-12-20T15:34:39-3f51350/plan9-arm
2019-12-15T15:34:17-a3dc6da/plan9-arm
2019-12-11T19:51:34-1b1fbb3/plan9-arm
2019-12-09T15:40:12-9d4717d/plan9-arm
2019-12-09T02:02:17-daaab44/plan9-arm
2019-11-25T23:37:03-87805c9/plan9-arm
2019-11-15T22:47:41-72f333a/plan9-arm
2019-11-15T21:02:07-d856e05/plan9-arm
2019-11-12T05:35:33-99957b6/plan9-arm
2019-11-05T21:26:19-649f341/plan9-arm
2019-11-01T21:32:33-08a72c8/plan9-arm
2019-11-01T01:51:26-efd395f/plan9-arm
2019-10-28T21:45:30-60f34f7/plan9-arm
2019-10-26T07:06:49-7e68f81/plan9-arm
2019-10-25T17:59:05-6b6e67f/plan9-arm
2019-10-24T13:58:59-05ee506/plan9-arm
2019-10-22T21:40:27-220150f/plan9-arm
2019-10-16T00:05:37-c4817f5/plan9-arm
2019-10-14T22:39:28-1ed932d/plan9-arm
2019-10-12T09:17:14-3c56eb4/plan9-arm
2019-10-10T22:38:38-03ef105/plan9-arm
2019-10-09T23:03:40-aa09e75/plan9-arm
2019-10-07T18:50:14-a0894ea/plan9-arm
2019-10-04T21:22:55-5f4aa5d/plan9-arm
2019-10-04T17:09:10-c8e7c53/plan9-arm
2019-09-30T23:42:51-ecc2d61/plan9-arm
2019-09-23T19:40:07-376fc48/plan9-arm
2019-09-21T21:01:41-9c0e56b/plan9-arm
2019-09-20T05:02:16-66e562c/plan9-arm
2019-09-19T15:36:49-5d548f1/plan9-arm
2019-09-18T18:32:20-e70e0a6/plan9-arm
2019-09-16T21:37:16-0e0bff8/plan9-arm
2019-09-11T00:16:10-a5026af/plan9-arm
2019-09-06T17:41:07-c78ac39/plan9-arm
2019-09-04T21:52:18-aae0b5b/plan9-arm
2019-09-04T20:38:05-d21953d/plan9-arm
2019-08-29T19:32:42-abda0a6/plan9-arm
2019-08-28T18:38:52-7ed973b/plan9-arm
2019-08-23T17:03:17-145e193/plan9-arm
2019-08-10T20:06:28-3928915/plan9-arm
2019-08-06T11:13:05-546ea78/plan9-arm
2019-07-29T20:29:54-7b8234b/plan9-arm
2019-07-08T18:21:41-0fddd66/plan9-arm
2019-06-19T18:13:09-7119f4b/plan9-arm
2019-06-14T18:41:07-9838f4d/plan9-arm
2019-06-13T17:44:27-6a9b22c/plan9-arm
2019-06-10T20:32:02-39f7561/plan9-arm
2019-06-06T19:44:12-f5e5bc1/plan9-arm

@bcmills bcmills changed the title net/http: apparent deadlock in TestTLSServerClosesConnection net/http: apparent deadlock in (*connReader).abortPendingRead via (*sync.Cond).Wait Jan 6, 2022
@bcmills bcmills added arch-arm Issues solely affecting the 32-bit arm architecture. arch-arm64 labels Jan 6, 2022
@bcmills bcmills changed the title net/http: apparent deadlock in (*connReader).abortPendingRead via (*sync.Cond).Wait net/http: apparent deadlock in (*connReader).abortPendingRead via (*sync.Cond).Wait on ARM and ARM64 Jan 6, 2022
@bcmills
Copy link
Member Author

bcmills commented Jan 6, 2022

This bug affects (and has a nonzero number of failures on builders for) darwin/arm64 and linux/arm64, which are first-class ports. However, it appears to be a very long-standing bug and might be difficult to debug or invasive to fix.

Marking as release-blocker, but for Go 1.19 rather than 1.18.

@bcmills bcmills modified the milestones: Backlog, Go1.19 Jan 6, 2022
@bcmills bcmills changed the title net/http: apparent deadlock in (*connReader).abortPendingRead via (*sync.Cond).Wait on ARM and ARM64 runtime,net/http: apparent deadlock in (*connReader).abortPendingRead via (*sync.Cond).Wait on ARM and ARM64 Jan 7, 2022
@chengxuncc
Copy link

chengxuncc commented Feb 6, 2022

It happens many times to me when I'm using gorilla/websocket, releated issue #20963 gorilla/websocket#246.
This appears not only on arm, but also windows/amd64, linux/amd64 and darwin/amd64.

@rhysh
Copy link
Contributor

rhysh commented Mar 9, 2022

I'm able to reproduce this on linux/arm64. I've been focusing on the failure in 2021-03-04T06:36:17-4b8b2c5/linux-arm64-packet, which means running the tests for net/http/httputil in "short" mode at Go commit 4b8b2c5.

A machine with 4 cores / 4 HW threads works fine. Trying to run the test more often (on a bigger machine) leads to ephemeral port exhaustion.

$ uname -a
Linux ip-172-31-53-48.us-west-2.compute.internal 5.10.102-99.473.amzn2.aarch64 #1 SMP Wed Mar 2 19:14:15 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
$ ~/goroot/bin/go version
go version devel +4b8b2c5864 Thu Mar 4 06:36:17 2021 +0000 linux/arm64

In the failures I've gotten on linux/arm64, and in the five dashboard failures across linux/arm64 and darwin/arm64, I see that the goroutine blocked in sync.Cond.Wait really is waiting on another goroutine. The address in net/http.(*connReader).abortPendingRead on its stack shows up on another goroutine's stack at net/http.(*connReader).backgroundRead. That second goroutine is waiting inside the netpoller.

Maybe a problem with SetReadDeadline(aLongTimeAgo) failing to abort in-flight reads?

@rhysh
Copy link
Contributor

rhysh commented Mar 14, 2022

I tracked the 2021-03-04 failure to the netpoller, so I see this issue as closely related to #48925 (closed on 2021-10-15) and #45211 (closed on 2022-01-14).

I have a reproducer that gave results within a few minutes on my darwin/arm64 M1 machine, but it stopped working as of the 2021-10-15 commit.

Note that the linux/arm64 and darwin/arm64 failures in this issue (mentioning abortPendingRead and sync.Cond) are all from before 2021-10-15. The more recent failures are on plan9/arm (a 32-bit system) and windows/arm64 (apparently run under qemu). The discussion in #45211 mentions runtime.arm64HasATOMICS as a relevant setting. That wouldn't be available on GOARCH=arm, and I wonder if it's also not available in the windows/arm64 qemu environment, and if that explains the failures between 2021-10-15 and 2022-01-14. And I don't see any dashboard failures mentioning abortPendingRead and sync.Cond after 2022-01-14.


Here's the reproducer that worked until the 2021-10-15 / #48925 fix
package main

import (
	"flag"
	"fmt"
	"io"
	"log"
	"net/http"
	"net/http/httptest"
	"sync"
)

func main() {
	count := flag.Int("count", 1000, "Number of requests to issue per worker")
	workers := flag.Int("workers", 10, "Number of workers to run")
	flag.Parse()

	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		out := "this call was relayed by the reverse proxy"
		// Coerce a wrong content length to induce io.ErrUnexpectedEOF
		w.Header().Set("Content-Length", fmt.Sprintf("%d", len(out)*2))
		fmt.Fprintln(w, out)
	}))
	defer srv.Close()

	var wg sync.WaitGroup
	for i := 0; i < *workers; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			for i := 0; i < *count; i++ {
				resp, err := http.Get(srv.URL)
				if err != nil {
					log.Fatalf("GET: %v", err)
				}
				_, err = io.Copy(io.Discard, resp.Body)
				if err == nil {
					log.Fatalf("Expected error, io.ErrUnexpectedEOF")
				}
				resp.Body.Close()
			}
		}()
	}
	wg.Wait()
}

@rhysh
Copy link
Contributor

rhysh commented Mar 14, 2022

When I disable arm64HasATOMICS = cpu.ARM64.HasATOMICS in src/runtime/proc.go, the reproducer I shared above is able to fail again on my darwin/arm64 machine even after the 2021-10-15 change. It fails at the parent of the 2022-01-14 change (at commit b41185c) with the usual abortPendingRead / sync.Cond deadlock. With stress -p=1 /tmp/q -workers=1 -count=1000, I see a failure rate of a few percent (and about 50 runs per minute).

At 17b2fb1 (the 2022-01-14 change) with arm64HasATOMICS disabled, it no longer reproduces that failure. Instead of a failure rate of a few percent, I've seen 0 failures out of 1000+ runs.

@bcmills
Copy link
Member Author

bcmills commented Mar 17, 2022

Still reproducing on the builders. @rhysh, do you have an angle on a fix?

greplogs --dashboard -md -l -e '^goroutine \d+ \[sync\.Cond\.Wait, \d+ minutes\]:\n(?:.+\n\t.+\n)*net/http\.\(\*connReader\)\.abortPendingRead' --since=2022-01-07

2022-03-15T18:18:20-5461726-88be85f/windows-arm64-10

@rhysh
Copy link
Contributor

rhysh commented Mar 17, 2022

The 2022-03-15T18:18:20-5461726-88be85f/windows-arm64-10 log starts with windows-arm64-10 at 88be85f18bf0244a2470fdf6719e1b5ca5a5e50a building website at 5461726d879af8cbbfe90e38dc26c8320b1b7d49. That Go commit, 88be85f, is in release-branch.go1.17, which does not contain rsc's fix https://go.dev/cl/392714.

I think that @prattmic fixed it for machines with arm64HasATOMICS in #48925. Yesterday's windows/arm64 failure includes a backport of that fix.

I think that @rsc fixed it for all ARM/ARM64/"weak memory" machines in #45211. That fix has not yet been released in the Go 1.17 series, so was not present during yesterday's windows/arm64 failure.

The only angle I have on a fix is the backport described in #50611. I think that backport is necessary and sufficient.

@bcmills
Copy link
Member Author

bcmills commented Apr 6, 2022

The backport in #50611 is merged, so I'm going to close this issue on the theory that it is fixed. (I'll reopen if I see it again on the builders.)

@rhysh, thank you for the analysis!

@bcmills bcmills closed this as completed Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm Issues solely affecting the 32-bit arm architecture. arch-arm64 NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

3 participants