Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: deadlock in TestDialParallelSpuriousConnection on darwin-arm* #37795

Open
bcmills opened this issue Mar 11, 2020 · 13 comments
Open

net: deadlock in TestDialParallelSpuriousConnection on darwin-arm* #37795

bcmills opened this issue Mar 11, 2020 · 13 comments
Labels
arch-arm64 NeedsInvestigation OS-Darwin release-blocker
Milestone

Comments

@bcmills bcmills added OS-Darwin NeedsInvestigation mobile labels Mar 11, 2020
@bcmills bcmills added this to the Backlog milestone Mar 11, 2020
@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 11, 2020

I think everything, but especially the network, is just slow on this builder. (Compare #37322, #35692, #35498, #34837.)

(CC @toothrot @cagedmantis @andybons @dmitshur, since they may suggest a builder-based solution to this cluster of flakes.)

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 11, 2020

Change https://golang.org/cl/222959 mentions this issue: net: use t.Deadline instead of an arbitrary read deadline in TestDialParallelSpuriousConnection

@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 9, 2020

@bcmills bcmills reopened this Apr 9, 2020
@bcmills bcmills changed the title net: occasional "i/o timeout" it TestDialParallelSpuriousConnection on darwin-arm* builders net: deadlock in TestDialParallelSpuriousConnection on darwin-arm* Apr 9, 2020
@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 9, 2020

The apparent deadlock in net suggests a possible connection to #34837. (Perhaps a platform bug on Corellium?)

@bcmills
Copy link
Member Author

@bcmills bcmills commented Oct 20, 2020

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 5, 2022

greplogs --dashboard -md -l -e 'FAIL: TestDialParallelSpuriousConnection.*\n.*i/o timeout' --since=2020-10-20

2020-10-20T00:59:23-a505312/ios-arm64-corellium

Looks fixed to me: none of these in over a year.

@bcmills bcmills closed this as completed Jan 5, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 30, 2022

Looks fixed to me: none of these in over a year.

...aaand it's back. 😞

greplogs --dashboard -md -l -e 'FAIL: TestDialParallelSpuriousConnection.*\n.*i/o timeout' --since=2022-01-05

2022-03-28T19:34:01-e84a370/darwin-arm64-11_0-toothrot
2022-03-22T16:02:31-3870705/darwin-arm64-11_0-toothrot

@bcmills bcmills reopened this Mar 30, 2022
@bcmills bcmills added arch-arm64 and removed mobile labels Mar 30, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 30, 2022

darwin/arm64 is a first class port, so this is a release blocker for Go 1.19.

@bcmills bcmills removed this from the Backlog milestone Mar 30, 2022
@bcmills bcmills added this to the Go1.19 milestone Mar 30, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented May 6, 2022

greplogs -l -e 'FAIL: TestDialParallelSpuriousConnection.*\n.*i/o timeout' --since=2022-03-29
2022-05-05T18:47:55-14fed2c/darwin-arm64-11_0-toothrot
2022-03-31T19:46:55-12acf9b/darwin-arm64-11_0-toothrot

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 16, 2022

Looks like a slightly different failure mode on darwin-arm64-12, but I'm guessing the root cause is either the same or related: the test failed after almost exactly one minute.

--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:52097->127.0.0.1:52268: read: connection reset by peer; want EOF
FAIL
FAIL	net	67.297s

greplogs -l -e 'FAIL: TestDialParallelSpuriousConnection ' --since=2022-05-06
2022-05-13T14:14:49-d4dd7ac/darwin-arm64-12

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented May 18, 2022

Running the test 100'000 times on a darwin/amd64 machine did not reproduce the problem:

$ go test -count=100000 -timeout=10h -run=TestDialParallelSpuriousConnection net
ok  	net	10214.424s

Running the same on a darwin/arm64 machine seems to reproduce after a while:

$ go test -count=100000 -timeout=10h -run=TestDialParallelSpuriousConnection net
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:63672->127.0.0.1:63674: read: connection reset by peer; want EOF
[...]
FAIL
FAIL	net	14609.803s
FAIL
Complete output
$ go test -count=100000 -timeout=10h -run=TestDialParallelSpuriousConnection net
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:63672->127.0.0.1:63674: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:54481->[::1]:54482: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:55077->[::1]:55078: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:57926->127.0.0.1:57928: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:63669->[::1]:63670: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:51744->[::1]:51746: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:55708->127.0.0.1:55710: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.15s)
    dial_test.go:445: got read tcp6 [::1]:65484->[::1]:65485: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:49742->127.0.0.1:49744: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:51501->[::1]:51503: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:52425->127.0.0.1:52427: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:60940->127.0.0.1:60942: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:54599->127.0.0.1:54600: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:51920->127.0.0.1:51922: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:59242->127.0.0.1:59244: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:56513->[::1]:56514: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.13s)
    dial_test.go:445: got read tcp6 [::1]:58734->[::1]:58735: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:59301->[::1]:59302: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.55s)
    dial_test.go:445: got read tcp4 127.0.0.1:63063->127.0.0.1:63065: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:64137->[::1]:64138: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:65499->[::1]:65501: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:53488->127.0.0.1:53489: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:53839->[::1]:53840: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:60746->[::1]:60747: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:55986->[::1]:55988: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:57475->127.0.0.1:57477: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.19s)
    dial_test.go:445: got read tcp4 127.0.0.1:59726->127.0.0.1:59728: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:60149->127.0.0.1:60151: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.57s)
    dial_test.go:445: got read tcp4 127.0.0.1:60743->127.0.0.1:60745: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:62183->[::1]:62184: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:49745->[::1]:49746: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:51823->127.0.0.1:51824: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:52795->127.0.0.1:52796: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:56493->[::1]:56495: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:56652->127.0.0.1:56653: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:57838->[::1]:57839: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:59977->127.0.0.1:59979: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:62461->[::1]:62462: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:53526->[::1]:53528: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:50751->127.0.0.1:50753: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:51924->127.0.0.1:51926: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:55363->[::1]:55365: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:58250->[::1]:58251: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:60293->[::1]:60295: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:61091->127.0.0.1:61092: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.36s)
    dial_test.go:445: got read tcp4 127.0.0.1:63470->127.0.0.1:63472: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:63614->127.0.0.1:63616: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:49644->127.0.0.1:49646: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:50178->127.0.0.1:50179: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:54815->127.0.0.1:54817: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:56958->127.0.0.1:56960: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:57606->[::1]:57607: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:62890->127.0.0.1:62891: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:64451->[::1]:64452: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:51542->127.0.0.1:51544: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:52580->127.0.0.1:52582: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:52769->[::1]:52770: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:53588->[::1]:53590: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:57702->127.0.0.1:57704: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.14s)
    dial_test.go:445: got read tcp6 [::1]:59139->[::1]:59140: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:60891->127.0.0.1:60892: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:61188->127.0.0.1:61189: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:61320->[::1]:61321: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp4 127.0.0.1:63618->127.0.0.1:63619: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:63942->127.0.0.1:63944: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:65073->[::1]:65074: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:62988->127.0.0.1:62990: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:53143->[::1]:53144: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:57219->[::1]:57220: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.11s)
    dial_test.go:445: got read tcp6 [::1]:62580->[::1]:62581: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp6 [::1]:63111->[::1]:63112: read: connection reset by peer; want EOF
--- FAIL: TestDialParallelSpuriousConnection (60.10s)
    dial_test.go:445: got read tcp4 127.0.0.1:55112->127.0.0.1:55113: read: connection reset by peer; want EOF
FAIL
FAIL	net	14609.803s
FAIL

Tested with Go 1.18.2, both with macOS 12 (though slightly different minor versions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 NeedsInvestigation OS-Darwin release-blocker
Projects
None yet
Development

No branches or pull requests

3 participants