Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestDialerLocalAddr and TestNetpollDeadlock are flaky on macOS 10.12 #22019

Open
dr2chase opened this issue Sep 25, 2017 · 15 comments

Comments

@dr2chase
Copy link
Contributor

commented Sep 25, 2017

What version of Go are you using (go version)?

development (1.9+)
go version devel +4fa3d69e00 Fri Sep 22 14:03:36 2017 -0400 darwin/amd64

Does this issue reproduce with the latest release?

intermittently. That's the bug.

What operating system and processor architecture are you using (go env)?

go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/drchase/GoogleDrive/work/gocode"
GORACE=""
GOROOT="/Users/drchase/GoogleDrive/work/go"
GOTOOLDIR="/Users/drchase/GoogleDrive/work/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/y4/90mx9nxx54bfvgh9jh0_410h0095tn/T/go-build459430235=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

If you run run.bash repeatedly on a Mac, for instance running a binary search to isolate a single failing function, the test sometimes fails. This then causes the binary search to go down a false path and not work.

@ianlancetaylor ianlancetaylor changed the title net/TestDialerLocalAddr is flaky on Mac and should be disabled net: TestDialerLocalAddr is flaky on Mac and should be disabled Sep 25, 2017
@ianlancetaylor ianlancetaylor added this to the Go1.10 milestone Sep 25, 2017
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2017

CC @mikioh

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2018

Here are some failures from the builders:

2017-12-01T00:23:05-301b127/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.11s)
	dial_test.go:642: tcp6 [::]:0->::1: got dial tcp6 [::]:0->[::1]:59465: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-01T15:43:52-08176b2/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:642: tcp [::1]:0->::1: got dial tcp [::1]:0->[::1]:49317: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-01T20:54:31-7b57e21/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.05s)
	dial_test.go:642: tcp6 :0->::1: got dial tcp6 :0->[::1]:59475: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-01T21:05:40-8bb51a7/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:642: tcp [::]:0->::1: got dial tcp [::]:0->[::1]:49294: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-01T23:50:23-bfa7a55/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:642: tcp 0.0.0.0:0->::1: got dial tcp 0.0.0.0:0->[::1]:59413: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-04T21:45:40-9d70b3a/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.02s)
	dial_test.go:642: tcp6 0.0.0.0:0->::1: got dial tcp6 0.0.0.0:0->[::1]:49309: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-05T18:42:38-49fec9b/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:642: tcp6 [::1]:0->::1: got dial tcp6 [::1]:0->[::1]:49296: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-06T18:33:50-3e52f22/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:642: tcp6 [::]:0->::1: got dial tcp6 [::]:0->[::1]:53707: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-14T17:12:09-1fc71e3/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:637: tcp 0.0.0.0:0->::1: got dial tcp 0.0.0.0:0->[::1]:49306: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-18T20:24:12-3aeb395/freebsd-arm-paulzhol:
--- FAIL: TestDialerLocalAddr (0.03s)
	dial_test.go:637: tcp [::]:0->127.0.0.1: got dial tcp [::]:0->127.0.0.1:46629: connect: address already in use; want <nil>
FAIL
FAIL	net

2018-01-02T20:02:06-2ba3fd4/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.02s)
	dial_test.go:637: tcp [::]:0->::1: got dial tcp [::]:0->[::1]:59610: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net
@gopherbot

This comment has been minimized.

Copy link

commented Jan 3, 2018

Change https://golang.org/cl/86000 mentions this issue: net: ignore TestDialerLocalAddr timeout failures on Darwin

gopherbot pushed a commit that referenced this issue Jan 3, 2018
I don't know why these errors occur. Ignore them to avoid breaking the
build.

Updates #22019

Change-Id: Ia048e6d9b928e8e237b311ff3a364e7a23af4aa4
Reviewed-on: https://go-review.googlesource.com/86000
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Unplanned Jan 3, 2018
@bcmills

This comment has been minimized.

Copy link
Member

commented May 15, 2019

#22403 was marked as a duplicate of this, and recurred in https://build.golang.org/log/b8eabc31874c5ab9c691a4a3e4dceb2c8656b16c:

--- FAIL: TestNetpollDeadlock (60.07s)
    crash_test.go:95: testprognet NetpollDeadlock exit status: exit status 2
    crash_test.go:409: output does not start with "done\n":
        dialing
        SIGQUIT: quit
        PC=0x7fffa2d6dc86 m=0 sigcode=0
        
        goroutine 0 [idle]:
        runtime.pthread_cond_wait(0x123c288, 0x123c248, 0x7fff00000000)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sys_darwin.go:368 +0x39
        runtime.semasleep(0xffffffffffffffff, 0x7fff5fbff4f8)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/os_darwin.go:63 +0x85
        runtime.notesleep(0x123c048)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/lock_sema.go:173 +0xe0
        runtime.stoplockedm()
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:2068 +0x88
        runtime.schedule()
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:2469 +0x485
        runtime.park_m(0xc000000180)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:2610 +0x9d
        runtime.mcall(0x1057ce6)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/asm_amd64.s:318 +0x5b
        
        goroutine 1 [select, locked to thread]:
        net.(*sysDialer).dialParallel(0xc0000ee080, 0x1168940, 0xc0000ae008, 0xc0000dc060, 0x1, 0x1, 0xc0000dc070, 0x1, 0x1, 0x0, ...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:489 +0x322
        net.(*Dialer).DialContext(0xc000044e30, 0x1168940, 0xc0000ae008, 0x1140620, 0x3, 0x11421f0, 0xf, 0x0, 0x0, 0x0, ...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:422 +0x4f0
        net.(*Dialer).Dial(...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:347
        net.Dial(0x1140620, 0x3, 0x11421f0, 0xf, 0x1, 0x8, 0x0, 0x0)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:318 +0xa8
        main.NetpollDeadlockInit()
        	/private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/testdata/testprognet/net.go:19 +0xa7
        main.registerInit(...)
        	/private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/testdata/testprognet/main.go:20
        main.init.0()
        	/private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/testdata/testprognet/net.go:13 +0x106
        
        goroutine 34 [syscall]:
        os/signal.signal_recv(0x0)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:139 +0x9f
        os/signal.loop()
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:23 +0x22
        created by os/signal.init.0
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:29 +0x41
        
        goroutine 23 [IO wait]:
        internal/poll.runtime_pollWait(0x1671f08, 0x77, 0xc000045950)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/netpoll.go:184 +0x55
        internal/poll.(*pollDesc).wait(0xc0000f4018, 0x77, 0x1168900, 0xc000086200, 0xc0000f4000)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/internal/poll/fd_poll_runtime.go:87 +0x10b
        internal/poll.(*pollDesc).waitWrite(...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/internal/poll/fd_poll_runtime.go:96
        internal/poll.(*FD).WaitWrite(...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/internal/poll/fd_unix.go:498
        net.(*netFD).connect(0xc0000f4000, 0x1168900, 0xc000086200, 0x0, 0x0, 0x11674c0, 0xc0000f2080, 0x0, 0x0, 0x0, ...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_unix.go:152 +0x27e
        net.(*netFD).dial(0xc0000f4000, 0x1168900, 0xc000086200, 0x1168f20, 0x0, 0x1168f20, 0xc0000b0150, 0x0, 0x1, 0xc000045b78)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/sock_posix.go:149 +0xff
        net.socket(0x1168900, 0xc000086200, 0x1140620, 0x3, 0x1e, 0x1, 0x0, 0x0, 0x1168f20, 0x0, ...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/sock_posix.go:70 +0x1c0
        net.internetSocket(0x1168900, 0xc000086200, 0x1140620, 0x3, 0x1168f20, 0x0, 0x1168f20, 0xc0000b0150, 0x1, 0x0, ...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/ipsock_posix.go:141 +0x141
        net.(*sysDialer).doDialTCP(0xc0000ee080, 0x1168900, 0xc000086200, 0x0, 0xc0000b0150, 0x1116f00, 0x1256a30, 0x0)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/tcpsock_posix.go:65 +0xc2
        net.(*sysDialer).dialTCP(0xc0000ee080, 0x1168900, 0xc000086200, 0x0, 0xc0000b0150, 0xa3036326a7, 0x2b9c0830, 0x2b9c0830000345b8)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/tcpsock_posix.go:61 +0xd7
        net.(*sysDialer).dialSingle(0xc0000ee080, 0x1168900, 0xc000086200, 0x1167900, 0xc0000b0150, 0x0, 0x0, 0x0, 0x0)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:578 +0x34b
        net.(*sysDialer).dialSerial(0xc0000ee080, 0x1168900, 0xc000086200, 0xc0000dc060, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:546 +0x221
        net.(*sysDialer).dialParallel.func1(0x1168900, 0xc000086200, 0xc0000dc001)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:467 +0x95
        created by net.(*sysDialer).dialParallel
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:482 +0x226
        
        goroutine 35 [select]:
        net.(*netFD).connect.func2(0x1168900, 0xc000086200, 0xc0000f4000, 0xc00008a1e0, 0xc00008a180)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_unix.go:129 +0xba
        created by net.(*netFD).connect
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_unix.go:128 +0x256
        
        rax    0x104
        rbx    0x500
        rcx    0x7fff5fbff318
        rdx    0x500
        rdi    0x123c288
        rsi    0x50100000600
        rbp    0x7fff5fbff3b0
        rsp    0x7fff5fbff318
        r8     0x0
        r9     0x60
        r10    0x0
        r11    0x202
        r12    0x123c288
        r13    0x16
        r14    0x1
        r15    0x7fffaba873c0
        rip    0x7fffa2d6dc86
        rflags 0x203
        cs     0x7
        fs     0x0
        gs     0x0
FAIL
FAIL	runtime	90.150s

@bcmills bcmills changed the title net: TestDialerLocalAddr is flaky on Mac and should be disabled net: TestDialerLocalAddr and TestNetpollDeadlock are flaky on Mac May 15, 2019
@bcmills

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

@bcmills bcmills added the Testing label Sep 11, 2019
@bcmills bcmills modified the milestones: Unplanned, Go1.14 Sep 11, 2019
@bcmills

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

@mikioh @bradfitz @ianlancetaylor, I think we need someone to look into this for 1.14.

I'm worried about another case like #32336, where an occasional test flake turned out to be a symptom of a serious bug.

@bcmills

This comment has been minimized.

Copy link
Member

commented Sep 16, 2019

@bcmills

This comment has been minimized.

Copy link
Member

commented Sep 26, 2019

@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 2, 2019

@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 2, 2019

#32919 may be related.

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Oct 11, 2019

As far as I can see all failures are on Darwin 10.12. There are no failures on newer versions. This makes me suspect that it is a kernel problem that was fixed.

@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 15, 2019

@ianlancetaylor, as far as I can tell that is indeed the case.

Do you think 10.12 is in wide enough usage to warrant a workaround for the kernel issue?

If not, should we skip TestNetpollDeadlock on the darwin-amd64-10_12 builder, and perhaps upgrade the darwin-amd64-race builder to get past the bug?

(CC @bradfitz @dmitshur @toothrot)

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Oct 15, 2019

I personally have no idea whether 10.12 is in wide use. I would be fine with a workaround but I don't have one. The test seems to be timing out waiting for a non-blocking connect to complete. I don't know why.

@dmitshur

This comment has been minimized.

Copy link
Member

commented Oct 15, 2019

In case it's helpful, this StatCounter Global Stats page suggests 10.12 is currently at just under 10% use, while 10.13 is at just over 20%, and 10.14 is at over 50%. (10.15 isn't on the chart yet; the latest entry is from September, before it was out.)

@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 15, 2019

To me that suggests that we should probably upgrade the -race builder and add some skips for the flaky tests. I'll file an issue for the upgrade.

@bcmills bcmills changed the title net: TestDialerLocalAddr and TestNetpollDeadlock are flaky on Mac net: TestDialerLocalAddr and TestNetpollDeadlock are flaky on macOS 10.12 Oct 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.