Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: apparent deadlock in TestCgoNumGoroutine #39024

Open
bcmills opened this issue May 12, 2020 · 8 comments
Open

runtime: apparent deadlock in TestCgoNumGoroutine #39024

bcmills opened this issue May 12, 2020 · 8 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented May 12, 2020

2020-05-11T22:38:32-8c1db77/openbsd-amd64-64

--- FAIL: TestCgoNumGoroutine (60.25s)
    crash_test.go:95: testprogcgo NumGoroutine exit status: exit status 2
    crash_cgo_test.go:417: expected "OK\n" got SIGQUIT: quit
        PC=0x469fff m=6 sigcode=0
        
        goroutine 0 [idle]:
        runtime.thrsleep(0xc00002f738, 0x200000003, 0x0, 0x0, 0xc00002f738, 0x58, 0xc00006a000, 0x8000, 0x0, 0xc000001980, ...)
        	/tmp/workdir/go/src/runtime/sys_openbsd_amd64.s:72 +0x1f
        runtime.semasleep(0xffffffffffffffff, 0x200e7039c)
        	/tmp/workdir/go/src/runtime/os_openbsd.go:167 +0xb4
        runtime.notesleep(0x88c178)
        	/tmp/workdir/go/src/runtime/lock_sema.go:181 +0xcf
        runtime.templateThread()
        	/tmp/workdir/go/src/runtime/proc.go:1863 +0xfa
        runtime.mstart1()
        	/tmp/workdir/go/src/runtime/proc.go:1156 +0xc8
        runtime.mstart()
        	/tmp/workdir/go/src/runtime/proc.go:1121 +0x6e
        
        goroutine 1 [syscall]:
        main._Cfunc_CheckNumGoroutine()
        	_cgo_gotypes.go:139 +0x45
        main.NumGoroutine()
        	/tmp/workdir/go/src/runtime/testdata/testprogcgo/numgoroutine.go:49 +0x59
        main.main()
        	/tmp/workdir/go/src/runtime/testdata/testprogcgo/main.go:34 +0x1da
        
        rax    0x58
        rbx    0xc00002f400
        rcx    0x469fff
        rdx    0x0
        rdi    0xc00002f738
        rsi    0x3
        rbp    0x200e70370
        rsp    0x200e70310
        r8     0xc00002f738
        r9     0x0
        r10    0x0
        r11    0x246
        r12    0x5205e0
        r13    0x7f7ffffd72f0
        r14    0xc000001980
        r15    0x4398c0
        rip    0x469fff
        rflags 0x246
        cs     0x2b
        fs     0x0
        gs     0x0
FAIL
FAIL	runtime	105.009s

Marking as release-blocker until we understand whether this is a regression.

@aclements
Copy link
Member

@aclements aclements commented May 14, 2020

/cc @mknyszek

@aclements
Copy link
Member

@aclements aclements commented May 14, 2020

/cc @prattmic

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 14, 2020

I'll take a look at this one after the runtime/trace test failures, if no one else gets to it first.

@prattmic
Copy link
Member

@prattmic prattmic commented May 15, 2020

Similar failure in another openbsd cgo test:

2020-05-12T19:15:34-cb11c98/openbsd-386-62

--- FAIL: TestEnsureDropM (120.07s)
    crash_test.go:95: testprogcgo EnsureDropM exit status: exit status 2
    crash_cgo_test.go:174: expected "OK\n", got SIGQUIT: quit
        PC=0x1c05a717 m=6 sigcode=0
        
        goroutine 0 [idle]:
        runtime.thrsleep(0x3c42cacc, 0x3, 0x0, 0x0, 0x3c42cacc, 0x4, 0x1c041aa2, 0x7c28a9ec, 0x0, 0x0, ...)
        	/tmp/workdir/go/src/runtime/sys_openbsd_386.s:384 +0x7
        runtime.semasleep(0xffffffff, 0xffffffff, 0x3c42c900)
        	/tmp/workdir/go/src/runtime/os_openbsd.go:167 +0xcc
        runtime.notesleep(0x3c1223fc)
        	/tmp/workdir/go/src/runtime/lock_sema.go:181 +0xda
        runtime.templateThread()
        	/tmp/workdir/go/src/runtime/proc.go:1863 +0xd4
        runtime.mstart1()
        	/tmp/workdir/go/src/runtime/proc.go:1156 +0x8f
        runtime.mstart()
        	/tmp/workdir/go/src/runtime/proc.go:1121 +0x4f
        
        goroutine 1 [syscall]:
        main._Cfunc_CheckM()
        	_cgo_gotypes.go:127 +0x2d
        main.EnsureDropM()
        	/tmp/workdir/go/src/runtime/testdata/testprogcgo/dropm.go:57 +0x14
        main.main()
        	/tmp/workdir/go/src/runtime/testdata/testprogcgo/main.go:34 +0x148
        
        eax    0x4
        ebx    0xffffffff
        ecx    0x0
        edx    0x3c42cacc
        edi    0x1c031b40
        esi    0x3c400d20
        ebp    0x1
        esp    0x7c28a9c4
        eip    0x1c05a717
        eflags 0x206
        cs     0x2b
        fs     0x7c24005b
        gs     0x9720063
@mknyszek mknyszek self-assigned this May 19, 2020
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 19, 2020

I've been running this test continuously (directly by compiling the testprog, and via go test) on openbsd-amd64-64 for about an hour now and haven't been able to reproduce at tip.

By the failures above, the template thread is sleeping (not unexpected) and goroutine 1 is in C code. Both failures above look like a timeout while the goroutine sits in C code, but there isn't much to go on here. I might try and see if reproducing on 386 is easier?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 21, 2020

Both EnsureDropM and NumGoroutine involve calling a C function which creates a C thread that calls into Go code, and in both the deadlocks above we see that the goroutine is blocked in C code, so probably waiting for that other C thread to do what what it needs to do, and it may not be in Go code yet.

I pored over the needm logic and I can't think of a way that there might be a deadlock e.g. via a C->Go thread waiting indefinitely for an m which will never come, so I don't think there's something there until we have more evidence.

@aclements
Copy link
Member

@aclements aclements commented May 21, 2020

Similar recent failures:

$ greplogs -dashboard -e "crash_cgo_test.*SIGQUIT" -md -l

2020-05-12T19:15:34-cb11c98/openbsd-386-62
2020-05-11T22:38:32-8c1db77/openbsd-amd64-64
2020-05-01T05:25:54-e1d1684/freebsd-arm64-dmgk
2020-04-29T20:33:31-197a2a3/netbsd-amd64-9_0

Then there's nothing until 2018. The second two seem to have a lot more going on, so they might not be the same.

@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented May 21, 2020

I think that it is ok to work on this after beta1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.