Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: hang after concurrent panic from two threads after apparent memory corruption #57420

Open
davepacheco opened this issue Dec 21, 2022 · 8 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@davepacheco
Copy link

What version of Go are you using (go version)?

This is the cockroach binary, running the simple cockroach version --build-tag command. A successful invocation prints:

dap@ivanova ~ $ /home/dap/tools/cockroach-v22.1.9/bin/cockroach version
Build Tag:        v22.1.9-dirty
Build Time:       2022/10/21 16:56:49
Distribution:     OSS
Platform:         illumos amd64 (x86_64-pc-solaris2.11)
Go Version:       go1.17.13
C Compiler:       gcc 10.3.0
Build Commit ID:  e438c2f89282e607e0e6ca1d38b2e0a622f94493
Build Type:       release

That same Go version on my system prints:

 $ /opt/ooce/go-1.17/bin/go version
go version go1.17.13 illumos/amd64

Does this issue reproduce with the latest release?

Unknown -- I've only ever seen it once.

What operating system and processor architecture are you using (go env)?

I'm not positive that this is the same build that was used to build the cockroach binary, but I suspect it is:

go env Output
$ go env
dap@ivanova ~ $ /opt/ooce/go-1.17/bin/go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/dap/.cache/go-build"
GOENV="/home/dap/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="illumos"
GOINSECURE=""
GOMODCACHE="/home/dap/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="illumos"
GOPATH="/home/dap/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/ooce/go-1.17"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/ooce/go-1.17/pkg/tool/illumos_amd64"
GOVCS=""
GOVERSION="go1.17.13"
GCCGO="/bin/gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/dangerzone/omicron_tmp/go-build3691127031=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Ran cockroach version --build-tag

What did you expect to see?

The output above, showing cockroach version information

What did you see instead?

No output. The command hung. I saved a core file and have a lot more information about it though!

More details

I spent some time digging into this under oxidecomputer/omicron#1876. The punchline is that this system suffered from a bug where we weren't preserving %ymm0 across signal handlers, which results in memclrNoHeapPointers not properly zeroing address ranges. In this specific case, from the core file, it looks like most of one of the GC bits arenas is not properly cleared.

There are two threads panicking in my core file. Thread 1 is panicking on "sweep increased allocation count", which is a known possible consequence of this %ymm0 issue. Thread 14 is panicking on reportZombies, another known possible consequence of the same issue. For details on how this corruption can cause these failures, see oxidecomputer/omicron#1146.

So at this point, we've got good reason to believe there's memory corruption here. However, I don't know how to tell if this is why Go ultimately hung. I figured I'd file this and let y'all decide how far it's worth digging into this. There could be a legit issue here related to hang during concurrent panic.

Stacks:

> ::walk thread | ::findstack
stack pointer for thread 1: fffffc7fffdfb0b0
[ fffffc7fffdfb0b0 libc.so.1`__lwp_park+0x17() ]
  fffffc7fffdfb0d0 libc.so.1`sema_wait+0x10()
  fffffc7fffdfb100 libc.so.1`sem_wait+0x22()
  fffffc7fffdfb190 runtime.asmsysvicall6+0x5a()
  fffffc7fffdfb1c8 runtime.lock2+0x185()
  fffffc7fffdfb1e8 runtime.printlock+0x5b()
  fffffc7fffdfb250 runtime.dopanic_m+0x290()
  fffffc7fffdfb298 runtime.fatalthrow.func1+0x67()
  000000c000b03c08 runtime.systemstack+0x52()
  000000c000b03c38 runtime.throw+0x74()
  000000c000b03d20 runtime.(*sweepLocked).sweep+0x99e()
  000000c000b03da0 runtime.(*mcentral).cacheSpan+0x4a5()
  000000c000b03de8 runtime.(*mcache).refill+0xaa()
  000000c000b03e28 runtime.(*mcache).nextFree+0x8d()
  000000c000b03ea8 runtime.mallocgc+0x530()
  000000c000b03f28 runtime.mapassign+0x485()
  000000c000b17798 github.com/aws/aws-sdk-go/aws/endpoints.init+0x4d19d()
  000000c000b178e8 runtime.doInit+0x129()
  000000c000b17a38 runtime.doInit+0x7e()
  000000c000b17b88 runtime.doInit+0x7e()
  000000c000b17cd8 runtime.doInit+0x7e()
  000000c000b17e28 runtime.doInit+0x7e()
  000000c000b17f78 runtime.doInit+0x7e()
  000000c000b17fd0 runtime.main+0x205()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 2: fffffc7fed1ffd90
[ fffffc7fed1ffd90 libc.so.1`__lwp_park+0x17() ]
  fffffc7fed1ffdb0 libc.so.1`sema_wait+0x10()
  fffffc7fed1ffde0 libc.so.1`sem_wait+0x22()
  fffffc7fed1ffe68 runtime.asmsysvicall6+0x5a()
  fffffc7fed1ffea0 runtime.lock2+0x185()
  fffffc7fed1fff08 runtime.sysmon+0x12d()
  fffffc7fed1fff30 runtime.mstart1+0x97()
  fffffc7fed1fff50 runtime.mstart0+0x66()
  fffffc7fed1fffb0 runtime.mstart+5()
  fffffc7fed1fffe0 libc.so.1`_thrp_setup+0x6c()
  fffffc7fed1ffff0 libc.so.1`_lwp_start()
stack pointer for thread 3: fffffc7fed7ffc50
[ fffffc7fed7ffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7fed7ffc70 libc.so.1`sema_wait+0x10()
  fffffc7fed7ffca0 libc.so.1`sem_wait+0x22()
  fffffc7fed7ffd28 runtime.asmsysvicall6+0x5a()
  fffffc7fed7ffd58 runtime.notesleep+0x93()
  fffffc7fed7ffd78 runtime.mPark+0x39()
  fffffc7fed7ffda0 runtime.stopm+0x92()
  fffffc7fed7ffe98 runtime.findrunnable+0xa07()
  fffffc7fed7ffef8 runtime.schedule+0x297()
  fffffc7fed7fff28 runtime.park_m+0x18e()
  000000c0007a1f50 runtime.mcall+0x63()
  000000c0007a1fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 4: fffffc7fecdffc50
[ fffffc7fecdffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7fecdffc70 libc.so.1`sema_wait+0x10()
  fffffc7fecdffca0 libc.so.1`sem_wait+0x22()
  fffffc7fecdffd28 runtime.asmsysvicall6+0x5a()
  fffffc7fecdffd58 runtime.notesleep+0x93()
  fffffc7fecdffd78 runtime.mPark+0x39()
  fffffc7fecdffda0 runtime.stopm+0x92()
  fffffc7fecdffe98 runtime.findrunnable+0xa07()
  fffffc7fecdffef8 runtime.schedule+0x297()
  fffffc7fecdfff28 runtime.park_m+0x18e()
  000000c000589f50 runtime.mcall+0x63()
  000000c000589fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 5: fffffc7feb7ffde0
[ fffffc7feb7ffde0 libc.so.1`__lwp_park+0x17() ]
  fffffc7feb7ffe00 libc.so.1`sema_wait+0x10()
  fffffc7feb7ffe30 libc.so.1`sem_wait+0x22()
  fffffc7feb7ffeb8 runtime.asmsysvicall6+0x5a()
  fffffc7feb7ffee8 runtime.notesleep+0x93()
  fffffc7feb7fff08 runtime.templateThread+0x86()
  fffffc7feb7fff30 runtime.mstart1+0x97()
  fffffc7feb7fff50 runtime.mstart0+0x66()
  fffffc7feb7fffb0 runtime.mstart+5()
  fffffc7feb7fffe0 libc.so.1`_thrp_setup+0x6c()
  fffffc7feb7ffff0 libc.so.1`_lwp_start()
stack pointer for thread 6: fffffc7febfffc50
[ fffffc7febfffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7febfffc70 libc.so.1`sema_wait+0x10()
  fffffc7febfffca0 libc.so.1`sem_wait+0x22()
  fffffc7febfffd28 runtime.asmsysvicall6+0x5a()
  fffffc7febfffd58 runtime.notesleep+0x93()
  fffffc7febfffd78 runtime.mPark+0x39()
  fffffc7febfffda0 runtime.stopm+0x92()
  fffffc7febfffe98 runtime.findrunnable+0xa07()
  fffffc7febfffef8 runtime.schedule+0x297()
  fffffc7febffff28 runtime.park_m+0x18e()
  000000c000583f50 runtime.mcall+0x63()
  000000c000583fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 7: fffffc7febbffd20
[ fffffc7febbffd20 libc.so.1`__lwp_park+0x17() ]
  fffffc7febbffd40 libc.so.1`sema_wait+0x10()
  fffffc7febbffd70 libc.so.1`sem_wait+0x22()
  fffffc7febbffdf8 runtime.asmsysvicall6+0x5a()
  fffffc7febbffe28 runtime.notesleep+0x93()
  fffffc7febbffe48 runtime.mPark+0x39()
  fffffc7febbffe98 runtime.stoplockedm+0x7b()
  fffffc7febbffef8 runtime.schedule+0x4f()
  fffffc7febbfff28 runtime.park_m+0x18e()
  000000c0000afe10 runtime.mcall+0x63()
  000000c0000aff30 runtime.selectgo+0x7b0()
  000000c0000affd0 runtime.ensureSigM.func1+0x1f2()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 8: fffffc7fea1ffe90
[ fffffc7fea1ffe90 libc.so.1`__lwp_park+0x17() ]
  fffffc7fea1ffeb0 libc.so.1`sema_wait+0x10()
  fffffc7fea1ffee0 libc.so.1`sem_wait+0x22()
  000000c0000b0710 runtime.asmsysvicall6+0x5a()
  000000c0000b0748 runtime.notetsleep_internal+0x66()
  000000c0000b0788 runtime.notetsleepg+0x67()
  000000c0000b07b0 os/signal.signal_recv+0xab()
  000000c0000b07d0 os/signal.loop+0x25()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 9: fffffc7feabffc50
[ fffffc7feabffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7feabffc70 libc.so.1`sema_wait+0x10()
  fffffc7feabffca0 libc.so.1`sem_wait+0x22()
  fffffc7feabffd28 runtime.asmsysvicall6+0x5a()
  fffffc7feabffd58 runtime.notesleep+0x93()
  fffffc7feabffd78 runtime.mPark+0x39()
  fffffc7feabffda0 runtime.stopm+0x92()
  fffffc7feabffe98 runtime.findrunnable+0xa07()
  fffffc7feabffef8 runtime.schedule+0x297()
  fffffc7feabfff28 runtime.park_m+0x18e()
  000000c000582f50 runtime.mcall+0x63()
  000000c000582fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 10: fffffc7feafffd20
[ fffffc7feafffd20 libc.so.1`__lwp_park+0x17() ]
  fffffc7feafffd40 libc.so.1`sema_wait+0x10()
  fffffc7feafffd70 libc.so.1`sem_wait+0x22()
  fffffc7feafffdf8 runtime.asmsysvicall6+0x5a()
  fffffc7feafffe28 runtime.notesleep+0x93()
  fffffc7feafffe48 runtime.mPark+0x39()
  fffffc7feafffe70 runtime.stopm+0x92()
  fffffc7feafffe98 runtime.startlockedm+0x85()
  fffffc7feafffef8 runtime.schedule+0x8e()
  fffffc7feaffff28 runtime.park_m+0x18e()
  000000c000587750 runtime.mcall+0x63()
  000000c0005877d0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 11: fffffc7feb3ffc50
[ fffffc7feb3ffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7feb3ffc70 libc.so.1`sema_wait+0x10()
  fffffc7feb3ffca0 libc.so.1`sem_wait+0x22()
  fffffc7feb3ffd28 runtime.asmsysvicall6+0x5a()
  fffffc7feb3ffd58 runtime.notesleep+0x93()
  fffffc7feb3ffd78 runtime.mPark+0x39()
  fffffc7feb3ffda0 runtime.stopm+0x92()
  fffffc7feb3ffe98 runtime.findrunnable+0xa07()
  fffffc7feb3ffef8 runtime.schedule+0x297()
  fffffc7feb3fff28 runtime.park_m+0x18e()
  000000c000589f50 runtime.mcall+0x63()
  000000c000589fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 12: fffffc7fea7ffc50
[ fffffc7fea7ffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7fea7ffc70 libc.so.1`sema_wait+0x10()
  fffffc7fea7ffca0 libc.so.1`sem_wait+0x22()
  fffffc7fea7ffd28 runtime.asmsysvicall6+0x5a()
  fffffc7fea7ffd58 runtime.notesleep+0x93()
  fffffc7fea7ffd78 runtime.mPark+0x39()
  fffffc7fea7ffda0 runtime.stopm+0x92()
  fffffc7fea7ffe98 runtime.findrunnable+0xa07()
  fffffc7fea7ffef8 runtime.schedule+0x297()
  fffffc7fea7fff28 runtime.park_m+0x18e()
  000000c000582f50 runtime.mcall+0x63()
  000000c000582fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 13: fffffc7fe9dffd00
[ fffffc7fe9dffd00 libc.so.1`__lwp_park+0x17() ]
  fffffc7fe9dffd20 libc.so.1`sema_wait+0x10()
  fffffc7fe9dffd50 libc.so.1`sem_wait+0x22()
  fffffc7fe9dffdd8 runtime.asmsysvicall6+0x5a()
  fffffc7fe9dffe10 runtime.lock2+0x185()
  fffffc7fe9dffe58 runtime.startm+0x50()
  fffffc7fe9dffe78 runtime.wakep+0x66()
  fffffc7fe9dffe98 runtime.resetspinning+0x59()
  fffffc7fe9dffef8 runtime.schedule+0x2c7()
  fffffc7fe9dfff28 runtime.park_m+0x18e()
  000000c0007a1f50 runtime.mcall+0x63()
  000000c0007a1fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 14: fffffc7fe65ffa60
[ fffffc7fe65ffa60 libc.so.1`__lwp_park+0x17() ]
  fffffc7fe65ffa80 libc.so.1`sema_wait+0x10()
  fffffc7fe65ffab0 libc.so.1`sem_wait+0x22()
  fffffc7fe65ffb38 runtime.asmsysvicall6+0x5a()
  fffffc7fe65ffb70 runtime.lock2+0x185()
  fffffc7fe65ffb88 runtime.lockWithRank+0x2b()
  fffffc7fe65ffba8 runtime.lock+0x34()
  fffffc7fe65ffbe0 runtime.startpanic_m+0x17e()
  fffffc7fe65ffc28 runtime.fatalthrow.func1+0x45()
  fffffc7fe65ffc60 runtime.fatalthrow+0x5e()
  fffffc7fe65ffc90 runtime.throw+0x74()
  fffffc7fe65ffd10 runtime.(*mspan).reportZombies+0x345()
  fffffc7fe65ffdf8 runtime.(*sweepLocked).sweep+0x35a()
  fffffc7fe65ffe28 runtime.(*mcentral).uncacheSpan+0xcf()
  fffffc7fe65ffe70 runtime.(*mcache).releaseAll+0x134()
  fffffc7fe65ffe98 runtime.(*mcache).prepareForSweep+0x46()
  fffffc7fe65ffeb0 runtime.gcMarkTermination.func4.1+0x2f()
  fffffc7fe65fff18 runtime.forEachP+0x12a()
  fffffc7fe65fff30 runtime.gcMarkTermination.func4+0x2d()
  000000c0005846f8 runtime.systemstack+0x52()
  000000c000584750 runtime.gcMarkDone+0x2d5()
  000000c0005847d0 runtime.gcBgMarkWorker+0x30c()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 15: fffffc7fe89ffc50
[ fffffc7fe89ffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7fe89ffc70 libc.so.1`sema_wait+0x10()
  fffffc7fe89ffca0 libc.so.1`sem_wait+0x22()
  fffffc7fe89ffd28 runtime.asmsysvicall6+0x5a()
  fffffc7fe89ffd58 runtime.notesleep+0x93()
  fffffc7fe89ffd78 runtime.mPark+0x39()
  fffffc7fe89ffda0 runtime.stopm+0x92()
  fffffc7fe89ffe98 runtime.findrunnable+0xa07()
  fffffc7fe89ffef8 runtime.schedule+0x297()
  fffffc7fe89fff28 runtime.park_m+0x18e()
  000000c000584f50 runtime.mcall+0x63()
  000000c000584fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 16: fffffc7fe93ffc50
[ fffffc7fe93ffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7fe93ffc70 libc.so.1`sema_wait+0x10()
  fffffc7fe93ffca0 libc.so.1`sem_wait+0x22()
  fffffc7fe93ffd28 runtime.asmsysvicall6+0x5a()
  fffffc7fe93ffd58 runtime.notesleep+0x93()
  fffffc7fe93ffd78 runtime.mPark+0x39()
  fffffc7fe93ffda0 runtime.stopm+0x92()
  fffffc7fe93ffe98 runtime.findrunnable+0xa07()
  fffffc7fe93ffef8 runtime.schedule+0x297()
  fffffc7fe93fff28 runtime.park_m+0x18e()
  000000c000588f50 runtime.mcall+0x63()
  000000c000588fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 17: fffffc7fe8dffd20
[ fffffc7fe8dffd20 libc.so.1`__lwp_park+0x17() ]
  fffffc7fe8dffd40 libc.so.1`sema_wait+0x10()
  fffffc7fe8dffd70 libc.so.1`sem_wait+0x22()
  fffffc7fe8dffdf8 runtime.asmsysvicall6+0x5a()
  fffffc7fe8dffe28 runtime.notesleep+0x93()
  fffffc7fe8dffe48 runtime.mPark+0x39()
  fffffc7fe8dffe70 runtime.stopm+0x92()
  fffffc7fe8dffe98 runtime.startlockedm+0x85()
  fffffc7fe8dffef8 runtime.schedule+0x8e()
  fffffc7fe8dfff28 runtime.park_m+0x18e()
  000000c0007a1f50 runtime.mcall+0x63()
  000000c0007a1fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 18: fffffc7fe7fffc50
[ fffffc7fe7fffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7fe7fffc70 libc.so.1`sema_wait+0x10()
  fffffc7fe7fffca0 libc.so.1`sem_wait+0x22()
  fffffc7fe7fffd28 runtime.asmsysvicall6+0x5a()
  fffffc7fe7fffd58 runtime.notesleep+0x93()
  fffffc7fe7fffd78 runtime.mPark+0x39()
  fffffc7fe7fffda0 runtime.stopm+0x92()
  fffffc7fe7fffe98 runtime.findrunnable+0xa07()
  fffffc7fe7fffef8 runtime.schedule+0x297()
  fffffc7fe7ffff28 runtime.park_m+0x18e()
  000000c0000b4dd8 runtime.mcall+0x63()
  000000c0000b4e68 runtime.chanrecv+0x5f7()
  000000c0000b4e98 runtime.chanrecv1+0x2b()
  000000c0000b4fd0 github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0.func1+0x1de()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 19: fffffc7fe85ffc50
[ fffffc7fe85ffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7fe85ffc70 libc.so.1`sema_wait+0x10()
  fffffc7fe85ffca0 libc.so.1`sem_wait+0x22()
  fffffc7fe85ffd28 runtime.asmsysvicall6+0x5a()
  fffffc7fe85ffd58 runtime.notesleep+0x93()
  fffffc7fe85ffd78 runtime.mPark+0x39()
  fffffc7fe85ffda0 runtime.stopm+0x92()
  fffffc7fe85ffe98 runtime.findrunnable+0xa07()
  fffffc7fe85ffef8 runtime.schedule+0x297()
  fffffc7fe85fff28 runtime.park_m+0x18e()
  000000c000582750 runtime.mcall+0x63()
  000000c0005827d0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 20: fffffc7fe7bffc50
[ fffffc7fe7bffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7fe7bffc70 libc.so.1`sema_wait+0x10()
  fffffc7fe7bffca0 libc.so.1`sem_wait+0x22()
  fffffc7fe7bffd28 runtime.asmsysvicall6+0x5a()
  fffffc7fe7bffd58 runtime.notesleep+0x93()
  fffffc7fe7bffd78 runtime.mPark+0x39()
  fffffc7fe7bffda0 runtime.stopm+0x92()
  fffffc7fe7bffe98 runtime.findrunnable+0xa07()
  fffffc7fe7bffef8 runtime.schedule+0x297()
  fffffc7fe7bfff28 runtime.park_m+0x18e()
  000000c000586f50 runtime.mcall+0x63()
  000000c000586fd0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()
stack pointer for thread 21: fffffc7fe73ffc50
[ fffffc7fe73ffc50 libc.so.1`__lwp_park+0x17() ]
  fffffc7fe73ffc70 libc.so.1`sema_wait+0x10()
  fffffc7fe73ffca0 libc.so.1`sem_wait+0x22()
  fffffc7fe73ffd28 runtime.asmsysvicall6+0x5a()
  fffffc7fe73ffd58 runtime.notesleep+0x93()
  fffffc7fe73ffd78 runtime.mPark+0x39()
  fffffc7fe73ffda0 runtime.stopm+0x92()
  fffffc7fe73ffe98 runtime.findrunnable+0xa07()
  fffffc7fe73ffef8 runtime.schedule+0x297()
  fffffc7fe73fff28 runtime.park_m+0x18e()
  000000c000587750 runtime.mcall+0x63()
  000000c0005877d0 runtime.gcBgMarkWorker+0x118()
  0000000000000000 runtime.goexit+1()

I'm happy to make the core file available but I'm not sure the best way to do that.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 21, 2022
@dr2chase
Copy link
Contributor

I think as long as there is memory corruption, default analysis of this bug is to blame everything on that. It's also possible that the Go runtime for Illumos has a bug in it because it is undertested, but given memory corruption, nothing is certain.

Also, go1.17 is kinda old and unsupported, you might try tip, we do have an illumos-amd64 builder and it is looking good ( https://build.golang.org/ ). We don't have recent coverage for 1.19, 1.18 or golang.org/x (not failures, just, no runs) so I cannot say for sure about those.

CC @golang/illumos (yes I know this is just one guy at Oxide, I looked first to be sure it wasn't empty like /solaris). Not sure what the process is for more illumos builders, I can ask today (we have also have a 1.19 builder coverage problem for loong64 that just bit us this week, so, ugh).

@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 21, 2022
@prattmic
Copy link
Member

I took a look at what is going on:

  • Thread 1: throwing
    • Holding paniclk
    • Waiting for printlock
  • Thread 2: sysmon
    • Holding ??
    • Waiting for sched.lock, sched.sysmonlock, or forcegc.lock
  • Thread 13
    • Waiting on sched.lock
  • Thread 14: throwing
    • Holding printlock (in reportZombies)
    • Maybe holding sched.lock (in forEachP)
    • Waiting on paniclk

There is a lock ordering consistency problem here between paniclk and printlock (aka debuglock). Thread 1 takes paniclk first and then printlock (this is the default behavior of throw). Thread 14 takes printlock before paniclk (by calling printlock explicitly before throw).

reportZombies should call printunlock before throw. I see a few other similar cases as well.

cc @golang/runtime

@prattmic prattmic added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 21, 2022
@prattmic prattmic added this to the Go1.21 milestone Dec 21, 2022
@prattmic
Copy link
Member

prattmic commented Dec 21, 2022

Another option may be to have throw take printlock prior to paniclk. I'm not sure if there is a concrete reason we don't do that currently. This would have the advantage of allowing places that want to add more context to a throw to guarantee it prints right alongside the fatal error.

@davepacheco
Copy link
Author

Thanks @prattmic for taking a look. I'm glad it turned out to be worthwhile (i.e., sounds like a real issue here).

@davepacheco
Copy link
Author

@dr2chase Thanks. I'll look into the question of illumos builders for those recent releases.

@dr2chase
Copy link
Contributor

The instructions for adding a builder are hard to find using search engines; they're in the Wiki here: https://github.com/golang/go/wiki/DashboardBuilders
Any feedback about confusing bits is helpful.

@davepacheco
Copy link
Author

@dr2chase Thanks. So we do already have one, but you mentioned it's not running several recent releases. Is there some config that needs to be changed somewhere to cover those too?

@dr2chase
Copy link
Contributor

I am not good at builders, there is absolutely a config that needs to be changed somewhere.
But I just looked, and the runs are there now: https://build.golang.org/
So, no configuration needed for now. And 1.19 looks okay on illumos.

There's only 1 builder, a second one would be nice.
The list of builders and queues etc is here: https://farmer.golang.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
Status: In Progress
Development

No branches or pull requests

4 participants