Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: sporadic but frequent "signal: illegal instruction" during all.bash on darwin/arm64 at CL 272258 PS 1 #42774

Closed
dmitshur opened this issue Nov 22, 2020 · 49 comments

Comments

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Nov 22, 2020

CL 272258 has made it easy to run all.bash, so I ran it a few times. It has passed at least once, but most of the some test will fail due to "signal: illegal instruction".

This is on a MacBook Air (M1, 2020), macOS 11.0.1 (20B29).

This may get resolved as part of finishing work on #42684, but reporting this in case these failures are helpful to see sooner, since I don't think of any existing macOS/ARM64 issues cover this failure mode specifically.

What version of Go are you using (go version)?

$ go version
go version devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500 darwin/arm64

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/dmitri/Library/Caches/go-build"
GOENV="/Users/dmitri/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/dmitri/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/dmitri/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/dmitri/go116r2"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/dmitri/go116r2/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-build446541792=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Built Go at commit 7716a2fbb7 (CL 272258 PS 1) successfully, pointed GOROOT_BOOTSTRAP to it, then checked out the same commit elsewhere and ran all.bash repeatedly.

What did you expect to see?

$ ./all.bash
Building Go cmd/dist using /Users/dmitri/go116r2. (devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500 darwin/arm64)
Building Go toolchain1 using /Users/dmitri/go116r2.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for darwin/arm64.

[...]

ALL TESTS PASSED
---
Installed Go for darwin/arm64 in /Users/dmitri/gotip
Installed commands in /Users/dmitri/gotip/bin
*** You need to add /Users/dmitri/gotip/bin to your PATH.

What did you see instead?

Most of the time, a failure like:

##### ../test
# go run run.go -- fixedbugs/issue5856.go
exit status 1
go tool compile: signal: illegal instruction

FAIL	fixedbugs/issue5856.go	0.015s
2020/11/22 15:17:43 Failed: exit status 1
##### ../test
# go run run.go -- fixedbugs/issue16095.go
exit status 1
go tool link: signal: illegal instruction

FAIL	fixedbugs/issue16095.go	0.027s
2020/11/22 15:49:41 Failed: exit status 1
##### ../test
# go run run.go -- fixedbugs/issue29264.go
exit status 1
go tool compile: signal: illegal instruction

FAIL	fixedbugs/issue29264.go	0.020s
2020/11/22 16:02:34 Failed: exit status 1
$ ./all.bash
Building Go cmd/dist using /Users/dmitri/gotip. (devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500 darwin/arm64)
Building Go toolchain1 using /Users/dmitri/gotip.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
go build internal/testlog: /Users/dmitri/go116r2/pkg/tool/darwin_arm64/compile: signal: illegal instruction
go tool dist: FAILED: /Users/dmitri/go116r2/pkg/tool/darwin_arm64/go_bootstrap install -gcflags=all= -ldflags=all= -i cmd/asm cmd/cgo cmd/compile cmd/link: exit status 1
--- FAIL: TestCover (1.48s)
    cover_test.go:141: running [go -build -o /var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-testcover103682725/toolexec.exe testdata/toolexec.go]
    cover_test.go:130: running [go build -o /var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-testcover103682725/cover.exe]
    cover_test.go:173: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestCoverHTML (1.48s)
    cover_test.go:390: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestFuncWithDuplicateLines (1.48s)
    cover_test.go:537: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestHtmlUnformatted (1.48s)
    cover_test.go:454: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestDirectives (1.48s)
    cover_test.go:249: failed to build testcover or toolexec program: signal: illegal instruction
--- FAIL: TestCoverFunc (1.48s)
    cover_test.go:368: failed to build testcover or toolexec program: signal: illegal instruction
FAIL
FAIL	cmd/cover	1.728s
--- FAIL: TestDWARF (0.85s)
    --- FAIL: TestDWARF/testprog (1.75s)
        dwarf_test.go:99: symbols testprog.exe: exit status 255: /var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-link-TestDWARF652205567/testprog.exe [arm64, 0.054295 seconds]:
                null-uuid                            /private/var/folders/_0/h0671fcn4rgb5pn9c745dx2h0000gn/T/go-link-TestDWARF652205567/testprog.exe [AOUT, PIE, FaultedFromDisk]  
                    0x0000000000000000 (0x100000000) __PAGEZERO SEGMENT
                    0x0000000100000000 ( 0xf0000) __TEXT SEGMENT
                        0x0000000100000000 (  0x1000) MACH_HEADER
                        0x0000000100001000 ( 0xc6c90) __TEXT __text
                            0x0000000100001000 (    0x70) go.buildid [FUNC, NameNList, MangledNameNList, Merged, NList] 
                            0x0000000100001070 (    0x50) internal/cpu.Initialize [FUNC, NameNList, MangledNameNList, NList] 
[...]
                            0x00000001002683a8 (     0x8) p [EXT, NameNList, MangledNameNList, NList] 
                            0x00000001002683b0 (     0x4) threadExited [EXT, NameNList, MangledNameNList, NList] 
                    0x000000010026c000 ( 0x402d2) __LINKEDIT SEGMENT
            symbols[31119]: [fatal]  child process status could not be determined; exit code unavailable.
FAIL
FAIL	cmd/link	20.081s

(Full log for TestDWARF/testprog failure.)

CC @cherrymui.

@tmm1
Copy link
Contributor

@tmm1 tmm1 commented Nov 22, 2020

It's worth trying with GODEBUG=asyncpreemptoff=1 set /cc #42700

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Nov 22, 2020

I think that issue is about running Go darwin/amd64 port under Rosetta 2. This is about running the darwin/arm64 port natively.

I can try it anyway to see what effect it has, thanks for the suggestion.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 22, 2020

Interesting. Looks similar to #41702. If I'm not mistaken, all failed cases above involve starting processes. Maybe it is more likely to fail on ARM64 than AMD64? Maybe the workaround in #41702 is not effective enough?

@tmm1 this is not Rosetta 2, so #42700 is irrelevant.

That said, GODEBUG=asyncpreemptoff=1 is a workaround of #41702 as well.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 22, 2020

I rarely see this on the DTK. I think I never see it since the workaround of #41702 is added. It is interesting that it more likely to fail on M1.

Does TestExec in syscall package pass reliably?

@rolandshoemaker
Copy link
Member

@rolandshoemaker rolandshoemaker commented Nov 22, 2020

I've also seen this happen a few times. Interestingly the place I ran into it most often is when running the io benchmarks, particularly BenchmarkCopyNLarge-8.

~/go/src % ../bin/go test ./io -run=XXX -bench=. -count=10
goos: darwin
goarch: arm64
pkg: io
BenchmarkCopyNSmall-8   	 5467614	       425.9 ns/op
BenchmarkCopyNSmall-8   	 3287067	       428.3 ns/op
BenchmarkCopyNSmall-8   	 5085464	       740.8 ns/op
BenchmarkCopyNLarge-8   	signal: illegal instruction
FAIL	io	24.012s
FAIL
@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Nov 22, 2020

Does TestExec in syscall package pass reliably?

I believe it does:


$ go version
go version devel +7716a2fbb7 Sun Nov 22 11:44:49 2020 -0500 darwin/arm64
$ go test -run=TestExec syscall
ok  	syscall	0.500s
$ go test -count=1 -run=TestExec syscall
ok  	syscall	0.227s
$ go test -count=10 -run=TestExec syscall
ok  	syscall	1.216s
$ go test -count=100 -run=TestExec syscall
ok  	syscall	7.949s
$ go test -count=1000 -run=TestExec syscall
ok  	syscall	137.463s
$ go test -count=1 -v -run=TestExec syscall 
=== RUN   TestExec
--- PASS: TestExec (0.02s)
=== RUN   TestExecHelper
--- PASS: TestExecHelper (0.00s)
=== RUN   TestExecPtrace
--- PASS: TestExecPtrace (0.00s)
=== RUN   TestExecErrPermutedFds
--- PASS: TestExecErrPermutedFds (0.00s)
PASS
ok  	syscall	0.193s
@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Nov 22, 2020

I'm able to reproduce @rolandshoemaker's finding above with go test io -run='^$' -bench=BenchmarkCopy -count=10, 5 out of 5 times. It triggers within 5-30 seconds.


I've tried GODEBUG=asyncpreemptoff=1. With it set, all.bash also passed 5 of 5 times, and BenchmarkCopy also worked okay 5 of 5 times, all without any instances of "signal: illegal instruction". When I cleared GODEBUG, the problem started to happen again (although seemingly less frequently...).

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 23, 2020

Okay, I can reproduce the SIGILL with running the io benchmarks in a loop on the DTK, although with a much lower failure rate. I'll take it from there. (I'll be OOO this afternoon, though, so I probably won't get back to this very soon.)

@tonistiigi
Copy link
Contributor

@tonistiigi tonistiigi commented Nov 23, 2020

Crash report for this error: https://gist.github.com/tonistiigi/23fbd4be8d1562ff935aa36b2dcdb299

With GODEBUG=asyncpreemptoff=1 ./io.test benchmark doesn't reproduce issue for me either.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 24, 2020

@tonistiigi thanks for the crash report. I can reproduce the failure and get similar crash reports. I also managed to get a core dump, which is also quite similar.

In the crash report, it says it SIGILLs at __psynch_cvwait + 8, which doesn't make sense. It points to a valid instruction (b.lo $pc-4). In the core dump, the register and stack trace matches, and the memory it points to does have valid instructions. However, the core dump doesn't specify which thread crashes, so I checked them all, and found that two other threads, the PC of which are supposed to be in __semwait_signal and __pthread_kill, respectively, but actually in a page of zeros. This page is next to __psynch_cvwait and in the range of libsystem_kernel.dylib. That page should be mapped read-only, right? So it should not be possible for Go the overwrite that page.

It could be the core file being somehow truncated. Or, if it is actually right, somehow a page of system functions is replaced with a page of 0. I'll try to get more core files.

I haven't been able to get a crash while running under lldb (to eliminate the possibility of truncated core file).

Either way, it seems it crashed in some system function, with SIGILL. I don't see how that could be possible from the Go runtime. Also, normally, when a Go program encounters a SIGILL it will print a stack trace. In this case it doesn't. The only possibilities I can think of is that it crashes before the runtime is initialized, or it SIGILLs while handing signal (so it cannot handle another signal). From the crash report and the core dump, neither seems the case. Maybe the kernel directly kills it when it sees SIGILL in a system library?

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 24, 2020

Hmmm, running in the debugger I saw the runtime does a few MAP_FIXED mmaps at fairly low addresses. I'll look into why it does that.

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 24, 2020

The crash report from @tonistiigi #42774 (comment) doesn’t look like a genuine SIGILL triggered by a hardware trap to me. It looks like the process itself called kill(getpid(), SIGILL).

https://gist.github.com/tonistiigi/23fbd4be8d1562ff935aa36b2dcdb299:

Exception Type:        EXC_CRASH (SIGILL)
Exception Codes:       0x0000000000000000, 0x0000000000000000

This is suspicious: a genuine illegal instruction in hardware will transform into EXC_BAD_INSTRUCTION, and codes[0] will appear as, in nearly every case, EXC_ARM_UNDEFINED. It will never be 0. Actually, it won’t be translated symbolically, it’ll show up numerically as 0x0000000000000001. Refer to 10.15.6 xnu-6153.141.1/osfmk/arm64/sleh.c handle_uncategorized, handle_wf_trap, handle_msr_trap, handle_user_trapped_instruction32, and handle_simd_trap. codes[1] would be the instruction, although seeing 0 here isn’t suspicious, as 0 is indeed an illegal instruction in arm64. In fact, asm(".long 0") is my preferred way to trigger this fault when I need to.

Termination Signal:    Illegal instruction: 4
Termination Reason:    Namespace SIGNAL, Code 0x4
Terminating Process:   io.test [65832]

Still more suspicious stuff here: for a genuine hardware trap, the Mach exception will be transformed to a signal by the in-kernel exc/mach_exc subsystem server, in 10.15.6 xnu-6153.141.1/bsd/uxkern/ux_exception.c handle_ux_exception. That would be blamed as the terminating process (although, as an in-kernel Mach server, it’s not a “process” in the strictest sense, but since it will have been operating on behalf of the faulting process, the PID will be the crashing process’ own PID). Here, you’d expect:

Terminating Process:   exc handler [65832]

Looks to me like this isn’t a hardware trap at all, but a software SIGILL. On the other hand, if you’re catching SIGILL and re-raising by calling, for example, kill(getpid(), SIGILL) rather than letting execution resume at the faulting instruction without your signal handler in place, the software SIGILL may ultimately have a hardware origin. In that case, though, information about the true origin of the SIGILL will unfortunately have been lost, making debugging a trickier proposition than it needs to be.

Is there anywhere a SIGILL might be generated in software by code?

@cherrymui #42774 (comment):

However, the core dump doesn't specify which thread crashes, so I checked them all, and found that two other threads, the PC of which are supposed to be in __semwait_signal and __pthread_kill, respectively, but actually in a page of zeros.

If you see a thread in __pthread_kill, it just may be the source of the software SIGILL that’s taking things down.

@tonistiigi
Copy link
Contributor

@tonistiigi tonistiigi commented Nov 24, 2020

I went through all my crash reports. While most of them are almost identical to the one I posted earlier (and with the same pc), I did find two that were quite different. https://gist.github.com/tonistiigi/9d0d40afe5248932cdac266c357d3135 . One of them has a stacktrace to pthread_kill.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 24, 2020

No, I don't think the Go program send itself a SIGILL. The __pthread_kill is likely to send SIGURG.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 24, 2020

Exception Type:        EXC_CRASH (SIGILL)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Application Specific Information:
dyld2 mode

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	0x00000001e54c75d0 __pthread_kill + 8
1   libsystem_pthread.dylib       	0x00000001e5572c54 pthread_kill + 292
2   a.out                         	0x0000000100c3ff5c main + 36
3   libdyld.dylib                 	0x00000001e5356844 start + 4

Here is what I got for a process pthread_kill itself with SIGILL. The Exception Type and Exception Codes look like above, but it doesn't have Termination Signal, etc.

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 25, 2020

I reproduced this 0/10 times running all.bash and 0/15 times running go test io -run='^$' -bench=BenchmarkCopy -count=10 on my DTK running 11.0.1 20B29. My M1 isn’t here yet.

Are others seeing this only on a different OS version than I’m using? Or is the hardware the crucial difference?

OK, I only thought I reproduced 0/10 running all.bash, but when I checked ~/Library/Logs/DiagnosticReports, I saw a single crash from link, although I never saw anything like go tool link: signal: illegal instruction at all.

Process:               link [34564]
[…]
Code Type:             ARM-64 (Native)
[…]
Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_CRASH (SIGILL)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Illegal instruction: 4
Termination Reason:    Namespace SIGNAL, Code 0x4
Terminating Process:   link [34564]

Application Specific Information:
dyld2 mode

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib              0x000000018ff3f488 __psynch_cvwait + 8
1   libsystem_pthread.dylib             0x000000018ff75568 _pthread_cond_wait + 1192
2   link                                0x00000001043526d8 runtime.pthread_cond_wait_trampoline + 24
3   link                                0x00000001043515b8 runtime.asmcgocall + 200
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 25, 2020

I did some dtrace. It appears something sends a SIGILL signal. The kernel stack varies. Apparently, the kernel entry points can be a syscall (e.g. madvise), somewhere in user code (not syscall, e.g. runtime.scanobject), or no explicit entry point.

  4   2606 get_system_inshutdown:signal-send pid=606 tid=54569 io.test send signal 4

              kernel.release.t8027`0xfffffe00202d9038+0x244
              kernel.release.t8027`0xfffffe00202d9540+0x494
              kernel.release.t8027`0xfffffe001fde79f4+0xdc
              kernel.release.t8027`0xfffffe001fdabd9c+0xc
              0x1c4dee4d0 // _madvise
  6   2606 get_system_inshutdown:signal-send pid=590 tid=53893 io.test send signal 4

              kernel.release.t8027`0xfffffe00202d9038+0x244
              kernel.release.t8027`0xfffffe00202d9540+0x494
              kernel.release.t8027`0xfffffe001fde79f4+0xdc
              kernel.release.t8027`0xfffffe001fdabd9c+0xc
              kernel.release.t8027`0xfffffe001fef6714+0x564
              kernel.release.t8027`0xfffffe001fdab858+0x28
              0x1003337f0 // runtime.markrootSpans
  3   2606 get_system_inshutdown:signal-send pid=572 tid=53266 io.test send signal 4

              kernel.release.t8027`0xfffffe00202d9038+0x244
              kernel.release.t8027`0xfffffe00202d9540+0x494
              kernel.release.t8027`0xfffffe001fde79f4+0xdc
              kernel.release.t8027`0xfffffe001fdabd9c+0xc
              kernel.release.t8027`0xfffffe001fde7d24
              kernel.release.t8027`0xfffffe001fdb47c0+0x7c

The top 4 frames are essentially always the same, but lower frames vary.

The user stack also varies.

On contrast, for a program pthread_kill itself with SIGILL,

  2   2606 get_system_inshutdown:signal-send pid=394 tid=11884 a.out send signal 4

              kernel.release.t8027`0xfffffe001170d210+0x2e4
              kernel.release.t8027`0xfffffe0011236714+0x274
              kernel.release.t8027`0xfffffe00110eb858+0x28
              0x1d258f5d0 // __pthread_kill

It is a different code path, and the entry point is always __pthread_kill.

I also traced syscall __pthread_kill and I didn't see any SIGILL (all SIGURG).

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 25, 2020

That's definite progress! Can you symbolize those kernel frames or even try the debug kernel?

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 25, 2020

I don't know how to do that. Tried sym(address) and func(address) with dtrace and it still just prints the hex.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 25, 2020

I tried to manually symbolize a kernel stack trace (by comparing the kernel memory with the symbol information of a kernel with a nearby version):

  6   2606 get_system_inshutdown:signal-send pid=670 tid=51032 io.test send signal 4 to 670 51030

              kernel.release.t8027`0xfffffe001765d038+0x244 // postsig_locked
              kernel.release.t8027`0xfffffe001765d540+0x494 // bsd_ast
              kernel.release.t8027`0xfffffe001716b9f4+0xdc  // ast_taken_user
              kernel.release.t8027`0xfffffe001712fd9c+0xc   // _arm64_thread_exception_return (label user_take_ast)
              kernel.release.t8027`0xfffffe001716bd24       // ???
              kernel.release.t8027`0xfffffe00171387c0+0x7c  // _Call_continuation

The fifth frame is weird, as it is not a call instruction. Maybe a faked return address? I'll see if I can make sense of it...

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 26, 2020

This is another case:

  0   2606 get_system_inshutdown:signal-send pid=671 tid=51041 io.test send signal 4 to 671 51037

              kernel.release.t8027`0xfffffe001765d038+0x244 // postsig_locked
              kernel.release.t8027`0xfffffe001765d540+0x494 // bsd_ast
              kernel.release.t8027`0xfffffe001716b9f4+0xdc  // ast_taken_user
              kernel.release.t8027`0xfffffe001712fd9c+0xc   // _arm64_thread_exception_return (label user_take_ast)
              kernel.release.t8027`0xfffffe001727a714+0x564 // sleh_synchronous ???
              kernel.release.t8027`0xfffffe001712f858+0x28  // fleh_synchronous
              0x102387f84 // user code

Not quite sure about sleh_synchronous frame. It seems changed quite a bit from the in-memory kernel to the one with symbol information.

For comparison, this is the case that it pthread_kill SIGILL itself:

  2   2606 get_system_inshutdown:signal-send pid=1317 tid=176087 a.out send signal 4 to 0 176090

              kernel.release.t8027`0xfffffe0017751210+0x2e4 // unix_syscall
              kernel.release.t8027`0xfffffe001727a714+0x274 // sleh_synchronous ???
              kernel.release.t8027`0xfffffe001712f858+0x28  // fleh_synchronous
              0x1d66235d0 // __pthread_kill

It seems to make sense.

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 26, 2020

@cherrymui #42774 (comment), #42774 (comment), #42774 (comment):

kernel.release.t8027

What OS version are you using (sw_vers or uname -v)? I suspect that it’s quite out of date.

On 11.0.1 20B29, the kernel is named kernel.release.t8020 for the DTK (A12) and kernel.release.t8101 for production hardware (M1, note that T8101 is the same designation for M1 and A14).

kernel.release.t8020 is the kernel that the DTK since 11.0db4 20A5343i (2020-08-04). kernel.release.t8027 was also present in earlier betas, as late as 11.0db3 20a5323l (2020-07-22). T8027 indicates A12X/A12Z, A12Z being the part in the DTK, and is more specific than T8020, but it seems that the two kernels were merged into a single one with the more generic T8020 label as the beta period progressed.

The point I’m getting at is that the kernel on your DTK may be a beta that’s at least 4 months old, either the original 11.0db1 20A5299w that it shipped with, or a very early intermediate beta up to 11.0db3 20a5323l. This is going to be a very different animal than what’s in the factory image now shipping with M1-based machines (11.0 20A2411), the current released kernel for both DTK and M1 (11.0.1 20B29/20B50), and the current beta for both (11.1b1 20C5048k). This isn’t the a kernel that we should be testing against at this point.

The KDK for those old releases where there was a separate T8027 kernel never included a .dSYM for them either, so it won’t be the nicest possible debugging experience. (Incidentally, the newer post-M1 KDKs don’t include any T8101 kernels at all, so it is good that you’re able to do this on a DTK, where .dSYMs are available for contemporary release and beta kernels, and debug kernels are also an option.)

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 26, 2020

I have the original 20A5299w kernel. I tried to update the DTK earlier but somehow it failed, and I didn't try again. I understand this is not a good debugging experience. I'll try updating it again at some point.

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 26, 2020

I kept looking anyway, and as you may have found, the T8020 and T8027 kernels in those early releases were identical, so I was able to use the T8020 .dSYM from the 20A5299w KDK. Nothing much came of that, it basically confirmed your symbolization. But I was looking at what seemed to be the most promising stack you shared:

mark@sweet16 zsh% pbpaste
              kernel.release.t8027`0xfffffe001765d038+0x244 // postsig_locked
              kernel.release.t8027`0xfffffe001765d540+0x494 // bsd_ast
              kernel.release.t8027`0xfffffe001716b9f4+0xdc  // ast_taken_user
              kernel.release.t8027`0xfffffe001712fd9c+0xc   // _arm64_thread_exception_return (label user_take_ast)
              kernel.release.t8027`0xfffffe001727a714+0x564 // sleh_synchronous ???
              kernel.release.t8027`0xfffffe001712f858+0x28  // fleh_synchronous
              0x102387f84 // user code
mark@sweet16 zsh% pbpaste | sed -En -e 's/^.*kernel\.release\.t8027`([^ ]+).*$/print(hex(\1))/p' | python3 | atos -o kernel.release.t8020.dSYM/Contents/Resources/DWARF/kernel.release.t8020 -s 0xff68000
postsig_locked (in kernel.release.t8020) (kern_sig.c:3359)
bsd_ast (in kernel.release.t8020) (kern_sig.c:3573)
ast_taken_user (in kernel.release.t8020) (ast.c:219)
user_take_ast (in kernel.release.t8020) + 12
sleh_synchronous (in kernel.release.t8020) (sleh.c:594)
fleh_synchronous (in kernel.release.t8020) + 40

(0xff68000 computed as 0xfffffe001765d038 minus the address shown by nm kernel.release.t8020.dSYM/Contents/Resources/DWARF/kernel.release.t8020 | grep ' [Ss] _postsig_locked$'.)

Nothing too new or interesting there, and we can’t do much with the filenames and line numbers without a source drop. So I looked for other sources of SIGILL in xnu that aren’t strictly hardware traps, and I found one, right in 10.15.6 xnu-6153.141.1/bsd/dev/arm/unix_signal.c sendsig. Incidentally, that’s exactly where bsd/kern/kern_sig.c postsig_locked calls into.

At the bottom of sendsig, at the label bad, you’ll see a hard-coded SIGILL. bad is used when an error occurs while attempting to fulfill a request to send a signal to a user-space handler. Here, the errors can be in getting or setting thread state, or in copying data to user-space, so as to establish the user-space handler’s stack frame. This SIGILL is forced to SIG_DFL, which also agrees with your observation that the user-space handler never ran to show a backtrace.

(x86_64 need not feel left out, it has the same SIGILL behavior in bsd/dev/i386/unix_signal.c sendsig.)

Here’s a sample: t_impossible_sigaltstack. It sets up a sigaltstack to handle SIGABRT, but sets the protection on it to PROT_NONE, preventing the kernel from setting up the stack frame, before calling abort() to crash. You were expecting SIGABRT? In reality:

mark@arm-and-hammer zsh% clang++ -Wall -Werror t_impossible_sigaltstack.cc -o t_impossible_sigaltstack
mark@arm-and-hammer zsh% ./t_impossible_sigaltstack 
zsh: illegal hardware instruction  ./t_impossible_sigaltstack
mark@arm-and-hammer zsh% cat less Library/Logs/DiagnosticReports/t_impossible_sigaltstack_*.crash
Process:               t_impossible_sigaltstack [72514]
[…]
Code Type:             ARM-64 (Native)
[…]
Exception Type:        EXC_CRASH (SIGILL)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Illegal instruction: 4
Termination Reason:    Namespace SIGNAL, Code 0x4
Terminating Process:   t_impossible_sigaltstack [72514]

Application Specific Information:
dyld2 mode
abort() called

(Another note: sigaltstack is broken for arm64 in the 10.15.6 xnu sources, but it’s working in 11.0. Doesn’t actually matter for this bug, though: the 10.15.6 xnu source shows everything you need to see how this SIGILL is generated.)

How to dig deeper from here? Well, the Mach exception type that was recorded (here, EXC_CRASH) will actually be the true exception type from the original hardware fault if there was one, and will be EXC_CRASH if it was a software fault. Since your reports don’t show EXC_BAD_INSTRUCTION, they’re not real illegal instructions at all, they’re just failures to invoke the user-space signal handler. In fact yours are showing EXC_CRASH, just like my example here.

Since golang is a heavy user of SIGURG internally, and since you mentioned that the __pthread_kill calls I originally suggested might be relevant were raising SIGURG, I think we have a very strong theory that the problem is a failure to invoke the user-space SIGURG signal handler. You don’t normally think of SIGURG as being a fatal signal, because its default disposition (at SIG_DFL) is to be discarded, but in fact when a user-space handler can’t be invoked, the default disposition of SIGILL will be the controlling behavior.

The remaining question concerns why the user-space signal handler couldn’t be invoked. It could be something like my example here, with an unusable signal stack. Conceivably, it could be stack exhaustion or some other memory problem, or it could be a problem with the user thread state or signal handler thread state. But I think that this new lead will be valuable in troubleshooting this further.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 27, 2020

the T8020 and T8027 kernels in those early releases were identical

Interesting. The T8020 kernel I found isn't identical to the running T8027 kernel, just pretty close. There are small changes here and there, like off by one instruction. The relative PCs generally don't match. So I had to do some manual work.

signaltstack may be the right direction. We do use sigaltstack, and if I disable it it seems no longer SIGILLs. I'll look more on that direction.

Out of curiosity, I wonder why the kernel chooses SIGILL in this situation. (I hope it is not a typo of SIGKILL...)

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 27, 2020

Change https://golang.org/cl/273627 mentions this issue: Revert "runtime: use sigaltstack on macOS/ARM64"

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 27, 2020

For people who want to try, the CL above disables sigaltstack. It seems to fix this on my machine.

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 27, 2020

Interesting. The T8020 kernel I found isn't identical to the running T8027 kernel, just pretty close. There are small changes here and there, like off by one instruction. The relative PCs generally don't match. So I had to do some manual work.

If you have ADC access, log in to it by clicking Downloads on the ADC page here and then download Kernel_Debug_Kit_11.0_build_20A5299w.dmg. If you don’t have an ADC account, find me internally.

signaltstack may be the right direction. We do use sigaltstack, and if I disable it it seems no longer SIGILLs. I'll look more on that direction.

I suspect that the conditions to trigger the bug are merely more likely with a sigaltstack in effect, but doubt that they disappear entirely without sigaltstack. They may become so rare that testing won’t show them, though.

That said, sigaltstack working on arm64 in xnu at all is a brand-new thing.

Out of curiosity, I wonder why the kernel chooses SIGILL in this situation. (I hope it is not a typo of SIGKILL...)

It’s been there forever: 10.0.0 /xnu-123.5/bsd/dev/ppc/unix_signal.c sendsig. That’s from way before all of these SIGKILLs were fashionable. We’re way off in the fringe edges of xnu that don’t see much love or attention. I bet that it was sorta-intentionally chosen over 20 years ago and then simply forgotten.

Very interesting: #41702 is another way to get the same SIGILL to trigger, although presumably that exact trigger has been worked around. Recent, too.

🦃

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 27, 2020

Change https://golang.org/cl/273686 mentions this issue: runtime: mlock signal stack on macOS/ARM64

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 27, 2020

Apparently, it more likely to fail when the system is under load (e.g. running multiple instances of the test in parallel). So, one assumption is that it likely to fail when the signal stack is not faulted in when the signal happens. So I tried to mlock the signal stack, and it does seem to fix this.

CL https://golang.org/cl/273686 does the mlock. For people who want to try, let me know if that works on your machine. Thanks!

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 27, 2020

Oh, that’s a good find. I thought that this might have to do with faulting in the signal stack, but despite spending a half hour trying to reproduce the problem with purge, I had no luck.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 27, 2020

If you have ADC access, log in to it by clicking Downloads on the ADC page here and then download Kernel_Debug_Kit_11.0_build_20A5299w.dmg.

I couldn't find 20A5299w kernel debug kit on that download page... So I downloaded 20A2411 and hoped for the best...

Anyway, I updated my kernel now.

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Nov 27, 2020

I tried CL 272258 (aka commit 7716a2fbb7abc24f1069b1fc4c4b10b2274eee8a) with and without the above sigaltstack revert CL 273627 rebased on top.

With revert, all.bash passed 3/3 times.

Without revert, all.bash passed 5/5 times.

When I filed this issue, all.bash was failing at 7716a2fbb7abc24f1069b1fc4c4b10b2274eee8a what felt like 80% of the time—I'm not sure what has changed since then to cause it to start to pass so often. (I checked and GODEBUG was unset.)

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 27, 2020

I’ve come up with a reduced C testcase that exhibits the bug fairly readily: t_fault_sigaltstack.cc. It should be possible to use this to prototype other possible workarounds, and for filing with Apple.

If it doesn’t crash with SIGILL shortly after starting it, start while true; do sudo purge; done (either before or while t_fault_sigaltstack is running) and it should crash.

Process:               t_fault_sigaltstack [80089]
[…]
Code Type:             ARM-64 (Native)
[…]
Crashed Thread:        1

Exception Type:        EXC_CRASH (SIGILL)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Illegal instruction: 4
Termination Reason:    Namespace SIGNAL, Code 0x4
Terminating Process:   t_fault_sigaltstack [80089]

Application Specific Information:
dyld2 mode

Thread 0:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib              0x000000018ff43cec __pthread_kill + 8
1   libsystem_pthread.dylib             0x000000018ff74c24 pthread_kill + 292
2   t_fault_sigaltstack                 0x0000000102b9dd60 main + 360
3   libdyld.dylib                       0x000000018ff90f54 start + 4

Thread 1 Crashed:
0   libsystem_kernel.dylib              0x000000018ff3f284 __semwait_signal + 8
1   libsystem_c.dylib                   0x000000018feb93a0 nanosleep + 216
2   t_fault_sigaltstack                 0x0000000102b9dfb4 (anonymous namespace)::ThreadMain(void*) + 420
3   libsystem_pthread.dylib             0x000000018ff7506c _pthread_start + 320
4   libsystem_pthread.dylib             0x000000018ff6fda0 thread_start + 8
[…]

Other times, it will show the thread calling pthread_kill as the crash thread.

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 27, 2020

The key to the t_fault_sigaltstack reproduction above is that I’ve allocated the signal stacks using mmap with file backing, as opposed to MAP_ANONYMOUS, malloc, vm_allocate, etc. That makes it much easier to see the problem with just purge. In theory, if there was enough pressure that the system was actually swapping, then this bug could occur regardless of the allocation method, and even without SA_ONSTACK, but it’s certainly easier to see with a file-backed mmap.

How is Go allocating the memory used for thread stacks and signal stacks?

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Nov 27, 2020

When I posted my last comment, I hadn't seen the other comments that came in after #42774 (comment).

dmitshur: I'm not sure what has changed since then to cause it to start to pass so often
cherrymui: it more likely to fail when the system is under load

Ah, adding system load in the background while running all.bash made the initial issue very reproducible on my machine once again. I tried both CL 273627 and CL 273686, this time using background load during all.bash, and got more conclusive results:

System load in the background was generated with:
$ go test io -run='^$' -bench=BenchmarkCopy -count=10  # (run repeatedly)

original (CL 272258):
• all.bash + background load = 5/5 fail
• io.BenchmarkCopy (on its own) = 5/5 fail

CL 272258 with CL 273627 (Revert "runtime: use sigaltstack on macOS/ARM64") on top:
• all.bash + background load = 3/3 ok, but also 2 inconclusive or unrelated fails¹
• io.BenchmarkCopy (on its own) = 6/6 ok

CL 272258 with CL 273686 (runtime: mlock signal stack on macOS/ARM64) on top:
• all.bash + background load = 3/3 ok
• io.BenchmarkCopy (on its own) = 5/5 ok
¹ (details about the two inconclusive failures)
--- FAIL: TestAllDependenciesVendored (4.88s)
    --- FAIL: TestAllDependenciesVendored/std (2.84s)
        moddeps_test.go:121: /Users/dmitri/gotip/bin/go list -mod=vendor -deps ./...: read |0: bad address
        moddeps_test.go:122: (Run 'go mod vendor' in /Users/dmitri/gotip/src to ensure that dependecies have been vendored.)
FAIL
FAIL	cmd/internal/moddeps	5.452s
src $ time ./all.bash
[...]
ok  	reflect	0.489s
ok  	regexp	0.228s
ok  	regexp/syntax	0.795s
./all.bash: line 13: 66061 Illegal instruction: 4  bash run.bash --no-rebuild
./all.bash  125.95s user 14.23s system 132% cpu 1:45.80 total
ok  	runtime	60.007s
ok  	runtime/debug	0.234s
ok  	runtime/internal/atomic	0.228s
[...]
ALL TESTS PASSED

In summary, both CL 273627 and CL 273686 seem effective at resolving the original issue on my machine.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 28, 2020

Thanks for the C++ reproducer!

How is Go allocating the memory used for thread stacks and signal stacks?

I think they are allocated from the Go heap (free page pool), and if that cannot be satisfied the heap will grow using mmap MAP_ANONYMOUS.

We always have SA_ONSTACK set. So mlock the signal stacks should suffice as a workaround.

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 28, 2020

I just revised t_fault_sigaltstack to be a bit more robust. I plan on using this to file the bug with Apple.

I also found that if there’s only one child thread, it gives a reliable SIGILL (with a file-backed mmap) on the very first SIGURG, even without all of the purge. So I made that the default mode of operation for the test program, but you can still select the number of threads (formerly 32) with -n.

mlock is a viable workaround, but I’m worried about unnecessarily wiring too much memory as it’s a constrained resource and in fact may be limited by RLIMIT_MEMLOCK (though it’s not limited by default on macOS). It should suffice to just mlock the highest page in the stack_t given to sigaltstack, because that’s the only page that the kernel needs to write the signal handler’s stack frame to.

(macOS’ SIGSTKSZ is 128kB, and even its MINSIGSTKSZ is 32kB, or two pages on arm64. No need to wire that much memory.)

@markmentovai
Copy link

@markmentovai markmentovai commented Nov 30, 2020

I filed FB8922558 with Apple about these uncatchable spurious SIGILLs.

@gopherbot gopherbot closed this in 7f688d1 Nov 30, 2020
@bcmills
Copy link
Member

@bcmills bcmills commented Dec 1, 2020

Is / was this a duplicate of #35851? (Can that issue be closed now?)

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 1, 2020

@bcmills probably not. #35851 is about iOS, where we don't use sigaltstack.

There is still the possibility that #35851 is due to that when the signal is delivered to the main stack and the main stack happens to be paged out, the kernel may still by buggy (at least I can't rule out this possibility), though much unlikely. Even if this is the case, the workaround for this issue won't help, as it is specific to sigaltstack.

@markmentovai
Copy link

@markmentovai markmentovai commented Dec 2, 2020

@cherrymui sigaltstack ought to be working on iOS 14 (and all iOS versions on x86_64 aka the simulator). You can probably address #35851 at least on newer OS versions by enabling it subject to an OS version and architecture check.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 2, 2020

@eliasnaur what versions of iOS do we support?

Last time I checked, our iOS builder machine doesn't support sigaltstack. The builder's kernel got updated recently, though. I could try again.

For iOS simulators we always use sigaltstack, as for macOS/AMD64.

@eliasnaur
Copy link
Contributor

@eliasnaur eliasnaur commented Dec 2, 2020

As luck would have it, Corellium recently added iOS 14 images. I've upgraded all 3 builders to iOS 14.2.

@eliasnaur
Copy link
Contributor

@eliasnaur eliasnaur commented Dec 2, 2020

@eliasnaur what versions of iOS do we support?

I don't think we have a policy for iOS versions. However, requiring iOS 14 for working Go programs is probably too little. Perhaps disable signal preemption on iOS < 14?

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 2, 2020

Preemption signal is not the only issue here. It can be any signals, like user generated signals. So I don't think disabling preemption is preferable.

We could use sigaltstack+mlock on iOS >=14. But before we do that, I think we would want to understand better if #35851 is the same cause or not.

@eliasnaur
Copy link
Contributor

@eliasnaur eliasnaur commented Dec 2, 2020

Ok, thank you for being thorough; my concern was to not leave iOS out of the fix for this issue in 1.16.

How do we understand #35851 better? FWIW, I still have the iOS 13 images ready to run.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 2, 2020

If there is someway we can reproduce the failure more easily (like running the io benchmarks in this issue), that would be helpful. Then we can play with the workaround and other ideas and see if it is effective.

It may also be helpful if you could get a crash report.

Do you know how likely the failure occurs? From the dashboard, it seems that it failed once in February, then not failed for several months, then failed 3 times in a row in October 29, then not failed since then.

2020-10-29T19:03:09-f2c0c2b/ios-arm64-corellium
2020-10-29T18:26:18-2e6f500/ios-arm64-corellium
2020-10-29T17:53:57-3a819e8/ios-arm64-corellium
2020-02-19T21:34:59-1e43298/darwin-arm64-corellium
2019-11-25T03:41:16-9940c77/darwin-arm64-corellium
2019-11-14T17:43:48-d3c2b1f/darwin-arm64-corellium
2019-11-07T19:18:49-69dcdbd/darwin-arm64-corellium
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 2, 2020

Another possibility could be #41702. The workaround (CL https://go-review.googlesource.com/c/go/+/262817 and https://go-review.googlesource.com/c/go/+/262438) only covers GOOS="darwin". Maybe we should extend them ios as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants