Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "unexpected return pc for runtime.pthread_kill_trampoline" on darwin builders #37605

Closed
bcmills opened this issue Mar 2, 2020 · 8 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Mar 2, 2020

2020-03-02T16:12:20-ab7ecea/darwin-amd64-race
2019-12-24T16:13:37-c170b14/darwin-amd64-race

--- FAIL: TestCgoCallbackGC (0.34s)
    crash_test.go:95: testprogcgo CgoCallbackGC exit status: exit status 2
    crash_cgo_test.go:70: expected "OK\n", but got:
        fatal error: unexpected signal during runtime execution
        [signal SIGSEGV: segmentation violation code=0x1 addr=0x7000052b0046 pc=0x7fff66890bd0]
        
        runtime stack:
        runtime: unexpected return pc for runtime.pthread_kill_trampoline called from 0x7ffeefbff3b8
        stack: frame={sp:0x7ffeefbff2c8, fp:0x7ffeefbff2d0} stack=[0x7ffeefb80468,0x7ffeefbff4d0)

I suspect (without evidence) that this may be related to goroutine preemption in 1.14.

See also #34039.

CC @aclements @ianlancetaylor @cherrymui

@bcmills bcmills added this to the Go1.15 milestone Mar 2, 2020
@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 2, 2020

Possibly related to #32023.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 2, 2020

Possibly related to #36996: does the fix for that issue in signal_unix.go need a corresponding fix in signal_darwin.go? (Hmm, but signal_unix.go is also used on Darwin.)

@bcmills bcmills added the OS-Darwin label Mar 2, 2020
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Mar 2, 2020

It doesn't seem the same as #36996 to me. (And as you said, signal_unix.go covers darwin.)

Looking at the faulting PC, it seems fault is inside pthread_kill?

@bcmills bcmills changed the title runtime: "unexpected return pc for runtime.pthread_kill_trampoline" on darwin-amd64-race builder runtime: "unexpected return pc for runtime.pthread_kill_trampoline" on darwin builders Jun 10, 2020
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jun 10, 2020

I took another look of the hex dumps. I'm pretty sure that it faults within pthread_kill itself. Also, I tried on macOS 10.14 and 10.15 builders, using stress:

10.15: 5043 runs so far, 0 failures
10.14: 5076 runs so far, 21 failures

I also cannot reproduce it on my laptop (10.15). So it seems this may be specific to macOS 10.14, where there may be a bug in pthread_kill.

The above-listed failures on darwin-amd64-race builders all happened before we updated darwin-amd64-race builder to 10.15. So this supports my hypothesis. It would be good to check that it doesn't fail after it.

I wrote the following C program, which creates a number of threads and send signals to each other really hard (derived from TestCgoCallbackGC):

#include <pthread.h>
#include <signal.h>
#include <unistd.h>

#define P 100

pthread_t t[P];

void *thr(void *arg) {
	int i = (int)(uintptr_t)arg;
	int j;
	usleep(1);
	for (j = 0; j < i; j++)
		pthread_kill(t[j], SIGURG);
	return 0;
}

void foo(int i) {
	pthread_attr_t attr;
	pthread_attr_init(&attr);
	pthread_attr_setstacksize(&attr, 256 << 10);
	pthread_create(&t[i], &attr, thr, (void*)(uintptr_t)i);
}

int main() {
	int i;
	for (i = 0; i < P; i++)
		foo(i);
	usleep(2);
	for (i = 0; i < P; i++)
		pthread_kill(t[i], SIGURG);
	for (i = 0; i < P; i++)
		pthread_join(t[i], 0);
	return 0;
}

It doesn't fail on my laptop, with 10000+ runs under stress. On 10.14 builder, it fails 4 times in 100 runs (even running sequentially) (failed with seg fault). So I guess it has something to do with pthread_kill implementation on macOS 10.14.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jun 10, 2020

I got the following trace for the C program above using lldb. It faults within pthread_kill.

(lldb) run
Process 25547 stopped
* thread #22, stop reason = EXC_BAD_ACCESS (code=1, address=0x700000bc0046)
    frame #0: 0x00007fff73414bd0 libsystem_pthread.dylib`pthread_kill + 251
libsystem_pthread.dylib`pthread_kill:
->  0x7fff73414bd0 <+251>: movzwl 0x46(%rbx), %eax
    0x7fff73414bd4 <+255>: andl   $0xc00, %eax              ; imm = 0xC00 
    0x7fff73414bd9 <+260>: movl   $0x2d, %r15d
    0x7fff73414bdf <+266>: cmpl   $0x400, %eax              ; imm = 0x400 
  thread #37, stop reason = signal SIGURG
    frame #0: 0x00007fff7335b9de libsystem_kernel.dylib`__ulock_wait + 10
libsystem_kernel.dylib`__ulock_wait:
->  0x7fff7335b9de <+10>: jae    0x7fff7335b9e8            ; <+20>
    0x7fff7335b9e0 <+12>: movq   %rax, %rdi
    0x7fff7335b9e3 <+15>: jmp    0x7fff73359457            ; cerror_nocancel
    0x7fff7335b9e8 <+20>: retq   
Target 0: (a.out) stopped.

Process 25547 launched: '/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/a.out' (x86_64)
(lldb) bt
* thread #22, stop reason = EXC_BAD_ACCESS (code=1, address=0x700000bc0046)
  * frame #0: 0x00007fff73414bd0 libsystem_pthread.dylib`pthread_kill + 251
    frame #1: 0x0000000100000db0 a.out`thr + 80
    frame #2: 0x00007fff734122eb libsystem_pthread.dylib`_pthread_body + 126
    frame #3: 0x00007fff73415249 libsystem_pthread.dylib`_pthread_start + 66
    frame #4: 0x00007fff7341140d libsystem_pthread.dylib`thread_start + 13

Just a guess: it might be a race between pthread_kill and thread exiting, and pthread_kill doesn't handle the race condition nicely (it should return an error instead of fault).

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jun 15, 2020

The darwin-amd64-race builder was upgraded to macOS 10.15 in early March (CL 222238), and the only failure since then was on a 10_14 builder. So that's consistent with the theory of a pthread_kill platform bug in macOS 10.14.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 16, 2020

Based on the discussion, closing this as a macOS bug that is fixed in 1.15.

If someone wants to figure out a workaround in Go, great, but it doesn't seem that we must do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.