-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/signal: flaky test on darwin/arm #10207
Comments
This happens when C code calls into Go code before any Go code calls into C code. For example, it could happen if a C++ global constructor calls a Go function. I don't think there is any other way it could happen. In a program that uses cgo, needextram is initialized to 1. It is set to 0 in the first cgo call, and is never to set to 1 again. The error occurs when a call from C code to Go code sees that needextram is still set to 1. |
In particular the failing test is TestStress.
Mercifully this is SIGUSR1 which rules out the EXC_BAD_ACCESS handler. Unfortunately I fear that this is some kind of bug in lldb's signal interception and replaying. Still investigating. |
My current working theory is that the new thread is getting a signal after the signal handler is installed but before it calls crosscall_arm1(ts.fn, ...). I believe this race exists on several platforms, but is much more likely on darwin/arm because darwin_arm_init_thread_exception_port happens in between. https://github.com/golang/go/blob/master/src/runtime/cgo/gcc_darwin_arm.c#L70 A cheap workaround just for darwin/arm would be to have the new thread set a condition that the creating thread can wait on, and only unblock signals after that. That would reduce the race to what it is on other platforms. If I'm right about this, a real fix would require the signals staying blocked until the new thread can successfully inform it that it is back in Go. Or have the signal handler know about this intermediate state. Not sure how to go about either option. @ianlancetaylor does this theory make any sense? |
The bug can only occur when C/C++ code, generated by cgo, calls into Go code before any Go code calls in C/C++ code. That is, something needs to call runtime·cgocallback_gofunc or runtime·cgocallback. So you are suggesting that when a signal occurs, it causes C/C++ code to call into Go code, even though nothing has called into C/C++ code. That would be a problem. But how could it happen? You say that calling crosscall_arm has something to do with fixing this, but I don't see that. Perhaps there is a problem there, but I don't see how it could be this problem. This problem can only happen in a callback from C/C++ code. And, to be clear, it has to be a callback from code generated by cgo, or something that is behaving the way that cgo behaves. I don't see any cgo code in os/signal. So, is there cgo in the darwin/arm support? cgo code that calls exported Go functions? I don't see anything but I don't know if I am looking in the right places. |
If the signal handler gets called somewhere around the call to In |
No, I am misunderstanding pthread_create. All signals are blocked when it's called, and that means the new thread starts with all signals blocked and none pending. So that cannot be cause. |
To confirm however, the cgocallback being triggered is the one in badsignal. (I put a print statement in it and then signal.TestStress eventually died trying to get runtime.printlock.) Now what's not clear is how badsignal is being called. |
Try adding some print statement in runtime/cgo to print the current thread
signal
mask, esp. in threadentry.
|
Huh, badsignal uses cgocallback. That can't possibly work. It will fail any time a program that uses cgo starts a new thread that receives a signal before it makes any cgo calls. This isn't a matter of the signal mask or of the new thread preparing itself to receive signals. The only question is whether the program has made any cgo calls yet. If it hasn't, it is doomed. The comment in badsignal explains the situation well enough, without bothering to point out that this can happen for any signal, not just SIGPROF. I think the only possible fix is for mstart1 to say something like
before calling initsig. And then the check in badsignal is not needed. |
@ianlancetaylor, I don't understand why it can't work. I think the only case
where it will fail is when the foreign thread is started by static
constructor.
In this specific case, I don't understand why our current solution is
inadequate.
we block all signals just before pthread_create, and the new thread should
inherent the parent threads's signal masks, so no signals will be sent to it
until it explicitly unblocks the signals, but at which point, g is
guaranteed to
be set in its TLS slot.
I still think something is wrong, either lldb's signal handling or the
kernel
fails to honor the signal masks.
|
I agree that we haven't found the cause of this bug yet, but we may as well fix badsignal on the way. After that's in, I'm going to do a bunch of instrumenting and see if I can work out what is happening. |
I think badsignal only breaks when threads are started by static
constructor,
which should be rare.
it won't break if the user code uses cgo to create threads, and there is
test
in misc/cgo/test to verify this.
|
@minux Yes, I was imprecise. It's when a thread is started by non-Go code before any cgo call, which does imply that the thread is started by a global constructor. |
Previously the extra m needed for cgo callbacks was created on the first callback. This works for cgo, however the cgocallback mechanism is also borrowed by badsignal which can run before any cgo calls are made. Now we initialize the extra M at runtime startup before any signal handlers are registered, so badsignal cannot be called until the extra M is ready. Updates #10207. Change-Id: Iddda2c80db6dc52d8b60e2b269670fbaa704c7b3 Reviewed-on: https://go-review.googlesource.com/7978 Reviewed-by: Ian Lance Taylor <iant@golang.org> Run-TryBot: David Crawshaw <crawshaw@golang.org>
Tentatively closing this, it appears to be fixed by golang.org/cl/7978. |
I occasionally see this failure on the darwin/arm builder:
Of the few examples I have, it's always in the os/signal package.
The text was updated successfully, but these errors were encountered: