Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: signal 2 received on thread with no signal stack #18600

Closed
jbardin opened this issue Jan 10, 2017 · 21 comments

Comments

Projects
None yet
7 participants
@jbardin
Copy link
Contributor

commented Jan 10, 2017

go1.8rc1 && master on darwin

Sending a SIGINT to a go build ./... caused a runtime error:

signal 2 received on thread with no signal stack
fatal error: non-Go code disabled sigaltstack

runtime stack:
runtime.throw(0x1499cf6, 0x20)
	/usr/local/go/src/runtime/panic.go:596 +0x95
runtime.noSignalStack(0x2)
	/usr/local/go/src/runtime/signal_unix.go:455 +0x94
runtime.sigtrampgo(0x2, 0xc4207fcd10, 0xc4207fcd78)
	/usr/local/go/src/runtime/signal_unix.go:238 +0x2d8
runtime.sigtramp(0xc420114386, 0x5, 0x5956, 0x200005a00000085, 0xc4202f43a0, 0x0, 0xe, 0xc4207fce30, 0x0, 0x3, ...)
	/usr/local/go/src/runtime/sys_darwin_amd64.s:240 +0x28

Managed to replicate this by building in a fairly large project, and running for a while

$ while true; do go build ./...& sleep .$RANDOM; kill -2 %; done
@bradfitz

This comment has been minimized.

Copy link
Member

commented Jan 10, 2017

@bradfitz bradfitz added this to the Go1.8Maybe milestone Jan 10, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2017

Which version of Darwin?

Is there any more to that stack trace? Please try setting the environment variable GOTRACEBACK=system when running go build.

@jbardin

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2017

Sorry, running on macOS 10.12.2 (16C67)

Here's a full system stack trace: https://gist.github.com/jbardin/0421e4a7689ebc46732894f3f8e2af8d

I wasn't able to trigger it on Linux, but I may also just be missing the timing window on that machine.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2017

I've looked over the Go runtime code, and it looks fine.

@quentinmit Is it still possible to ssh into that Sierra system?

@quentinmit

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2017

@ianlancetaylor Yeah, I don't remember what we had actually set up though. Ping me on Hangouts and I can set it up again.

@jbardin Do you have any antivirus software installed? Or haxies?

@jbardin

This comment has been minimized.

Copy link
Contributor Author

commented Jan 17, 2017

@quentinmit, nope. running vanilla macOS on real hardware.

@quentinmit

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2017

"haxies" = things that inject code into running programs; Application Enhancer and SIMBL are examples of that.

@jbardin

This comment has been minimized.

Copy link
Contributor Author

commented Jan 17, 2017

Not running anything like that that I know of, unless macOS is injecting something by default.

I did see that the go tool is dynamically linked, so as an experiment I rebuilt the toolchain with CGO_ENABLED=0 and haven't been able to reproduce the crash. I'm not able to reliably reproduce the crash in the first place though, so this doesn't prove anything, just adds another data point.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

Reproduced on 10.12.2 16C67 using the while loop in $GOPATH/src/camlistore.org.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

I confirmed that the signal is arriving on the ordinary goroutine stack, and the stack trace for that goroutine at the time of the signal is:

syscall.RawSyscall(0x2, 0x0, 0x0, 0x0, 0x3, 0x3, 0x5c)
	/Users/rsc/go/src/syscall/asm_darwin_amd64.s:109 +0x25 fp=0xc420cf4e38 sp=0xc420cf4e30
syscall.forkAndExecInChild(0xc42072db90, 0xc4208b8210, 0x15, 0x15, 0xc420a78000, 0x40, 0x40, 0x0, 0xc4204ff2c0, 0xc420cf5110, ...)
	/Users/rsc/go/src/syscall/exec_bsd.go:69 +0xee fp=0xc420cf4f00 sp=0xc420cf4e38
syscall.forkExec(0xc42072dad0, 0x2b, 0xc420a643c0, 0x14, 0x14, 0xc420cf5110, 0x0, 0xc420cec118, 0xc420a1a9c0)
	/Users/rsc/go/src/syscall/exec_unix.go:193 +0x368 fp=0xc420cf5028 sp=0xc420cf4f00
syscall.StartProcess(0xc42072dad0, 0x2b, 0xc420a643c0, 0x14, 0x14, 0xc420cf5110, 0x2, 0x4, 0x0, 0x20)
	/Users/rsc/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc420cf5080 sp=0xc420cf5028
os.startProcess(0xc42072dad0, 0x2b, 0xc420a643c0, 0x14, 0x14, 0xc420cf52b8, 0xc420778d90, 0x1, 0x0)
	/Users/rsc/go/src/os/exec_posix.go:45[436] 81429
 +0x1a3 fp=0xc420cf5168 sp=0xc420cf5080

That's the RawSyscall(SYS_FORK, ...) call that creates the child process.

I changed the runtime to record its own pid at init time and then at the time of the signal print both the init-time pid and the current pid. That appears to confirm that the signal has been received in the child process (at least not in the parent process!), before the child process has gotten a chance to continue executing after the fork system call.

It looks like kill -2 % is sending the signal to the background job's process group, not the parent pid (confirmed by looking at Apple's bash-103 source code: jobs.c is full of killpg calls). If the child has just been created, then it's in the process group and gets the signal but apparently has had alternate signal stack information cleared.

When we create a new thread we mask all signals, make the new thread, and then bring back the signals. We could do the same for fork, using the beforefork and afterfork runtime hooks, except there is no good time to run the afterfork hook and turn signals back on. Any window between 'after the fork' and 'before the exec' where signals are enabled will confuse the runtime if a signal is received. And we can't start the child with signals masked off, or it will likely keep them masked off.

It seems like the best we can do is detect when we receive a signal in a child process that hasn't quite gotten to exec yet and make the child die quietly?

I assume that using posix_spawn would be another way around this.

I'm a little surprised this hasn't turned up on other systems too, but I guess the race window is small and it only affects killpg. Or maybe macOS is the only one that clears alternate signal stacks during fork.

/cc @ianlancetaylor for thoughts about how to proceed

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

Confirmed that using 'kill -2 $! ' instead of 'kill -2 %' makes the problem go away: this is about killpg, not a kernel bug where parent signals get delivered to a child.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

This problem does not happen on GNU/Linux because the child process still has an alternate signal stack in effect. At least, that is what I see from this program, which on GNU/Linux prints

0x2517010 0 8192

Program:

#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

static void
die(const char *s)
{
  perror(s);
  exit(EXIT_FAILURE);
}

int
main()
{
  stack_t st;
  pid_t pid;

  st.ss_sp = malloc(SIGSTKSZ);
  if (st.ss_sp == NULL)
    die("malloc");
  st.ss_flags = 0;
  st.ss_size = SIGSTKSZ;
  if (sigaltstack(&st, NULL) < 0)
    die("sigaltstack");

  pid = fork();
  if (pid < 0)
    die("fork");
  if (pid == 0) {
    if (sigaltstack(NULL, &st) < 0)
      die("sigaltstack in child");
    printf("%p %#x %zu\n", st.ss_sp, st.ss_flags, st.ss_size);
    exit(EXIT_SUCCESS);
  }
  if (waitpid(pid, NULL, 0) < 0)
    die("waitpid");
  exit(EXIT_SUCCESS);
}
@rsc

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

On macOS 12.12.2 I get

0x0 0x4 0

OK, so we know that Linux doesn't clear sigaltstack at fork (that's nice), but there's still technically a problem. A signal has arrived and we're handling it as if we're in the parent, but we're in the child. If the signal handler did anything that depended on being in the real parent process, we'd be in trouble. We probably do need to either make this impossible or detect it and react accordingly.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

The most correct procedure I can think of is

  1. block all signals before fork
  2. after fork in parent, restore original signal mask
  3. after fork in child, change all signal handlers that are not SIG_IGN to SIG_DFL, then restore signal mask

I will work on a CL.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

To clarify, the reason for changing all the signal handlers to SIG_DFL is 1) that avoids the problem of invoking a signal handler in the child; 2) that is what execve is going to do anyhow, so any race condition already exists.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

SGTM. Thanks.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

I think this can wait for Go 1.9. We're more likely to break something common than fix this rare event.

@aclements

This comment has been minimized.

Copy link
Member

commented Jun 13, 2017

@gopherbot

This comment has been minimized.

Copy link

commented Jun 13, 2017

CL https://golang.org/cl/45471 mentions this issue.

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 20, 2017

@ianlancetaylor should this (plus the follow-up fix 28f650a) get back-ported to Go 1.8.4?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

Russ decided above that this was rare enough that we wouldn't worry about for 1.8.

@golang golang locked and limited conversation to collaborators Jul 20, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.