Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: cgo stuck because of go signal handler went into dead loop if pid=1 #59569

Open
cocktail828 opened this issue Apr 12, 2023 · 14 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@cocktail828
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.19 linux/amd64

Does this issue reproduce with the latest release?

Sure

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=
GONOSUMDB=
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=
GOPROXY="https://goproxy.cn,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.19"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2188387714=/tmp/go-build -gno-record-gcc-switches"

What did you do?

The bug found in go1.10.3 at first, the I update go to go1.19, it happends too. I find the root case lives here:

In almost all real cases the program is about to crash, but in my situation, it does not after usleep. But why, I cannot explain!!!.

Then I try fix it myself with the following patch:

	raise(sig)
	// Give the signal a chance to be delivered.
	// In almost all real cases the program is about to crash,
	// so sleeping here is not a waste of time.
	usleep(1000)

	switch sig {
	case _SIGSEGV, _SIGABRT, _SIGFPE, _SIGILL, _SIGBUS:
		print("fatal error: unexpected signal during runtime execution\n")
		signame := signame(sig)
		if signame != "" {
			print("[signal ", signame)
		} else {
			print("[signal ", hex(sig))
		}
		print(" code=", hex(c.sigcode()), " addr=", hex(c.sigaddr()), "]\n")
		dieFromSignal(sig)

	default:
		// If the signal didn't cause the program to exit, restore the
		// Go signal handler and carry on.
		//
		// We may receive another instance of the signal before we
		// restore the Go handler, but that is not so bad: we know
		// that the Go program has been ignoring the signal.
		setsig(sig, abi.FuncPCABIInternal(sighandler))
	}

What did you expect to see?

panic

What did you see instead?

See #56649

@seankhliao seankhliao added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Apr 12, 2023
@seankhliao
Copy link
Member

We will still need a way to reproduce this issue to verify any fixes

@seankhliao seankhliao changed the title affected/package: runtime runtime: cgo stuck because of go signal handler went into dead loop Apr 12, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 12, 2023
@prattmic
Copy link
Member

What does “containered CentOS 7” mean? Does this change the semantics of rt_sigaction? We set the action for SIGSEGV to DFL, so it shouldn’t be possible to deliver to a handler anymore.

@cocktail828
Copy link
Author

First thanks for your help.

What does “containered CentOS 7” mean? Does this change the semantics of rt_sigaction?

The container is built from centos 7.
The host is running kernel 3.10.0 and we don't make any change so rt_sigaction should work as expect.

We set the action for SIGSEGV to DFL, so it shouldn’t be possible to deliver to a handler anymore.

Yes, I can see that from strace. It should crash after usleep, but it doesn't. That's what I confused. Maybe the kernel bug???
The issue can only reproduce in small probility.

I can keep an environment for debug the next time it reproduce, if anybody interest enough to debug the issue.

@cocktail828
Copy link
Author

I don't want to merge the patch, but want somebody professional enough to review the changes.

@ianlancetaylor
Copy link
Contributor

Can you show us the output from strace -f? Thanks.

@prattmic
Copy link
Member

@ianlancetaylor the strace is on #56649 (we probably should have reopened that one rather than making a new bug)

@ianlancetaylor
Copy link
Contributor

Thanks. I also don't understand that strace. It appears to be running the code in raisebadsignal. That means that the SIGSEGV was received on a thread that was not started by Go code.

// unblocksig(sig)
[pid 108096] rt_sigprocmask(SIG_UNBLOCK, [SEGV], NULL, 8) = 0
// setsig(sig, handler) where handler == SIG_DFL
[pid 108096] rt_sigaction(SIGSEGV, {SIG_DFL, ~[], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x7f695ef1d630}, NULL, 8) = 0
// raise(sig)
[pid 108096] getpid()                   = 1
[pid 108096] gettid()                   = 223
[pid 108096] tgkill(1, 223, SIGSEGV)    = 0
[pid 108096] --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
// usleep(1000)
[pid 108096] nanosleep({0, 1000000}, NULL) = 0

In other words the process receives a SIGSEGV signal while the signal handler is set to SIG_DFL and the signal is not blocked. But the program just keeps running. That should not happen.

You said that this only happens occasionally. What does the strace -f output look like when the program does not go into an endless loop? Thanks.

@cocktail828
Copy link
Author

In other words the process receives a SIGSEGV signal while the signal handler is set to SIG_DFL and the signal is not blocked. But the program just keeps running. That should not happen.

Yep. I agree with u. So this maybe the kernel bug, I guess. Recently, we also request our SRE to record the kernel version of the host which reproduce the issue.

You said that this only happens occasionally. What does the strace -f output look like when the program does not go into an endless loop? Thanks.

It take some times. I will provide the log and keep an environment for debug the next time it reproduce.
Thank you very much.

@cocktail828
Copy link
Author

@ianlancetaylor
The following attachment is the output by strace -f -p ${pid}. May you get something useful from it.
strace-f.txt

Thanks in advance.

@ianlancetaylor
Copy link
Contributor

Thanks. I just noticed si_pid=1 in the strace. I think that must be the problem. It sounds like you are running this program in a container, and you are running it as process 1. As it happens, process 1 is magic in Unix kernels: the kernel only delivers signals that actually have signal handlers. So when the Go runtime removes the signal handlers and raises the signal again, nothing happens. Now I don't know why this ever works at all, at least when running as process 1.

I don't know how much we care about this edge case. But we could change raisebadsignal to check for a process ID of 1 and a signal expected to kill the process.

@ianlancetaylor ianlancetaylor removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label May 5, 2023
@cocktail828
Copy link
Author

It's weird, the process crash and exit in almost all case, and they all run as pid 1 in a container. In other words, a container with only one process(no matter it comes from C/C++, Golang, or others) should crash and exit for SIGSEGV, or something unexpected should happed. Image a case, that all container crash but not exit... So I still think there is a bug hide in go or the kernel.

Thanks for your attention!

@cherrymui cherrymui changed the title runtime: cgo stuck because of go signal handler went into dead loop runtime: cgo stuck because of go signal handler went into dead loop if pid=1 May 9, 2023
@cherrymui cherrymui added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label May 9, 2023
@cherrymui cherrymui added this to the Backlog milestone May 9, 2023
@ianlancetaylor
Copy link
Contributor

The failure will only occur if the signal is delivered on a thread that was created by C, not by Go. So there is some unpredictability as to precisely how it will be handled. It is only when the signal is delivered to a C thread that we go into the endless loop of signal delivery due to running on PID 1.

@prattmic
Copy link
Member

For reference, the kernel ignores signals with SIG_DFL handler for init tasks (SIGNAL_UNKILLABLE) unless it is a "forced" signal: https://elixir.bootlin.com/linux/v6.3/source/kernel/signal.c#L89

Kernel generated signals (such as SIGSEGV from a page fault) are "forced": https://elixir.bootlin.com/linux/v6.3/source/kernel/signal.c#L1245

So we should be able to raise the signal by explicitly faulting after removing the signal handler.

@cocktail828
Copy link
Author

cocktail828 commented May 11, 2023

Yep, you are right. I can reproduce the issue myself with follwing step.
And the issue will not be reproduce when not ru at pid 1

  • step1: compile the code and get an executable a.out
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#ifndef SYS_gettid
#error"SYS_gettid unavailable on this system"
#endif
#define gettid() ((pid_t)syscall(SYS_gettid))

long sys_tgkill(pid_t pid, pid_t tid, int signo) {
    printf(">>>>>%ld\n", syscall(SYS_tgkill, pid, tid, signo));
}

pid_t tid_send_signal;
pid_t tid_recv_signal;

void* thread_func(void* arg) {
    pid_t tid = gettid();
    if (*(int*)arg)
        tid_send_signal = tid;
    else
        tid_recv_signal = tid;
    printf("======%d %ld\n", *(int*)arg, tid);

    pthread_detach(pthread_self());
    sleep(1);

    while (1) {
        if (*(int*)arg)
                sys_tgkill(getpid(), tid_recv_signal, SIGSEGV);
        sleep(1);
    }
}

int main(int argc, char* argv[]) {
    pthread_t t1, t2;
    int should_send_signal = 1;
    pthread_create(&t1, NULL, thread_func, &should_send_signal);

    int should_recv_signal = 0;
    pthread_create(&t2, NULL, thread_func, &should_recv_signal);
    while (1) { sleep(100); }
    return 0;
}
  • step2: make an docker image, and add the a.out into the image.
cat > Dockerfile <<EOF
from debian:bullseye-slim
add a.out /root
cmd ["/root/a.out"]
EOF
  • step3: run the image and the issue reproduce...
[root@sentry-dev ~]# strace -ff -p `pidof a.out`
strace: Process 23367 attached with 3 threads
[pid 23399] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 23367] restart_syscall(<... resuming interrupted clock_nanosleep ...> <unfinished ...>
[pid 23398] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 23399] <... restart_syscall resumed>) = 0
[pid 23399] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 23398] <... restart_syscall resumed>) = 0
[pid 23398] getpid()                    = 1
[pid 23398] tgkill(1, 8, SIGSEGV <unfinished ...>
[pid 23399] <... clock_nanosleep resumed>{tv_sec=0, tv_nsec=983584817}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 23398] <... tgkill resumed>)       = 0
[pid 23399] --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
[pid 23398] write(1, ">>>>>0\n", 7 <unfinished ...>
[pid 23399] restart_syscall(<... resuming interrupted clock_nanosleep ...> <unfinished ...>
[pid 23398] <... write resumed>)        = 7
[pid 23398] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 23399] <... restart_syscall resumed>) = 0
[pid 23399] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 23398] <... clock_nanosleep resumed>0x7f72f8363e80) = 0
[pid 23398] getpid()                    = 1
[pid 23398] tgkill(1, 8, SIGSEGV <unfinished ...>
[pid 23399] <... clock_nanosleep resumed>{tv_sec=0, tv_nsec=983160829}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 23398] <... tgkill resumed>)       = 0
[pid 23399] --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
[pid 23399] restart_syscall(<... resuming interrupted clock_nanosleep ...> <unfinished ...>
[pid 23398] write(1, ">>>>>0\n", 7)     = 7
[pid 23398] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, ^Cstrace: Process 23367 detached
strace: Process 23398 detached
 <detached ...>
strace: Process 23399 detached
[root@sentry-dev ~]#

Notice:
run with --init to make the process not start with pid 1, will fix this issue!

[root@sentry-dev xxx]# docker run -it --rm --init xxx
======1 8
======0 9
>>>>>0 # the process exit successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
Development

No branches or pull requests

6 participants