-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: SIGSEGV after performing clone(CLONE_PARENT) via C constructor prior to runtime start #65625
Comments
I can't repro this locally (Fedora 39, kernel 6.7.3-200.fc39.x86_64). |
I test two kernels:
It seems that there is a race condition? But I only can see 1 success correct log. Most of time it's wrong. |
Bisecting details For bisecting, I was using code provided by @lifubang (see above), an x86_64 VM with Ubuntu 20.04 installed, and the following script: #!/bin/bash
set -e -u -o pipefail
# Build go.
(cd src && ./make.bash)
# Build our repro.
./bin/go build -o main-tst ~/cgoclone2/main.go
# Run it; check if "main" is printed.
./main-tst | grep main The bisect itself (cd to go source repo first) can be run like this:
|
Most of the code in CL 519457 is dealing with libc/pthreads. I wonder if this actually related to kernel version or glibc version? Could you test using a newer version of glibc on the machine with the older kernel version. (e.g., by running in a Docker container using a newer distro which has a newer version of glibc). Could you also provide |
cc @golang/runtime |
@prattmic it looks like it is indeed related to glibc: Here's a backtrace from a failed run:
and the last few lines from
This is glibc 2.31 (libc6:amd64 2.31-0ubuntu9.14) I will run it with newer glibc later today. |
We initialize the I bet that call is failing when the
|
After reading the reproducer more closely, I can see how this could happen.
This reproducer is doing I don't have a reproducer environment, but I suspect if you adjust the program like this:
That it will hit the abort. And that even extracting this program from the Go file and making it 100% C would have the same effect. Why working on some versions of glibc and not on others? I suspect that the old version of glibc sets some global state prior to |
In my testing it doesn't, and Going to try a different glibc now. |
Hmm, glibc 2.31 from an older release of Fedora (Fedora 32) works fine (I tried glibc-2.31-6.fc32.x86_64 rpm, glibc-2.31-2.fc32.x86_64 rpm, and glibc-2.31 compiled from source). glibc 2.32 from Fedora 33 also works. glibc 2.31-13+deb11u7 from Debian 11 also works. Older versions (2.31-13+deb11u5, 2.31-7, 2.31-1) also work. My guess, the issue is somehow specific to Ubuntu (it's either their glibc patches, or gcc patches, or some configuration detail, but not the kernel). I also tried bisecting upstream glibc but it's quite challenging. |
@dr2chase could you please specify what kind of info do you need me to provide (except for what was provided above)? |
To reproduce this, I added this Dockerfile to https://github.com/lifubang/cgoclone2, and placed a GOROOT source tree in the
|
My theory of a stale PID from #65625 (comment) is indeed the immediate cause of issues:
We are on TID/PID 982, but pthreads thinks the TID is 980, and ultimately passes this to |
OK, I think I've gotten to the bottom of this. tl;dr, it is indeed the same issue in #65625 (comment). This C program reproduces the problem: #define _GNU_SOURCE
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <grp.h>
#include <limits.h>
#include <pthread.h>
#include <sched.h>
#include <setjmp.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <linux/limits.h>
#include <linux/netlink.h>
#include <linux/types.h>
#include <sched.h>
#define STAGE_SETUP -1
#define STAGE_PARENT 0
#define STAGE_CHILD 1
#define STAGE_INIT 2
int current_stage = STAGE_SETUP;
struct clone_t {
char stack[4096] __attribute__((aligned(16)));
char stack_ptr[0];
jmp_buf *env;
int jmpval;
};
static int child_func(void *arg) __attribute__((noinline));
static int child_func(void *arg)
{
struct clone_t *ca = (struct clone_t *)arg;
longjmp(*ca->env, ca->jmpval);
}
static int clone_parent(jmp_buf *env, int jmpval) __attribute__((noinline));
static int clone_parent(jmp_buf *env, int jmpval)
{
struct clone_t ca = {
.env = env,
.jmpval = jmpval,
};
return clone(child_func, ca.stack_ptr, CLONE_PARENT | SIGCHLD, &ca);
}
static void nsexec() {
jmp_buf env;
current_stage = setjmp(env);
switch (current_stage) {
case STAGE_PARENT: {
printf("STAGE_PARENT\n");
clone_parent(&env, STAGE_CHILD);
exit(0);
}
break;
case STAGE_CHILD: {
printf("STAGE_CHILD\n");
clone_parent(&env, STAGE_INIT);
exit(0);
}
break;
case STAGE_INIT: {
printf("STAGE_INIT\n");
}
break;
}
printf("This from nsexec\n");
return;
}
void __attribute__((constructor)) init(void) {
nsexec();
pthread_attr_t attr;
int ret = pthread_getattr_np(pthread_self(), &attr);
if (ret != 0) {
printf("pthread_getattr_np: %s\n", strerror(ret));
/* Try to destroy attr anyway. Bad idea, because getattr fails, but this is what Go does. */
pthread_attr_destroy(&attr);
abort();
}
}
int main(void) {
printf("Hello from main!\n");
} With glibc 2.31, it crashes in the same way as Go:
With glibc 2.37 (the local version I happen to have), it still gets ESRCH, but doesn't SIGSEGV:
In #65625 (comment), I assumed that So newer glibc is still getting confused about the current TID, it just isn't crashing in In my opinion, this is ultimately a bug in the C program (which I assume is extracted from runc). i.e., it is not safe to do anything not async-signal-safe after I don't think that Go should work around this. I do think we should check for errors from |
Change https://go.dev/cl/563379 mentions this issue: |
In triage, we're of the opinion at this point that this is generally not a bug in the Go project. However, we'll keep the issue open until we land a planned change to add error checking to some of the glibc calls made (that @prattmic described in the previous comment), and at least try to fail with a nicer error. |
Go 1.22 currently causes crashes on older Debian/Ubuntu systems. lxc/incus#497 golang/go#65625 opencontainers/runc#4193 Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
Go 1.22 currently causes crashes on older Debian/Ubuntu systems. lxc/incus#497 golang/go#65625 opencontainers/runc#4193 Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
If we remove |
I managed to finally root-cause the reason for the stale @lifubang pointed out that the PID cache was removed from glibc in 2016, which also removed the code they had to update Critically, this means the issue is not with Ultimately I think the core issue is that we are violating |
Fixes: golang#65625 Before go1.22, the example in golang#65625 can be run successfully, though the core issue is in old version glibc, but it will be better to provide this backward compatibility to let people can upgrade to go 1.22+. Signed-off-by: lifubang <lifubang@acmcoder.com>
Fixes: golang#65625 Before go1.22, the example in golang#65625 can be run successfully, though the core issue is in old version glibc, but it will be better to provide this backward compatibility to let people can upgrade to go 1.22+. Signed-off-by: lifubang <lifubang@acmcoder.com>
Change https://go.dev/cl/585019 mentions this issue: |
Fixes: golang#65625 Before go1.22, the example in golang#65625 can be run successfully, though the core issue is in old version glibc, but it will be better to provide this backward compatibility to let people can upgrade to go 1.22+. Signed-off-by: lifubang <lifubang@acmcoder.com>
Fixes: golang#65625 Before go1.22, the example in golang#65625 can be run successfully, though the core issue is in old version glibc, but it will be better to provide this backward compatibility to let people can upgrade to go 1.22+. Signed-off-by: lifubang <lifubang@acmcoder.com>
Fixes: golang#65625 Before go1.22, the example in golang#65625 can be run successfully, though the core issue is in old version glibc, but it will be better to provide this backward compatibility to let people can upgrade to go 1.22+. Signed-off-by: lifubang <lifubang@acmcoder.com>
Fixes: golang#65625 Before go1.22, the example in golang#65625 can be run successfully, though the core issue is in old version glibc, but it will be better to provide this backward compatibility to let people can upgrade to go 1.22+. Signed-off-by: lifubang <lifubang@acmcoder.com>
Fixes: golang#65625 Before go1.22, the example in golang#65625 can be run successfully, though the core issue is in old version glibc, but it will be better to provide this backward compatibility to let people can upgrade to go 1.22+. Signed-off-by: lifubang <lifubang@acmcoder.com>
Fixes: golang#65625 Before go1.22, the example in golang#65625 can be run successfully, though the core issue is in old version glibc, but it will be better to provide this backward compatibility to let people can upgrade to go 1.22+. Signed-off-by: lifubang <lifubang@acmcoder.com>
Change https://go.dev/cl/587919 mentions this issue: |
Change https://go.dev/cl/587920 mentions this issue: |
In glibc versions older than 2.32 (before commit 4721f95), pthread_getattr_np does not always initialize the `attr` argument, and when it fails, it results in a NULL pointer dereference in pthread_attr_destroy down the road. This is the simplest way to avoid this, and an alternative to CL 585019. Updates #65625. Change-Id: If490fd37020b03eb084ebbdbf9ae0248916426d0 Reviewed-on: https://go-review.googlesource.com/c/go/+/587919 Auto-Submit: Ian Lance Taylor <iant@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Ian Lance Taylor <iant@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Cherry Mui <cherryyz@google.com>
I forgot to mention this here after we did some more work on the issue on the runc side -- it turns out we cannot |
This is being fixed in Go 1.22 by https://go.dev/cl/587920, can we please have |
Sorry, I am not very familiar with the backport process in this project; if I'm not doing something right, please point me out to relevant documentation. |
@kolyshkin You can find documentation for our backporting process at https://go.dev/wiki/MinorReleases. If you believe this issue meets the criteria to be considered for backporting, you should ask @gopherbot to create backport issues, and include a rationale. The 1.22 backport issue will be automatically added to the Go1.22.4 milestone. Thanks. |
@gopherbot please consider this for backport to 1.22, it’s a major regression for runc (it stopped working completely). |
Backport issue(s) opened: #67650 (for 1.22). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
See * https://github.com/sylabs/singularity/releases/tag/v4.1.3 * sylabs/singularity#2677 * golang/go#65625 * https://dev.arvados.org/issues/21705#note-13 Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curii.com>
Go version
go version go1.22.0 linux/amd64
Output of
go env
in your module/workspace:What did you do?
The main language of runc is
go
, but we are usingc
to enter some linux namespaces. Recently, go 1.22.0 has been released, when we want to bump go version to1.22.0
(opencontainers/runc#4193), the CI is fail, it seems that after we are callingclone(2)
inc
, the children process can't return togo
if the first process exited inc
.The test code is in https://github.com/lifubang/cgoclone2/blob/main/main.go
I think we should see
From main!
in the last line.What did you see happen?
root@acmcoder:/home/acmcoder/cgo# go version
go version go1.22.0 linux/amd64
root@acmcoder:/home/acmcoder/cgo# go run main.go
STAGE_PARENT
STAGE_CHILD
STAGE_INIT
This from nsexec
What did you expect to see?
root@acmcoder:/home/acmcoder/cgo# go version
go version go1.21.1 linux/amd64
root@acmcoder:/home/acmcoder/cgo# go run main.go
STAGE_PARENT
STAGE_CHILD
STAGE_INIT
This from nsexec
From main!
The text was updated successfully, but these errors were encountered: