-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: crash since Go 1.22.5 on Alpine (newstack at runtime.printlock) #68285
Comments
Do you have any local patches to the Go standard library? I know that some Alpine systems use local patches. |
You are hitting the fatal error at https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/runtime/cgocall.go;l=241 due to an apparently bogus SP.
From the mention of FrankenPHP, I assume this is just a continuation of #62130? In #62130 (comment) and @dunglas' reply, we established that FrankenPHP changes stacks between a Go-to-C call and subsequent C-to-Go callback on the same thread. Go does not support this, and we never made a concrete decision on whether we should support this or not. Seems like this should be closed as a duplicate of #62130? |
@ianlancetaylor we're using the official Docker images for Go (Alpine variants), I assume the standard library is not patched. @prattmic this doesn't seem to be the same problem as #62130. #62130 is triggered only in a special case that is not very popular (Fibers, a new PHP feature that changes stack). Since yesterday, all FrankenPHP installations (including ones not using Fibers) on Alpine are crashing, it wasn't the case before Go 1.22.5, so I suppose this is a problem introduced in this version. |
We have an easy way to reproduce and debug, feel free to contact me in private (kevin@dunglas.dev, X, etc) if you want assistance to setup a reproducer. |
Thanks for the clarification, let's keep this separate then. |
cc @golang/runtime @cherrymui |
Not sure if this is actionable / supported, but assigning for the moment to Michael. |
Here is the stack trace with http://go.dev/cl/585817, as suggested by @prattmic:
To be honest, I've no idea what's going on. It may be a bug in our code, but it's surprising that the problem only occurs when using musl. I tested and can confirm that the problem only happens when using musl (both with Alpine and with a static build) and not with glibc. |
I created a custom Go build that reverts 3560cf0, and this makes our test suite green. So I can confirm that the problem has been introduced by this commit: https://github.com/dunglas/frankenphp/actions/runs/9877453302/job/27279127840?pr=913 Would it be possible to revert it while a better fix isn't available? I'm not confident relying on a custom build, and the latest patch version contains security fixes in |
@dunglas thanks for the update! The reason why this is musl specific is probably that the code getting the stack bounds https://source.corp.google.com/h/go/go/+/release-branch.go1.22:src/runtime/cgo/gcc_stack_unix.c;l=24 uses different code paths for glibc and musl. With glibc, we get accurate stack bounds from pthread. With musl, it goes to the fallback path (the It seems musl also supports But there is still an issue on systems that we couldn't get the accurate bounds. It seems this is a C created thread, which calls into Go, which then calls back into C, which calls back to Go again, which throws. Presumably during the first C->Go call it gets the estimated stack bounds. At the inner C-> Go call, the SP is slightly above the estimated upper bound, which is weird. I don't see how this could happen. The inner callback should be at a lower stack address than the outer callback... @dunglas could you confirm if the code does C->Go->C->Go calls? And does it adjust the stacks in anyway, or using non-local control flow like Perhaps in the case that we can't get the accurate bounds we don't throw if the SP is out of the estimated bounds? But that can also be tricky as if we update the bounds in the inner callback, when it returns, we may have weird stack bounds for the outer callback... |
Maybe what happens is that
The callback at step 4 is at a shallower stack (higher SP), but it sees the stack bounds from step 2, which is at a lower address, then throws. @dunglas could you confirm that the code does C->Go calls like that? Thanks. |
@cherrymui thank you very much for investigating! Yes: FrankenPHP creates a C thread (in C: https://github.com/dunglas/frankenphp/blob/4fab5a3169357f6c6cadee1aa77ff8fa8480419f/frankenphp.c#L810), and the C code calls back to Go (and the Go code then calls the C code, etc). The C code uses Let me know if you want me to try some patches (I can try to use |
Using Regarding the detection problem, a solution could be to check if the function is available (using a shell script similar to I can work on a patch doing this if you agree on the approach. |
Here are my experiments so far: diff --git a/src/make.bash b/src/make.bash
index 76ad51624a..c6c3b87157 100755
--- a/src/make.bash
+++ b/src/make.bash
@@ -138,6 +138,13 @@ if [[ "$(uname -s)" == "GNU/kFreeBSD" ]]; then
export CGO_ENABLED=0
fi
+# Check pthread_getattr_np() availability
+if [[ "${CGO_ENABLED}" != "0" ]] && ${CC:-gcc} -S runtime/cgo/testdata/detect_pthread_getattr_np.c -o /dev/null 2> /dev/null; then
+ export CGO_CFLAGS="-DHAVE_PTHREAD_GETATTR_NP ${CGO_CFLAGS}"
+ export CFLAGS="-DHAVE_PTHREAD_GETATTR_NP ${CFLAGS}"
+ echo $CGO_CFLAGS
+fi
+
# Clean old generated file that will cause problems in the build.
rm -f ./runtime/runtime_defs.go
diff --git a/src/runtime/cgo/gcc_stack_unix.c b/src/runtime/cgo/gcc_stack_unix.c
index 884281dc15..39d3402eca 100644
--- a/src/runtime/cgo/gcc_stack_unix.c
+++ b/src/runtime/cgo/gcc_stack_unix.c
@@ -21,7 +21,7 @@ x_cgo_getstackbound(uintptr bounds[2])
// Needed before pthread_getattr_np, too, since before glibc 2.32
// it did not call pthread_attr_init in all cases (see #65625).
pthread_attr_init(&attr);
-#if defined(__GLIBC__) || (defined(__sun) && !defined(__illumos__))
+#if defined(HAVE_PTHREAD_GETATTR_NP) || defined(__GLIBC__) || (defined(__sun) && !defined(__illumos__))
// pthread_getattr_np is a GNU extension supported in glibc.
// Solaris is not glibc but does support pthread_getattr_np
// (and the fallback doesn't work...). Illumos does not.
diff --git a/src/runtime/cgo/testdata/detect_pthread_getattr_np.c b/src/runtime/cgo/testdata/detect_pthread_getattr_np.c
new file mode 100644
index 0000000000..f15fcd0e25
--- /dev/null
+++ b/src/runtime/cgo/testdata/detect_pthread_getattr_np.c
@@ -0,0 +1,10 @@
+/* Dummy program used to check if pthread_getattr_np() is available. */
+
+#define _GNU_SOURCE
+#include <pthread.h>
+
+int main(void) {
+ pthread_getattr_np(pthread_self(), NULL);
+
+ return 0;
+} The detection code of |
Thanks for the update! I don't think we want to introduce any autoconf-like things into Either way, I think we still want to make the fallback path (if
|
PHP doesn't manage threads itself, it's up to the webserver to create and manage them. In our case, we use simple ad-hoc code: https://github.com/dunglas/frankenphp/blob/323edefc4b7cb69bcb318745efac4dec9b7ac488/frankenphp.c#L810. It indeed only
This seems indeed a good fix. By the way, @withinboredom instigated #62130, and it appears that having more permissive checks fixes this issue too: dunglas/frankenphp#46 (comment) |
What do you think about just skipping the initial bound check in |
I don't think we can just skip the bounds check. If C1 calls Go1, which calls C2, which calls Go2, if at Go2 (where |
When I was originally approaching this problem (before I simply deleted the check to see how it crashed -- it didn't), my idea was to "trick" go into thinking it was a C1' to G1' type call via a stack -- probably a less refined version of what you are thinking. My thinking was that we could basically ignore the original G1 call until we returned back to C2. |
Since go 1.22.5, this results in a panic. There is an upstream bug report at [1] for musl, but the same thing happens with bionic. The same fix of using pthread_getattr_np() on bionic also works, but it's better to avoid needing folks to recompile the go toolchain. We were just using these calls as a small hack to pass a list of structs to the Java side, which we can easily do another way. [1] golang/go#68285 Signed-off-by: Andrew Gunnerson <accounts+github@chiller3.com>
I encountered this crash with bionic libc on Android as well (calling Java functions over JNI via gomobile). It looks like bionic also supports |
CL 600296 implements the update+restore logic. Could you try if this works with your case? Thanks. |
Change https://go.dev/cl/600296 mentions this issue: |
@chenxiaolong thanks. We could add |
I took a quick look and I think we should be good with unconditionally enabling it for Android. |
@cherrymui I will give this a test! |
@cherrymui CL 600296 fixes this issue with FrankenPHP: https://github.com/dunglas/frankenphp/actions/runs/10091715703/job/27903861001?pr=938 🎉 |
I think I'm also hitting this on FreeBSD 14.1 (amd64). I'm in a C->Go->C path that happens right after another apparently stack-hungry C->Go->C path that shares the first C->Go part. I verified that everything was fine with I applied CL 600296 to As a side note, the stack-hungry C->Go->C path (which initializes numpy), triggers So aside from testing the way ahead I'm in the search of a good workaround for going backwards. Is there any way to see how far are the stack bounds? |
I think I hit the same issue, with Gio, on Android. Which does a lot of C -> Go and Go -> C calls. It shows "fatal error: runtime: stack split at bad time" since I update to Go 1.22.6. I rollback to 1.22.4 and it's working fine. Since the issue still open, I'm not sure if CL 600296 landed in any version of Go. |
I'm hitting the same issue on Windows with go1.23.2 and go-gl/glfw in glfw.PoolEvents that calls to C code which can generate events consumed by callbacks in Go. It only happens if my project is build with @cherrymui Applying CL 600296 fixes the issue so far. |
Currently, at a cgo callback where there is already a Go frame on the stack (i.e. C->Go->C->Go), we require that at the inner Go callback the SP is within the g0's stack bounds set by a previous callback. This is to prevent that the C code switches stack while having a Go frame on the stack, which we don't really support. But this could also happen when we cannot get accurate stack bounds, e.g. when pthread_getattr_np is not available. Since the stack bounds are just estimates based on the current SP, if there are multiple C->Go callbacks with various stack depth, it is possible that the SP of a later callback falls out of a previous call's estimate. This leads to runtime throw in a seemingly reasonable program. This CL changes it to save the old g0 stack bounds at cgocallback, update the bounds, and restore the old bounds at return. So each callback will get its own stack bounds based on the current SP, and when it returns, the outer callback has the its old stack bounds restored. TODO: not sure if this works if inner callback panics and outer callback recovers. Maybe we need to do something in unwindm? TODO: do we do this only when we cannot get the accurate bounds, and still throw if we can? This makes it not throw if the C code switches stacks. What else could go wrong? For golang#68285. Change-Id: I3422badd5ad8ff63e1a733152d05fb7a44d5d435
This reverts commit 7ea28b0. This commit was originally added to pin the golang build version to 1.20.7 because of an issue in golang related to CGO: golang/go#62130 (comment) This issue was fixed for most systems, although the issue is still open because it appears to have opened up a separate issue with MUSL-based systems (such as alpine): golang/go#68285 Since the fluent-bit container is on a glibc-based system (Amazon Linux), we should be safe to revert back to building with the latest version of golang.
Go version
go version go1.22.5 linux/arm64
Output of
go env
in your module/workspace:What did you do?
Reproducer:
What did you see happen?
Full stack trace: https://github.com/dunglas/frankenphp/actions/runs/9778553362/job/26995650580?pr=898
What did you expect to see?
No crash.
The test suite wasn't crashing with previous Go versions.
It also doesn't crash on Debian and macOS, only Alpine (musl?) is affected.
The project (FrankenPHP) heavily relies on cgo.
We change the default Alpine stack trace
-ldflags "-w -s -extldflags '-Wl,-z,stack-size=0x80000'
.The binary is compressed with UPX.
The text was updated successfully, but these errors were encountered: