New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: mayMoreStackPreempt failures #55160
Comments
Found new matching flaky dashboard failures. 2022-09-02 19:08 linux-386-longtest go@dbf442b1 runtime.TestVDSO (log)
2022-09-02 19:08 linux-386-longtest go@b91e3737 runtime.TestVDSO (log)
2022-09-02 19:09 linux-386-longtest go@55ca6a20 runtime.TestVDSO (log)
2022-09-02 19:22 linux-386-longtest go@0fda8b19 runtime.TestVDSO (log)
2022-09-02 19:22 linux-386-longtest go@0fda8b19 runtime.TestCgoPprofCallback (log)
2022-09-03 15:45 linux-386-longtest go@f798dc68 runtime.TestVDSO (log)
2022-09-03 18:21 linux-386-longtest go@a0f05823 runtime.TestVDSO (log)
2022-09-03 18:21 linux-386-longtest go@a0f05823 runtime.TestCgoPprofCallback (log)
2022-09-04 04:17 linux-386-longtest go@535fe2b2 runtime.TestVDSO (log)
2022-09-05 07:14 linux-386-longtest go@3fbcf05d runtime.TestVDSO (log)
2022-09-05 07:17 linux-386-longtest go@4e7e7ae1 runtime.TestVDSO (log)
2022-09-05 08:08 linux-386-longtest go@4ad55cd9 runtime.TestVDSO (log)
2022-09-05 08:08 linux-386-longtest go@af7f4176 runtime.TestVDSO (log)
2022-09-05 08:28 linux-386-longtest go@bd5595d7 runtime.TestVDSO (log)
2022-09-05 08:28 linux-386-longtest go@bd5595d7 runtime.TestCgoPprofCallback (log)
2022-09-05 21:39 linux-386-longtest go@4c1ca42a runtime.TestVDSO (log)
2022-09-06 14:44 linux-386-longtest go@a60a3dc5 runtime.TestCgoPprofCallback (log)
2022-09-06 15:44 linux-386-longtest go@32f68b5a runtime.TestVDSO (log)
2022-09-06 15:44 linux-386-longtest go@32f68b5a runtime.TestCgoPprofCallback (log)
2022-09-06 15:48 linux-386-longtest go@07b19bf5 runtime.TestVDSO (log)
2022-09-02 19:08 linux-amd64-longtest go@dbf442b1 runtime.TestCgoPprofCallback (log)
2022-09-02 19:08 linux-amd64-longtest go@dbf442b1 runtime.TestCgoCCodeSIGPROF (log)
2022-09-02 19:08 linux-amd64-longtest go@b91e3737 runtime.TestCgoPprofCallback (log)
2022-09-03 15:45 linux-amd64-longtest go@f798dc68 runtime.TestCgoPprofCallback (log)
2022-09-03 18:21 linux-amd64-longtest go@a0f05823 runtime.TestCgoPprofCallback (log)
2022-09-04 04:17 linux-amd64-longtest go@535fe2b2 runtime.TestCgoPprofCallback (log)
2022-09-05 08:12 linux-amd64-longtest go@67e65424 runtime.TestCgoPprofCallback (log)
2022-09-05 08:12 linux-amd64-longtest go@67e65424 runtime.TestCgoCCodeSIGPROF (log)
2022-09-05 21:39 linux-amd64-longtest go@4c1ca42a runtime.TestCgoPprofCallback (log)
2022-09-06 11:14 linux-amd64-longtest go@1c504843 runtime.TestCgoPprofCallback (log)
|
These were coming in multiple times per day and have now stopped for almost two weeks. Closing. |
Found new matching flaky dashboard failures for:
2022-08-23 03:09 windows-amd64-longtest go@0a52d806 (log)
|
Found new dashboard test flakes for:
2022-09-06 15:48 linux-amd64-longtest go@07b19bf5 runtime.TestCgoPprofCallback (log)
|
Found new dashboard test flakes for:
2022-10-31 21:00 linux-386-longtest go@ec0b5402 runtime.TestCgoTracebackContextPreemption (log)
2022-11-02 18:19 linux-386-longtest go@07a70bca runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 15:17 linux-386-longtest go@1bfb51f8 runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 15:30 linux-386-longtest go@e81263c7 runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 17:01 linux-386-longtest go@667c53e1 runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 17:40 linux-386-longtest go@7abc8a2e runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 18:33 linux-386-longtest go@44cabb80 runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 19:34 linux-386-longtest go@3511c822 runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 19:59 linux-386-longtest go@d031e9e0 runtime.TestCgoTracebackContextPreemption (log)
|
No more failure for a while. Probably fixed by CL https://golang.org/cl/447495 , which is submitted at Nov. 4, so the time seems to match. |
Found new dashboard test flakes for:
2023-01-25 16:38 linux-arm64-longtest go@1d3088ef runtime.TestCrashDumpsAllThreads (log)
2023-02-01 21:30 linux-amd64-longtest go@cda461bb runtime.TestCrashDumpsAllThreads (log)
|
cc @aclements These timeouts blocked release branch CLs as flakes; we should get them fixed or skipped. |
This reproduces pretty quickly on my linux/amd64 laptop with: cd runtime
go test -c
GOFLAGS="-gcflags=runtime=-d=maymorestack=runtime.mayMoreStackPreempt -gcflags=runtime/testdata/...=" stress2 ./runtime.test -test.run=CrashDumpsAllThreads -test.timeout=30m Of course, my usual first step would be to send it a SIGQUIT and see what's happening, but that's exactly what fails in this test. 😛 I haven't debugged yet. gdb may hold answers. For reference, I initially tried running the testprog directly with the below script before I realized there are various pipes and signals involved in running this test. We might still need to reproduce those, but I'm hoping we can debug this through the test driver. cd runtime/testdata/testprog
GOFLAGS="-gcflags=runtime=-d=maymorestack=runtime.mayMoreStackPreempt -gcflags=runtime/testdata/...=" go build
# Expects to write to FD 3 and then get a SIGQUIT
GOTRACEBACK=crash GOGC=off GODEBUG=asyncpreemptoff=1 ./testprog CrashDumpsAllThreads |
All of the Ms are just parked in |
Ah hah, it's getting stuck before the SIGQUIT is involved. It's actually getting stuck in this loop, before the subprocess even indicates to the parent that it's ready for the SIGQUIT. |
Amusingly, this means you can send the subprocess a SIGQUIT. I've pasted the interesting part of the traceback below. All other goroutines and threads are uninteresting (this is running with
|
Faster repro. Apply the following diff --- a/src/runtime/testdata/testprog/crashdump.go
+++ b/src/runtime/testdata/testprog/crashdump.go
@@ -29,6 +29,8 @@ func CrashDumpsAllThreads() {
<-c
}
+ return
+
// Tell our parent that all the goroutines are executing.
if _, err := os.NewFile(3, "pipe").WriteString("x"); err != nil {
fmt.Fprintf(os.Stderr, "write to pipe failed: %v\n", err) Then cd runtime/testdata/testprog
GOFLAGS="-gcflags=runtime=-d=maymorestack=runtime.mayMoreStackPreempt -gcflags=runtime/testdata/...=" go build
GOTRACEBACK=crash GOGC=off GODEBUG=asyncpreemptoff=1 stress2 ./testprog CrashDumpsAllThreads |
Here's a version that works without a patch: cd runtime/testdata/testprog
GOFLAGS="-gcflags=runtime=-d=maymorestack=runtime.mayMoreStackPreempt -gcflags=runtime/testdata/...=" go build
GOTRACEBACK=crash GOGC=off GODEBUG=asyncpreemptoff=1 stress2 -timeouts-fail -max-fails=1 -timeout=10s -pass="write to pipe" -max-runs=50000 ./testprog CrashDumpsAllThreads |
This isn't new, either. I reproduced in Go 1.18, which was the first release to have mayMoreStack. |
Found new dashboard test flakes for:
2023-02-23 06:07 linux-arm64-longtest go@71c02bed runtime.TestCrashDumpsAllThreads (log)
|
There is actually a second, similar issue: |
Change https://go.dev/cl/501976 mentions this issue: |
Change https://go.dev/cl/501975 mentions this issue: |
Found new dashboard test flakes for:
2023-06-16 17:08 linux-arm64-longtest go@2b0ff4b6 runtime.TestCrashDumpsAllThreads (log)
|
Although this bug was causing failures, I think @prattmic decided to wait until Go 1.22 to land the fix? The changes aren't that complex, but the modified parts of the codebase are subtle. The primary failure modes are mitigated (I think? Why did we see a failure on June 16th if the test was skipped with mayMoreStack on June 6th? Oh. Is that a release branch?). We can also choose to backport it later after the fix gets a chance to soak. |
Correct, the sweet results from my CLs are:
The effect here is pretty small, but has the potential to be bigger for edge-case applications. I'd be more comfortable giving this time to soak during 1.22. This can cause complete stuckness when [1] Technically if async preemption is disabled, then it takes GOMAXPROCS goroutines spinning in an infinite loop to cause stuckness. So GOMAXPROCS=2 could still hang if there are 2 spinning goroutines. |
https://go.dev/cl/501229 should be skipping this test entirely. I agree it is odd that we got another failure on June 16 (it isn't a release branch). Maybe that GCFLAGS check isn't correct? |
Change https://go.dev/cl/507359 mentions this issue: |
GCFLAGS doesn't have any defined meaning. cmd/dist enables mayMoreStackPreempt with GOFLAGS. For #55160. Change-Id: I7ac71e4a1a983a56bd228ab5d24294db5cc595f7 Reviewed-on: https://go-review.googlesource.com/c/go/+/507359 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Auto-Submit: Michael Pratt <mpratt@google.com>
GCFLAGS doesn't have any defined meaning. cmd/dist enables mayMoreStackPreempt with GOFLAGS. For golang#55160. Change-Id: I7ac71e4a1a983a56bd228ab5d24294db5cc595f7 Reviewed-on: https://go-review.googlesource.com/c/go/+/507359 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Auto-Submit: Michael Pratt <mpratt@google.com>
This issue is currently labeled as early-in-cycle for Go 1.22. |
When a thread transitions to spinning to non-spinning it must recheck all sources of work because other threads may submit new work but skip wakep because they see a spinning thread. However, since the beginning of time (CL 7314062) we do not check the global run queue, only the local per-P run queues. The global run queue is checked just above the spinning checks while dropping the P. I am unsure what the purpose of this check is. It appears to simply be opportunistic since sched.lock is already held there in order to drop the P. It is not sufficient to synchronize with threads adding work because it occurs before decrementing sched.nmspinning, which is what threads us to decide to wake a thread. Resolve this by adding an explicit global run queue check alongside the local per-P run queue checks. Almost nothing happens between dropped sched.lock after dropping the P and relocking sched.lock: just clearing mp.spinning and decrementing sched.nmspinning. Thus it may be better to just hold sched.lock for this entire period, but this is a larger change that I would prefer to avoid in the freeze and backports. For #55160. Change-Id: Ifd88b5a4c561c063cedcfcfe1dd8ae04202d9666 Reviewed-on: https://go-review.googlesource.com/c/go/+/501975 Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Found new dashboard test flakes for:
2023-11-19 02:15 linux-arm64-longtest go@aa9dd500 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-21 22:06 linux-amd64-longtest go@5a6f1b35 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Both failures are #64253. |
Intermittent failures in the new mayMoreStackPreempt. This bug will replace #54778 and #55073.
The text was updated successfully, but these errors were encountered: