Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: crashes with "runtime: traceback stuck" #62086

Open
aciduck opened this issue Aug 17, 2023 · 4 comments
Open

runtime: crashes with "runtime: traceback stuck" #62086

aciduck opened this issue Aug 17, 2023 · 4 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@aciduck
Copy link

aciduck commented Aug 17, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20.5 linux/arm64

Does this issue reproduce with the latest release?

We can't use 1.20.6 because of #61431

What operating system and processor architecture are you using (go env)?

go env Output
$ go env

What did you do?

Our production environments runs tens of thousands of containers written in Go, most of them running only for a few minutes or hours. About once every day one of them crashes with runtime: traceback stuck. It is not always the same service, and has been happening for months and across multiple Go versions, going back to at least Go 1.18. We are not sure exactly when it started.
We did saw a common pattern, where the stack trace is always of the routine running our internal MemoryMonitor. It is a small library that runs in all our services, samples the cgroup memory parameters every second from procfs, and logs all the running operation if we use 90% of available memory. When we turn off this functionality the problem disappear, so we know it is related. All the containers that crashed didn't reach this limit during their run, so only the sampling occurred.
Another thing to note that we always see in the dump is an active runtime profiling running by the DataDog agent we are integrating with. It is running every 30 seconds and takes a CPU and memory profile using the standard pprof library. We are not sure if this is related.

What did you expect to see?

No crashes.

What did you see instead?

runtime: traceback stuck. pc=0xfdfcf0 sp=0x40097fbf90
stack: frame={sp:0x40097fbf90, fp:0x40097fbf90} stack=[0x40097fa000,0x40097fc000)
0x00000040097fbe90: 0x00000040075ae200 0x0000004003a53d00
0x00000040097fbea0: 0x0000004007587d40 0x0000000000000001
0x00000040097fbeb0: 0x0000000001b38580 <github.com/xxx/xxx/commonlib/memmonitor.(*memoryMonitor).run.func1.1+0x0000000000000000> 0x000000400771e500
0x00000040097fbec0: 0x0000004003a13e60 0x0000000000000000
0x00000040097fbed0: 0x0000000000000000 0x0000000000000000
0x00000040097fbee0: 0x00000040097fbeb0 0x0000004000507ef8
0x00000040097fbef0: 0x0000000000fe02c8 <github.com/xxx/xxx/commonlib/waitgroup.(*SafeErrorGroup).Go.func1+0x0000000000000088> 0x0000004000507f58
0x00000040097fbf00: 0x0000000000fdfbfc <golang.org/x/sync/errgroup.(*Group).Go.func1+0x000000000000005c> 0x0000004004eed080
0x00000040097fbf10: 0x0300000000000000 0x0000000000000000
0x00000040097fbf20: 0x0000000000000000 0x0000000000fe0330 <github.com/xxx/xxx/commonlib/waitgroup.(*SafeErrorGroup).Go.func1.1+0x0000000000000000>
0x00000040097fbf30: 0x0000000004324400 0x0000004007587d40
0x00000040097fbf40: 0x00000040097fbf18 0x00000000036f2020
0x00000040097fbf50: 0x00000040097fbf28 0x0000000000000000
0x00000040097fbf60: 0x000000000007fd44 <runtime.goexit+0x0000000000000004> 0x00000040076be780
0x00000040097fbf70: 0x0000004002347b60 0x0000000000000000
0x00000040097fbf80: 0x0100004004d079e0 0x0000004007585f00
0x00000040097fbf90: >0x0000000000fdfcf0 <golang.org/x/sync/errgroup.(*Group).Go.func1.2+0x0000000000000000> 0x0000004007585f00
0x00000040097fbfa0: 0x0000004000507f48 0x0000000000000000
0x00000040097fbfb0: 0x000000000007fd44 <runtime.goexit+0x0000000000000004> 0x000000400074f7a0
0x00000040097fbfc0: 0x00000040097fbf90 0x0000000000000000
0x00000040097fbfd0: 0x0000000000000000 0x0000000000000000
0x00000040097fbfe0: 0x0000000000000000 0x0000000000000000
0x00000040097fbff0: 0x0000000000000000 0x0000000000000000
fatal error: traceback stuck
runtime stack:
runtime.throw({0x30b0158?, 0x63afaa0?})
/usr/local/go/src/runtime/panic.go:1047 +0x40 fp=0x4000111bc0 sp=0x4000111b90 pc=0x49cd0
runtime.gentraceback(0x4000111f98?, 0x76480?, 0x50e55d58a9b8?, 0x4000102b60, 0x0, 0x4009100c00, 0x20, 0x0, 0x0?, 0x0)
/usr/local/go/src/runtime/traceback.go:487 +0xdd8 fp=0x4000111f30 sp=0x4000111bc0 pc=0x70658
runtime.saveg(0x0?, 0x7d92c?, 0x4005e36600?, 0x4009100c00)
/usr/local/go/src/runtime/mprof.go:1181 +0x44 fp=0x4000111f90 sp=0x4000111f30 pc=0x42154
runtime.doRecordGoroutineProfile.func1()
/usr/local/go/src/runtime/mprof.go:1093 +0x48 fp=0x4000111fc0 sp=0x4000111f90 pc=0x420e8
runtime.systemstack()
/usr/local/go/src/runtime/asm_arm64.s:243 +0x6c fp=0x4000111fd0 sp=0x4000111fc0 pc=0x7d92c
@bcmills bcmills added compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 17, 2023
@bcmills bcmills changed the title Crashes with "runtime: traceback stuck" runtime: crashes with "runtime: traceback stuck" Aug 17, 2023
@mknyszek mknyszek added this to the Backlog milestone Aug 23, 2023
@cherrymui
Copy link
Member

It looks like it is stuck when collecting a goroutine profile, which takes stack traces for all goroutines. The place it is stuck is at the entry of a deferred function. It should never unwind to there. It looks like a stack slot to save the defer record, instead of a return PC. So the unwinding already went off at this point. It could be due to failing to unwind a frame lower down the stack, or the stack being corrupted.

@aciduck is your program a pure-Go program or it uses cgo? Does it use any unsafe code? Is there a way we can reproduce the failure? Thanks.

@cherrymui cherrymui added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 13, 2023
@aciduck
Copy link
Author

aciduck commented Sep 14, 2023

This happens in multiple services, some are compiled with cgo and some don't.
We don't use unsafe code directly, but it might be called from some dependency we have.
Unfortunately we can't reproduce it consistently or in a simple program. The only pattern we found is that this only happens if our MemoryMonitor code is running, and we suspect it is related to the fact it executes frequent syscalls.

@cherrymui
Copy link
Member

Could you show a more complete error message? There should be other goroutine stack dumps below, and my have information about how the unwinder get to that point (It is okay to have symbol names redacted). Thanks.

@aciduck
Copy link
Author

aciduck commented Oct 9, 2023

Full dump is in the attached file:

extract-2023-08-16T12_49_41.656Z - Copy.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
Status: Todo
Development

No branches or pull requests

4 participants