-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: some runtime throws print all goroutine stack dumps by default #46995
Comments
For the sake of completeness, here's an example of "only print the current goroutine by default" working as documented:
|
I think the distinction here is the difference between |
I see; I hadn't caught on to the distinction between panic and throw. Note that the docs simply talk about "an unrecovered panic or an unexpected runtime condition", which I would assume would cover both. So if we want to keep that distinction, maybe that's also something to clarify in the godoc. Regardless, I agree that concurrent map writes shouldn't dump all goroutines - they aren't a bug in the runtime package itself. I'd think that any such runtime failures due to bugs in the user's code should cause something akin to the current panic, not the current throw. |
This has come up in the past in the context of differentiating throws due to "user bugs" (unlock of unlocked mutex, concurrent map write, etc) and "runtime bugs" (most throws) when analyzing crashes. Perhaps we should have a |
I agree that we should differentiate runtime-internal fatal errors from user-caused fatal errors, but isn't "concurrent map writes" actually one of the few cases where it is useful to see the other goroutines so you have a chance of catching the other goroutine that's writing to the map? |
I think it is unlikely to actually catch the particular write itself. By the time the runtime stops that goroutine it has finished the write and moved on. (write-write conflicts might actually catch both offenders, but for read-write conflicts we only catch the read side.) |
Perhaps, but note that dumping all goroutines is an entirely different matter. See my mention above of a problem we ran into on production: all panics we had seen up to that point were under a kilobyte, as they only contained one stack trace. Then one day we hit a concurrent map write, and the tens of megabytes dumped all of a sudden caused the logging system to get stuck. You can say that's something that we should fix, but it's also extremely surprising that a tiny minority of user-caused panics can surprisingly dump all goroutines :) I wouldn't oppose if this map panic printed the stack traces of all goroutines involved in the data race, which I assume would almost never be more than a handful. If that's not possible, then I think it should just use the running goroutine. |
(user experience report) Encountered this issue at my work yesterday. We had a service crash in production due to concurrent map writes. Our debugging got slowed down a bit because our process printed stack traces for hundreds of thousands of goroutines. Our log viewer "kubectl logs " was only displaying the tail end of the stack traces, because of the length. This made it hard to understand the cause until we figured out how to get all of the logs. |
I'm going to be using
go version devel go1.17-d19a53338f Wed Jun 30 02:00:49 2021 +0000 linux/amd64
as an example here, but this also reproduces on 1.16.5.From https://golang.org/pkg/runtime/, emphasis mine:
So it seems like, by default, a panic should only print the stack trace for the goroutine that panicked, and nothing else.
However, that doesn't seem correct in some cases: https://play.golang.org/p/zWw2TUS4YzZ
As per the docs, I would expect to just see one goroutine stack dump - the "running" one that hit the panic via
runtime.mapassign_fast64
, but I get all of them.This seems to be on purpose in the code:
go/src/runtime/panic.go
Lines 1385 to 1399 in d19a533
This is problematic for two reasons:
A: It's confusing. This had me puzzled when looking at a Go process crash output along with the runtime docs.
B: It can lead to large amounts of panic output, depending on what kind of panic one runs into. At work, someone was very surprised that a panic dumped ~50MiB of goroutine stacks, while most other panics dumped just one goroutine stack. That actually caused infra issues, because it appears our log ingestion didn't like that sudden dump of output very much, and we never found out before given that most panics don't behave that way.
I see three possible resolutions to this problem:
The code is working as intended. We should update the docs to clarify the nuance that the default does print all goroutine stack traces in some cases, and what those cases are.
We should amend the implementation to match the docs. Concurrent map writes should result in a panic that only prints one stack trace, i.e. the "running" goroutine from above.
Something in between, such as "print the stack traces for all goroutines that are involved in writes to the map in question". I can imagine that the current behavior exists for debugging purposes: if you get concurrent map writes, you might want to know what those two or more goroutines were doing, not just one of them. However, you presumably don't need all goroutines for that; just the ones involved in the race.
cc @aclements @randall77 @mknyszek @prattmic as per https://dev.golang.org/owners
The text was updated successfully, but these errors were encountered: