-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/race: report TSAN memory statistics #37233
Comments
/cc @dvyukov |
Change https://golang.org/cl/220586 mentions this issue: |
Change https://golang.org/cl/220586 mentions this issue: |
As far as I can tell, there is no public documentation on this topic, which cost me several days of debugging. I am possibly unusual in that I run binaries in production with the race detector turned on, but I think that others who do the same may want to be aware of the risk. Updates #26813. Updates #37233. Change-Id: I1f8111bd01d0000596e6057b7cb5ed017d5dc655 Reviewed-on: https://go-review.googlesource.com/c/go/+/220586 Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Change https://golang.org/cl/221019 mentions this issue: |
…ak potential As far as I can tell, there is no public documentation on this topic, which cost me several days of debugging. I am possibly unusual in that I run binaries in production with the race detector turned on, but I think that others who do the same may want to be aware of the risk. Updates #26813. Updates #37233. Change-Id: I1f8111bd01d0000596e6057b7cb5ed017d5dc655 Reviewed-on: https://go-review.googlesource.com/c/go/+/220586 Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> (cherry picked from commit ba093c4) Reviewed-on: https://go-review.googlesource.com/c/go/+/221019 Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
Sorry for the noise, but is this still the case as of Go 1.19's update to thread sanitizer v3? |
Change https://go.dev/cl/458218 mentions this issue: |
Currently, all stack walking logic is in one venerable, large, and very, very complicated function: runtime.gentraceback. This function has three distinct operating modes: printing, populating a PC buffer, or invoking a callback. And it has three different modes of unwinding: physical Go frames, inlined Go frames, and cgo frames. It also has several flags. All of this logic is very interwoven. This CL reimplements the monolithic gentraceback function as an "unwinder" type with an iterator API. It moves all of the logic for stack walking into this new type, and gentraceback is now a much-simplified wrapper around the new unwinder type that still implements printing, populating a PC buffer, and invoking a callback. Follow-up CLs will replace uses of gentraceback with direct uses of unwinder. Exposing traceback functionality as an iterator API will enable a lot of follow-up work such as simplifying the open-coded defer implementation (which should in turn help with #26813 and #37233), printing the bottom of deep stacks (#7181), and eliminating the small limit on CPU stacks in profiles (#56029). Fixes #54466. Change-Id: I36e046dc423c9429c4f286d47162af61aff49a0d Reviewed-on: https://go-review.googlesource.com/c/go/+/458218 Reviewed-by: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Austin Clements <austin@google.com>
Looks like I've had the same entertainment for couple of days. I vote for info about race detector allocs and frees in runtime.MemStats. |
It is still the case with GO |
As far as I understand (which might not be correct), the “bug” is an unimplemented feature on the Go side rather than on the TSAN side. See Dmitry’s comment here: #26813 (comment) If that’s correct, I would expect that upgrading TSAN would not help. Instead, the new iterator-based approach for stack unwinding would hopefully enable fixing the root cause of this memory growth. See for example the reference to #26813 by @aclements in the CL description for https://go.dev/cl/458218. In any event, #26813 is probably the bug to follow if you want to track a fix for the root cause. |
Since #26813 is about the bug that caused the memory leak, I've retitled this bug to focus on reporting memory used by the TSAN runtime. |
I don't really have specific recommendations here, but I wanted to report this troubleshooting session, because this is the worst time I've ever had trying to troubleshoot Go program performance, I was at my wit's end trying to debug it, Go has a brand of being not difficult to troubleshoot, and "I can find the problems quickly" is a big part of the reason I like the language. So some time spent thinking about and fixing these issues seems useful.
The details of the program are not that important, but I had a pretty simple program that checks for rows in a Postgres database, updates their status if it finds one, then makes an HTTP request with the contents of the database row. I tested it by writing a single database row once per second, with a single worker. Via
*DBStats
I never had more than 4 database conns open at a time.What I observed was high and growing RSS usage. In production this would result in the program being OOM killed after some length of time. Locally on my Mac, RAM use grew from 140MB to 850MB when I left it running overnight.
I tried all of the tricks I could find via blog posts:
None of these made a difference, the memory usage kept growing. Running pprof and looking at the runtime memory statistics, I was struck by a difference between the reported values and the actual observed memory usage. These numbers are after the program had been running all night and Activity Monitor told me it was using 850MB of RAM.
Specifically there's only about 200MB of "Sys" allocated and that number stayed pretty constant even though Activity Monitor reported the program was using about 850MB of RSS.
Another odd thing was the
inuse_space
heap profile only reported about 340kB of RAM usage. This sample was taken after reducingruntime.MemProfileRate
to 512 and after leaving the program running all night.I would have expected that number to be a lot higher.
Eventually I realized that I had been running this program with the race detector on - performance isn't super important but correctness is so I wanted to ensure I caught any unsafe reads or writes. Some searching around in the issue tracker revealed this issue - #26813 - which seems to indicate that the race detector might not clean up correctly after defers and recovers.
Recompiling and running the program without the race detector seems to eliminate the leak. It would be great to have confirmation that #26813 is actually the problem, though. github.com/lib/pq uses recover and panic quite heavily to do fast stack unwinds - similar to the JSON package - but as far as I can tell from adding print statements to my test program, it's not ever actually panicking, so maybe just the heavy use of
recover
and my use ofdefer context.cancel()
is enough to set it off.Look, it is fine if the race detector leaks memory, and I understand running programs in production for days on end with the race detector on is atypical. What bothered me more was how hard this was to track down, and how poorly the tooling did at identifying the actual problem. The only documentation on the runtime performance of the race detector states:
I read pretty much every blog post anyone had written on using pprof and none of them mentioned this issue, either.
It would be nice if runtime.MemStats had a section about memory allocated for the race detector, or if there was more awareness/documentation of this issue somewhere. I'm sure there are other possible solutions that I'm probably not aware of because I'm not super familiar
The text was updated successfully, but these errors were encountered: