-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: -race
leaks memory on Go >= 1.19
#63276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
-race
leaks memory on Go >= 1.19-race
leaks memory on Go >= 1.19
-race
leaks memory on Go >= 1.19-race
leaks memory on Go >= 1.19
@golang/compiler @golang/runtime |
Thanks for the report. @walles you write that this happens "On Linux (not on macOS)..." -- does that mean you've tried it on a mac and the leak doesn't happen, or just that the program is linux-specific? |
It leaks on macOS as well, sorry about that. I updated the repro case at the start of this ticket to use |
In triage, we're a bit unsure as to what the cause could be. It seems like bisecting the toolchain might be the best path forward to try and identify the root cause. |
@walles Would you be willing to do the bisection, since you have a reproducer readily available in your environment? |
@mknyszek What happened when you tried the repro? Did |
@walles , we might not get a chance to bisect this for a while. If you have a chance to bisect it, that might help us a lot with moving this issue forward more quickly. |
Do you think you could at least validate the repro case now? I have only run it myself, so if at least one other person could verify that it shows a leak that would be super! |
Can confirm I am seeing a leak with the same script on macOS
Output
|
git bisect points to https://go-review.googlesource.com/c/go/+/333529 |
Thank you @cuonglm for the bisect. Suspect CL seems plausible to me. I will dig into it a bit more. |
I didn't add go version go1.21.0 linux/amd64 cat /etc/*release: uname -a : What did you do?go build main.go What did you see instead?After 2 hours : |
@thanm Anything we can do to assist you? Need anymore info to track this down and get it resolved? Happy to be a test subject 😀 |
I spent a little while looking at this. I built a fresh copy of runtime/race/internal/amd64v1/race_linux.syso from LLVM source at dffa28fa3a907a3e0c64c9b06a46b55fc5cea40a, e.g.
I then hacked the TSAN runtime to expose the memory profiling hooks (which are normally stubbed out for Go). Here are the changes I made:
Then finally I modified the original program to call into TSAN's memory profiler as well:
and finally did the run using this:
Here's an excerpt of the generated profile:
So in essence it seems that we have a lot of dead threads on our hands. Given the way TSAN works, it is not clear to me whether these dead threads are actually created by TSAN or just created somewhere else and registered with TSAN via interceptor? I will spent a little time in the debugger to see if I can learn more. |
From what I can see in the debugger, the threads in question are being created in Go. For example, here's one call stack:
At least from a sample that I took, all the CreateThread calls are resulting from the Go side (the one above looks like it is from this line: https://go.googlesource.com/go/+/b788e91badd523e5bb0fc8d50cd76b8ae04ffb20/src/net/http/transport.go#1544 Initially I thought that perhaps we were somehow missing a callback from Go to TSAN when a goroutine exits, but in fact that seems to be working just fine based on what I see in the debugger: I set breakpoints in both I also took a closer look at the TSAN thread registry and it looks as though the large number of "dead" threads is actually expected given the way this program works (e.g. creates lots of goroutines); the number being reported is total threads created, including those that are reaped properly with I think the next thing to do is figure out some way to collect a heap profile from the program to see where the memory is going. It seems likely at this point that the TSAN runtime is what's holding the memory, just not clear which part. |
Well, profiling with https://gperftools.github.io/gperftools/heapprofile.html was largely a waste of time -- it is clear that whatever memory growth is happening it is not via the regular heap, but rather via anonymous mmaps. |
I'm not sure if this is useful info or not (probably obvious to you). We are using DD's continuous profiler which I believe hooks into runtime/trace and it is not reporting any of this increased heap usage. The graphs for all of our staging environments are flat according to that. So at least according to runtime/trace it's invisible. Not sure if that helps narrow it down for you at all. 😄 |
Definitely not Go heap or C heap usage, it looks like the space is coming from the anonymous mmaps that the TSAN runtime does. I will spend some more time with that code to see if I can figure out what the issue is. The main job is going to be determining which TSAN data structures are the major culprits. |
I wrote some throw-away code to profile internal mmaps (e.g. instrumented this function: https://github.com/llvm/llvm-project/blob/dffa28fa3a907a3e0c64c9b06a46b55fc5cea40a/compiler-rt/lib/sanitizer_common/sanitizer_posix.cpp#L401) and did a length run to see how mmap'd data grows for the example program. Turns out to be a bit tricky, since the memory profiler itself actually allocates a fair amount of memory, but eventually I was able to gather a little data. At the start of a run here's what mmap usage looks like (for each line the number on the left is in bytes, tag at the right is the name passed to
and at the end of the run:
Not sure if this really tells us anything however; At this point given what I've seen from the numbers (Go heap / C++ heap numbers, mmap numbers from |
@dvyukov do you think you might be able to offer some advice here? Thanks |
I see you already did a great deal of debugging. Re the profiles in the previous comment, it would be useful to capture several profiles during the run, rather than just start/end. They should show what components are constantly growing w/o bound, and which were just subject to some initial growth. |
This is suspicious and what I would check first. I see that one part that disappeared in v3 is this kThreadQuarantineSize/kMaxTidReuse: kThreadQuarantineSize was supposed to set bound of number of memorized dead threads. Probably my idea was to remove all dead threads in DoResetImpl, but maybe I never did that. I am not sure how easy/safe is it to reset all dead threads in DoResetImpl now, so probably the easiest solution is to restore the quarantine (at least for Go). |
Thanks, that seems like a promising direction to follow. I'll read up on the old quarantine code and see what I can do there. |
Following up on the suggestion in this comment, I tried this patch to restore the thread quarantine limit:
This did not seem to go well. The Go testcase fails fairly quickly with a failed assertion:
A bit more from the debugger:
Code:
|
I unfortunately don't have any suggestions off the top of my head. Things are a bit elaborate there. Probably there is some latent bug, or unimplemented part somewhere, but things worked fine b/c thread reuse was never enabled in the new version. |
What version of Go are you using (
go version
)?Inside of
docker run -it --rm golang:1.21
.Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
main.go
.go run -race main.go
(without-race
there is no leak)Also, just having the HTTP server part is enough to leak, but I wanted a repro that is standalone so the client code and the memory reporting code are there for that.
What did you expect to see?
Stable memory usage
What did you see instead?
Notes
The text was updated successfully, but these errors were encountered: