runtime/race: leak in long-running programs, more transparent memory statistics #37233
I don't really have specific recommendations here, but I wanted to report this troubleshooting session, because this is the worst time I've ever had trying to troubleshoot Go program performance, I was at my wit's end trying to debug it, Go has a brand of being not difficult to troubleshoot, and "I can find the problems quickly" is a big part of the reason I like the language. So some time spent thinking about and fixing these issues seems useful.
The details of the program are not that important, but I had a pretty simple program that checks for rows in a Postgres database, updates their status if it finds one, then makes an HTTP request with the contents of the database row. I tested it by writing a single database row once per second, with a single worker. Via
What I observed was high and growing RSS usage. In production this would result in the program being OOM killed after some length of time. Locally on my Mac, RAM use grew from 140MB to 850MB when I left it running overnight.
I tried all of the tricks I could find via blog posts:
None of these made a difference, the memory usage kept growing. Running pprof and looking at the runtime memory statistics, I was struck by a difference between the reported values and the actual observed memory usage. These numbers are after the program had been running all night and Activity Monitor told me it was using 850MB of RAM.
Specifically there's only about 200MB of "Sys" allocated and that number stayed pretty constant even though Activity Monitor reported the program was using about 850MB of RSS.
Another odd thing was the
I would have expected that number to be a lot higher.
Eventually I realized that I had been running this program with the race detector on - performance isn't super important but correctness is so I wanted to ensure I caught any unsafe reads or writes. Some searching around in the issue tracker revealed this issue - #26813 - which seems to indicate that the race detector might not clean up correctly after defers and recovers.
Recompiling and running the program without the race detector seems to eliminate the leak. It would be great to have confirmation that #26813 is actually the problem, though. github.com/lib/pq uses recover and panic quite heavily to do fast stack unwinds - similar to the JSON package - but as far as I can tell from adding print statements to my test program, it's not ever actually panicking, so maybe just the heavy use of
Look, it is fine if the race detector leaks memory, and I understand running programs in production for days on end with the race detector on is atypical. What bothered me more was how hard this was to track down, and how poorly the tooling did at identifying the actual problem. The only documentation on the runtime performance of the race detector states:
I read pretty much every blog post anyone had written on using pprof and none of them mentioned this issue, either.
It would be nice if runtime.MemStats had a section about memory allocated for the race detector, or if there was more awareness/documentation of this issue somewhere. I'm sure there are other possible solutions that I'm probably not aware of because I'm not super familiar
The text was updated successfully, but these errors were encountered:
As far as I can tell, there is no public documentation on this topic, which cost me several days of debugging. I am possibly unusual in that I run binaries in production with the race detector turned on, but I think that others who do the same may want to be aware of the risk. Updates #26813. Updates #37233. Change-Id: I1f8111bd01d0000596e6057b7cb5ed017d5dc655 Reviewed-on: https://go-review.googlesource.com/c/go/+/220586 Reviewed-by: Dmitri Shuralyov <firstname.lastname@example.org>
…ak potential As far as I can tell, there is no public documentation on this topic, which cost me several days of debugging. I am possibly unusual in that I run binaries in production with the race detector turned on, but I think that others who do the same may want to be aware of the risk. Updates #26813. Updates #37233. Change-Id: I1f8111bd01d0000596e6057b7cb5ed017d5dc655 Reviewed-on: https://go-review.googlesource.com/c/go/+/220586 Reviewed-by: Dmitri Shuralyov <email@example.com> (cherry picked from commit ba093c4) Reviewed-on: https://go-review.googlesource.com/c/go/+/221019 Run-TryBot: Dmitri Shuralyov <firstname.lastname@example.org> TryBot-Result: Gobot Gobot <email@example.com>