Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/race: leak in long-running programs, more transparent memory statistics #37233

Open
kevinburkemeter opened this issue Feb 14, 2020 · 4 comments

Comments

@kevinburkemeter
Copy link

@kevinburkemeter kevinburkemeter commented Feb 14, 2020

I don't really have specific recommendations here, but I wanted to report this troubleshooting session, because this is the worst time I've ever had trying to troubleshoot Go program performance, I was at my wit's end trying to debug it, Go has a brand of being not difficult to troubleshoot, and "I can find the problems quickly" is a big part of the reason I like the language. So some time spent thinking about and fixing these issues seems useful.

The details of the program are not that important, but I had a pretty simple program that checks for rows in a Postgres database, updates their status if it finds one, then makes an HTTP request with the contents of the database row. I tested it by writing a single database row once per second, with a single worker. Via *DBStats I never had more than 4 database conns open at a time.

What I observed was high and growing RSS usage. In production this would result in the program being OOM killed after some length of time. Locally on my Mac, RAM use grew from 140MB to 850MB when I left it running overnight.

I tried all of the tricks I could find via blog posts:

  • disabling HTTP2 support
  • disabling MADV_DONTNEED
  • setting Connection: close on the HTTP Client
  • manually inspecting the code for goroutine leaks (constant use of around 20 goroutines)
  • manually inspecting the code for defers in a for loop in a long running function, unbounded array appends etc. (none)
  • ensuring HTTP bodies were fully consumed, DB queries closed after use, etc.

None of these made a difference, the memory usage kept growing. Running pprof and looking at the runtime memory statistics, I was struck by a difference between the reported values and the actual observed memory usage. These numbers are after the program had been running all night and Activity Monitor told me it was using 850MB of RAM.

# runtime.MemStats
# Alloc = 3410240
# TotalAlloc = 2834416608
# Sys = 213078288
# Lookups = 0
# Mallocs = 38955235
# Frees = 38943138
# HeapAlloc = 3410240
# HeapSys = 199819264
# HeapIdle = 194469888
# HeapInuse = 5349376
# HeapReleased = 192643072
# HeapObjects = 12097
# Stack = 1376256 / 1376256
# MSpan = 132600 / 196608
# MCache = 13888 / 16384
# BuckHashSys = 2045579
# GCSys = 8042776
# OtherSys = 1581421
# NextGC = 4194304
# LastGC = 1581710840991223000
# PauseNs = [...]
# PauseEnd = [...]
# NumGC = 1187
# NumForcedGC = 0
# GCCPUFraction = 4.524177278457185e-06
# DebugGC = false

Specifically there's only about 200MB of "Sys" allocated and that number stayed pretty constant even though Activity Monitor reported the program was using about 850MB of RSS.

Another odd thing was the inuse_space heap profile only reported about 340kB of RAM usage. This sample was taken after reducing runtime.MemProfileRate to 512 and after leaving the program running all night.

(pprof) top20
Showing nodes accounting for 313.47kB, 91.83% of 341.36kB total
Dropped 100 nodes (cum <= 1.71kB)
Showing top 20 nodes out of 130
      flat  flat%   sum%        cum   cum%
   76.76kB 22.49% 22.49%    76.76kB 22.49%  runtime.malg
   68.02kB 19.93% 42.41%    68.02kB 19.93%  bufio.NewReaderSize (inline)
      54kB 15.82% 58.23%       54kB 15.82%  github.com/rcrowley/go-metrics.newExpDecaySampleHeap (inline)
   15.35kB  4.50% 62.73%    57.89kB 16.96%  github.com/kevinburke/rickover/dequeuer.(*Dequeuer).Work
   15.35kB  4.50% 67.23%    15.35kB  4.50%  net/http.(*persistConn).roundTrip
   14.80kB  4.34% 71.56%    15.90kB  4.66%  database/sql.(*Rows).awaitDone
    9.99kB  2.93% 74.49%    13.55kB  3.97%  github.com/lib/pq.NewConnector
    8.77kB  2.57% 77.06%     8.77kB  2.57%  database/sql.(*Tx).awaitDone
    8.28kB  2.43% 79.48%    69.90kB 20.48%  github.com/lib/pq.(*Connector).open
       8kB  2.34% 81.83%        8kB  2.34%  bufio.NewWriterSize
       8kB  2.34% 84.17%        8kB  2.34%  hash/crc32.slicingMakeTable
    5.24kB  1.53% 85.71%    10.23kB  3.00%  github.com/lib/pq.(*conn).prepareTo
       4kB  1.17% 86.88%        4kB  1.17%  runtime.allgadd

I would have expected that number to be a lot higher.

Eventually I realized that I had been running this program with the race detector on - performance isn't super important but correctness is so I wanted to ensure I caught any unsafe reads or writes. Some searching around in the issue tracker revealed this issue - #26813 - which seems to indicate that the race detector might not clean up correctly after defers and recovers.

Recompiling and running the program without the race detector seems to eliminate the leak. It would be great to have confirmation that #26813 is actually the problem, though. github.com/lib/pq uses recover and panic quite heavily to do fast stack unwinds - similar to the JSON package - but as far as I can tell from adding print statements to my test program, it's not ever actually panicking, so maybe just the heavy use of recover and my use of defer context.cancel() is enough to set it off.

Look, it is fine if the race detector leaks memory, and I understand running programs in production for days on end with the race detector on is atypical. What bothered me more was how hard this was to track down, and how poorly the tooling did at identifying the actual problem. The only documentation on the runtime performance of the race detector states:

The cost of race detection varies by program, but for a typical program, memory usage may increase by 5-10x and execution time by 2-20x.

I read pretty much every blog post anyone had written on using pprof and none of them mentioned this issue, either.

It would be nice if runtime.MemStats had a section about memory allocated for the race detector, or if there was more awareness/documentation of this issue somewhere. I'm sure there are other possible solutions that I'm probably not aware of because I'm not super familiar

@dgryski
Copy link
Contributor

@dgryski dgryski commented Feb 14, 2020

/cc @dvyukov

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 24, 2020

Change https://golang.org/cl/220586 mentions this issue: doc/articles: mention memory leak in race detector docs

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 24, 2020

Change https://golang.org/cl/220586 mentions this issue: doc/articles/race_detector: mention memory leak potential

gopherbot pushed a commit that referenced this issue Feb 25, 2020
As far as I can tell, there is no public documentation on this topic,
which cost me several days of debugging.

I am possibly unusual in that I run binaries in production with the
race detector turned on, but I think that others who do the same may
want to be aware of the risk.

Updates #26813.
Updates #37233.

Change-Id: I1f8111bd01d0000596e6057b7cb5ed017d5dc655
Reviewed-on: https://go-review.googlesource.com/c/go/+/220586
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@gopherbot
Copy link

@gopherbot gopherbot commented Feb 25, 2020

Change https://golang.org/cl/221019 mentions this issue: [release-branch.go1.14] doc/articles/race_detector: mention memory leak potential

gopherbot pushed a commit that referenced this issue Feb 25, 2020
…ak potential

As far as I can tell, there is no public documentation on this topic,
which cost me several days of debugging.

I am possibly unusual in that I run binaries in production with the
race detector turned on, but I think that others who do the same may
want to be aware of the risk.

Updates #26813.
Updates #37233.

Change-Id: I1f8111bd01d0000596e6057b7cb5ed017d5dc655
Reviewed-on: https://go-review.googlesource.com/c/go/+/220586
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
(cherry picked from commit ba093c4)
Reviewed-on: https://go-review.googlesource.com/c/go/+/221019
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants