Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decay-based purging results in ~50% dirty pages. #956

Open
mberhault opened this issue Jul 18, 2017 · 16 comments
Open

Decay-based purging results in ~50% dirty pages. #956

mberhault opened this issue Jul 18, 2017 · 16 comments

Comments

@mberhault
Copy link

mberhault commented Jul 18, 2017

We (cockroachdb) recently updated jemalloc to 5.0.1 (see here to see our fork of 5.0.1 plus a few of our own commits).

The change from ratio to decay-based purging results in approximately 50% dirty pages.
Here is a graph showing the go/cgo allocated/total memory for a single one of our nodes (in a cluster of 6). The cgo part of cockroach consists mostly of rocksdb though it's a bit more complicated than that.
Three distinct sections can be seen in the graph:

  • 16:30-17:10 (and again from 17:15-17:25): 5.0.1 with MALLOC_CONF=prof:true
  • 17:25-21:00: 5.0.1 with MALLOC_CONF=prof:true,dirty_decay_ms:0
  • 21:00-now: 4.5.0 with MALLOC_CONF=prof:true

omega_mem_threeway_comparison

The relevant lines are the orange (CGO allocated, from jemalloc stats.allocated) and red (CGO total, from jemalloc stats.resident). Here's the graph again with only those stats shown:
omega_mem_threeway_comparison_cgoonly

We almost immediately allocate ~4GB, this is our allowed table cache size in rocksdb (25% of system memory).

In the first case (default 5.0.1 options), total memory quickly reaches double the active memory. The forced purge (with decay=0 in the second case) keeps it more or less in line. The final case (4.5.0 defaults) is not quite 1/8th, but much more reasonable.

There is more discussion about tracking this down in our issue

And here is just the summary about 5 minutes into the 17:15-17:25 run and the 17:25-21:00 run:

default 5.0.1 decay settings:

Allocated: 5024958400, active: 5579403264, metadata: 47390248, resident: 8354721792, mapped: 8425742336, retained: 1572429824
Merged arenas stats:
...
decaying:  time       npages       sweeps     madvises       purged
   dirty:   N/A       666235          115         1021        28278
   muzzy:   N/A            0            0            0            0

dirty_decay_ms=0:

Allocated: 5028887256, active: 5585518592, metadata: 44305784, resident: 5628817408, mapped: 5681987584, retained: 3381903360
Merged arenas stats:
...
decaying:  time       npages       sweeps     madvises       purged
   dirty:   N/A            0       322624       322722      4024960
   muzzy:   N/A            0            0            0            0

I'm also attaching the logs for both runs. You'll find jemalloc stats being logged every 10 seconds.
jemalloc_5.0.1_defaults.log.gz
jemalloc_5.0.1_decay0.log.gz

Machine details:

$ uname -a
Linux cockroach-omega-0006 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.2 LTS
Release:	16.04
Codename:	xenial

$ cat /proc/meminfo 
MemTotal:       16432076 kB
MemFree:          171792 kB
MemAvailable:    7954448 kB
Buffers:           88420 kB
Cached:          7811452 kB
SwapCached:            0 kB
Active:         11659116 kB
Inactive:        4075848 kB
Active(anon):    7864600 kB
Inactive(anon):    82500 kB
Active(file):    3794516 kB
Inactive(file):  3993348 kB
Unevictable:        3656 kB
Mlocked:            3656 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               560 kB
Writeback:             0 kB
AnonPages:       7841176 kB
Mapped:            44504 kB
Shmem:            107160 kB
Slab:             367884 kB
SReclaimable:     333120 kB
SUnreclaim:        34764 kB
KernelStack:        8544 kB
PageTables:        50332 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8216036 kB
Committed_AS:    5039368 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:    176128 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      790516 kB
DirectMap2M:    15986688 kB
DirectMap1G:     2097152 kB

Running is running on GCE. We've seen similar behavior on other clusters running in Azure and Digital Ocean (all Ubuntu, slightly different kernels).

@mberhault
Copy link
Author

I should also clarify the title. The graphs and logs above are for a node/cluster that OOMed shortly after, so the long-term trend is not available.

Here is the memory graph for a different cluster running for 3.5 days (it was rolled back to 4.5.0 yesterday):

adriatic_mem_jemalloc_5 0 1

This is the sum of memory usage across 6 nodes, each with 8GB of system memory, hence the ~12GB active CGO memory when summed across all nodes (still using 25% of memory for rocksdb cache).
CGO allocated hovers around 12.5GB, while CGO total is around 22GB.

@davidtgoldblatt
Copy link
Member

Wow, this is nasty looking. @interwq knows the purging code better and will be back from vacation soon. In the meantime though, a couple things I might try:

  • Cutting the number of arenas (I'd try 8 at first and tweak as appropriate). This will allow better reuse across threads, and more opportunities to go down the purging pathways, at the possible cost of more contention.
  • Enabling background threads (a lot of your arenas only have one thread in them -- if that thread is sleeping or otherwise not interacting with the allocator much, it might never get far enough into the arena code to purge).

@mberhault
Copy link
Author

Thanks for the pointers.
I haven't tried fewer arenas yet, but here's the results with background thread:
Three sections again:

  • -19:38: jemalloc 4.5.0 with defaults
  • 19:40-20:00: jemalloc 5.0.1 with prof:true
  • 20:02-now: jemalloc 5.0.1 with prof:true,background_thread:true

You'll notice that the second run starts sloping upwards (probably until OOM, but I interrupted it early. I do want to run it longer to see if the periodic drops are frequent and big enough to keep up).
The last run remains flat after reaching 4.8GB/5.5GB allocated/total. There's not noticeable difference in cpu usage.

omega_background_thread

I suspect the decreased nrarenas will reduce the fraction of dirty pages due to combined pools, but I'm not too concerned about that as long as it's linear.

@mberhault
Copy link
Author

Ouch. I tried with prof:true,narenas:8, 4th run in the graph below, the first three are the one mentioned in the previous comment.
omega_background_arenas_8

While the climb is not as sharp as with 32 arenas, the side effects are pretty terrible: we ended up with goroutines slowly piling up behind calls to the cgo code, eventually reaching a point where various ops timed out, resulting in increase in errors and latency.

The background thread was much friendlier. Is it disabled by default because not supported by all platforms?

@davidtgoldblatt
Copy link
Member

Oh dang, am I reading it right that it's a 10% memory drop relative to 4.5 when background threads are on?

We don't ship with background threads enabled because there are some esoteric situations in which it can cause bugs in bug-free programs, like if people are carefully counting their threads through some out of band mechanism (like some sort of signal-based stop-the-world sort of thing). Internally, we enable them almost everywhere (one of these days we need to put a "tuning" wiki page together).

Am I right in guessing that there's a whole lot of allocator activity from the C threads at the beginning of your program, but very little afterwards (or, concentrated just to a small number of threads)? That might begin to explain the lack of purging we're seeing.

Is the test you're using to generate this load proprietary? I'm trying to come up with some performance / behavioral test cases that we can use going forward.

@mberhault
Copy link
Author

The 10% seems about right, but the numbers aren't exactly strictly comparable, this is a 6 node cluster with load moving around the cluster. Utilization (mem/cpu/etc..) on the go side of the app is much more erratic, and some of that translates into the cgo side.

We can probably try to enable background threads at startup. However, we're trying to be pretty cross-platform so we may have to be careful there.

The load is a cockroach, completely open source. Specifically, it was the one built at cockroachdb/cockroach@bf59245 (see cockroachdb/cockroach#17044 for details). I can give you the binary if you like, although the previously released alpha version has the same issue, the switch to 5.0.1 happened just before v1.1-alpha.20170713.

The cgo part is 99% rocksdb, with some custom logic from us to interface with it. I can try to see if I can find some rocksdb load tests that exhibit this behavior, but it's also not impossible that cgo is doing something odd.

@mberhault
Copy link
Author

Oh, I forgot to talk about allocations. I honestly don't know how much allocation churn there is. I think I tried jeprof --alloc-(space|objects), on my profiles but that gave me nothing, only --inuse-(space|objects) had data. Is it supposed to work?

@bdarnell
Copy link

This is a Go program (which calls into C++ for RocksDB), so the threading situation is... non-trivial. It's entirely possible that this is tripping some sort of pathological case in jemalloc (it kind of reminds me of problems I encountered with tcmalloc years ago). Go has its own allocator, so most threads interact with jemalloc infrequently. It is plausible that many threads are initially allocating memory, then the system settles into a steady state in which only a few threads do so and memory is trapped in those other thread's caches. But Go has a M:N scheduler, so there shouldn't be huge numbers of threads. Sounds like enabling the background thread is the right thing for us to do.

@davidtgoldblatt
Copy link
Member

The load is a cockroach, completely open source. Specifically, it was the one built at cockroachdb/cockroach@bf59245 (see cockroachdb/cockroach#17044 for details). I can give you the binary if you like, although the previously released alpha version has the same issue, the switch to 5.0.1 happened just before v1.1-alpha.20170713.

Ah, thanks. Are you doing something to throw load at it? Or is this just how the cluster acts on boot.

I think you (@bdarnell) are probably right about background threads here (and I think it's a good idea in general), but hopefully we can also figure something out on our end that prevents this sort of catastrophically bad behavior for people who don't or can't tweak their malloc_confs.

@mberhault
Copy link
Author

There's some load going on (both read/writes down to rocksdb and general cluster activity), but the initial spike in memory allocation is the rocksdb cache.
Here's the graph showing the rocksdb cache usage vs alloc vs total cgo memory for the same runs listed above:
cache_vs_mem

My next step is to see if I can get something similar happening without using cgo. That would be of much greater concern for jemalloc. If it's a cgo-related problem, I think tweaking malloc_conf becomes a slightly easier case to make as that's already a rather specific solution that requires some work.

Could you take a quick look at the --alloc_space issue? It's a reasonably easy way to get an idea of the order of magnitude of memory churn. Although that's probably accessible through jemalloc stats as well, I haven't search around much there.

@davidtgoldblatt
Copy link
Member

You know, I'm honestly not sure. IIRC that all comes from gperftools and has been only lightly edited since. @djwatson might know more?

@djwatson
Copy link
Member

djwatson commented Jul 19, 2017

for the --alloc options, you need to specify prof_accum:true when creating the profile data

@mberhault
Copy link
Author

Thank you both. Sorry, I should really have kept reading the man page.

I looked at --alloc_space sometime after the initial 4GB allocation, and diffs (--base) at 5 minute intervals show approximately 130MB/s allocations (did this on a few samples).

@mberhault
Copy link
Author

Friendly ping to see if there's been any movement that may impact the behavior seen here. I'm happy to try again with fresh experiments if things may have changed.

@davidtgoldblatt
Copy link
Member

I had thought that turning on background threads had fixed things for you -- is that not the case?

The issue of lots of unpurged memory sitting in arenas that no longer get touched by active threads is conceptually very straightforward on our end (we just need another level of operation count ticker; once that happens, go check and see if other arenas need purging), but annoying enough (for dumb reasons, mostly) that it might take a while. A straightforward workaround could be to periodically execute some purging operations on specific arenas through direct mallctl calls.

It may be worth trying things at dev, as well; we fixed a pretty gnarly high-fragmentation edge case in between the last cut and now. It doesn't look implicated here, but it's hard to tell for sure just from the graphs.

@interwq
Copy link
Member

interwq commented Apr 11, 2018

As David mentioned we had improvements on purging behavior and fragmentation, I'd recommend testing again (we are cutting 5.1 release in a couple of weeks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants