New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decay-based purging results in ~50% dirty pages. #956
Comments
|
I should also clarify the title. The graphs and logs above are for a node/cluster that OOMed shortly after, so the long-term trend is not available. Here is the memory graph for a different cluster running for 3.5 days (it was rolled back to 4.5.0 yesterday): This is the sum of memory usage across 6 nodes, each with 8GB of system memory, hence the ~12GB active CGO memory when summed across all nodes (still using 25% of memory for rocksdb cache). |
|
Wow, this is nasty looking. @interwq knows the purging code better and will be back from vacation soon. In the meantime though, a couple things I might try:
|
|
Thanks for the pointers.
You'll notice that the second run starts sloping upwards (probably until OOM, but I interrupted it early. I do want to run it longer to see if the periodic drops are frequent and big enough to keep up). I suspect the decreased nrarenas will reduce the fraction of dirty pages due to combined pools, but I'm not too concerned about that as long as it's linear. |
|
Oh dang, am I reading it right that it's a 10% memory drop relative to 4.5 when background threads are on? We don't ship with background threads enabled because there are some esoteric situations in which it can cause bugs in bug-free programs, like if people are carefully counting their threads through some out of band mechanism (like some sort of signal-based stop-the-world sort of thing). Internally, we enable them almost everywhere (one of these days we need to put a "tuning" wiki page together). Am I right in guessing that there's a whole lot of allocator activity from the C threads at the beginning of your program, but very little afterwards (or, concentrated just to a small number of threads)? That might begin to explain the lack of purging we're seeing. Is the test you're using to generate this load proprietary? I'm trying to come up with some performance / behavioral test cases that we can use going forward. |
|
The 10% seems about right, but the numbers aren't exactly strictly comparable, this is a 6 node cluster with load moving around the cluster. Utilization (mem/cpu/etc..) on the go side of the app is much more erratic, and some of that translates into the cgo side. We can probably try to enable background threads at startup. However, we're trying to be pretty cross-platform so we may have to be careful there. The load is a cockroach, completely open source. Specifically, it was the one built at cockroachdb/cockroach@bf59245 (see cockroachdb/cockroach#17044 for details). I can give you the binary if you like, although the previously released alpha version has the same issue, the switch to 5.0.1 happened just before v1.1-alpha.20170713. The cgo part is 99% rocksdb, with some custom logic from us to interface with it. I can try to see if I can find some rocksdb load tests that exhibit this behavior, but it's also not impossible that cgo is doing something odd. |
|
Oh, I forgot to talk about allocations. I honestly don't know how much allocation churn there is. I think I tried |
|
This is a Go program (which calls into C++ for RocksDB), so the threading situation is... non-trivial. It's entirely possible that this is tripping some sort of pathological case in jemalloc (it kind of reminds me of problems I encountered with tcmalloc years ago). Go has its own allocator, so most threads interact with jemalloc infrequently. It is plausible that many threads are initially allocating memory, then the system settles into a steady state in which only a few threads do so and memory is trapped in those other thread's caches. But Go has a M:N scheduler, so there shouldn't be huge numbers of threads. Sounds like enabling the background thread is the right thing for us to do. |
Ah, thanks. Are you doing something to throw load at it? Or is this just how the cluster acts on boot. I think you (@bdarnell) are probably right about background threads here (and I think it's a good idea in general), but hopefully we can also figure something out on our end that prevents this sort of catastrophically bad behavior for people who don't or can't tweak their malloc_confs. |
|
You know, I'm honestly not sure. IIRC that all comes from gperftools and has been only lightly edited since. @djwatson might know more? |
|
for the --alloc options, you need to specify prof_accum:true when creating the profile data |
|
Thank you both. Sorry, I should really have kept reading the man page. I looked at |
|
Friendly ping to see if there's been any movement that may impact the behavior seen here. I'm happy to try again with fresh experiments if things may have changed. |
|
I had thought that turning on background threads had fixed things for you -- is that not the case? The issue of lots of unpurged memory sitting in arenas that no longer get touched by active threads is conceptually very straightforward on our end (we just need another level of operation count ticker; once that happens, go check and see if other arenas need purging), but annoying enough (for dumb reasons, mostly) that it might take a while. A straightforward workaround could be to periodically execute some purging operations on specific arenas through direct mallctl calls. It may be worth trying things at dev, as well; we fixed a pretty gnarly high-fragmentation edge case in between the last cut and now. It doesn't look implicated here, but it's hard to tell for sure just from the graphs. |
|
As David mentioned we had improvements on purging behavior and fragmentation, I'd recommend testing again (we are cutting 5.1 release in a couple of weeks). |




We (cockroachdb) recently updated jemalloc to 5.0.1 (see here to see our fork of 5.0.1 plus a few of our own commits).
The change from ratio to decay-based purging results in approximately 50% dirty pages.
Here is a graph showing the go/cgo allocated/total memory for a single one of our nodes (in a cluster of 6). The cgo part of cockroach consists mostly of rocksdb though it's a bit more complicated than that.
Three distinct sections can be seen in the graph:
MALLOC_CONF=prof:trueMALLOC_CONF=prof:true,dirty_decay_ms:0MALLOC_CONF=prof:trueThe relevant lines are the orange (CGO allocated, from jemalloc

stats.allocated) and red (CGO total, from jemallocstats.resident). Here's the graph again with only those stats shown:We almost immediately allocate ~4GB, this is our allowed table cache size in rocksdb (25% of system memory).
In the first case (default 5.0.1 options), total memory quickly reaches double the active memory. The forced purge (with decay=0 in the second case) keeps it more or less in line. The final case (4.5.0 defaults) is not quite 1/8th, but much more reasonable.
There is more discussion about tracking this down in our issue
And here is just the summary about 5 minutes into the 17:15-17:25 run and the 17:25-21:00 run:
default 5.0.1 decay settings:
dirty_decay_ms=0:I'm also attaching the logs for both runs. You'll find jemalloc stats being logged every 10 seconds.
jemalloc_5.0.1_defaults.log.gz
jemalloc_5.0.1_decay0.log.gz
Machine details:
Running is running on GCE. We've seen similar behavior on other clusters running in Azure and Digital Ocean (all Ubuntu, slightly different kernels).
The text was updated successfully, but these errors were encountered: