Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Please help interpret stats.resident and stats.retained #1098

Closed
igelhaus opened this issue Dec 26, 2017 · 14 comments
Closed

Question: Please help interpret stats.resident and stats.retained #1098

igelhaus opened this issue Dec 26, 2017 · 14 comments
Labels

Comments

@igelhaus
Copy link

Hello,

Thanks for your work on jemalloc. I'm experimenting with using jemalloc for an implementation of a scripting language (Lua) and I cannot figure out how to interpret jemalloc's stats.

My setup:

  • 4.4.0-72-generic #93~14.04.1-Ubuntu SMP Fri Mar 31 15:05:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • An application server with 16 worker threads, with each worker thread hosting an independent single-threaded instance of the Lua interpreter.
  • The Lua interpreter is statically linked with jemalloc 5.0.1, configured with --disable-cxx --without-export --with-jemalloc-prefix=jemalloc_

During the test, I simulated the workload by blasting a real-world traffic (~2000 requests per second) which was processed by business logic implemented in Lua. On top of that, I softly reloaded the application server several times: without stopping the process, it gently terminated existing worker threads (the Lua interpreter was eventually completely shut down inside terminated threads freeing all resources) and created new ones.

I measured following stats (the chart is attached):

  1. stats.allocated
  2. stats.active
  3. stats.metadata
  4. stats.resident
  5. stats.retained

I'm fine with the first three metrics: The server is under a constant load, business logic allocates memory which is eventually collected with a garbage collector, and the overall amount of application-allocated memory remains the same. So far, so good. However, here are things that I do not understand:

  1. Is it just a coincidence that stats.resident and stats.retained "mirror" each other on the chart? If not, how to interpret stats.retained + stats.resident?
  2. Changes in stats.retained / stats.resident correlate with soft-reloads of the application server. Do I get it right that growth of stats.resident can be interpreted as: An instance of the Lua interpreter shuts down on soft-reload and frees all previously allocated memory, but this freed memory is not reused immediately for a newly created instance - some fresh memory is allocated instead? If so, is there a way to re-use the freed memory more actively?
  3. Until 14:20 (a dip in the chart) I ran the test with all jemalloc settings set to their default values. At 14:20 I set retain:false in /etc/jemalloc_malloc.conf, but the behaviour of stats.retained did not seem to change. Why?

jemalloc-stats

P.S. Just for added clarity: the "saw" on the chart is a number of seconds passed since last soft-reload. A drop to zero obviously indicates a reload.

@interwq
Copy link
Member

interwq commented Jan 2, 2018

Is it just a coincidence that stats.resident and stats.retained "mirror" each other on the chart? If not, how to interpret stats.retained + stats.resident?

It's no coincidence -- when freed resident memory is returned to the kernel (i.e. by calling madvise on them), they turn into retained memory. In other words, retained memory only means virtual memory usage, no physical memory is backing retained regions. You can see more details here: http://jemalloc.net/jemalloc.3.html#stats.resident.

Changes in stats.retained / stats.resident correlate with soft-reloads of the application server. Do I get it right that growth of stats.resident can be interpreted as: An instance of the Lua interpreter shuts down on soft-reload and frees all previously allocated memory, but this freed memory is not reused immediately for a newly created instance - some fresh memory is allocated instead? If so, is there a way to re-use the freed memory more actively?

By soft-reloads, does it clean up such as shutdown child processes etc? If that's the case, all jemalloc related metadata in those processes will be gone as well. Currently there is no shared memory management support in jemalloc (though there has been discussions around it); and that part could be a bit tricky to use. However if it's "reloading" within the same process, everything such as arenas and thread cache will remain active and can be reused.

Until 14:20 (a dip in the chart) I ran the test with all jemalloc settings set to their default values. At 14:20 I set retain:false in /etc/jemalloc_malloc.conf, but the behaviour of stats.retained did not seem to change. Why?

Is that a soft-reload as you mentioned? If so you may want to restart the entire process. Also make sure you use the prefix as you build jemalloc for the conf file name (e.g. without custom prefix it should be malloc.conf). You can also use env var to set options, e.g. MALLOC_CONF="retain:false". Note that this option (disabling retain) is not recommended as it uses munmap (instead of madivse) which could cause high # of VM mappings in kernel.

@igelhaus
Copy link
Author

igelhaus commented Jan 9, 2018

Thanks for your replies @interwq!

It's no coincidence -- when freed resident memory is returned to the kernel (i.e. by calling madvise on them), they turn into retained memory.

Thanks, it helped!

By soft-reloads, does it clean up such as shutdown child processes etc?

The application server I'm using is single-process and multi-threaded. So a soft-reload means that the process itself persists, some of its threads are terminated, and some new threads are created within the same process. So by saying

However if it's "reloading" within the same process, everything such as arenas and thread cache will remain active and can be reused.

you confirmed my expectations of how everything works integrally. However, I still do not understand why stats.resident is so much bigger than stats.active. Probably I should run jemalloc's memory profiler for getting a better grasp on how allocated objects are spread across arenas. If you could point any other directions to look at, I'd appreciate it very much:-)

Is that a soft-reload as you mentioned?

Nope, to test the retain:false option I killed the entire process and started it anew. I used a prefixed file name (/etc/jemalloc_malloc.conf), but I will double-check. Yes, I understand that disabling retain is not recommended, my intention was simply to experiment with various options and find a difference in metrics.

Thanks for the support anyway!

@interwq
Copy link
Member

interwq commented Jan 19, 2018

re: resident being much higher than active -- this is because stats.active does not include unpurged dirty pages (i.e. pages freed by application, but not returned to the OS yet). See http://jemalloc.net/jemalloc.3.html#stats.active for more details.

If you find the number of dirty pages is too high for your use case, you can tweak the decay time setting http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms

@interwq
Copy link
Member

interwq commented Jan 19, 2018

Closing this. Feel free to let us know if need more info.

@oranagra
Copy link

sorry to hog this issue, i have some confusion, and the title matches part of my question, and wanted to avoid creating a new issue.

i have this stats output from Jemalloc 5.1.0:

Allocated: 26690095480, active: 31030251520, metadata: 331158040 (n_thp 0), resident: 31361589248, mapped: 31372386304, retained: 16204652544
                            allocated     nmalloc     ndalloc   nrequests
small:                      875473272     8539690     3951389   153442175
large:                    25814622208    14612178    13552674    14612178
total:                    26690095480    23151868    17504063   168054353

I have two things that puzzle me:

  1. i previously understood that the retained memory is a portion of the resident that's not really used (not active), i.e. since retained is 16GB, i'd expect mapped and resident to be 16GB bigger than active.
  2. i see there's some 6GB difference between active and allocated (what i'd normally consider an external fragmentation), caused by slabs that are partially used. but it appears that all that excess memory is actually inside large bins, which AFAIK are not susceptible to fragmentation.

@interwq can you please help me understand this?

@interwq
Copy link
Member

interwq commented Jun 13, 2023

@oranagra :

  1. retained memory isn't part of RSS or mapped. They all went through madvise(... MADV_DONTNEED). So even though the VM is still retained, no physical memory is backing them. This design makes VM management easier and more efficient.
  2. the difference between active and allocated -- this is exactly external fragmentation as you pointed out. You can further check the bin stats, to see which size classes contribute more. There's a util column in malloc_stats which is the reverse of fragmentation ratio. Combining the util and allocated for each size class, you can calculate how much fragmented / wasted bytes from each size class. Large size class are all page sized so they don't suffer external fragmentation.

@oranagra
Copy link

1. retained memory isn't part of RSS or mapped. They all went through madvise(... MADV_DONTNEED). So even though the VM is still retained, no physical memory is backing them. This design makes VM management easier and more efficient.

ok, I probably misunderstood something i read in the past. i thought that when we MADV_DONTNEED, it'll will still initially show as RSS for the process, until the kernel decides to re-use these pages.

2. There's a util column in malloc_stats which is the reverse of fragmentation ratio. Combining the util and allocated for each size class, you can calculate how much fragmented / wasted bytes from each size class. Large size class are all page sized so they don't suffer external fragmentation

that's exactly what i mean. only small size classes show util and suffer from external fragmentation. but in my example the delta between allocated
and active is some 6GB, but the small size classes only occupy 800mb, and if they have excess of 6gb that would mean a very severe fragmentation, i don't see any of that in the util.
attaching the full stats dump, maybe i'm missing something.

malloc-stats.txt

@interwq
Copy link
Member

interwq commented Jun 13, 2023

i thought that when we MADV_DONTNEED, it'll will still initially show as RSS for the process, until the kernel decides to re-use these pages.

Oh that's the MADV_FREE behavior. These are the pages under the "muzzy" state; not retained.

the small size classes only occupy 800mb, and if they have excess of 6gb that would mean a very severe fragmentation, i don't see any of that in the util.

Good point and that's indeed not coming from small sizes. In fact the external fragmentation is super low.

One thing jumps out, is that the workload is dominated by sizes around 24K -- this would mean that the cache-oblivious feature is causing high overhead, see for that option in https://github.com/jemalloc/jemalloc/blob/dev/INSTALL.md
Can you please try building jemalloc with --disable-cache-oblivious? It should save the entire 6GB you are seeing and worth verifying.

@oranagra
Copy link

Thanks a lot, that's really helpful.

In fact the external fragmentation is super low

That's Redis with active-defrag enabled, so i'll take it as a compliment 😄

These are the pages under the "muzzy" state; not retained

is there any way to see how much memory is in "muzzy" state (something that's updated when the OS reclaims it and reduces RSS)? i.e. so i'll be able to tell how much of the RSS isn't really "pinned"

Can you please try building jemalloc with --disable-cache-oblivious?

Following your observation, i was able to easily reproduce this in the lab and prove that the build flag solved the problem.
I think i'll make that permanent in Redis builds, so i wanna be sure i know what i'm giving up.
It implies that i'm giving up some CPU cache efficiency, but i doubt that it has a high impact on these sizes (I understand it only affects large bins; allocations of 16k and above).
And that what it does is adds some padding for being page aligned, but isn't that always the case for large allocations, that don't share pages with other allocation of that same bin?

oranagra added a commit to oranagra/redis that referenced this issue Jun 14, 2023
Apparently for large size classes Jemalloc allocate some extra memory in
order to be CPU cache friendly, but the cost on memory usage is high
(can be up to 25% overhead for allocations of 16kb).
see jemalloc/jemalloc#1098 (comment)

p.s. from redis's perspective that looks like external fragmentation,
(i.e. allocated bytes will be low, and active pages bytes will be large)
which  can cause active-defrag to eat CPU cycles in vain.
@oranagra
Copy link

@interwq can you please respond to my last question (maybe you missed the notification)?
thanks a lot.

@interwq
Copy link
Member

interwq commented Jun 16, 2023

@oranagra sorry I'm on leave this week so my response time might be unpredictable.

Yes it's safe to add --disable-cache-oblivious. I'd even say it's recommended, unless you have a specific workload pattern and hardware that benefit from this feature -- by default it's enabled and adds address randomization to all large buffers, by over allocating 1 page per large size class, and offsetting into that page to make the starting address of the user buffer randomized. Workloads such as scientific computation often handle multiple big matrixes at the same time, and the randomization makes sure that the cacheline level accesses don't suffer bad conflicts (when they all start from page-aligned addresses).

However the downsize is also quite noticeable, like you observed that extra page per large size can cause memory overhead, plus the extra TLB entry. The other factor is, hardware in the last few years started doing the randomization at the hardware level, i.e. the address to cacheline mapping isn't a direct mapping anymore. So there's debate to disable the randomization by default, but we are still hesitant because when it matters, it could matter a lot, and having it enabled by default limits that worst case behavior, even though it means the majority of workloads suffers a regression.

So in short, please do add that in redis as it's safe and offers better performance in most cases.

@interwq
Copy link
Member

interwq commented Jun 16, 2023

is there any way to see how much memory is in "muzzy" state

The decay section in malloc_stats has number of pages under muzzy and dirty. You can also query this mallctl: http://jemalloc.net/jemalloc.3.html#stats.arenas.i.pmuzzy

@oranagra
Copy link

thanks a lot!

oranagra added a commit to redis/redis that referenced this issue Jun 18, 2023
)

Apparently for large size classes Jemalloc allocate some extra
memory (can be up to 25% overhead for allocations of 16kb).
see jemalloc/jemalloc#1098 (comment)

p.s. from Redis's perspective that looks like external fragmentation,
(i.e. allocated bytes will be low, and active pages bytes will be large)
which  can cause active-defrag to eat CPU cycles in vain.

Some details about this mechanism we disable:
---------------------------------------------------------------
Disabling this mechanism only affects large allocations (above 16kb)
Not only that it isn't expected to cause any performance regressions,
it's actually recommended, unless you have a specific workload pattern
and hardware that benefit from this feature -- by default it's enabled and
adds address randomization to all large buffers, by over allocating 1 page
per large size class, and offsetting into that page to make the starting
address of the user buffer randomized. Workloads such as scientific
computation often handle multiple big matrixes at the same time, and the
randomization makes sure that the cacheline level accesses don't suffer
bad conflicts (when they all start from page-aligned addresses).

However the downsize is also quite noticeable, like you observed that extra
page per large size can cause memory overhead, plus the extra TLB entry.
The other factor is, hardware in the last few years started doing the
randomization at the hardware level, i.e. the address to cacheline mapping isn't
a direct mapping anymore. So there's debate to disable the randomization by default,
but we are still hesitant because when it matters, it could matter a lot, and having
it enabled by default limits that worst case behavior, even though it means the
majority of workloads suffers a regression.

So in short, it's safe and offers better performance in most cases.
@azat
Copy link
Contributor

azat commented Dec 20, 2023

Just throwing my two cents here about cache oblivious.

I was thinking to disable this optimization for ClickHouse (ClickHouse/ClickHouse#57951), but there was some perf test failures, well, it wasn't significant, but it was stable, usually such things are ignored on changes, however I decided to verify.

And what I came, is that it is indeed still make sense, here are simple repo - https://gist.github.com/azat/2dc33fdadbb2feaf18e9cb591392f6cb

And AFAIU it will always make sense, due to CPU cache is not fully associative...

P.S. I also found this publication - https://www.cs.tau.ac.il/~mad/publications/ismm2011-CIF.pdf, which worth to read (though I guess jemalloc developers seen it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants