New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Please help interpret stats.resident and stats.retained #1098
Comments
It's no coincidence -- when freed resident memory is returned to the kernel (i.e. by calling
By soft-reloads, does it clean up such as shutdown child processes etc? If that's the case, all jemalloc related metadata in those processes will be gone as well. Currently there is no shared memory management support in jemalloc (though there has been discussions around it); and that part could be a bit tricky to use. However if it's "reloading" within the same process, everything such as arenas and thread cache will remain active and can be reused.
Is that a soft-reload as you mentioned? If so you may want to restart the entire process. Also make sure you use the prefix as you build jemalloc for the conf file name (e.g. without custom prefix it should be malloc.conf). You can also use env var to set options, e.g. MALLOC_CONF="retain:false". Note that this option (disabling retain) is not recommended as it uses munmap (instead of madivse) which could cause high # of VM mappings in kernel. |
Thanks for your replies @interwq!
Thanks, it helped!
The application server I'm using is single-process and multi-threaded. So a soft-reload means that the process itself persists, some of its threads are terminated, and some new threads are created within the same process. So by saying
you confirmed my expectations of how everything works integrally. However, I still do not understand why
Nope, to test the Thanks for the support anyway! |
re: resident being much higher than active -- this is because stats.active does not include unpurged dirty pages (i.e. pages freed by application, but not returned to the OS yet). See http://jemalloc.net/jemalloc.3.html#stats.active for more details. If you find the number of dirty pages is too high for your use case, you can tweak the decay time setting http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms |
Closing this. Feel free to let us know if need more info. |
sorry to hog this issue, i have some confusion, and the title matches part of my question, and wanted to avoid creating a new issue. i have this stats output from Jemalloc 5.1.0:
I have two things that puzzle me:
@interwq can you please help me understand this? |
|
ok, I probably misunderstood something i read in the past. i thought that when we MADV_DONTNEED, it'll will still initially show as RSS for the process, until the kernel decides to re-use these pages.
that's exactly what i mean. only small size classes show |
Oh that's the MADV_FREE behavior. These are the pages under the "muzzy" state; not retained.
Good point and that's indeed not coming from small sizes. In fact the external fragmentation is super low. One thing jumps out, is that the workload is dominated by sizes around 24K -- this would mean that the |
Thanks a lot, that's really helpful.
That's Redis with active-defrag enabled, so i'll take it as a compliment 😄
is there any way to see how much memory is in "muzzy" state (something that's updated when the OS reclaims it and reduces RSS)? i.e. so i'll be able to tell how much of the RSS isn't really "pinned"
Following your observation, i was able to easily reproduce this in the lab and prove that the build flag solved the problem. |
Apparently for large size classes Jemalloc allocate some extra memory in order to be CPU cache friendly, but the cost on memory usage is high (can be up to 25% overhead for allocations of 16kb). see jemalloc/jemalloc#1098 (comment) p.s. from redis's perspective that looks like external fragmentation, (i.e. allocated bytes will be low, and active pages bytes will be large) which can cause active-defrag to eat CPU cycles in vain.
@interwq can you please respond to my last question (maybe you missed the notification)? |
@oranagra sorry I'm on leave this week so my response time might be unpredictable. Yes it's safe to add However the downsize is also quite noticeable, like you observed that extra page per large size can cause memory overhead, plus the extra TLB entry. The other factor is, hardware in the last few years started doing the randomization at the hardware level, i.e. the address to cacheline mapping isn't a direct mapping anymore. So there's debate to disable the randomization by default, but we are still hesitant because when it matters, it could matter a lot, and having it enabled by default limits that worst case behavior, even though it means the majority of workloads suffers a regression. So in short, please do add that in redis as it's safe and offers better performance in most cases. |
The decay section in malloc_stats has number of pages under muzzy and dirty. You can also query this mallctl: http://jemalloc.net/jemalloc.3.html#stats.arenas.i.pmuzzy |
thanks a lot! |
) Apparently for large size classes Jemalloc allocate some extra memory (can be up to 25% overhead for allocations of 16kb). see jemalloc/jemalloc#1098 (comment) p.s. from Redis's perspective that looks like external fragmentation, (i.e. allocated bytes will be low, and active pages bytes will be large) which can cause active-defrag to eat CPU cycles in vain. Some details about this mechanism we disable: --------------------------------------------------------------- Disabling this mechanism only affects large allocations (above 16kb) Not only that it isn't expected to cause any performance regressions, it's actually recommended, unless you have a specific workload pattern and hardware that benefit from this feature -- by default it's enabled and adds address randomization to all large buffers, by over allocating 1 page per large size class, and offsetting into that page to make the starting address of the user buffer randomized. Workloads such as scientific computation often handle multiple big matrixes at the same time, and the randomization makes sure that the cacheline level accesses don't suffer bad conflicts (when they all start from page-aligned addresses). However the downsize is also quite noticeable, like you observed that extra page per large size can cause memory overhead, plus the extra TLB entry. The other factor is, hardware in the last few years started doing the randomization at the hardware level, i.e. the address to cacheline mapping isn't a direct mapping anymore. So there's debate to disable the randomization by default, but we are still hesitant because when it matters, it could matter a lot, and having it enabled by default limits that worst case behavior, even though it means the majority of workloads suffers a regression. So in short, it's safe and offers better performance in most cases.
Just throwing my two cents here about cache oblivious. I was thinking to disable this optimization for ClickHouse (ClickHouse/ClickHouse#57951), but there was some perf test failures, well, it wasn't significant, but it was stable, usually such things are ignored on changes, however I decided to verify. And what I came, is that it is indeed still make sense, here are simple repo - https://gist.github.com/azat/2dc33fdadbb2feaf18e9cb591392f6cb And AFAIU it will always make sense, due to CPU cache is not fully associative... P.S. I also found this publication - https://www.cs.tau.ac.il/~mad/publications/ismm2011-CIF.pdf, which worth to read (though I guess jemalloc developers seen it). |
Hello,
Thanks for your work on jemalloc. I'm experimenting with using jemalloc for an implementation of a scripting language (Lua) and I cannot figure out how to interpret jemalloc's stats.
My setup:
--disable-cxx --without-export --with-jemalloc-prefix=jemalloc_
During the test, I simulated the workload by blasting a real-world traffic (~2000 requests per second) which was processed by business logic implemented in Lua. On top of that, I softly reloaded the application server several times: without stopping the process, it gently terminated existing worker threads (the Lua interpreter was eventually completely shut down inside terminated threads freeing all resources) and created new ones.
I measured following stats (the chart is attached):
I'm fine with the first three metrics: The server is under a constant load, business logic allocates memory which is eventually collected with a garbage collector, and the overall amount of application-allocated memory remains the same. So far, so good. However, here are things that I do not understand:
retain:false
in/etc/jemalloc_malloc.conf
, but the behaviour of stats.retained did not seem to change. Why?P.S. Just for added clarity: the "saw" on the chart is a number of seconds passed since last soft-reload. A drop to zero obviously indicates a reload.
The text was updated successfully, but these errors were encountered: