-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vm.max_map_count growing steadily when vm.overcommit_memory is 2 #1328
Comments
Thank you for this great bug report and diagnosis. I don't think the current behavior was really chosen per se; it's just an emergent property of a combination of things we don't test very well. I think this needs some amount of philosophizing. (I think it might have to wait for @interwq before anyone can take thorough look; we're stretched a little thin for now). Ironically, one thing that might help is disabling purging settings (i.e. |
@davidtgoldblatt : thanks for looking into it and the suggestion for disabling purging. With jemalloc 5.1.0 this indeed makes the number of mappings grows much much more slowly, but the overall number still increases over time. It is the hundreds rather than in the tens of thousands now. |
Just a side note regarding kernel if needed, this behavior could also be observed with 3.10.x and 4.4.x. |
For reference, here is a "fix" for 5.1.0: diff --git a/3rdParty/jemalloc/v5.1.0/src/pages.c b/3rdParty/jemalloc/v5.1.0/src/pages.c
index 26002692d6..3fbad076ad 100644
--- a/3rdParty/jemalloc/v5.1.0/src/pages.c
+++ b/3rdParty/jemalloc/v5.1.0/src/pages.c
@@ -23,7 +23,7 @@ static size_t os_page;
#ifndef _WIN32
# define PAGES_PROT_COMMIT (PROT_READ | PROT_WRITE)
-# define PAGES_PROT_DECOMMIT (PROT_NONE)
+# define PAGES_PROT_DECOMMIT (PROT_READ | PROT_WRITE)
static int mmap_flags;
#endif
static bool os_overcommits; And for 5.0.1: diff --git a/3rdParty/jemalloc/v5.0.1/src/pages.c b/3rdParty/jemalloc/v5.0.1/src/pages.c
index fec64dd01d..733652adf3 100644
--- a/3rdParty/jemalloc/v5.0.1/src/pages.c
+++ b/3rdParty/jemalloc/v5.0.1/src/pages.c
@@ -20,7 +20,7 @@ static size_t os_page;
#ifndef _WIN32
# define PAGES_PROT_COMMIT (PROT_READ | PROT_WRITE)
-# define PAGES_PROT_DECOMMIT (PROT_NONE)
+# define PAGES_PROT_DECOMMIT (PROT_READ | PROT_WRITE)
static int mmap_flags;
#endif
static bool os_overcommits; These "fixes" prevent the endless growth of mappings. |
@interwq Hello, is there any chance to get this fixed for milestone 5.2.0 ? |
@egaudry : there doesn't seem to be a straightforward fix I can think of right now. Like David mentioned the current behavior under no overcommit isn't particularly optimized, as the environment we work with usually has overcommit enabled. I wasn't able to change the overcommit setting on my dev box somehow. Can you guys help to try one more thing, running with malloc_conf |
Do you mean that I'm not sure about the retain option : as @jsteemann observed, any option that would reduce the number of mapping would not allow to avoid the issue in the long term (i.e. with numerous allocation and/or long living process). |
I meant turning off retain is worth trying, since it could affect number of mappings even in the long term. Plus I believe the option was more designed with overcommit in mind; it may affect # of mappings negatively with no overcommit. |
Thanks, I will have a look (I though this was a compile time parameter). |
Do I need to rebuild jemalloc ?
I should have stated that I was using (and need to stick to this) version 4.5.0 here. |
I tried |
Jan, you are right : I relaunched my test with the current master branch and I was able to use retain:false at runtime. Unfortunately, it didn't solve the large map_count issue (or not enough) I observed with one of our test-cases. |
@egaudry : that is also what I had observed before. |
My main concern is that our users are reluctant to such a change or are not in a position to request such a change (ex: cluster/centralized computing resources with different software running) and I cannot reasonably expect them to switch to a more permissive mode. |
@egaudry : yes, I understand this. |
@interwq Qi, I hope our feedbacks can help. Please let us know if you have another test we can perform. |
@jsteemann @egaudry thanks for all your feedback and help testing the cases! We did discuss this in our last team meeting, however no straightforward fix came to mind. One thing that can for sure alleviate the issue is, using a larger page size, e.g. build jemalloc with My best suggestion for right now, is to combine the following options:
For long term, it's unclear if we will be focusing on reducing the # of mappings w/o overcommit. On one hand, it's probably fair to consider this an limitation of the Linux kernel (require a max mapping limit / suffer big perf degradation as mappings grow). IIRC FreeBSD doesn't have such issues. On the other hand we already spent effort to workaround this (i.e. the retain feature but obviously only for w/ overcommit). Let us know if the config above solves it for you, or how far it goes. |
@interwq Qi, this configuration indeed offers a solution (at least on a specific test-case I'm using). I do understand that having vm.overcommit_memory=2 nowadays might not be really relevant and as such I won't discuss anymore the need for a fix. I will instead rely on this configuration when needed. The downside of this solution is that for an external (i.e. not aware of the jemalloc behavior) user looking at VmRSS and VmSize, it will be difficult to understand when memory gets back to the system (because it is relying on muzzy/dirty decay time jemalloc-5 behavior, but that is out of the scope here). Thanks you all for your feedback and the solutions offered. |
@egaudry : glad it worked for you. re: the time based decay, we did observe efficiency wins on the vast majority of our binaries -- given that memory reuse is usually very frequent, in theory we should only start purging memory after the workload is finished / reduced; time-based decay does that a lot better than the previous ratio-based decay which assumes a fixed ratio. However I also understand that memory not returned to OS immediately may cause some confusion, especially in micro benchmarks (we got quite a few questions on that front). We had some discussion regarding combining time based decay with ratio based; however the exact approach is still a bit unclear. Please feel free to share your use cases / thoughts, or ask for features there. |
Is there an open issue tracking the problem that jemalloc in its default configuration, on a system with overcommit disabled (with vm.overcommit_memory=2), will exhaust the default mapping limit under normal usage patterns? I'm having trouble finding the decay algorithm discussions that @interwq mentioned as being the right place to pursue this further. I also don't understand jemalloc well enough to grasp how even the best decay algorithm would prevent eventually hitting the mapping limit. Eventually unused pages will still decay, and mappings will split, right? If they split more than they recombine, eventually the limit will be hit. We're shipping jemalloc in our release binaries in our scientific computing project, with the result that on some high-performance computing systems where memory overcommit has been disabled by cluster administrators, our software crashes because it stops being able to allocate memory. Is the recommended solution to add |
(Sorry, accidentally posted and then deleted a half-done comment): Reopening, since I don't think we yet have a good general solution to this class of issues, even though the original question seems solved; there's more left to do here. I think that it may be the case that recent changes have helped some (opt.oversize_threshold). Even better would be to turn off retain for oversized allocations, even if it's on for smaller ones (which can't be done as a tuning change I think; it needs a little bit of extra jemalloc code written). I don't know that there's something that would make us consider this problem "solved". Fundamentally, saying so with confidence would need prod performance testing across a range of applications, and I'm not sure that the core dev team has the ability to do that sort of "testing in anger", given the sorts of prod systems we touch day-to-day. (E.g. none of us work on HPC scientific computing applications, and so can't form and test guesses on what sorts of configurations work well there). We'd definitely be receptive to PRs updating configuration settings / tweaking allocation strategies in those cases. |
@adamnovak I'm still puzzled by the fact that people tends to believe that disabling overcommit and/or limiting the max_map_count is the way to go on computing nodes. I know for a fact that it is pretty difficult to get sysadmins to change those settings (and mainly because they would use it since more than a decade), but limiting virtual memory does not make much sense in the HPC world... If you consider that, for instance, cuda will allocate a VM equals to the physical memory detected on the host when starting, the problem becomes broader too. |
@egaudry I can't necessarily explain why someone would want to disable overcommit either. The best I can come up with is that they want jobs to fail fast at allocation time, rather than after wasting a bunch of cluster time filling in the pages that did happen to fit in memory. My project's immediate users are the scientists who sometimes get handed clusters with overcommit off, not the people who decided to adopt that setting, so I need to provide at least passable, if not particularly performant, behavior in that environment. As for limiting max_map_count, the default limit on my workstation is 65530, without me having done anything to reduce it. So I don't think that people are choosing to limit it so much as that they haven't thought to increase it. |
Like mentioned currently there is no plan to focus on the no overcommit case. @adamnovak: the page size + decay tuning should alleviate the issue. For the decay setting, you can also run the binary with env var |
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
In an application that uses jemalloc statically linked, I am seeing an ever-increasing value of the process'
vm.max_map_count
value. Theovercommit_memory
setting value is2
, so no overcommitting.It seems that jemalloc reads the overcommit setting at startup, and later takes this setting's value into account when "returning" memory.
When
overcommit_memory
is set to2
, it seems to callmmap
on the returned range, with a protection ofPROT_NONE
. It seems that this punches holes into existing mappings, so that the kernel will split them and create more of them. This would not be a problem if it happened only seldomly, but we have several use cases in which it happens so often that even increasing the value ofvm.max_map_count
to tens of millions does not help much.I have created some (contrived) standalone test program which shows the behavior. I hope it is somewhat deterministic so others can reproduce it:
The test program can be compiled and run as follows:
The program allocates memory of pseudo-random sizes and returns some of the memory. It does so with a few parallel threads. Each thread will not exceed a certain size of allocated memory, so it should not leak.
Each thread is writing out some values to
std::cout
. The only interesting figure to look at is the "mappings" value reported, e.g.That "mappings" value is calculated as the number of lines in
/proc/self/maps
, which is not 100% accurate but should be a good-enough approximation.The problem is that when
overcommit_memory
is set to2
, the number of mappings will grow crazily, both with jemalloc 5.0.1 and jemalloc 5.1.0.A "fix" for the problem is to apply the following patch:
This makes the test program run with a very low number of memory mappings. It is obviously not a good fix, because it will leave the memory around with read & write access allowed. So please consider it just a demo.
I think it would be good to make jemalloc more usable with an
overcommit_memory
setting value of2
. Right now, it is kind of risky to use it, because applications may too quickly hit the defaultvm.max_map_count
value of 65K. And even increasing that setting does not help much, because the number of mappings can increase much over time, which means long-running server processes can hit the threshold easily, even if increased.I guess the current implementation is as it is for a reason, so I guess you will be pretty reluctant to change it. However, it would be good to suggest how to avoid that behavior on systems that don't use overcommit and where vm settings cannot be adjusted. Can an option be added to jemalloc to adjust the behavior on commit in this case, when explicitly configured as such? I think this would help plenty of users, as I have seen several issues in this repository that may have the same root cause. The last one I checked was #1324.
Thanks!
(btw. although I think it does make any difference: the above was tried on Linux kernels 4.15 both on bare metal and an Azure cloud instance, compilers in use were g++-7.3.0 and g++-5.4.0)
The text was updated successfully, but these errors were encountered: