vm.max_map_count growing steadily when vm.overcommit_memory is 2 #1328

jsteemann · 2018-09-20T12:26:54Z

In an application that uses jemalloc statically linked, I am seeing an ever-increasing value of the process' vm.max_map_count value. The overcommit_memory setting value is 2, so no overcommitting.
It seems that jemalloc reads the overcommit setting at startup, and later takes this setting's value into account when "returning" memory.
When overcommit_memory is set to 2, it seems to call mmap on the returned range, with a protection of PROT_NONE. It seems that this punches holes into existing mappings, so that the kernel will split them and create more of them. This would not be a problem if it happened only seldomly, but we have several use cases in which it happens so often that even increasing the value of vm.max_map_count to tens of millions does not help much.

I have created some (contrived) standalone test program which shows the behavior. I hope it is somewhat deterministic so others can reproduce it:

#include <iostream>
#include <fstream>
#include <deque>
#include <vector>
#include <thread>
#include <algorithm>
#include <cstdlib>
#include <cstring>

void fun() {
  std::string line;
  size_t allocated = 0;
  size_t allocations = 0;
  std::deque<std::pair<void*, size_t>> all;

  size_t n = 1024 * 1024 * 1024;
  ::srand(2848583);
  for (size_t i = 0; i < n; ++i) {
    if (allocations % 500000 == 0) {
      std::ifstream ifs("/proc/self/maps", std::ios_base::in);
      if (!ifs.is_open()) {
        std::cerr << "unable to open mappings file" << std::endl;
        std::abort();
      }
      size_t mappings = 0;
      while (std::getline(ifs, line)) {
        ++mappings;
      }
      std::cout << "- i: " << i << ", allocations: " << allocations << ", mappings: " << mappings << ", allocated: " << allocated << "\n";
    }
    size_t s = ::rand() % (4096 * 128 / 8);
    if (s > 0) {
      auto* p = ::malloc(s);
      if (p == nullptr) {
        std::cerr << "OOM. s: " << s << std::endl;
        std::abort();
      }
      ++allocations;
      allocated += s;
      all.push_back(std::make_pair(p, s));
    }
    while (!all.empty() && allocated > 1024 * 1024 * 128) {
      auto& f = all.front();
      ::free(f.first);
      allocated -= f.second;
      all.pop_front();
    }
  }
}

int main() {
  std::vector<std::thread> threads;

  size_t n = 8; 
  for (size_t i = 0; i < n; ++i) {
    threads.emplace_back(fun);
  }
  
  for (auto& it : threads) {
    it.join();
  }
}

The test program can be compiled and run as follows:

g++ -std=c++11 -O2 -Wall -Wextra test.cc -o test -lpthread
GLIBCXX_FORCE_NEW=1 LD_PRELOAD=~/jemalloc-5.1.0/lib/libjemalloc.so ./test

The program allocates memory of pseudo-random sizes and returns some of the memory. It does so with a few parallel threads. Each thread will not exceed a certain size of allocated memory, so it should not leak.
Each thread is writing out some values to std::cout. The only interesting figure to look at is the "mappings" value reported, e.g.

- i: 7500131, allocations: 7500000, mappings: 18347, allocated: 134196266

That "mappings" value is calculated as the number of lines in /proc/self/maps, which is not 100% accurate but should be a good-enough approximation.

The problem is that when overcommit_memory is set to 2, the number of mappings will grow crazily, both with jemalloc 5.0.1 and jemalloc 5.1.0.

A "fix" for the problem is to apply the following patch:

diff --git a/3rdParty/jemalloc/v5.1.0/src/pages.c b/3rdParty/jemalloc/v5.1.0/src/pages.c
index 26002692d6..3fbad076ad 100644
--- a/3rdParty/jemalloc/v5.1.0/src/pages.c
+++ b/3rdParty/jemalloc/v5.1.0/src/pages.c
@@ -23,7 +23,7 @@ static size_t os_page;
 
 #ifndef _WIN32
 #  define PAGES_PROT_COMMIT (PROT_READ | PROT_WRITE)
-#  define PAGES_PROT_DECOMMIT (PROT_NONE)
+#  define PAGES_PROT_DECOMMIT (PROT_READ | PROT_WRITE)
 static int     mmap_flags;
 #endif
 static bool    os_overcommits;

This makes the test program run with a very low number of memory mappings. It is obviously not a good fix, because it will leave the memory around with read & write access allowed. So please consider it just a demo.

I think it would be good to make jemalloc more usable with an overcommit_memory setting value of 2. Right now, it is kind of risky to use it, because applications may too quickly hit the default vm.max_map_count value of 65K. And even increasing that setting does not help much, because the number of mappings can increase much over time, which means long-running server processes can hit the threshold easily, even if increased.

I guess the current implementation is as it is for a reason, so I guess you will be pretty reluctant to change it. However, it would be good to suggest how to avoid that behavior on systems that don't use overcommit and where vm settings cannot be adjusted. Can an option be added to jemalloc to adjust the behavior on commit in this case, when explicitly configured as such? I think this would help plenty of users, as I have seen several issues in this repository that may have the same root cause. The last one I checked was #1324.
Thanks!

(btw. although I think it does make any difference: the above was tried on Linux kernels 4.15 both on bare metal and an Azure cloud instance, compilers in use were g++-7.3.0 and g++-5.4.0)

The text was updated successfully, but these errors were encountered:

davidtgoldblatt · 2018-09-20T18:24:38Z

Thank you for this great bug report and diagnosis. I don't think the current behavior was really chosen per se; it's just an emergent property of a combination of things we don't test very well. I think this needs some amount of philosophizing.

(I think it might have to wait for @interwq before anyone can take thorough look; we're stretched a little thin for now).

Ironically, one thing that might help is disabling purging settings (i.e. MALLOC_CONF="dirty_decay_ms:-1,muzzy_decay_ms:-1"); this would increase RSS, but not having fiddling around with mappings as much might make it net-better.

jsteemann · 2018-09-20T20:40:23Z

@davidtgoldblatt : thanks for looking into it and the suggestion for disabling purging.
I tried running the test program with purging disabled as you suggested (MALLOC_CONF="dirty_decay_ms:-1,muzzy_decay_ms:-1").

With jemalloc 5.1.0 this indeed makes the number of mappings grows much much more slowly, but the overall number still increases over time. It is the hundreds rather than in the tens of thousands now.
Worse however is that this setting makes the overall virtual memory size used by the process (as reported by top) quickly grow to more than 20 GB. On my local laptop it reaches 20 GB vsz after less than 2 minutes and then dies with OOM. With the default malloc settings the process is at around 2,5 GB vsz usage after the same time and can still allocate then.
In terms of rss the adjusted MALLOC_CONF value is also a bit worse, at least for this particular test.

egaudry · 2018-09-26T07:30:11Z

Just a side note regarding kernel if needed, this behavior could also be observed with 3.10.x and 4.4.x.

jsteemann · 2018-09-26T07:38:44Z

For reference, here is a "fix" for 5.1.0:

diff --git a/3rdParty/jemalloc/v5.1.0/src/pages.c b/3rdParty/jemalloc/v5.1.0/src/pages.c
index 26002692d6..3fbad076ad 100644
--- a/3rdParty/jemalloc/v5.1.0/src/pages.c
+++ b/3rdParty/jemalloc/v5.1.0/src/pages.c
@@ -23,7 +23,7 @@ static size_t os_page;
 
 #ifndef _WIN32
 #  define PAGES_PROT_COMMIT (PROT_READ | PROT_WRITE)
-#  define PAGES_PROT_DECOMMIT (PROT_NONE)
+#  define PAGES_PROT_DECOMMIT (PROT_READ | PROT_WRITE)
 static int     mmap_flags;
 #endif
 static bool    os_overcommits;

And for 5.0.1:

diff --git a/3rdParty/jemalloc/v5.0.1/src/pages.c b/3rdParty/jemalloc/v5.0.1/src/pages.c
index fec64dd01d..733652adf3 100644
--- a/3rdParty/jemalloc/v5.0.1/src/pages.c
+++ b/3rdParty/jemalloc/v5.0.1/src/pages.c
@@ -20,7 +20,7 @@ static size_t os_page;
 
 #ifndef _WIN32
 #  define PAGES_PROT_COMMIT (PROT_READ | PROT_WRITE)
-#  define PAGES_PROT_DECOMMIT (PROT_NONE)
+#  define PAGES_PROT_DECOMMIT (PROT_READ | PROT_WRITE)
 static int     mmap_flags;
 #endif
 static bool    os_overcommits;

These "fixes" prevent the endless growth of mappings.
However, I don't recommend using these fixes, because they will allow the application to still access memory it has returned to the allocator via free()/delete. The standard jemalloc behavior will lead to the kernel raising a SIGSEGV when accessing a memory page that the application or the allocator has given back after decommitting the memory. The patches above will however allow the application to still access memory it has given back, so they may hide errors in an application's memory handling and may also cause security issues, because memory given back by the application is still readable (only by the application itself - however, this is normally undesired).

egaudry · 2018-10-05T09:02:25Z

@jasone : do you think that a simple fix can be made to avoid this issue (this is likely related to c2f970c) ?

egaudry · 2019-01-11T09:55:25Z

@interwq Hello, is there any chance to get this fixed for milestone 5.2.0 ?

interwq · 2019-01-11T22:53:17Z

@egaudry : there doesn't seem to be a straightforward fix I can think of right now. Like David mentioned the current behavior under no overcommit isn't particularly optimized, as the environment we work with usually has overcommit enabled.

I wasn't able to change the overcommit setting on my dev box somehow. Can you guys help to try one more thing, running with malloc_conf retain:false? The retain option was specifically designed to reduce the number of mappings. I wonder if it doesn't interact properly with no overcommit.

egaudry · 2019-01-14T08:25:01Z

Do you mean that
$ /sbin/sysctl vm.overcommit_memory=2
would not work on your box ?
You can try to lower vm.max_map_count too, it would allow to reproduce a similar issue too.

I'm not sure about the retain option : as @jsteemann observed, any option that would reduce the number of mapping would not allow to avoid the issue in the long term (i.e. with numerous allocation and/or long living process).

interwq · 2019-01-18T20:07:38Z

I meant turning off retain is worth trying, since it could affect number of mappings even in the long term. Plus I believe the option was more designed with overcommit in mind; it may affect # of mappings negatively with no overcommit.

egaudry · 2019-01-19T12:57:30Z

Thanks, I will have a look (I though this was a compile time parameter).

egaudry · 2019-01-21T15:20:42Z

Do I need to rebuild jemalloc ?
When I set MALLOC_CONF=retain:false before launching my test, jemalloc complains:

<jemalloc>: Invalid conf pair: retain:false

I should have stated that I was using (and need to stick to this) version 4.5.0 here.

jsteemann · 2019-01-21T15:30:11Z

I tried MALLOC_CONF="retain:true" original_program original_args and it works for me.
Jemalloc prints error messages whenever I used some non-existing conf option or provide non-allowed values for config options.
As far as I can see the options are always processed and there is no way to disable options parsing at compile-time. "retain" should be present at least in v5.0.1 and v5.1.0.

egaudry · 2019-01-21T16:07:18Z

Jan, you are right : I relaunched my test with the current master branch and I was able to use retain:false at runtime.

Unfortunately, it didn't solve the large map_count issue (or not enough) I observed with one of our test-cases.

jsteemann · 2019-01-21T16:12:54Z

@egaudry : that is also what I had observed before.
The only "working solution" we found is to move away completely from vm.overcommit_memory=2, but that may not be an option for everyone and also makes the allocator less general-purpose than desired.

egaudry · 2019-01-21T16:32:33Z

My main concern is that our users are reluctant to such a change or are not in a position to request such a change (ex: cluster/centralized computing resources with different software running) and I cannot reasonably expect them to switch to a more permissive mode.

jsteemann · 2019-01-21T16:40:59Z

@egaudry : yes, I understand this.
IMHO the only ones that can properly fix the issue are the jemalloc maintainers, but I guess this would be a substantial change.
As I am just a user of jemalloc myself I am not in the position comment on their development plans w.r.t. this.

egaudry · 2019-01-21T16:46:05Z

@interwq Qi, I hope our feedbacks can help. Please let us know if you have another test we can perform.
If you could comment on the work needed to properly support the no-overcommit scenario in future releases, this would be great.

interwq · 2019-01-24T22:38:03Z

@jsteemann @egaudry thanks for all your feedback and help testing the cases! We did discuss this in our last team meeting, however no straightforward fix came to mind.

One thing that can for sure alleviate the issue is, using a larger page size, e.g. build jemalloc with --with-lg-page=16 will make jemalloc uses 64k as the base page size instead of the default 4k, which naturally reduces number of mappings (usually at the cost of higher fragmentation).

My best suggestion for right now, is to combine the following options:

build with 64k pages as mentioned above
disable muzzy decay, e.g. with MALLOC_CONF=muzzy_decay_ms:0; plus a longer dirty decay time, e.g. dirty_muzzy_ms:30000
use the current dev branch which contains a new feature to reduce VM fragmentation

For long term, it's unclear if we will be focusing on reducing the # of mappings w/o overcommit. On one hand, it's probably fair to consider this an limitation of the Linux kernel (require a max mapping limit / suffer big perf degradation as mappings grow). IIRC FreeBSD doesn't have such issues. On the other hand we already spent effort to workaround this (i.e. the retain feature but obviously only for w/ overcommit).

Let us know if the config above solves it for you, or how far it goes.

egaudry · 2019-01-25T13:31:25Z

@interwq Qi, this configuration indeed offers a solution (at least on a specific test-case I'm using).

I do understand that having vm.overcommit_memory=2 nowadays might not be really relevant and as such I won't discuss anymore the need for a fix. I will instead rely on this configuration when needed.

The downside of this solution is that for an external (i.e. not aware of the jemalloc behavior) user looking at VmRSS and VmSize, it will be difficult to understand when memory gets back to the system (because it is relying on muzzy/dirty decay time jemalloc-5 behavior, but that is out of the scope here).

Thanks you all for your feedback and the solutions offered.

interwq · 2019-01-25T21:32:52Z

@egaudry : glad it worked for you.

re: the time based decay, we did observe efficiency wins on the vast majority of our binaries -- given that memory reuse is usually very frequent, in theory we should only start purging memory after the workload is finished / reduced; time-based decay does that a lot better than the previous ratio-based decay which assumes a fixed ratio. However I also understand that memory not returned to OS immediately may cause some confusion, especially in micro benchmarks (we got quite a few questions on that front).

We had some discussion regarding combining time based decay with ratio based; however the exact approach is still a bit unclear. Please feel free to share your use cases / thoughts, or ask for features there.

adamnovak · 2019-09-03T17:25:02Z

Is there an open issue tracking the problem that jemalloc in its default configuration, on a system with overcommit disabled (with vm.overcommit_memory=2), will exhaust the default mapping limit under normal usage patterns? I'm having trouble finding the decay algorithm discussions that @interwq mentioned as being the right place to pursue this further.

I also don't understand jemalloc well enough to grasp how even the best decay algorithm would prevent eventually hitting the mapping limit. Eventually unused pages will still decay, and mappings will split, right? If they split more than they recombine, eventually the limit will be hit.

We're shipping jemalloc in our release binaries in our scientific computing project, with the result that on some high-performance computing systems where memory overcommit has been disabled by cluster administrators, our software crashes because it stops being able to allocate memory.

Is the recommended solution to add --with-lg-page=16 to the jemalloc build, and use mallctl calls to set muzzy_decay_ms to 0 and dirty_muzzy_ms to 30000 when our application detects that memory overcommit is disabled, on startup? Does this completely solve the issue of the number of mappings growing unboundedly, or only slow it down? Is there a plan or open issue to integrate these defaults into jemalloc itself? Has the "current dev branch" of January with its relevant improvements been integrated into current master?

davidtgoldblatt · 2019-09-03T22:32:02Z

(Sorry, accidentally posted and then deleted a half-done comment):

Reopening, since I don't think we yet have a good general solution to this class of issues, even though the original question seems solved; there's more left to do here.

I think that it may be the case that recent changes have helped some (opt.oversize_threshold). Even better would be to turn off retain for oversized allocations, even if it's on for smaller ones (which can't be done as a tuning change I think; it needs a little bit of extra jemalloc code written).

I don't know that there's something that would make us consider this problem "solved". Fundamentally, saying so with confidence would need prod performance testing across a range of applications, and I'm not sure that the core dev team has the ability to do that sort of "testing in anger", given the sorts of prod systems we touch day-to-day. (E.g. none of us work on HPC scientific computing applications, and so can't form and test guesses on what sorts of configurations work well there). We'd definitely be receptive to PRs updating configuration settings / tweaking allocation strategies in those cases.

egaudry · 2019-09-04T07:52:48Z

@adamnovak I'm still puzzled by the fact that people tends to believe that disabling overcommit and/or limiting the max_map_count is the way to go on computing nodes. I know for a fact that it is pretty difficult to get sysadmins to change those settings (and mainly because they would use it since more than a decade), but limiting virtual memory does not make much sense in the HPC world...

If you consider that, for instance, cuda will allocate a VM equals to the physical memory detected on the host when starting, the problem becomes broader too.

adamnovak · 2019-09-04T17:23:25Z

@egaudry I can't necessarily explain why someone would want to disable overcommit either. The best I can come up with is that they want jobs to fail fast at allocation time, rather than after wasting a bunch of cluster time filling in the pages that did happen to fit in memory.

My project's immediate users are the scientists who sometimes get handed clusters with overcommit off, not the people who decided to adopt that setting, so I need to provide at least passable, if not particularly performant, behavior in that environment.

As for limiting max_map_count, the default limit on my workstation is 65530, without me having done anything to reduce it. So I don't think that people are choosing to limit it so much as that they haven't thought to increase it.

interwq · 2019-09-04T18:16:27Z

Like mentioned currently there is no plan to focus on the no overcommit case. opt.retain properly solves the number of mappings issue. However it does so through keeping VM around and allocating in larger batch, which won't work well with no overcommit. Like David mentioned, we are certainly open to ideas though.

@adamnovak: the page size + decay tuning should alleviate the issue. For the decay setting, you can also run the binary with env var MALLOC_CONF, instead of having to do runtime mallctls. The longer decay helps because it gives the application and jemalloc more time to deallocate and merge separate extents, before trying to unmap them and create holes (extra mappings) in the VM space.

…tself > WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. > Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. > To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

egaudry mentioned this issue Sep 20, 2018

bad_alloc due to overcommit setup in sysctl #1324

Closed

interwq closed this as completed Jan 25, 2019

adamnovak mentioned this issue Aug 30, 2019

Signal 11 occurred for vg prune vgteam/vg#2417

Open

davidtgoldblatt reopened this Sep 3, 2019

interwq mentioned this issue Jan 17, 2020

Application using jemalloc causes mmapped regions to grow beyond 65530 and result in an application to crash with std::bad_alloc #1256

Closed

doxg mentioned this issue Apr 11, 2024

[CRASH] Redis 7.2.4 crashed by signal: 6, si_code: -6 redis/redis#13203

Open

edenon mentioned this issue May 3, 2024

Need help, not running - Portainer issues plaintextpackets/netprobe_lite#29

Open

lizhi10 mentioned this issue May 9, 2024

error "/etc/my_init.d/500_check_db_access.sh failed with status 1" overleaf/overleaf#1224

Closed

chriscrutch mentioned this issue May 10, 2024

[Bug] Romm does not start after the latest update rommapp/romm#758

Closed

SA1D77 mentioned this issue May 12, 2024

Не подключается к базе данных. KonstantinS343/Finance-telegram-bot#41

Closed

IvanMazzoli mentioned this issue May 22, 2024

[Bug] No content is shown, 502 Bad Gateway in dev console rommapp/romm#864

Closed

samschultzponsys mentioned this issue Jun 12, 2024

Latest Server version not working after removal of YAML lines. immich-app/immich#10215

Closed

3 tasks

iwannet mentioned this issue Jun 13, 2024

Redis Database using 10GB of RAM. ViewTube/viewtube#2828

Closed

alexanderbahlk mentioned this issue Jun 17, 2024

Build on DigitalOcean Droplet fails nickjj/docker-rails-example#81

Open

netbix mentioned this issue Jun 28, 2024

Error in the script does not change the password immauss/openvas#284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vm.max_map_count growing steadily when vm.overcommit_memory is 2 #1328

vm.max_map_count growing steadily when vm.overcommit_memory is 2 #1328

jsteemann commented Sep 20, 2018 •

edited

Loading

davidtgoldblatt commented Sep 20, 2018

jsteemann commented Sep 20, 2018

egaudry commented Sep 26, 2018

jsteemann commented Sep 26, 2018

egaudry commented Oct 5, 2018 •

edited

Loading

egaudry commented Jan 11, 2019

interwq commented Jan 11, 2019

egaudry commented Jan 14, 2019 •

edited

Loading

interwq commented Jan 18, 2019

egaudry commented Jan 19, 2019

egaudry commented Jan 21, 2019 •

edited

Loading

jsteemann commented Jan 21, 2019

egaudry commented Jan 21, 2019

jsteemann commented Jan 21, 2019

egaudry commented Jan 21, 2019

jsteemann commented Jan 21, 2019 •

edited

Loading

egaudry commented Jan 21, 2019

interwq commented Jan 24, 2019

egaudry commented Jan 25, 2019

interwq commented Jan 25, 2019

adamnovak commented Sep 3, 2019

davidtgoldblatt commented Sep 3, 2019

egaudry commented Sep 4, 2019

adamnovak commented Sep 4, 2019

interwq commented Sep 4, 2019

vm.max_map_count growing steadily when vm.overcommit_memory is 2 #1328

vm.max_map_count growing steadily when vm.overcommit_memory is 2 #1328

Comments

jsteemann commented Sep 20, 2018 • edited Loading

davidtgoldblatt commented Sep 20, 2018

jsteemann commented Sep 20, 2018

egaudry commented Sep 26, 2018

jsteemann commented Sep 26, 2018

egaudry commented Oct 5, 2018 • edited Loading

egaudry commented Jan 11, 2019

interwq commented Jan 11, 2019

egaudry commented Jan 14, 2019 • edited Loading

interwq commented Jan 18, 2019

egaudry commented Jan 19, 2019

egaudry commented Jan 21, 2019 • edited Loading

jsteemann commented Jan 21, 2019

egaudry commented Jan 21, 2019

jsteemann commented Jan 21, 2019

egaudry commented Jan 21, 2019

jsteemann commented Jan 21, 2019 • edited Loading

egaudry commented Jan 21, 2019

interwq commented Jan 24, 2019

egaudry commented Jan 25, 2019

interwq commented Jan 25, 2019

adamnovak commented Sep 3, 2019

davidtgoldblatt commented Sep 3, 2019

egaudry commented Sep 4, 2019

adamnovak commented Sep 4, 2019

interwq commented Sep 4, 2019

jsteemann commented Sep 20, 2018 •

edited

Loading

egaudry commented Oct 5, 2018 •

edited

Loading

egaudry commented Jan 14, 2019 •

edited

Loading

egaudry commented Jan 21, 2019 •

edited

Loading

jsteemann commented Jan 21, 2019 •

edited

Loading