Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning excessive memory retention #351

Open
hp48gx opened this issue Jan 25, 2021 · 3 comments
Open

Tuning excessive memory retention #351

hp48gx opened this issue Jan 25, 2021 · 3 comments

Comments

@hp48gx
Copy link

hp48gx commented Jan 25, 2021

Hi,

just want to share some information about an experiment we did, that looks a bit suspicious.

We have a binary that starts doing nothing, periodically it picks a random thread from a pool of 20, this thread iterates through a directory full of fairly large xml files (kept fixed, say, 10 files, 50 MB each); each file is parsed using pugixml (a DOM parser, that basically copies the entire file in memory in a big string), and then all the data is dropped (for the purpose of this experiment, we just return true/false if the file was valid xml).
In particular, everything happens within a single thread: the parser is created, triggered and destroyed every time; nothing is passed to another thread. Also, there are no leaks, as we also run it under valgrind (without mimalloc).

What we noticed is that whenever a new thread in the pool picks up the task, memory grows significantly (say 150MB), and never decreases; then, when the same thread runs again, memory consumption is roughly flat.
These numbers would be explained if mimalloc does not return the memory back to the OS.

Here's what we tried so far:

  1. mi_collect true/false has no effect (we run it in each thread, just before stats_merge)
  2. starting with MIMALLOC_PAGE_RESET=1 has no effect on RSS, but it's visible in the stats:
heap stats:     peak      total      freed       unit      count  
  reserved:   512.0 mb   512.0 mb    12.0 kb       1 b              not all freed!
 committed:   690.0 mb   690.0 mb    12.0 kb       1 b              not all freed!
     reset:    89.2 mb   351.0 mb   461.6 mb       1 b              ok

   process: user: 32.399 s, system: 13.636 s, faults: 5, reclaims: 179973, rss: 737.7 mb
heap stats:     peak      total      freed       unit      count  
  reserved:   512.0 mb   512.0 mb     8.0 kb       1 b              not all freed!
 committed:   687.5 mb   687.5 mb     8.0 kb       1 b              not all freed!
     reset:       0 b        0 b        0 b        1 b              ok

   process: user: 27.284 s, system: 10.617 s, faults: 3, reclaims: 177145, rss: 726.8 mb
  1. we tried using a different heap for the xml parser (pugixml supports passing two pointers to "malloc" and "free"). that was by far the worse solution. the heap grows as much as 4GB during parsing, and when we destroy it, about 25% is kept by the process.

  2. we tried destroying the threads after parsing. basically no effect, memory usage is slightly higher during parsing, and eventually no memory is released to the OS.

Is there anything we should tune?
thanks in advance for the hints

@daanx
Copy link
Collaborator

daanx commented Jan 26, 2021

Thank you for the report; (if you can create a binary that I could run to repro locally that would be great, but I can see that usually not so easy).

  • Is this on Windows ? (this is important for commit/reset statistics vs. Linux). Is this the latest mimalloc master branch?

Generally, mimalloc is not eager to share memory between threads as that generally is not good for performance so the behavior is not entirely unexpected (moreover, mimalloc will reset memory so the physical memory can be reused by other threads anyways). Nevertheless, it seems as if the memory stays in rss without ever being "reset"; here are some things to try:

  • On some OS's, even when the memory is "reset" it will still show in the rss as it is reset lazily by the OS (but it is actually available to other processes). You could set both MIMALLOC_PAGE_RESET=1 as well as MIMALLOC_RESET_DECOMMITS=1 to force more aggressive reset that is reflected in the rss for sure. (Also set MIMALLOC_ABANDONED_PAGE_RESET=1 -- I don't think it helps but just to make sure.).

  • At the end of the parsing etc. call mi_collect(false); mi_collect(true) (twice) -- this should really purge everything. You seem to imply it does not do this though, which makes me suspect that some memory is not freed ... (.. but you checked with valgrind ... hmm, I wonder if perhaps the library uses a free call that is not intercepted by mimalloc? )

  • If the previous things do not change things, it would be great if you could try the dev-slice branch -- this is currently the candidate for mimalloc 1.8 and does decommit on a finer grained basis. Try it first with stock settings, and then with MIMALLOC_RESET_DELAY=1 or even MIMALLOC_RESET_DELAY=0 (these settings are too aggressive for general use but it would let me understand what is happening with your workload).

Hope there is something we can do here.

@hp48gx
Copy link
Author

hp48gx commented Jan 26, 2021

Thanks for your quick reply. I have a question, first, and some more data below.

Is there a way to create a custom heap, use it, and on destruction force returning all its memory to the OS?
We didn't check the code of mi_heap_destroy, but it sounds achievable.
This would actually be super useful, because in the real binary, we have different thread pools doing different things, and generally we are very satisfied by default mimalloc settings (so, in particular, making mimalloc too aggressive may not be a good solution). It's just this thread pool that is doing weird things, because it has a different allocation pattern (i.e. periodical large spikes of allocation, which are quickly freed).

if you can create a binary that I could run to repro locally that would be great, but I can see that usually not so easy

I'll try to make one. Possibly it's enough to copy files in memory, without even parsing them.

  • Is this on Windows ? (this is important for commit/reset statistics vs. Linux).

no, but we see an identical behaviour on macosx, linux and freebsd.

Is this the latest mimalloc master branch?

no, it's release 1.6.3, I believe.
we do occasionally check code changes, but we didn't remember anything that seems relevant (this is not a proof, of course...).
do you think using master would help?

mimalloc will reset memory so the physical memory can be reused by other threads anyways

sorry for the dumb question... so reset means "available to any thread"?

Nevertheless, it seems as if the memory stays in rss without ever being "reset"

without being ever returned to the OS.
we also checked that the memory in RSS of this process cannot be reclaimed by the OS: we created another binary that allocates a linked list of N strings of length M, so it basically consumes a given fixed amount of memory, and when we exhaust the RAM, this process is killed, but the OS does not touch the other (swapfile off, no need to say).

could set both MIMALLOC_PAGE_RESET=1 as well as MIMALLOC_RESET_DECOMMITS=1

Thanks, we will try.

  • At the end of the parsing etc. call mi_collect(false); mi_collect(true) (twice) -- this should really purge everything.
    You seem to imply it does not do this though, which makes me suspect that some memory is not freed ...
    (.. but you checked with valgrind ... hmm, I wonder if perhaps the library uses a free call that is not intercepted by mimalloc? )

we tried calling either mi_collect(true) or mi_collect(false); none had any effect (as visible from top and ps aux).
we never tried calling them both.

I'm totally confident that pugixml is doing the right thing, because it was my prime suspect, and we intensively debugged it.
this thing, as I may have mentioned, takes two pointers to functions "malloc" and "free" and it uses them absolutely everywhere.
By default, of course, they are initialised with the real malloc and free.
We also tried registering a custom "malloc" that adds all pointers in a global std::set, a custom "free" that removes pointers from the set, and we have an assert that ensures that the set is empty every time it should be.

  • If the previous things do not change things, it would be great if you could try the dev-slice branch -- this is currently the candidate for mimalloc 1.8 and does decommit on a finer grained basis. Try it first with stock settings, and then with MIMALLOC_RESET_DELAY=1 or even MIMALLOC_RESET_DELAY=0 (these settings are too aggressive for general use but it would let me understand what is happening with your workload).

thanks, I'll try and let you know.

@hp48gx
Copy link
Author

hp48gx commented Jan 26, 2021

Ok, here's a minimal example, which possibly reproduces this beheviour.
It's pretty stupid code: a thread just copies in memory all the files in a directory and prints something (just to touch the memory).

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <thread>
#include <functional>
#include <unistd.h>
#include <mimalloc.h>
#include <cstdlib>

// you may need to change this line
#define PRINT_MEM     system( ("top -b -n 1 -p " + std::to_string(getpid())).c_str())

#define KEYPRESS  std::cout << "\n Press a key."; std::cin >> c

int main(int argc, char* argv[])
{
    char c;
    PRINT_MEM;
    KEYPRESS;

  std::thread t([&] {
    std::vector<std::string> v;

    for (int i = 1; i < argc; ++i)
    {
      auto file = argv[i];
      std::ifstream inf(file);
      inf.seekg(0, std::ios::end);
      auto len = static_cast<size_t>(inf.tellg());
      inf.seekg(0);

      std::string s(len, '0');
      inf.read(s.data(), len);
      std::cout << argv[i] << ':' << std::hash<std::string> {}(s) << std::endl;
      v.push_back(std::move(s));
    }

    PRINT_MEM;
  });

  t.join();

  t = std::thread(); // redundant, just wipe out the thread completely...

  std::cout << "end of thread.";
  KEYPRESS;
  PRINT_MEM;
  KEYPRESS;

  mi_collect(false);
  mi_collect(true);
  PRINT_MEM;
  return 0;
}

We compiled with mimalloc 1.6.7 in linux and put ~ 580MB in ./tmp.
Output: (redacted to emphasise the relevant parts)

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 6770 mycroft   20   0  347196  21808  18616 S   0.0   0.5   0:00.03 test_alloc // before starting
 6770 mycroft   20   0  879684 591764  18616 S   0.0  14.7   0:00.90 test_alloc // vector<string> fully allocated
 6770 mycroft   20   0  879680 592032  18676 S   0.0  14.7   0:00.92 test_alloc // thread died, nothing allocated
 6770 mycroft   20   0  355392 181996  18804 S   0.0   4.5   0:00.97 test_alloc // after mi_collect

So, RSS (RES) is 580MB. After the thread is gone, it's still 580MB.
Only after mi_collect, RSS shrinks, but it only goes down partially (with mimalloc 1.6.3, mi_collect had no effect).
All the other options instead seem to have no impact (MIMALLOC_PAGE_RESET=1 MIMALLOC_RESET_DECOMMITS=1 MIMALLOC_RESET_DELAY=1)
Also, mi_collect(false) does nothing (which is somewhat surprising).

In freebsd the behaviour is almost identical, except that mi_collect brings RSS down to something like 220MB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants