memory usage significantly increased with memkind #708

adrianjhpc · 2021-09-21T13:23:42Z

When using memkind but targetting DRAM and running my C++ application the amount of memory used by the application more than doubles (when looking at usaging information in /proc/$PID/status). i.e. I'm using ~2GB per process without memkind but over 4gb per process when memkind is linked. Is this to do with the allocator being used? And is this tunable? It's significantly interferring with my ability to differentiate between memory and NVRAM usage when testing NVRAM allocaiton.

bratpiorka · 2021-09-22T13:37:30Z

Memkind uses jemalloc 5.2.1 as a heap allocator. It is highly configurable - see http://jemalloc.net/jemalloc.3.html#tuning. Memkind uses the jemk_mallctl call to tune some jemalloc options.

To get information about allocations you can use memkind_stats_print() function.

As for your problem - what kind of memory are you using? What is your allocation pattern? In a recent commit we added an optimization to the FS_DAX type that helps in scenarios where you have many small allocations - see c3fb4e4

Also, to debug your problem, you can try using jemalloc directly instead of memkind. Just compile your application without memkind and use LD_PRELOAD to use jemalloc instead of the standard mallocs: LD_PRELOAD=memkind/jemalloc/lib/libjemalloc.so /path/to/app

bratpiorka · 2021-09-28T09:54:31Z

@adrianjhpc Hi Adrian, did you manage to find anything?

adrianjhpc · 2021-09-30T16:08:50Z

Hi @bratpiorka, thanks for the replies, turning with jemalloc and investigating my code let me identify some usage issues I had.

adrianjhpc · 2021-10-05T15:38:22Z

Ok, I've now got a standalone benchmark that highlights the issue I'm seeing here, so I'm re-opening this, hope that's ok.

adrianjhpc · 2021-10-05T15:38:51Z

The benchmark is this:

#include <stdlib.h>
#include <mpi.h>
#include <limits.h>
#include <float.h>
#include <string.h>
#include <sys/sysinfo.h>
#if defined(__aarch64__)
#include <sys/syscall.h>
#endif
#include <memkind.h>


unsigned long get_processor_and_core(int *chip, int *core);

int main(int argc, char **argv){

  double *data;
  int i;
  int N = 0;
  int memkind;
  int err;
  size_t pmem_size;
  char filename[1000];
  struct memkind *pmem_space = NULL;
  int socket, core;

  MPI_Init(&argc, &argv);

  if(argc == 3){
    N = atoi(argv[1]);
    memkind = atoi(argv[2]);
  }

  if(N <= 0){
    N = 100000;
  }

  if(memkind != 1 && memkind != 0 ){
    memkind = 0;
  }

  printf("Using memkind %d (0 for false, 1 for true)\n", memkind);

  if(memkind == 0){

    data = malloc(sizeof(double)*N);

    for(int i=0; i<N; i++){
      data[i] = (float)i+(2.0/1.0/(i+1.0));
    }

    printf("No memkind %.10lf\n", data[N-1]);

    free(data);

  }

  if(memkind == 1){

    get_processor_and_core(&socket, &core);

    pmem_size = (long long)sizeof(double)*N;

    strcpy(filename,"/mnt/pmem_fsdax");
    sprintf(filename+strlen(filename), "%d", socket);
    err = memkind_create_pmem(filename, 0, &pmem_space);
    if (err) {
      fprintf(stderr, "Unable to create pmem partition %d\n",err);
    }
    data = (double *)memkind_malloc(pmem_space, pmem_size);

    for(int i=0; i<N; i++){
      data[i] = (float)i+(2.0/1.0/(i+1.0));
    }

    printf("Memkind %.10lf\n", data[N-1]);

    memkind_free(pmem_space, data);
    memkind_destroy_kind(pmem_space);
  }

  MPI_Finalize();

  return 0;

}


#if defined(__aarch64__)
// TODO: This might be general enough to provide the functionality for any system
// regardless of processor type given we aren't worried about thread/process migration.
// Test on Intel systems and see if we can get rid of the architecture specificity
// of the code.
unsigned long get_processor_and_core(int *chip, int *core){
  return syscall(SYS_getcpu, core, chip, NULL);
}
// TODO: Add in AMD function
#else
// If we're not on an ARM processor assume we're on an intel processor and use the
// rdtscp instruction.
unsigned long get_processor_and_core(int *chip, int *core){
  unsigned long a,d,c;
  __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c));
  *chip = (c & 0xFFF000)>>12;
  *core = c & 0xFFF;
  return ((unsigned long)a) | (((unsigned long)d) << 32);;
}
#endif

adrianjhpc · 2021-10-05T15:45:30Z

The benchmark is run like this:

mpirun -n 2 ./test_malloc_size 1000 0 to run without memkind, i.e. just use normal malloc and DRAM

or like this:

mpirun -n 2 ./test_malloc_size 1000 1 to run with memkind on Optane DIMMs.

I have a small library that collects VmHWM from /proc/%d/status when the processes are running and reports it at the end of the run.

Running the smallest array possible I get these results:

[adrianj@nextgenio-cn01 memkind]$ mpirun -n 2 ./test_malloc_size 1 0
Using memkind 0 (0 for false, 1 for true)
No memkind 2.0000000000
Using memkind 0 (0 for false, 1 for true)
No memkind 2.0000000000
process max 410MB min 405MB
node max 816MB min 816MB avg 816MB

[adrianj@nextgenio-cn01 memkind]$ mpirun -n 2 ./test_malloc_size 1 1
Using memkind 1 (0 for false, 1 for true)
Using memkind 1 (0 for false, 1 for true)
Memkind 2.0000000000
Memkind 2.0000000000
process max 942MB min 935MB
node max 1878MB min 1878MB avg 1878MB

The above output shows the memkind enabled run is using 2x the volatile memory of the non-memkind run.

I've also tried using a standalone version of jemalloc with the non-memkind run, i.e.:

[adrianj@nextgenio-cn01 memkind]$ LD_PRELOAD="/home/nx01/nx01/adrianj/jemalloc/5.2.1/lib/libjemalloc.so" mpirun -n 2 ./test_malloc_size 1 0
Using memkind 0 (0 for false, 1 for true)
No memkind 2.0000000000
Using memkind 0 (0 for false, 1 for true)
No memkind 2.0000000000
process max 429MB min 422MB
node max 851MB min 851MB avg 851MB

But as you see above it doesn't change the underlying memory requirements.

If I scale up the array size, I can see memkind clearly is working, i.e.:

[adrianj@nextgenio-cn01 memkind]$ mpirun -n 2 ./test_malloc_size 1000000000 0
Using memkind 0 (0 for false, 1 for true)
Using memkind 0 (0 for false, 1 for true)
No memkind 1000000000.0000000000
No memkind 1000000000.0000000000
process max 8039MB min 8036MB
node max 16076MB min 16076MB avg 16076MB

[adrianj@nextgenio-cn01 memkind]$ mpirun -n 2 ./test_malloc_size 1000000000 1
Using memkind 1 (0 for false, 1 for true)
Using memkind 1 (0 for false, 1 for true)
Memkind 1000000000.0000000000
Memkind 1000000000.0000000000
process max 942MB min 941MB
node max 1884MB min 1884MB avg 1884MB

But is there any way to get the base memory usage down with memkind? It makes it hard to track the actual memory consumption over time for benchmarking when using memkind to offload data from DRAM to NVRAM.

adrianjhpc · 2021-10-05T15:55:09Z

btw, I'm just trying building against master to see if that changes the memory usage.

But it's failing on build like this:

[ ! -e jemalloc/configure ] && (cd jemalloc && autoconf) || exit 0
[ ! -e jemalloc/lib/libjemalloc_pic.a ] && (cd jemalloc && ./configure --enable-autogen  --without-export --with-version=5.2.1-0-gea6b3e973b477b8061e0076bb257dbd7f3faa756 --disable-fill  --disable-initial-exec-tls --with-jemalloc-prefix=jemk_ --with-malloc-conf="narenas:256,lg_tcache_max:12"  && make) || exit 0
  CC       src/heap_manager.lo
src/heap_manager.c(32): error: identifier "jemk_malloc_usable_size" is undefined
      .heap_manager_malloc_usable_size = jemk_malloc_usable_size,
                                         ^

Let me know if you want me to open a separate issue for this, or if I'm just being stupid.

bratpiorka · 2021-10-07T07:09:10Z

Give me some time to look at this.
For building problems - please open a separate issue.

kilobyte · 2021-11-02T17:52:53Z

I see that there's indeed base memory cost of ~0.5GB per process when using any kinds other than DRAM, but it doesn't grow when adding huge allocations. Thus, a process that allocs a few KB wastes that 0.5GB, while a process with hundreds of GB also needs just that extra 0.5GB.

What's your use case? Do you have plenty of small tasks, or a single big one? If the latter, fixing this issue might be not so urgent.

adrianjhpc · 2021-11-04T09:19:39Z

I'm doing a range of benchmarking at the moment, mainly on parallel programs, so as many processes as there are cores on a node. However, if I know the overhead, I can factor it out/in of the memory calculations so I can work around the issue. In reality it's only a big issue for things that don't use much memory, but as you can imagine on a 48 core system, 24GB is a reasonable base overhead.

bratpiorka self-assigned this Sep 22, 2021

adrianjhpc closed this as completed Sep 30, 2021

adrianjhpc reopened this Oct 5, 2021

lukaszstolarczuk added the Type: Bug label Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory usage significantly increased with memkind #708

memory usage significantly increased with memkind #708

adrianjhpc commented Sep 21, 2021

bratpiorka commented Sep 22, 2021

bratpiorka commented Sep 28, 2021

adrianjhpc commented Sep 30, 2021

adrianjhpc commented Oct 5, 2021

adrianjhpc commented Oct 5, 2021 •

edited

Loading

adrianjhpc commented Oct 5, 2021 •

edited

Loading

adrianjhpc commented Oct 5, 2021 •

edited

Loading

bratpiorka commented Oct 7, 2021

kilobyte commented Nov 2, 2021

adrianjhpc commented Nov 4, 2021

memory usage significantly increased with memkind #708

memory usage significantly increased with memkind #708

Comments

adrianjhpc commented Sep 21, 2021

bratpiorka commented Sep 22, 2021

bratpiorka commented Sep 28, 2021

adrianjhpc commented Sep 30, 2021

adrianjhpc commented Oct 5, 2021

adrianjhpc commented Oct 5, 2021 • edited Loading

adrianjhpc commented Oct 5, 2021 • edited Loading

adrianjhpc commented Oct 5, 2021 • edited Loading

bratpiorka commented Oct 7, 2021

kilobyte commented Nov 2, 2021

adrianjhpc commented Nov 4, 2021

adrianjhpc commented Oct 5, 2021 •

edited

Loading

adrianjhpc commented Oct 5, 2021 •

edited

Loading

adrianjhpc commented Oct 5, 2021 •

edited

Loading