Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

memory usage significantly increased with memkind #708

Open
adrianjhpc opened this issue Sep 21, 2021 · 10 comments
Open

memory usage significantly increased with memkind #708

adrianjhpc opened this issue Sep 21, 2021 · 10 comments
Assignees

Comments

@adrianjhpc
Copy link

When using memkind but targetting DRAM and running my C++ application the amount of memory used by the application more than doubles (when looking at usaging information in /proc/$PID/status). i.e. I'm using ~2GB per process without memkind but over 4gb per process when memkind is linked. Is this to do with the allocator being used? And is this tunable? It's significantly interferring with my ability to differentiate between memory and NVRAM usage when testing NVRAM allocaiton.

@bratpiorka
Copy link
Collaborator

Memkind uses jemalloc 5.2.1 as a heap allocator. It is highly configurable - see http://jemalloc.net/jemalloc.3.html#tuning. Memkind uses the jemk_mallctl call to tune some jemalloc options.

To get information about allocations you can use memkind_stats_print() function.

As for your problem - what kind of memory are you using? What is your allocation pattern? In a recent commit we added an optimization to the FS_DAX type that helps in scenarios where you have many small allocations - see c3fb4e4

Also, to debug your problem, you can try using jemalloc directly instead of memkind. Just compile your application without memkind and use LD_PRELOAD to use jemalloc instead of the standard mallocs: LD_PRELOAD=memkind/jemalloc/lib/libjemalloc.so /path/to/app

@bratpiorka bratpiorka self-assigned this Sep 22, 2021
@bratpiorka
Copy link
Collaborator

@adrianjhpc Hi Adrian, did you manage to find anything?

@adrianjhpc
Copy link
Author

Hi @bratpiorka, thanks for the replies, turning with jemalloc and investigating my code let me identify some usage issues I had.

@adrianjhpc
Copy link
Author

Ok, I've now got a standalone benchmark that highlights the issue I'm seeing here, so I'm re-opening this, hope that's ok.

@adrianjhpc adrianjhpc reopened this Oct 5, 2021
@adrianjhpc
Copy link
Author

adrianjhpc commented Oct 5, 2021

The benchmark is this:

#include <stdlib.h>
#include <mpi.h>
#include <limits.h>
#include <float.h>
#include <string.h>
#include <sys/sysinfo.h>
#if defined(__aarch64__)
#include <sys/syscall.h>
#endif
#include <memkind.h>


unsigned long get_processor_and_core(int *chip, int *core);

int main(int argc, char **argv){

  double *data;
  int i;
  int N = 0;
  int memkind;
  int err;
  size_t pmem_size;
  char filename[1000];
  struct memkind *pmem_space = NULL;
  int socket, core;

  MPI_Init(&argc, &argv);

  if(argc == 3){
    N = atoi(argv[1]);
    memkind = atoi(argv[2]);
  }

  if(N <= 0){
    N = 100000;
  }

  if(memkind != 1 && memkind != 0 ){
    memkind = 0;
  }

  printf("Using memkind %d (0 for false, 1 for true)\n", memkind);

  if(memkind == 0){

    data = malloc(sizeof(double)*N);

    for(int i=0; i<N; i++){
      data[i] = (float)i+(2.0/1.0/(i+1.0));
    }

    printf("No memkind %.10lf\n", data[N-1]);

    free(data);

  }

  if(memkind == 1){

    get_processor_and_core(&socket, &core);

    pmem_size = (long long)sizeof(double)*N;

    strcpy(filename,"/mnt/pmem_fsdax");
    sprintf(filename+strlen(filename), "%d", socket);
    err = memkind_create_pmem(filename, 0, &pmem_space);
    if (err) {
      fprintf(stderr, "Unable to create pmem partition %d\n",err);
    }
    data = (double *)memkind_malloc(pmem_space, pmem_size);

    for(int i=0; i<N; i++){
      data[i] = (float)i+(2.0/1.0/(i+1.0));
    }

    printf("Memkind %.10lf\n", data[N-1]);

    memkind_free(pmem_space, data);
    memkind_destroy_kind(pmem_space);
  }

  MPI_Finalize();

  return 0;

}


#if defined(__aarch64__)
// TODO: This might be general enough to provide the functionality for any system
// regardless of processor type given we aren't worried about thread/process migration.
// Test on Intel systems and see if we can get rid of the architecture specificity
// of the code.
unsigned long get_processor_and_core(int *chip, int *core){
  return syscall(SYS_getcpu, core, chip, NULL);
}
// TODO: Add in AMD function
#else
// If we're not on an ARM processor assume we're on an intel processor and use the
// rdtscp instruction.
unsigned long get_processor_and_core(int *chip, int *core){
  unsigned long a,d,c;
  __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c));
  *chip = (c & 0xFFF000)>>12;
  *core = c & 0xFFF;
  return ((unsigned long)a) | (((unsigned long)d) << 32);;
}
#endif

@adrianjhpc
Copy link
Author

adrianjhpc commented Oct 5, 2021

The benchmark is run like this:

mpirun -n 2 ./test_malloc_size 1000 0 to run without memkind, i.e. just use normal malloc and DRAM

or like this:

mpirun -n 2 ./test_malloc_size 1000 1 to run with memkind on Optane DIMMs.

I have a small library that collects VmHWM from /proc/%d/status when the processes are running and reports it at the end of the run.

Running the smallest array possible I get these results:

[adrianj@nextgenio-cn01 memkind]$ mpirun -n 2 ./test_malloc_size 1 0
Using memkind 0 (0 for false, 1 for true)
No memkind 2.0000000000
Using memkind 0 (0 for false, 1 for true)
No memkind 2.0000000000
process max 410MB min 405MB
node max 816MB min 816MB avg 816MB
[adrianj@nextgenio-cn01 memkind]$ mpirun -n 2 ./test_malloc_size 1 1
Using memkind 1 (0 for false, 1 for true)
Using memkind 1 (0 for false, 1 for true)
Memkind 2.0000000000
Memkind 2.0000000000
process max 942MB min 935MB
node max 1878MB min 1878MB avg 1878MB

The above output shows the memkind enabled run is using 2x the volatile memory of the non-memkind run.

I've also tried using a standalone version of jemalloc with the non-memkind run, i.e.:

[adrianj@nextgenio-cn01 memkind]$ LD_PRELOAD="/home/nx01/nx01/adrianj/jemalloc/5.2.1/lib/libjemalloc.so" mpirun -n 2 ./test_malloc_size 1 0
Using memkind 0 (0 for false, 1 for true)
No memkind 2.0000000000
Using memkind 0 (0 for false, 1 for true)
No memkind 2.0000000000
process max 429MB min 422MB
node max 851MB min 851MB avg 851MB

But as you see above it doesn't change the underlying memory requirements.

If I scale up the array size, I can see memkind clearly is working, i.e.:

[adrianj@nextgenio-cn01 memkind]$ mpirun -n 2 ./test_malloc_size 1000000000 0
Using memkind 0 (0 for false, 1 for true)
Using memkind 0 (0 for false, 1 for true)
No memkind 1000000000.0000000000
No memkind 1000000000.0000000000
process max 8039MB min 8036MB
node max 16076MB min 16076MB avg 16076MB
[adrianj@nextgenio-cn01 memkind]$ mpirun -n 2 ./test_malloc_size 1000000000 1
Using memkind 1 (0 for false, 1 for true)
Using memkind 1 (0 for false, 1 for true)
Memkind 1000000000.0000000000
Memkind 1000000000.0000000000
process max 942MB min 941MB
node max 1884MB min 1884MB avg 1884MB

But is there any way to get the base memory usage down with memkind? It makes it hard to track the actual memory consumption over time for benchmarking when using memkind to offload data from DRAM to NVRAM.

@adrianjhpc
Copy link
Author

adrianjhpc commented Oct 5, 2021

btw, I'm just trying building against master to see if that changes the memory usage.

But it's failing on build like this:

[ ! -e jemalloc/configure ] && (cd jemalloc && autoconf) || exit 0
[ ! -e jemalloc/lib/libjemalloc_pic.a ] && (cd jemalloc && ./configure --enable-autogen  --without-export --with-version=5.2.1-0-gea6b3e973b477b8061e0076bb257dbd7f3faa756 --disable-fill  --disable-initial-exec-tls --with-jemalloc-prefix=jemk_ --with-malloc-conf="narenas:256,lg_tcache_max:12"  && make) || exit 0
  CC       src/heap_manager.lo
src/heap_manager.c(32): error: identifier "jemk_malloc_usable_size" is undefined
      .heap_manager_malloc_usable_size = jemk_malloc_usable_size,
                                         ^

Let me know if you want me to open a separate issue for this, or if I'm just being stupid.

@bratpiorka
Copy link
Collaborator

Give me some time to look at this.
For building problems - please open a separate issue.

@kilobyte
Copy link
Contributor

kilobyte commented Nov 2, 2021

I see that there's indeed base memory cost of ~0.5GB per process when using any kinds other than DRAM, but it doesn't grow when adding huge allocations. Thus, a process that allocs a few KB wastes that 0.5GB, while a process with hundreds of GB also needs just that extra 0.5GB.

What's your use case? Do you have plenty of small tasks, or a single big one? If the latter, fixing this issue might be not so urgent.

@adrianjhpc
Copy link
Author

I'm doing a range of benchmarking at the moment, mainly on parallel programs, so as many processes as there are cores on a node. However, if I know the overhead, I can factor it out/in of the memory calculations so I can work around the issue. In reality it's only a big issue for things that don't use much memory, but as you can imagine on a 48 core system, 24GB is a reasonable base overhead.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants