Skip to content

Benchmark Results

Matthias Springer edited this page May 21, 2019 · 10 revisions

Reproducing Benchmark Results

This document refers to our research paper "DynaSOAr: A Parallel Memory Allocator for Object-oriented Programming on GPUs with Efficient Memory Access".

On this page, we explain how to reproduce a subset of the results in the benchmark section of our paper. We ran all experiments on an NVIDIA Titan Xp GPU (12 GB memory). Different GPUs may have different performance characteristics, but we expect similar results on other GPUs (e.g., which allocator is the fastest for a certain benchmark).

An NVIDIA GPU with a minimum compute capability of 5.0 is required to run the benchmarks. Some benchmarks require a lot of memory, so we recommend running them on a GPU with at least 8 GB of memory; otherwise, you may have to change the problem size. The default settings of our benchmark scripts assume a GPU with at least 12 GB of memory. You may have to decrease the heap size of custom allocators on smaller GPUs (default heap size: 8 GB, but the parallel do-all implementation also requires some memory), e.g., for 4 GB: -s 4294967296.

Overall Benchmark Running Time

If no problem size/heap size is specified, the build scripts compile our benchmark applications with the default sizes which were used when creating the benchmarks in our paper.

$ build_scripts/build_nbody.sh
$ bin/nbody_dynasoar
62208209, 327237         # 62 seconds

$ bin/nbody_baseline_aos
184639539                # 184 seconds

$ bin/nbody_baseline_soa
87614381                 # 84 seconds

$ build_scripts/build_nbody.sh -a mallocmc
$ bin/nbody_dynasoar
162730113, 187784        # 162 seconds

$ build_scripts/build_nbody.sh -a halloc
$ bin/nbody_dynasoar
181976756, 185857        # 181 seconds

$ build_scripts/build_nbody.sh -a bitmap
$ bin/nbody_dynasoar
149374736, 503076        # 149 seconds

$ build_scripts/build_nbody.sh -a cuda
$ bin/nbody_dynasoar
194476443, 190327        # 194 seconds

The following image shows the overall running time of all benchmarks (Figure 10 in the paper). The running times of nbody above are very close to the numbers reported in the paper. E.g., the red bar is somewhere around 62 seconds. We have benchmark scripts for all applications, so readers should be able to reproduce our results for all applications.

Benchmark Running Time

Space Efficiency

Our numbers for space efficiency (Figure 11) require a bit more time to reproduce. Basically, we have to run the same problem size with smaller and smaller heap sizes until the program crashes, to determine how much memory an allocator needs. This process could be automated with shell scripts, but they would run for multiple hours. We obtained the results of Figure 11 manually, so we do not have an automated shell script available at the moment.

Here is an example for wa-tor. First, we determine the minimum heap size when using DynaSOAr. We compile the program in debug mode and try heap sizes 17*64*64*64 and 16*64*64*64. (We could find these heap sizes with a binary search.) We can see that the benchmark finishes successfully with the first heap size but crashes with the second one. This means that the minimum heap size is somewhere in-between.

$ build_scripts/build_wa_tor.sh -d -m "17*64*64*64"
$ bin/wator_dynasoar_no_cell
(lots of output, but runs through without crash)

$ build_scripts/build_wa_tor.sh -d -m "16*64*64*64"
$ bin/wator_dynasoar_no_cell
(...)
./bitmap/bitmap.h:171: SizeT Bitmap<SizeT, N, ContainerT, ScanType>::deallocate_seed(int) [with SizeT = int, N = 65536, ContainerT = unsigned long long, ScanType = 1]: block: [1769,0,0], thread: [253,0,0] Assertion `success` failed.
GPUassert: device-side assert triggered ./allocator/soa_executor.h 135

We assume from now that the minimum heap size for wa-tor is 17*64*64*64. (It is in fact a bit lower, which penalizes DynaSOAr a little bit, but that's OK.) From the debug output we can see that DynaSOAr's heap size is 261 MiB = 273887391 bytes with this configuration.

┌───────────────────────────────────────────────────────────────────────┐
│ Smallest block type:                                            4Fish │
│ Max. #objects:            4456448                                     │
│ Block size:                  3904 bytes                               │
│ #Blocks:                    69632                                     │
│ #Bitmap levels:                 3                                     │
│ Data buffer size:     000259.250000 MB                                │
│ Allocator overead:    000001.949371 MB + block overhead               │
│ Total memory usage:   000261.199371 MB                                │
└───────────────────────────────────────────────────────────────────────┘

Now let's build the benchmark with mallocMC and give the allocator 273887391 bytes. We see in Figure 11 that the maximum problem size with mallocMC is somewhere between 20% and 40% of the maximum problem size with DynaSOAr (given the same amount of heap memory). So let's try to run the program with a 40% problem size by modifying the size in Y direction. Note that this benchmark has two size parameters (-x and -y). The default y value is 2560, so let's try with 0.4 * 2560 = 1024. As we see, the allocator runs out of memory and the program crashes.

$ build_scripts/build_wa_tor.sh -d -m "17*64*64*64" -s 273887391 -y 1024 -a mallocmc
$ bin/nbody_dynasoar_no_cell
(...)
example/configuration/mallocmc/allocator_config.h:36: T *AllocatorState<AllocatorT>::allocate_new() [with T = Fish, AllocatorT = SoaAllocatorAdapter<AllocatorState, 4456448U, Agent, Fish, Shark>]: block: [1588,0,0], thread: [157,0,0] Assertion `ptr != nullptr` failed.
GPUassert: device-side assert triggered example/configuration/mallocmc/../allocator_interface_adapter.h 270

Now let's try with a 20% problem size, i.e., y = 512. This time, the program runs through successfully.

$ build_scripts/build_wa_tor.sh -d -m "17*64*64*64" -s 273887391 -y 512 -a mallocmc
$ bin/nbody_dynasoar_no_cell
(...)
13004013, 10276546

We can see from this experiment that the maximum problem size of wa-tor with mallocMC is somewhere between 20% and 40% of the maximum problem size of wa-tor with DynaSOAr. This is consistent with the ~35% reported in Figure 11 in the paper.

Benchmark Running Time

Heap Utilization

We now measure how much heap memory allocators can utilize under simplified conditions: Only objects of the same size (64 bytes) are allocated. We can measure this with the Linux Scalability benchmark (Figure 14a).

That benchmarks spawns 16384 GPU threads and allocates the same number of objects in each thread. We can change the number of allocations per thread with the -x command line parameter.

Assuming a heap size of 1 GiB, the heap can fit 1024 * 1024 * 1024 / 64 = 16777216 objects, i.e., 16777216 / 16384 = 1024 objects per thread.

First, we run the benchmark with DynaSOAr. We have to set the number of objects to a value that is a bit smaller than 16777216 = 64 * 64 * 64 * 64, because some memory is used by other internal data structures. So let's run the benchmark with an object capacity of 62 * 64 * 64 * 64. This equates to 62 * 64 * 64 * 64 / 16384 = 992 objects per thread.

$ build_scripts/build_linux_scalability.sh -d -m "62*64*64*64" -x 992
$ bin/linux_scalability_dynasoar
Current Device Number: 0
  Device name: TITAN Xp
  Memory Clock Rate (KHz): 5705000
  Memory Bus Width (bits): 384
  Peak Memory Bandwidth (GB/s): 547.680000
  Total global memory: 12771.721216 MB
  Available (free) global memory: 12189.761536 MB

┌───────────────────────────────────────────────────────────────────────┐
│ Smallest block type:                                     10DummyClass │
│ Max. #objects:           16252928                                     │
│ Block size:                  4160 bytes                               │
│ #Blocks:                   253952                                     │#Bitmap levels:                 3                                     │
│ Data buffer size:     001007.500000 MB                                │
│ Allocator overead:    000003.044731 MB + block overhead               │
│ Total memory usage:   001010.544739 MB                                │
└───────────────────────────────────────────────────────────────────────┘
(...)
526202, 0

We can see from this benchmark that DynaSOAr can utilize 100% of all object slots (under such simplified conditions), i.e., all 992 object slots. This is consistent with Figure 14a: The red line stops at x = 992.

Now let us run the same benchmark with mallocMC. We set the heap size to 1 GiB (1073741824 bytes) and try two problem sizes: 500 and 550.

$ build_scripts/build_linux_scalability.sh -d -s 1073741824 -x 500 -a mallocmc
$ bin/linux_scalability_dynasoar
(...)
884339, 0

$ build_scripts/build_linux_scalability.sh -d -s 1073741824 -x 550 -a mallocmc
$ bin/linux_scalability_dynasoar
(...)
example/configuration/mallocmc/allocator_config.h:36: T *AllocatorState<AllocatorT>::allocate_new() [with T = DummyClass, AllocatorT = SoaAllocatorAdapter<AllocatorState, 16515072U, DummyClass>]: block: [22,0,0], thread: [223,0,0] Assertion `ptr != nullptr` failed.
GPUassert: device-side assert triggered example/linux-scalability/dynasoar/linux_scalability.cu 53

The first problem size runs through while the second one leads to an out-of-memory error. This shows that mallocMC can allocate between 500 and 550 objects per thread, given the same heap size (actually, a few megabytes more) as DynaSOAr. This is consistent with Figure 14a: The blue line stops at x = 512.

Linux Scalability Running Time

Note that to compare benchmark running times, the program should not be built in debug mode (omit -d parameter).

Now let us examine how well DynaSOAr scales with increasing allocation numbers per thread. I.e., we would like to recreate the red line in Figure 14a.

for (( i = 1; i <= 992; i += 32 ))  # Figure 14a: i++
do
  build_scripts/build_linux_scalability.sh -m "62*64*64*64" -x ${i}
  bin/linux_scalability_dynasoar
done

If we plot the result (first value), e.g., using Excel or LibreOffice, we should see a graph similar to the one in Figure 14a.

Fragmentation

In the paper, fragmentation is defined as the ratio of allocated but unused object slots among all object slots. We report the fragmentation rate per iteration for wa-tor in Figure 12d. The data for this graph can be reproduced if wa-tor is built with the fragmentation output flag (-o).

Fragmentation

build_scripts/build_wa_tor.sh -o
bin/wator_dynasoar_no_cell

The program generates one line per iteration with four values:

  1. Number of allocated object slots (Fish)
  2. Number of allocated object slots (Shark)
  3. Number of used object slots (Fish)
  4. Number of used object slots (Shark)

This should look somewhat similar to:

1310912,1309020,1310845,1308939
1587072,1606860,828800,1580927
1581248,1606860,544736,1121226
1577344,1604340,390391,1075630
1573376,1604340,281454,1118328
1563648,1604340,213230,1211307
1529408,1604340,179994,1042255
1448064,1602480,141384,939040
1356608,1602480,122275,828875
1237696,1602480,104602,794406
1144192,1602480,99640,546411
1059648,1598220,99911,525722
966784,1596840,96894,490541
900928,1596600,97634,430984
...

This data can be read with Excel or LibreOffice (CSV format) to produce a line chart.