CUDA Dynamic Memory Allocator for SOA Data Layout
Clone or download
Latest commit cdcf6a9 Jan 16, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
allocator Update. Jan 16, 2019
bitmap More debugging. Jan 16, 2019
example More debugging. Jan 16, 2019
lib Implement CUDA allocator interface. Dec 12, 2018
microbench Cleanup. Sep 20, 2018
util More debugging. Jan 16, 2019
.gitignore Add benchmark script. Jan 2, 2019
.gitmodules Add Halloc. Jan 4, 2019
CMakeLists.txt Update. Jan 5, 2019
README.md Update readme. Dec 10, 2018
fast_build.sh Update. Dec 12, 2018
run_benchmarks.py Update. Jan 5, 2019

README.md

SoaAlloc: A CUDA Framework for Single-Method Multiple-Objects Applications

SMMO (Single-Method Multiple-Objects) is a wide-spread pattern of parallel, object-oriented, high-performance code. It is OOP-speech for SIMD (Single-Instruction Multiple-Data) and means that a method should be executed for multiple or all objects of a type. As an example, an nbody simulation consists of n body objects, for each of which a move method for computing the next position of a body should be executed. SoaAlloc is a CUDA framework (C++ template library) that facilitates the development of such programs. The three main features of SoaAlloc are:

  • SOA Data Layout: Objects are stored in the SIMD-friendly Structure of Arrays data layout.
  • Dynamic Memory Management on Device: New objects can be created at any time in the CUDA kernel and existing objects can be destructed (malloc/free).
  • Parallel Enumeration: SoaAlloc provides an efficient way to run a member function (method) for all objects of a type in parallel.

Documentation/Papers

Prerequisites

Tested with CUDA Toolkit 9.1 on a Nvidia Titan Xp machine (Ubuntu 16.04.1). A device with a minimum compute capability of 6.x is required. CMake version 3.2 or higher is required for building the examples.

# Build types: Debug, Release
cmake -DCMAKE_BUILD_TYPE=Debug .
make

# Examples are located in example directory.
bin/nbody_soa

API Overview

All classes/structs that should be managed by SoaAlloc must inherit from SoaBase<AllocatorT>, where AllocatorT is the fully configured typed of the allocator. The first template argument to SoaAllocator is the maximum number of objects that can exist within the allocator at any given time; this number determines the memory usage of the allocator. The following arguments are all classes/structs that are managed by SoaAlloc.

SoaAlloc has a host side API (AllocatorHandle<AllocatorT>) and a device side API (AllocatorT). The following functionality is provided with those APIs.

  • AllocatorHandle::AllocatorHandle(): The constructor allocated all necessary memory on GPU.
  • AllocatorHandle::device_pointer(): Returns a pointer to the device allocator handle (AllocatorT*).
  • AllocatorHandle::parallel_do<C, &C::foo>(): Runs the member function C::foo() in parallel for all objects of type C that were created with the allocator. Internally, this will launch a CUDA kernel. This function returns when the CUDA kernel has finished processing all objects.
  • AllocatorT::make_new<C>(/*args*/): Creates a new object of type C, where C must be managed by the allocator. Returns a pointer to the new objects. This is similar to C++ new.
  • AllocatorT:free<C>(/*args*/): Deletes an existing object of type C that was created with the allocator. This is similar to C++ delete.
  • AllocatorT::device_do<C>(&C::foo /*, args*/): Runs C::foo(/*args*/) for all objects of type C that were created with the allocator. Note that this does not spawn a new CUDA kernel; execution is sequential.

API Example

This example does not compute anything meaningful and is only meant to show the API. Take a look at the code in the example directory for more interesting examples.

#include "allocator/soa_allocator.h"
#include "allocator/soa_base.h"
#include "allocator/allocator_handle.h"

// Pre-declare all classes.
class Foo;
class Bar;

// Declare allocator type. First argument is max. number of objects that can be created.
using AllocatorT = SoaAllocator<64*64*64*64, Foo, Bar>;

// Allocator handles.
__device__ AllocatorT* device_allocator;
AllocatorHandle<AllocatorT>* allocator_handle;

class Foo : public SoaBase<AllocatorT> {
 public:
  // Pre-declare types of all fields.
  using FieldTypes = std::tuple<float, int, char>;
  
  // Declare fields.
  SoaField<Foo, 0> field1_;  // float
  SoaField<Foo, 1> field2_;  // int
  SoaField<Foo, 2> field3_;  // char
  
  __device__ Foo(float f1, int f2, char f3) : field1_(f1), field2_(f2), field3_(f3) {}
 
  __device__ void qux() {
    field1_ = field2_ + field3_;
  }

  __device__ void baz() {
    // Run in Bar::foo(42) sequentially for all objects of type Bar. Note that
    // Foo::baz may run in parallel though.
    device_allocator->template device_do<Bar>(&Bar::foo, 42);
  }
};

class Bar : public SoaBase<AllocatorT> { /* ... */ };

__global__ void create_objects() {
  device_allocator->make_new<Foo>(1.0f, threadIdx.x, 2);
  // Delete objects with: device_allocator->free<Foo>(ptr)
}

int main(int argc, char** argv) {
  // Optional, for debugging.
  AllocatorT::DBG_print_stats();
  
  // Create new allocator.
  allocator_handle = new AllocatorHandle<AllocatorT>();
  AllocatorT* dev_ptr = allocator_handle->device_pointer();
  cudaMemcpyToSymbol(device_allocator, &dev_ptr, sizeof(AllocatorT*), 0,
                     cudaMemcpyHostToDevice);

  // Create 2048 objects.
  create_objects<<<32, 64>>>();
  cudaDeviceSynchronize();

  // Call Foo::qux on all 2048 objects.
  allocator_handle->parallel_do<Foo, &Foo::qux>();
}