Allow MemoryPool::allocate() to be called from multiple threads per warp. #487

gmackey · 2016-10-20T18:28:56Z

Currently MemoryPool::allocate() can only be called by one thread per warp. The memory pool was designed like this because it satisfied the use cases Kokkos had for the memory pool (task-DAG parallelism and dynamically growing View) and this limitation made it easier to code.

There are two issues that would need to be addressed to remove this limitation.

The first issue is that the current method for swapping out the active superblock is not thread-safe between threads in the same warp due to the lock required to protect the swapping. Deadlock will happen if multiple threads from the same warp try to swap out the active superblock.

The second issue is that threads from the same warp would get the same starting location to search for an empty slot in the active superblock. We currently use the value of a clock register, and since the threads in a warp execute in lockstep, they would get the same value. This would cause all kinds of memory contention on the Boolean atomics used to reserve an empty slot. The way the starting location is determined would need to be improved.

There are two options to remove this limitation.

The first option is to coalesce the memory requests from all the threads in a warp to a single request, do a single allocation from the pool for the warp, and divide up the single allocation between the threads to fulfill their requests. Since the memory pool can't assume that all the threads will call deallocate at the same time, it would have to keep a count of the number of smaller blocks allocated in the large block. Calls to deallocate would decrement the count, and the large block would be freed when the count reaches 0. This would avoid both issues.

The second option is to deal with each issue independently. The method for swapping out the active superblock could be refactored such that only a single thread per warp (for each block size) tries to swap out the active superblock. Halloc does this. We would still need a new scheme for choosing a different starting location for each thread. A simple option would be to increment the starting location for each successive thread in a warp requesting an allocation, but I'm not sure if this would perform well once the superblock starts to fill up.

race conditions by condensing state representation to a single integer and simplifying algorithm. Addresses issues kokkos#320 , kokkos#487 , kokkos#452 Using power-of-two Kokkos::Impl::concurrent_bitset size to streamline implementation and align with MemoryPool needs. Unit testing over a range of superblocks the following sequence: 1) allocate N of varying size 2) deallocate N/3 of these 3) reallocation deallocated 4) concurrently deallocate and allocate N/3 of these

race conditions by condensing state representation to a single integer and simplifying algorithm. Addresses issues kokkos#320 , kokkos#487 , kokkos#452 Using power-of-two Kokkos::Impl::concurrent_bitset size to streamline implementation and align with MemoryPool needs. Unit testing over a range of superblocks the following sequence: 1) allocate N of varying size 2) deallocate N/3 of these 3) reallocation deallocated 4) concurrently deallocate and allocate N/3 of these Add performance test for memory pool. Add performance enhancement note for multiple hints per block size.

race conditions by condensing state representation to a single integer and simplifying algorithm. Addresses issues kokkos#320 , kokkos#487 , kokkos#452 Creating power-of-two Kokkos::Impl::concurrent_bitset size to streamline implementation and align with MemoryPool needs. Unit testing over a range of superblocks the following sequence: 1) allocate N of varying size 2) deallocate N/3 of these 3) reallocation deallocated 4) concurrently deallocate and allocate N/3 of these Add performance test for memory pool. Add performance enhancement note for multiple hints per block size.

enable stealing of empty superblocks among block sizes. Expand block size superblock hint array to "N" values per block size to provide space for TBD superblock search optimizations. Construct memory pool with min block, max block, and superblock size and introduce performance optimizations related to max vs. min block size. Issues kokkos#487, kokkos#320, kokkos#738, kokkos#215

hcedwar · 2017-04-28T21:59:11Z

Redesigned memory pool is tested with every thread in a warp allocating/deallocating.

gmackey added Enhancement Improve existing capability; will potentially require voting Feature Request Create new capability; will potentially require voting labels Oct 20, 2016

gmackey added this to the Backlog milestone Oct 20, 2016

gmackey self-assigned this Oct 20, 2016

hcedwar added this to Backlog in On-node Task DAG Apr 19, 2017

hcedwar moved this from Feature Backlog to In Progress in On-node Task DAG Apr 25, 2017

hcedwar added the InDevelop label Apr 28, 2017

hcedwar moved this from In Progress to In Develop in On-node Task DAG May 12, 2017

crtrott closed this as completed May 27, 2017

hcedwar moved this from In Develop to Done in On-node Task DAG Jun 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow MemoryPool::allocate() to be called from multiple threads per warp. #487

Allow MemoryPool::allocate() to be called from multiple threads per warp. #487

gmackey commented Oct 20, 2016

hcedwar commented Apr 28, 2017

Allow MemoryPool::allocate() to be called from multiple threads per warp. #487

Allow MemoryPool::allocate() to be called from multiple threads per warp. #487

Comments

gmackey commented Oct 20, 2016

hcedwar commented Apr 28, 2017