-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow MemoryPool::allocate() to be called from multiple threads per warp. #487
Labels
Enhancement
Improve existing capability; will potentially require voting
Feature Request
Create new capability; will potentially require voting
Projects
Milestone
Comments
gmackey
added
Enhancement
Improve existing capability; will potentially require voting
Feature Request
Create new capability; will potentially require voting
labels
Oct 20, 2016
hcedwar
added a commit
to hcedwar/kokkos
that referenced
this issue
Apr 18, 2017
race conditions by condensing state representation to a single integer and simplifying algorithm. Addresses issues kokkos#320 , kokkos#487 , kokkos#452 Using power-of-two Kokkos::Impl::concurrent_bitset size to streamline implementation and align with MemoryPool needs. Unit testing over a range of superblocks the following sequence: 1) allocate N of varying size 2) deallocate N/3 of these 3) reallocation deallocated 4) concurrently deallocate and allocate N/3 of these
hcedwar
added a commit
to hcedwar/kokkos
that referenced
this issue
Apr 19, 2017
race conditions by condensing state representation to a single integer and simplifying algorithm. Addresses issues kokkos#320 , kokkos#487 , kokkos#452 Using power-of-two Kokkos::Impl::concurrent_bitset size to streamline implementation and align with MemoryPool needs. Unit testing over a range of superblocks the following sequence: 1) allocate N of varying size 2) deallocate N/3 of these 3) reallocation deallocated 4) concurrently deallocate and allocate N/3 of these Add performance test for memory pool. Add performance enhancement note for multiple hints per block size.
hcedwar
added a commit
to hcedwar/kokkos
that referenced
this issue
Apr 19, 2017
race conditions by condensing state representation to a single integer and simplifying algorithm. Addresses issues kokkos#320 , kokkos#487 , kokkos#452 Using power-of-two Kokkos::Impl::concurrent_bitset size to streamline implementation and align with MemoryPool needs. Unit testing over a range of superblocks the following sequence: 1) allocate N of varying size 2) deallocate N/3 of these 3) reallocation deallocated 4) concurrently deallocate and allocate N/3 of these Add performance test for memory pool. Add performance enhancement note for multiple hints per block size.
hcedwar
added a commit
to hcedwar/kokkos
that referenced
this issue
Apr 19, 2017
race conditions by condensing state representation to a single integer and simplifying algorithm. Addresses issues kokkos#320 , kokkos#487 , kokkos#452 Using power-of-two Kokkos::Impl::concurrent_bitset size to streamline implementation and align with MemoryPool needs. Unit testing over a range of superblocks the following sequence: 1) allocate N of varying size 2) deallocate N/3 of these 3) reallocation deallocated 4) concurrently deallocate and allocate N/3 of these Add performance test for memory pool. Add performance enhancement note for multiple hints per block size.
hcedwar
added a commit
to hcedwar/kokkos
that referenced
this issue
Apr 19, 2017
race conditions by condensing state representation to a single integer and simplifying algorithm. Addresses issues kokkos#320 , kokkos#487 , kokkos#452 Creating power-of-two Kokkos::Impl::concurrent_bitset size to streamline implementation and align with MemoryPool needs. Unit testing over a range of superblocks the following sequence: 1) allocate N of varying size 2) deallocate N/3 of these 3) reallocation deallocated 4) concurrently deallocate and allocate N/3 of these Add performance test for memory pool. Add performance enhancement note for multiple hints per block size.
hcedwar
added a commit
to hcedwar/kokkos
that referenced
this issue
Apr 23, 2017
enable stealing of empty superblocks among block sizes. Expand block size superblock hint array to "N" values per block size to provide space for TBD superblock search optimizations. Construct memory pool with min block, max block, and superblock size and introduce performance optimizations related to max vs. min block size. Issues kokkos#487, kokkos#320, kokkos#738, kokkos#215
Redesigned memory pool is tested with every thread in a warp allocating/deallocating. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Enhancement
Improve existing capability; will potentially require voting
Feature Request
Create new capability; will potentially require voting
Currently MemoryPool::allocate() can only be called by one thread per warp. The memory pool was designed like this because it satisfied the use cases Kokkos had for the memory pool (task-DAG parallelism and dynamically growing View) and this limitation made it easier to code.
There are two issues that would need to be addressed to remove this limitation.
The first issue is that the current method for swapping out the active superblock is not thread-safe between threads in the same warp due to the lock required to protect the swapping. Deadlock will happen if multiple threads from the same warp try to swap out the active superblock.
The second issue is that threads from the same warp would get the same starting location to search for an empty slot in the active superblock. We currently use the value of a clock register, and since the threads in a warp execute in lockstep, they would get the same value. This would cause all kinds of memory contention on the Boolean atomics used to reserve an empty slot. The way the starting location is determined would need to be improved.
There are two options to remove this limitation.
The first option is to coalesce the memory requests from all the threads in a warp to a single request, do a single allocation from the pool for the warp, and divide up the single allocation between the threads to fulfill their requests. Since the memory pool can't assume that all the threads will call deallocate at the same time, it would have to keep a count of the number of smaller blocks allocated in the large block. Calls to deallocate would decrement the count, and the large block would be freed when the count reaches 0. This would avoid both issues.
The second option is to deal with each issue independently. The method for swapping out the active superblock could be refactored such that only a single thread per warp (for each block size) tries to swap out the active superblock. Halloc does this. We would still need a new scheme for choosing a different starting location for each thread. A simple option would be to increment the starting location for each successive thread in a warp requesting an allocation, but I'm not sure if this would perform well once the superblock starts to fill up.
The text was updated successfully, but these errors were encountered: