Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow MemoryPool::allocate() to be called from multiple threads per warp. #487

Closed
gmackey opened this issue Oct 20, 2016 · 1 comment
Closed
Assignees
Labels
Enhancement Improve existing capability; will potentially require voting Feature Request Create new capability; will potentially require voting
Milestone

Comments

@gmackey
Copy link
Contributor

gmackey commented Oct 20, 2016

Currently MemoryPool::allocate() can only be called by one thread per warp. The memory pool was designed like this because it satisfied the use cases Kokkos had for the memory pool (task-DAG parallelism and dynamically growing View) and this limitation made it easier to code.

There are two issues that would need to be addressed to remove this limitation.

The first issue is that the current method for swapping out the active superblock is not thread-safe between threads in the same warp due to the lock required to protect the swapping. Deadlock will happen if multiple threads from the same warp try to swap out the active superblock.

The second issue is that threads from the same warp would get the same starting location to search for an empty slot in the active superblock. We currently use the value of a clock register, and since the threads in a warp execute in lockstep, they would get the same value. This would cause all kinds of memory contention on the Boolean atomics used to reserve an empty slot. The way the starting location is determined would need to be improved.

There are two options to remove this limitation.

The first option is to coalesce the memory requests from all the threads in a warp to a single request, do a single allocation from the pool for the warp, and divide up the single allocation between the threads to fulfill their requests. Since the memory pool can't assume that all the threads will call deallocate at the same time, it would have to keep a count of the number of smaller blocks allocated in the large block. Calls to deallocate would decrement the count, and the large block would be freed when the count reaches 0. This would avoid both issues.

The second option is to deal with each issue independently. The method for swapping out the active superblock could be refactored such that only a single thread per warp (for each block size) tries to swap out the active superblock. Halloc does this. We would still need a new scheme for choosing a different starting location for each thread. A simple option would be to increment the starting location for each successive thread in a warp requesting an allocation, but I'm not sure if this would perform well once the superblock starts to fill up.

@gmackey gmackey added Enhancement Improve existing capability; will potentially require voting Feature Request Create new capability; will potentially require voting labels Oct 20, 2016
@gmackey gmackey added this to the Backlog milestone Oct 20, 2016
@gmackey gmackey self-assigned this Oct 20, 2016
hcedwar added a commit to hcedwar/kokkos that referenced this issue Apr 18, 2017
race conditions by condensing state representation
to a single integer and simplifying algorithm.
Addresses issues kokkos#320 , kokkos#487 , kokkos#452

Using power-of-two Kokkos::Impl::concurrent_bitset size to streamline
implementation and align with MemoryPool needs.

Unit testing over a range of superblocks the following sequence:
  1) allocate N of varying size
  2) deallocate N/3 of these
  3) reallocation deallocated
  4) concurrently deallocate and allocate N/3 of these
hcedwar added a commit to hcedwar/kokkos that referenced this issue Apr 19, 2017
race conditions by condensing state representation
to a single integer and simplifying algorithm.
Addresses issues kokkos#320 , kokkos#487 , kokkos#452

Using power-of-two Kokkos::Impl::concurrent_bitset size to streamline
implementation and align with MemoryPool needs.

Unit testing over a range of superblocks the following sequence:
  1) allocate N of varying size
  2) deallocate N/3 of these
  3) reallocation deallocated
  4) concurrently deallocate and allocate N/3 of these

Add performance test for memory pool.
Add performance enhancement note for multiple hints per block size.
hcedwar added a commit to hcedwar/kokkos that referenced this issue Apr 19, 2017
race conditions by condensing state representation
to a single integer and simplifying algorithm.
Addresses issues kokkos#320 , kokkos#487 , kokkos#452

Using power-of-two Kokkos::Impl::concurrent_bitset size to streamline
implementation and align with MemoryPool needs.

Unit testing over a range of superblocks the following sequence:
  1) allocate N of varying size
  2) deallocate N/3 of these
  3) reallocation deallocated
  4) concurrently deallocate and allocate N/3 of these

Add performance test for memory pool.
Add performance enhancement note for multiple hints per block size.
hcedwar added a commit to hcedwar/kokkos that referenced this issue Apr 19, 2017
race conditions by condensing state representation
to a single integer and simplifying algorithm.
Addresses issues kokkos#320 , kokkos#487 , kokkos#452

Using power-of-two Kokkos::Impl::concurrent_bitset size to streamline
implementation and align with MemoryPool needs.

Unit testing over a range of superblocks the following sequence:
  1) allocate N of varying size
  2) deallocate N/3 of these
  3) reallocation deallocated
  4) concurrently deallocate and allocate N/3 of these

Add performance test for memory pool.
Add performance enhancement note for multiple hints per block size.
hcedwar added a commit to hcedwar/kokkos that referenced this issue Apr 19, 2017
race conditions by condensing state representation
to a single integer and simplifying algorithm.
Addresses issues kokkos#320 , kokkos#487 , kokkos#452

Creating power-of-two Kokkos::Impl::concurrent_bitset size to streamline
implementation and align with MemoryPool needs.

Unit testing over a range of superblocks the following sequence:
  1) allocate N of varying size
  2) deallocate N/3 of these
  3) reallocation deallocated
  4) concurrently deallocate and allocate N/3 of these

Add performance test for memory pool.
Add performance enhancement note for multiple hints per block size.
@hcedwar hcedwar added this to Backlog in On-node Task DAG Apr 19, 2017
hcedwar added a commit to hcedwar/kokkos that referenced this issue Apr 23, 2017
enable stealing of empty superblocks among block sizes.

Expand block size superblock hint array to "N" values per block size
to provide space for TBD superblock search optimizations.

Construct memory pool with min block, max block, and superblock size
and introduce performance optimizations related to max vs. min block size.

Issues kokkos#487, kokkos#320, kokkos#738, kokkos#215
@hcedwar hcedwar moved this from Feature Backlog to In Progress in On-node Task DAG Apr 25, 2017
@hcedwar
Copy link
Contributor

hcedwar commented Apr 28, 2017

Redesigned memory pool is tested with every thread in a warp allocating/deallocating.

@hcedwar hcedwar moved this from In Progress to In Develop in On-node Task DAG May 12, 2017
@crtrott crtrott closed this as completed May 27, 2017
@hcedwar hcedwar moved this from In Develop to Done in On-node Task DAG Jun 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting Feature Request Create new capability; will potentially require voting
Projects
No open projects
Development

No branches or pull requests

3 participants