Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mempool: lock-free bucket-based memory pools for threads and tasks #183

merged 7 commits into from Jun 5, 2020


Copy link

@shintaro-iwasaki shintaro-iwasaki commented May 29, 2020


The current memory pools for threads and tasks use the following algorithms:

Thread pool

  • [Return] The thread pool returns an individual element to the global pool one by one if the local pool is full (i.e., more than ABT_MEM_MAX_NUM_STACKS).
  • [Allocate] The thread pool takes all elements from a global pool if the local pool is empty.
    • It can take more than ABT_MEM_MAX_NUM_STACKS of elements.

Task pool

  • [Return] The task pool returns an individual element if that task is not created on this execution stream.
    • It caches locally-created tasks in the local pool (no upper limit).
  • [Allocate] The task pool checks all the pages locally created by this execution stream and if no free tasks are available (e.g., returned in the [Return] part), allocate a new page and create headers.
    • Increase of the number of pages just increase the allocation time.

Both have several issues:

  • The memory pool for threads and one for tasks use totally different algorithms without any specific reason, so developers need to maintain two different algorithms.
  • Both pool algorithms do not a maximum number of elements in a local cache, so it requires significant memory footprint.
    • For example, even if the maximum number of threads/tasks at any given point is "N", the consumed memory can be "N * # of ESs" in the worst case (since there is no upper limit).
  • Both algorithms use a naive stack implementation, which has the ABA issue (Naive lock-free stack in mempool causes an ABA problem #178)
  • The interface is not very generic, so it needs another hard-coded memory pool for other descriptors (if needed).


This PR creates a generic lock-free bucket-based memory pool that allows users to set the strict capacity of every local pool. The new implementation has the following merits:

  • The ABA issue is solved in a lock-free manner by using a pointer+tag parallel LIFO algorithm ( if the architecture supports 16-byte CAS.
    • As far as I tried, the following environments are supported:
      • Typical Intel x86/64 with ICC17+, GCC4.8+, and Clang3.9+
      • POWER8/9 with XLC16 and GCC4.8+. No Clang (including a pretty new one) since it does not recognize 128-bit LL/SC inline assembly.
      • 64-bit ARM with GCC4.8+. No Clang (including a pretty new one) since it does not recognize 128-bit LL/SC inline assembly.
    • If not supported, it falls back to a spinlock-based implementation (and thus not lock-free).
  • All the data transfer between a local pool and a global pool is per-bucket (multiple elements rather than a single element), which can reduce the overheads
    • This is also lock-free.
  • Each local pool has a stricter upper limit of the capacity, so if the program uses at most "N" elements in the program, Argobots consumes at most "N + local-pool capacity * # of ESs".
    • This guarantee is not very strict because per-page memory allocation is performed in a lock-free manner (i.e., not serialized), so the total memory consumption can be a little bit exceeded in the worst case.
  • Both use the same pool implementation, so developers need to optimize only one implementation if necessary.
  • The interface is generic, so it is easy to create another memory pool.


The new algorithm is not always faster than the existing implementations. Because of the difference of memory access order and cache access patterns, I observed 60% slowdown (and up to 2500% speedup) with fork-join mircobenchmarks. I arbitrarily set the local pool capacity, but this tuning can also negatively affect the performance in some cases. For now, because of the ABA problem (#178), I do not think the current implementation is better than this, but if the application performance is noticeably changed by this PR, please let me/us know so that we can fix it.

Copy link
Collaborator Author


Copy link
Collaborator Author


src/mem/mem_pool.c Outdated Show resolved Hide resolved
src/mem/mem_pool.c Outdated Show resolved Hide resolved
src/mem/mem_pool.c Outdated Show resolved Hide resolved
src/mem/mem_pool.c Outdated Show resolved Hide resolved
Copy link
Collaborator Author


Copy link
Collaborator Author


Copy link
Collaborator Author


128-bit atomic CAS is useful to implement lock-free LIFO, but most compiler
intrinsics do not support 128-bit atomic operations.  This patch implements
128-bit atomic compare-and-swap.  Since some CPU models do not support necessary
instructions, the availability is detected at configure time.

Currently, x86/64, 64-bit ARM, and POWER 8 and 9 are supported.

Note that, even if the CPU supports them, some compilers fail to recognize them
in inline assembly code: for example, Clang 9 and older do not recognize 128-bit
LL/SC instructions on ARM and POWER.
Atomic tagged pointer operations (void * + size_t) are implemented primarily
for lock-free LIFO.  This requires a special instruction (i.e., 128-bit atomic
CAS is needed if the 64-bit OS is used), so not all environments support this.
If this atomic type is supported, ABTD_ATOMIC_SUPPORT_TAGGED_PTR is defined.

Note that this atomic type is special and therefore provides only weak CAS and
non-atomic acquire/release/relaxed load and store.
ABTI_sync_lifo is a scalable LIFO implementation that does not have the ABA
problem.  If an atomic tagged pointer operation is supported (i.e., most x86/64
with ICC, GCC, Clang, 64-bit ARM with GCC, and POWER 8 and 9 with XLC and GCC),
push and pop are lock-free.  If not, it falls back on a spinlock-based blocking
ABTI_mem_pool is a generic memory pool implementation.  The basic algorithm is
similar to the current memory pool implementation; first it accesses a local
memory pool and then a global memory pool if it is empty or full.  The
advantages of the new algorithm are as follows:
- Generic: the implementation takes a segment size as a runtime argument.
- Per-bucket operation: not a single memory segment but multiple segments in a
  "bucket" are used when accessing a global memory pool, which reduces a number
  of global pool accesses.  The number of local buckets and the bucket size are
  constant, so the local pool do not keep too many segments.
- Lock-free: entire push-pop operations are lock-free (on most CPUs).  No ABA

The current memory pool implementations should be replaced by this.
This patch introduces new memory pools for thread and task, which replace the
existing ones.
Copy link
Collaborator Author


@shintaro-iwasaki shintaro-iwasaki linked an issue Jun 5, 2020 that may be closed by this pull request
Copy link
Collaborator Author

Finally I could confirm that this PR works on POWER 9 and 64-bit ARM in addition to x86/64 with various compilers including GCC 4.8, 6.5, 8.3, 9.2, Clang 3.9, 7.0, 9.0, 10.0, and architecture-specific compilers such as ICC 18, 19, 20 and XLC 16.

Note that this change might cause some performance issues; please tell us if you encounter any performance issue.

@shintaro-iwasaki shintaro-iwasaki merged commit 1735db1 into pmodels:master Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

Successfully merging this pull request may close these issues.

Naive lock-free stack in mempool causes an ABA problem
1 participant