-
Notifications
You must be signed in to change notification settings - Fork 56
efficient synchronization of external threads using futex #306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ABT_CONFIG_USE_LINUX_FUTEX is declared if futex is available on that system.
--enable-wait-policy=<active|passive|default|auto> changes the wait policy of synchronization objects. This information can be retrieved by ABT_info functions.
The current spinlock implementation relies only on ABTD_atomic features and therefore can be classified as ABTD. To introduce ABTD_futex in the next commit, which uses this spinlock, this patch moves the spinlock implementation from ABTI to ABTD.
ABTD_futex_multiple is an internal synchronization object for external threads (Pthreads). Unlike spinlock, external threads that block on ABTD_futex_multiple will suspend, so waiting external threads do not burn cores. Since futex is a Linux-specific feature, ABTD_futex_multiple has a POSIX version that implements the same feature with pthread_mutex_t and pthread_cond_t. To cover both futex and Pthreads, ABTD_futex_multiple does not support all the futex operations. ABTD_futex_multiple supports only wait, timedwait, and broadcast.
blocking is not used. This patch removes it to simplify introduction of ABTD_futex_multiple.
This patch refactors ABTI_waitlist_wait_timedout_and_unlock() to minimize the change in the next commit. This patch does not change the logic of the existing algorithm.
Argobots synchronization objects such as ABT_cond and ABT_barreir internally use ABTI_waitlist. This patch introduces ABTD_futex_multiple for ABTI_waitlist operations so that external threads and tasklets can wait for those objects without spinning cores.
Collaborator
Author
|
test:argobots/all |
To check synchronization objects on external threads, this patch adds six tests: ext_thread_barrier, ext_thread_cond, ext_thread_eventual, ext_thread_future, ext_thread_mutex, and ext_thread_rwlock. Those tests are tested under signals, which cause spurious wake up and can break futex-based implementation.
e5de9ff to
25c08f3
Compare
Collaborator
Author
|
test:argobots/osx |
ABTD_futex_single is an internal synchronization object for external threads (Pthreads). Unlike spinlock, external threads that block on ABTD_futex_single will suspend, so waiting external threads do not burn cores. ABTD_futex_single has a POSIX version as a fallback. ABTD_futex_single supports only suspend and resume.
This patch refactors ABTI_ythread_context_switch_to_child_internal() and ythread_terminate() to prepare for the next commit. This commit does not change the existing join request handling.
When an external thread or a tasklet joins yieldable threads, it will wait on ABTD_futex_single. This improves CPU utilization.
A new test, ext_thread_join, checks ULT join and execution stream join on external threads, which internally uses ABTD_futex_single.
New ext_thread tests create Pthreads on the primary execution stream. On Linux, a newly created Pthread will inherit its Parent's affinity setting, so when affinity is enabled and no specific affinity is set (i.e., when no ABT_SET_AFFINIY is set), the primary execution stream and external Pthreads are pinned to a single core, which significantly degrades the performance because the primary execution stream does not sleep (while Pthreads internally sleep on futex or pthread_cond_t). This patch relaxes the affinity setting of the primary execution stream to avoid this oversubscription.
25c08f3 to
7daee58
Compare
Collaborator
Author
|
test:argobots/osx |
Collaborator
Author
|
test:argobots/freebsd |
Collaborator
Author
|
We understand that the current synchronization mechanism has room for performance improvement. Let us know any use case so that we can optimize it. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Description
Problem
Most Argobots functions now support external threads; Pthreads can call Argobots synchronization functions such as
ABT_mutex_lock(),ABT_barrier_wait(), andABT_thread_join()together with ULTs. However, the current implementation of Argobots uses spin-wait (maybe pluspauseorsched_yield()), which burns cores if synchronization conditions are not satisfied soon. This can cause a catastrophic performance degradation in cases where Pthreads would call such synchronization functions, for example, when users do not know who will call Argobots functions. This issue can happen particularly when Argobots is used as a building block of another runtime system that is expected to be called by other high-level programs.Solution
This change affects only cases where Pthreads (neither ULTs nor tasklets!) directly call Argobots synchronization functions.
Argobots provides a futex-based suspension configuration that affects all the synchronization operations. Currently, unless the user passes
--enable-wait-policy=active, Argobots will make the underlying external thread block on a futex (like most POSIX synchronization operations internally do). The underlying external thread will not use CPU resource in a busy loop, so it alleviates thread oversubscription. This does not change the ULT performance, so if you call such synchronization objects on ULTs (this is a typical use case), this change does not affect the behavior (i.e., it does not increase the overheads of these operations).Developer Notes
This synchronization mechanism is a little bit complicated. Just some notes:
futex is basically Linux only. As a fallback, Argobots has a POSIX-based implementation (
pthread_mutex_tandpthread_cond_t) for other UNIX systems.Since we should not allocate large memory space just for this additional mechanism that not many programs use,
ABTD_futex_xxx_tshould be small and simple. Currently, its size is 4 or 8 bytes. This structure can be initialized by zero clear no matter whether POSIX or futex is used (note:pthread_mutex_torpthread_cond_tcan be 32 bytes or more (implementation dependent)). This is also necessary to coexist this feature with static initializers ofABT_mutexandABT_cond.The synchronization semantics of some synchronization operations is different. Specifically,
ABT_cond_wait()does not allow spurious wakeup whilepthread_cond_wait()allows it. This causes a semantics mismatch and adds some overhead.External signal can disrupt system calls (i.e., futex or
pthread_cond_wait). Seven tests are newly added to check if external threads work while receiving a lot of signals (ext_thread_join,ext_thread_mutex, ...).Performance
The following shows the performance of synchronization objects of the following:
pthread_mutex_lock()etc)on those machines:
All use Pthreads for underlying threads.
(The benchmark code is collapsed)
(The experimental setting is collapsed)
Little contention:
Mutex: 1 thread (Skylake/ARM64/POWER9)
Cond: 2 threads (Skylake/ARM64/POWER9)
Barrier: 2 threads (Skylake/ARM64/POWER9)
High contention (not oversubscribed):
Mutex/Cond/Barrier: 28 threads (Skylake/ARM64), 40 threads (POWER9)
High contention (oversubscribed):
Mutex/Cond/Barrier: 256 threads (Skylake/ARM64), 320 threads (POWER9)
Results are the average of 6 time executions. Each execution repeats these operations many times (around 100000 or more). Argobots is compiled with
--enable-perf-opt.Please note that this result is to "understand" the performance. This PR's change is qualitative, so bad performance of microbenchmarks does not lessen the value of this PR much. This microbenchmark does not well represent a use case of Pthreads+Argobots: the purpose of this new mechanism is to sleep external threads while waiting for others' work that is moderately coarse-grained and if it is the case this PR works very well no matter POSIX or futex is used. I also note that semantics (spurious wakeup or not), provided features (Pthreads only or ULT + Pthreads), and scheduling fairness (Argobots synchronization objects might not be as fair as those of Pthreads) are different, so this comparison is not fair.
We can observe the following:
ABT_barrierandABT_condgets significantly bad when the wait policy of Argobots is active (because Argobots spins cores).There is much room for improvement, so if further performance optimization is necessary, please let us know.
Known issues
ABT_cond_signal()).ABT_mutex_lock_active()andABT_mutex_lock_passive()). This is future work (if there is such a demand).Checklist
module: short descriptionand follows good practice