Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

key: extend get/set_specific interface #201

merged 7 commits into from Jun 16, 2020


Copy link


This addresses #198; it would be nice to have TLS functions that can be accessed by not only the owner (=self) but also other parallel units as well.


This PR implements the following functions:

int ABT_thread_set_specific(ABT_thread thread, ABT_key key, void *value);
int ABT_thread_get_specific(ABT_thread thread, ABT_key key, void **value);
int ABT_task_set_specific(ABT_task task, ABT_key key, void *value);
int ABT_task_get_specific(ABT_task task, ABT_key key, void **value);

These functions take a work unit handle, so these functions can access other's TLS.

Performance impact

It degrades the performance of ABT_key access since atomic operations are newly added in some paths. In Argobots, a key table and its element of a key table are lazily created, so synchronization is necessary if two parallel units try to allocate a key table or insert a new element (i.e., two parallel ABT_xxx_set_specific()).

The following is the overheads (nanoseconds) of ABT_key operations on some CPUs: 56-core Skylake (2-socket Intel Xeon Platinum 8180 Processor), 64-core Intel Knights Landing 7210, and Summit-like POWER 9. I also measured the performance on a nice ARM CPU. Lower is better.

The blue one is before the optimization of #196, the orange bar is #196, the gray one is #197, and the yellow one shows this PR. Basically each ES forks and joins many ULTs (256 for 1ES and 4096 for more ESs) in total, while each ULT accesses its own TLS several times (so no contention). It shows the average of five time measurement.


The result shows that, yes, this new synchronization adds certain overheads (10% - 80%)), but I believe this cost is acceptable enough considering the original performance of ABT_key operations. Specifically, this feature is useful for #199 but also contributes to #198. The performance on 64-bit ARM was almost similar to the other three CPUs.

Note that basically the performance is degraded only on the first entry creation; subsequent TLS access with the same ABT_key does not get costlier.

This change is necessary to allow non-ABT_key functions to manipulate work
unit-specific data efficiently.
Previously, only the currently running thread can modify its WU-specific value
as does Pthreads.  The new interface allows users to access others' WU-specific
values.  This is useful to pass data across threads and tasklets, although it
slightly adds a cost to avoid data race.
This is one of two changes that are necessary to allow parallel set_specific()
This change prepares for the next patch.  Note that this change does not affect
the performance since TLS entry order does not have locality.  People should use
a larger key table (ABT_KEY_TABLE_SIZE) if they worry about the performance.
This is the last missing piece that is required to make set_specific()
operations thread-safe.  This adds a small overhead on creation, but does not
incur extra overheads (at least on machines that use a strong memory model) once
the element is created.
Copy link
Collaborator Author


Copy link
Collaborator Author


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

Successfully merging this pull request may close these issues.

None yet

1 participant