key: extend get/set_specific interface #201
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Problems
This addresses #198; it would be nice to have TLS functions that can be accessed by not only the owner (=self) but also other parallel units as well.
Solutions
This PR implements the following functions:
These functions take a work unit handle, so these functions can access other's TLS.
Performance impact
It degrades the performance of
ABT_key
access since atomic operations are newly added in some paths. In Argobots, a key table and its element of a key table are lazily created, so synchronization is necessary if two parallel units try to allocate a key table or insert a new element (i.e., two parallelABT_xxx_set_specific()
).The following is the overheads (nanoseconds) of
ABT_key
operations on some CPUs: 56-core Skylake (2-socket Intel Xeon Platinum 8180 Processor), 64-core Intel Knights Landing 7210, and Summit-like POWER 9. I also measured the performance on a nice ARM CPU. Lower is better.The blue one is before the optimization of #196, the orange bar is #196, the gray one is #197, and the yellow one shows this PR. Basically each ES forks and joins many ULTs (256 for 1ES and 4096 for more ESs) in total, while each ULT accesses its own TLS several times (so no contention). It shows the average of five time measurement.
The result shows that, yes, this new synchronization adds certain overheads (10% - 80%)), but I believe this cost is acceptable enough considering the original performance of
ABT_key
operations. Specifically, this feature is useful for #199 but also contributes to #198. The performance on 64-bit ARM was almost similar to the other three CPUs.Note that basically the performance is degraded only on the first entry creation; subsequent TLS access with the same
ABT_key
does not get costlier.