New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a new algorithm for allgather and allgatherv #9422
Conversation
It is based on the paper "Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm" by Wilton Loch and Guilherme Koslovski, accepted in the 2021 SBAC-PAD and currently available at: https://arxiv.org/abs/2109.08751. I have included the math.h lib in both the allgather and allgatherv files, as the algorithm needs the ceil and log functions. I have naturally not changed the default algoritm selection rules inside the tuned component. Signed-off-by: Wilton Jaciel Loch (wiltonloch <wilton.loch@gmail.com>)
Can one of the admins verify this patch? |
Ok to test. |
Thanks for the contribution! |
@bosilca do you want to take a look before we merge? It looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm is very similar to recursive doubling or Bruck, with few exceptions:
- not limited to power of 2 participants.
- it generates more send/recv because for each exchange it follows the original scount and rcount of the collective.
It is not clear to me where the benefit described in the paper comes from. The communications scheme saves the data reordering at the end of the collective (which exists in most of the other algorithms), in exchange of a large number of send and receive operations simultaneously. So basically, we exchange local memory bandwidth for network noise and matching costs.
Second, it does not solve the issue with the mapping, it just does the communications in a different order, which makes it possibly faster when processes are mapped on a specific way and if the communicator participants is perfectly balanced across the nodes.
Hello and thank you for the review. The main source of the benefits is indeed the inverted order of the communications, which under the sequential mapping makes the heaviest exchanges more local, in opposition to Bruck or Recursive Doubling under the same mapping. On the flip side this implies on non-contiguos memory organization, which we have chosen to deal with by employing multiple sends simultaneously. If the sequential mapping is employed the benefits are not necessarily dependent on a homogeneous balance of processes over the topology, but will definitely vary according to the distribution. Nonetheless, Sparbit with sequential mapping has in general much more stable runs than Bruck with cyclic as the number of processes vary. Also, further (unpublished) analysis have shown that for the tested environments the algorithm is able to reduce the communication time by an average of 20% on almost half of the tests, thus on such cases finishing the exchanges faster than any currently available algorithm on either sequential or cyclic mappings. Thanks once again for the attention and reviews. |
@bosilca can you please merge when ready? |
@wiltonloch did you want this for v5.0.0? We're targeting a release in early November. This would get it out into the world instead of master only - the next master based release is probably ~two years down the road. If you want to see it in production faster than that, could you please Based on my reading these new algorithms are the added at lowest priority, is that accurate? |
Thank you very much for the tip! I've created a PR for the inclusion of the algorithm on v5.0. |
It is based on the paper "Sparbit: a new logarithmic-cost and data
locality-aware MPI Allgather algorithm" by Wilton Loch and Guilherme
Koslovski, accepted in the 2021 SBAC-PAD and currently available at:
https://arxiv.org/abs/2109.08751.
I have included the math.h lib in both the allgather and allgatherv
files, as the algorithm needs the ceil and log functions.
I have naturally not changed the default algoritm selection rules inside
the tuned component.
Signed-off-by: Wilton Jaciel Loch (wiltonloch wilton.loch@gmail.com)