Adding a new algorithm for allgather and allgatherv #9422

wiltonloch · 2021-09-24T18:37:35Z

It is based on the paper "Sparbit: a new logarithmic-cost and data
locality-aware MPI Allgather algorithm" by Wilton Loch and Guilherme
Koslovski, accepted in the 2021 SBAC-PAD and currently available at:
https://arxiv.org/abs/2109.08751.

I have included the math.h lib in both the allgather and allgatherv
files, as the algorithm needs the ceil and log functions.

I have naturally not changed the default algoritm selection rules inside
the tuned component.

Signed-off-by: Wilton Jaciel Loch (wiltonloch wilton.loch@gmail.com)

It is based on the paper "Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm" by Wilton Loch and Guilherme Koslovski, accepted in the 2021 SBAC-PAD and currently available at: https://arxiv.org/abs/2109.08751. I have included the math.h lib in both the allgather and allgatherv files, as the algorithm needs the ceil and log functions. I have naturally not changed the default algoritm selection rules inside the tuned component. Signed-off-by: Wilton Jaciel Loch (wiltonloch <wilton.loch@gmail.com>)

ompiteam-bot · 2021-09-24T18:37:37Z

Can one of the admins verify this patch?

bosilca · 2021-09-24T20:38:38Z

Ok to test.

gpaulsen · 2021-09-25T20:45:14Z

Thanks for the contribution!

gpaulsen · 2021-09-27T13:55:42Z

@bosilca do you want to take a look before we merge? It looks good to me.

bosilca

This algorithm is very similar to recursive doubling or Bruck, with few exceptions:

not limited to power of 2 participants.
it generates more send/recv because for each exchange it follows the original scount and rcount of the collective.

It is not clear to me where the benefit described in the paper comes from. The communications scheme saves the data reordering at the end of the collective (which exists in most of the other algorithms), in exchange of a large number of send and receive operations simultaneously. So basically, we exchange local memory bandwidth for network noise and matching costs.

Second, it does not solve the issue with the mapping, it just does the communications in a different order, which makes it possibly faster when processes are mapped on a specific way and if the communicator participants is perfectly balanced across the nodes.

wiltonloch · 2021-09-27T19:56:47Z

Hello and thank you for the review.

The main source of the benefits is indeed the inverted order of the communications, which under the sequential mapping makes the heaviest exchanges more local, in opposition to Bruck or Recursive Doubling under the same mapping. On the flip side this implies on non-contiguos memory organization, which we have chosen to deal with by employing multiple sends simultaneously.

If the sequential mapping is employed the benefits are not necessarily dependent on a homogeneous balance of processes over the topology, but will definitely vary according to the distribution. Nonetheless, Sparbit with sequential mapping has in general much more stable runs than Bruck with cyclic as the number of processes vary.

Also, further (unpublished) analysis have shown that for the tested environments the algorithm is able to reduce the communication time by an average of 20% on almost half of the tests, thus on such cases finishing the exchanges faster than any currently available algorithm on either sequential or cyclic mappings.

Thanks once again for the attention and reviews.

gpaulsen · 2021-09-28T15:22:31Z

@bosilca can you please merge when ready?

awlauria · 2021-10-01T13:56:14Z

@wiltonloch did you want this for v5.0.0? We're targeting a release in early November. This would get it out into the world instead of master only - the next master based release is probably ~two years down the road. If you want to see it in production faster than that, could you please cherry-pick -x back to the v5.0.0 branch asap?

Based on my reading these new algorithms are the added at lowest priority, is that accurate?

wiltonloch · 2021-10-01T22:45:05Z

Thank you very much for the tip!

I've created a PR for the inclusion of the algorithm on v5.0.
Regarding the priority issue, I don't think that its position on the switch selection necessarily reflects a lower priority (at least as far as my understanding of the code goes, which might as well be wrong). Either way, as I have not modified the tuned selection rules, the algorithm will not be used by default in any case. It will, however, still be available for execution upon manual selection or by OTPO custom rules. Also, in the future it may be included on the default rules if it performs well on the tests.

gpaulsen self-requested a review September 25, 2021 20:44

gpaulsen approved these changes Sep 25, 2021

View reviewed changes

gpaulsen added the NEWS label Sep 25, 2021

bosilca approved these changes Sep 27, 2021

View reviewed changes

bosilca merged commit 4828663 into open-mpi:master Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a new algorithm for allgather and allgatherv #9422

Adding a new algorithm for allgather and allgatherv #9422

wiltonloch commented Sep 24, 2021

ompiteam-bot commented Sep 24, 2021

bosilca commented Sep 24, 2021

gpaulsen commented Sep 25, 2021

gpaulsen commented Sep 27, 2021

bosilca left a comment

wiltonloch commented Sep 27, 2021

gpaulsen commented Sep 28, 2021

awlauria commented Oct 1, 2021 •

edited

wiltonloch commented Oct 1, 2021

Adding a new algorithm for allgather and allgatherv #9422

Adding a new algorithm for allgather and allgatherv #9422

Conversation

wiltonloch commented Sep 24, 2021

ompiteam-bot commented Sep 24, 2021

bosilca commented Sep 24, 2021

gpaulsen commented Sep 25, 2021

gpaulsen commented Sep 27, 2021

bosilca left a comment

Choose a reason for hiding this comment

wiltonloch commented Sep 27, 2021

gpaulsen commented Sep 28, 2021

awlauria commented Oct 1, 2021 • edited

wiltonloch commented Oct 1, 2021

awlauria commented Oct 1, 2021 •

edited