Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a new algorithm for allgather and allgatherv #9422

Merged
merged 1 commit into from Sep 28, 2021
Merged

Adding a new algorithm for allgather and allgatherv #9422

merged 1 commit into from Sep 28, 2021

Conversation

wiltonloch
Copy link
Contributor

It is based on the paper "Sparbit: a new logarithmic-cost and data
locality-aware MPI Allgather algorithm" by Wilton Loch and Guilherme
Koslovski, accepted in the 2021 SBAC-PAD and currently available at:
https://arxiv.org/abs/2109.08751.

I have included the math.h lib in both the allgather and allgatherv
files, as the algorithm needs the ceil and log functions.

I have naturally not changed the default algoritm selection rules inside
the tuned component.

Signed-off-by: Wilton Jaciel Loch (wiltonloch wilton.loch@gmail.com)

It is based on the paper "Sparbit: a new logarithmic-cost and data
locality-aware MPI Allgather algorithm" by Wilton Loch and Guilherme
Koslovski, accepted in the 2021 SBAC-PAD and currently available at:
https://arxiv.org/abs/2109.08751.

I have included the math.h lib in both the allgather and allgatherv
files, as the algorithm needs the ceil and log functions.

I have naturally not changed the default algoritm selection rules inside
the tuned component.

Signed-off-by: Wilton Jaciel Loch (wiltonloch <wilton.loch@gmail.com>)
@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@bosilca
Copy link
Member

bosilca commented Sep 24, 2021

Ok to test.

@gpaulsen gpaulsen self-requested a review September 25, 2021 20:44
@gpaulsen
Copy link
Member

Thanks for the contribution!

@gpaulsen gpaulsen added the NEWS label Sep 25, 2021
@gpaulsen
Copy link
Member

@bosilca do you want to take a look before we merge? It looks good to me.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm is very similar to recursive doubling or Bruck, with few exceptions:

  • not limited to power of 2 participants.
  • it generates more send/recv because for each exchange it follows the original scount and rcount of the collective.

It is not clear to me where the benefit described in the paper comes from. The communications scheme saves the data reordering at the end of the collective (which exists in most of the other algorithms), in exchange of a large number of send and receive operations simultaneously. So basically, we exchange local memory bandwidth for network noise and matching costs.

Second, it does not solve the issue with the mapping, it just does the communications in a different order, which makes it possibly faster when processes are mapped on a specific way and if the communicator participants is perfectly balanced across the nodes.

@wiltonloch
Copy link
Contributor Author

Hello and thank you for the review.

The main source of the benefits is indeed the inverted order of the communications, which under the sequential mapping makes the heaviest exchanges more local, in opposition to Bruck or Recursive Doubling under the same mapping. On the flip side this implies on non-contiguos memory organization, which we have chosen to deal with by employing multiple sends simultaneously.

If the sequential mapping is employed the benefits are not necessarily dependent on a homogeneous balance of processes over the topology, but will definitely vary according to the distribution. Nonetheless, Sparbit with sequential mapping has in general much more stable runs than Bruck with cyclic as the number of processes vary.

Also, further (unpublished) analysis have shown that for the tested environments the algorithm is able to reduce the communication time by an average of 20% on almost half of the tests, thus on such cases finishing the exchanges faster than any currently available algorithm on either sequential or cyclic mappings.

Thanks once again for the attention and reviews.

@gpaulsen
Copy link
Member

@bosilca can you please merge when ready?

@bosilca bosilca merged commit 4828663 into open-mpi:master Sep 28, 2021
@awlauria
Copy link
Contributor

awlauria commented Oct 1, 2021

@wiltonloch did you want this for v5.0.0? We're targeting a release in early November. This would get it out into the world instead of master only - the next master based release is probably ~two years down the road. If you want to see it in production faster than that, could you please cherry-pick -x back to the v5.0.0 branch asap?

Based on my reading these new algorithms are the added at lowest priority, is that accurate?

@wiltonloch
Copy link
Contributor Author

Thank you very much for the tip!

I've created a PR for the inclusion of the algorithm on v5.0.
Regarding the priority issue, I don't think that its position on the switch selection necessarily reflects a lower priority (at least as far as my understanding of the code goes, which might as well be wrong). Either way, as I have not modified the tuned selection rules, the algorithm will not be used by default in any case. It will, however, still be available for execution upon manual selection or by OTPO custom rules. Also, in the future it may be included on the default rules if it performs well on the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants