-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement hierarchical MPI_Gatherv and MPI_Scatterv #12376
Conversation
2ca243b
to
1fa5e1b
Compare
3fc83b3
to
7f67b7a
Compare
cae601f
to
c7d8a77
Compare
0d28109
to
c7d8a77
Compare
Edit: Moved OMB MPI_Scatterv comparison to PR description |
@devreal I updated the OMB metrics in description. Please take a look. |
Any idea what is going on at 4k procs on scatterv? It's significantly slower than the linear version, while for smaller proc numbers performance is similar or better...
We should also think about segmenting for larger messages messages. Esp gatherv seems to be doing worse at larger messages and smaller proc numbers. |
c2d84f8
to
6829cec
Compare
@devreal lol You caught me. I flipped the table - the worse result was the current scatterv. Hierarchical scatterv was faster. Also I had some bugs in the previous commit. It should be fixed now and I updated the scatterv numbers.
That is correct and unfortunate - we profiled the algorithm and identified 2 performance bottlenecks:
I don't see how segmentation actually helps with reducing the gatherv/scatterv latency, and it's very complicated for the *v collectives since not all processes need multiple segments, which requires per-peer bookkeeping on node leaders. The unfortunate thing is that we cannot simply switch to a faster algorithm for large messages, since only Root knows the message size. I'm afraid it's a trade off for the application - it must choose one specific algorithm for all message sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, some more comments...
@devreal Come to think about disabling HAN Gatherv/Scatterv, currently the user has to use a dynamic rule file. It would be easier IMO to provide the mca param e.g. |
Why do you want to disable HAN for Gatherv/Scatterv? The numbers look good to me. I understand we don't have a way for fine-grain selection of collective implementations but maybe that's a broader decision than this PR? |
@devreal Yes I brought up the topic but we should address it separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
Relax the function requirement to allow null low/up_rank output pointers, and rename the arguments because the function works for non-root ranks as well. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
Add gatherv implementation to optimize large-scale communications on multiple nodes and multiple processes per node, by avoiding high-incast traffic on the root process. Because *V collectives do not have equal datatype/count on every process, it does not natively support message-size based tuning without an additional global communication. Similar to gather and allgather, the hierarchical gatherv requires a temporary buffer and memory copy to handle out-of-order data, or non-contiguous placement on the output buffer, which results in worse performance for large messages compared to the linear implementation. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
Add scatterv implementation to optimize large-scale communications on multiple nodes and multiple processes per node, by avoiding high-incast traffic on the root process. Because *V collectives do not have equal datatype/count on every process, it does not natively support message-size based tuning without an additional global communication. Similar to scatter, the hierarchical scatterv requires a temporary buffer and memory copy to handle out-of-order data, or non-contiguous placement on the send buffer, which results in worse performance for large messages compared to the linear implementation. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Summary
Add gatherv and scatterv implementation to optimize large-scale communications on multiple nodes and multiple processes per node, by avoiding high-incast traffic on the root process.
Because *V collectives do not have equal datatype/count on every process, it does not natively support message-size based tuning without an additional global communication.
Similar to gather and allgather, the hierarchical gatherv requires a temporary buffer and memory copy to handle out-of-order data, or non-contiguous placement on the output buffer, which results in worse performance for large messages compared to the linear implementation.
OMB comparison between linear and hierarchical MPI_Gatherv
Happy Path Performance without reordering (
--rank-by slot
)2 nodes x 64ppn = 128 procs
Default linear gatherv
Hierarchical gatherv
16 nodes x 64ppn = 1024 procs
Default linear gatherv
Hierarchical gatherv
64 nodes x 64ppn = 4096 procs(reduced message size due to memory limit)
Default linear gatherv
Hierarchical gatherv
Worst Case Performance with Reordering (
--rank-by node
)Only show result for hierarchical gatherv - default linear algorithm is not affected by rank-by
2 nodes x 64ppn = 128 procs
16 nodes x 64ppn = 1024 procs
64 nodes x 64ppn = 4096 procs
OMB comparison between linear and hierarchical MPI_Scatterv
Happy Path Performance without reordering (
--rank-by slot
)2 nodes x 64ppn = 128 procs
Default linear scatterv
Hierarchical scatterv
16 nodes x 64ppn = 1024 procs
Default linear scatterv
Hierarchical scatterv
64 nodes x 64ppn = 4096 procs(reduced message size due to memory limit)
Default linear scatterv
Hierarchical scatterv
Worst Case Performance with Reordering (
--rank-by node
)Only show result for hierarchical scatterv - default linear algorithm is not affected by rank-by
2 nodes x 64ppn = 128 procs
16 nodes x 64ppn = 1024 procs
64 nodes x 64ppn = 4096 procs