Skip to content

Conversation

@devreal
Copy link
Contributor

@devreal devreal commented Jan 31, 2023

Increase segment sizes for bcast, reduce, and allreduce to 512k. On modern machines, higher segment sizes seem to be more efficient as they reduce the overhead of segmenting (less messages, better chance at saturating the network).

Example for increased segment sizes on Hawk (64 core AMD EPYC Rome, ConnectX-6):

Reduce with 64k (current segment size)
Screen Shot 2023-01-31 at 09 59 11

Reduce with 512k (new segment size)
Screen Shot 2023-01-31 at 09 59 42

Note the lower latency on the right side of the plots. The change in segment size yields an improvement of about 10x for han over tuned. There is no data for han over sm because sm crashes at this segment size.

Increase segment sizes for bcast, reduce, and allreduce to 512k.
On modern machines, higher segment sizes seem to be more efficient
as they reduce the overhead of segmenting.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
A larger segment size helps reduce the overhead of segmenting.
The 512k size matches the size of coll/han.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@devreal
Copy link
Contributor Author

devreal commented Feb 28, 2023

Some additional measurements on Hawk for allreduce, bcast, and reduce for 4MB operations, 32 nodes, 64 processes per node. Clearly, higher segment sizes are favorable for HAN. I tried to set the segment size for coll/tuned but that mechanism seems broken.

allreduce_64_osu_han_segsize_32_64_1971862 hawk-pbs5
bcast_64_osu_han_segsize_32_64_1971862 hawk-pbs5
reduce_64_osu_han_segsize_32_64_1971862 hawk-pbs5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant