Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hierarchical MPI_Gatherv and MPI_Scatterv #12376

Merged
merged 3 commits into from
Mar 23, 2024

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented Feb 26, 2024

Summary

Add gatherv and scatterv implementation to optimize large-scale communications on multiple nodes and multiple processes per node, by avoiding high-incast traffic on the root process.

Because *V collectives do not have equal datatype/count on every process, it does not natively support message-size based tuning without an additional global communication.

Similar to gather and allgather, the hierarchical gatherv requires a temporary buffer and memory copy to handle out-of-order data, or non-contiguous placement on the output buffer, which results in worse performance for large messages compared to the linear implementation.

OMB comparison between linear and hierarchical MPI_Gatherv

Happy Path Performance without reordering (--rank-by slot)

2 nodes x 64ppn = 128 procs

Default linear gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                       4.64              0.35             49.63        1000
[1,0]<stdout>: 2                       4.67              0.35             47.83        1000
[1,0]<stdout>: 4                       4.40              0.35             46.51        1000
[1,0]<stdout>: 8                       4.48              0.34             47.04        1000
[1,0]<stdout>: 16                      4.46              0.35             47.13        1000
[1,0]<stdout>: 32                      4.50              0.35             47.31        1000
[1,0]<stdout>: 64                      5.84              0.35             53.00        1000
[1,0]<stdout>: 128                     5.60              0.35             52.76        1000
[1,0]<stdout>: 256                    12.01              0.35             77.85        1000
[1,0]<stdout>: 512                    11.66              0.35             76.06        1000
[1,0]<stdout>: 1024                   11.49              0.37             83.19        1000
[1,0]<stdout>: 2048                   11.35              0.38             88.96        1000
[1,0]<stdout>: 4096                   11.13              0.41             95.71        1000
[1,0]<stdout>: 8192                   60.45             13.57            216.82        1000
[1,0]<stdout>: 16384                  78.74             12.01            312.69         100
[1,0]<stdout>: 32768                 116.05             20.48            489.80         100
[1,0]<stdout>: 65536                 189.58             38.27            857.93         100
[1,0]<stdout>: 131072                742.51             48.57           1577.27         100
[1,0]<stdout>: 262144               1448.12             77.61           3137.05         100
[1,0]<stdout>: 524288               3041.88             78.65           6115.82         100
[1,0]<stdout>: 1048576              5046.68           1098.15           8445.72         100

real	0m7.592s
user	0m0.032s
sys	0m0.048s

Hierarchical gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      13.24              0.62             57.08        1000
[1,0]<stdout>: 2                      13.58              0.55             59.21        1000
[1,0]<stdout>: 4                      13.63              0.53             59.58        1000
[1,0]<stdout>: 8                      13.65              0.53             59.65        1000
[1,0]<stdout>: 16                     13.63              0.54             60.32        1000
[1,0]<stdout>: 32                     13.62              0.53             60.47        1000
[1,0]<stdout>: 64                     13.69              0.54             61.90        1000
[1,0]<stdout>: 128                    13.47              0.60             61.72        1000
[1,0]<stdout>: 256                    30.31              0.86            119.91        1000
[1,0]<stdout>: 512                    29.73              0.80            121.20        1000
[1,0]<stdout>: 1024                   29.74              0.84            125.64        1000
[1,0]<stdout>: 2048                   30.49              0.83            162.71        1000
[1,0]<stdout>: 4096                   28.90              0.79            174.88        1000
[1,0]<stdout>: 8192                   85.67             25.98            282.71        1000
[1,0]<stdout>: 16384                 104.62             25.79            323.59         100
[1,0]<stdout>: 32768                 144.11             29.70            452.66         100
[1,0]<stdout>: 65536                 229.81             33.26            721.29         100
[1,0]<stdout>: 131072                383.63             40.88           1195.43         100
[1,0]<stdout>: 262144                751.58             63.67           2270.66         100
[1,0]<stdout>: 524288               2977.58            282.49          13139.84         100
[1,0]<stdout>: 1048576              5969.17            548.00          25890.52         100

real	0m10.551s
user	0m0.038s
sys	0m0.037s

16 nodes x 64ppn = 1024 procs

Default linear gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                       1.68              0.34            634.42        1000
[1,0]<stdout>: 2                       1.69              0.34            638.99        1000
[1,0]<stdout>: 4                       1.69              0.34            642.46        1000
[1,0]<stdout>: 8                       1.70              0.25            652.73        1000
[1,0]<stdout>: 16                      1.71              0.32            653.11        1000
[1,0]<stdout>: 32                      1.72              0.26            654.13        1000
[1,0]<stdout>: 64                      1.76              0.27            679.29        1000
[1,0]<stdout>: 128                     1.78              0.25            681.59        1000
[1,0]<stdout>: 256                     1.97              0.34            693.36        1000
[1,0]<stdout>: 512                     2.02              0.34            717.60        1000
[1,0]<stdout>: 1024                    2.12              0.35            776.07        1000
[1,0]<stdout>: 2048                    2.25              0.37            898.00        1000
[1,0]<stdout>: 4096                    2.45              0.40           1054.73        1000
[1,0]<stdout>: 8192                  176.12             31.85           1413.68        1000
[1,0]<stdout>: 16384                 359.66             36.26           2521.24         100
[1,0]<stdout>: 32768                 722.06             40.71           4523.63         100
[1,0]<stdout>: 65536                1425.91             45.81           8647.08         100
[1,0]<stdout>: 131072               4968.47            185.44          15497.74         100
[1,0]<stdout>: 262144              16032.77            211.21          30089.29         100
[1,0]<stdout>: 524288              39176.47            272.97          58606.06         100

real	0m38.337s
user	0m0.079s
sys	0m0.088s

Hierarchical gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      13.10              0.59             66.92        1000
[1,0]<stdout>: 2                      13.32              0.56             65.37        1000
[1,0]<stdout>: 4                      13.34              0.54             65.17        1000
[1,0]<stdout>: 8                      13.37              0.53             69.54        1000
[1,0]<stdout>: 16                     13.34              0.54             67.34        1000
[1,0]<stdout>: 32                     13.35              0.52             66.23        1000
[1,0]<stdout>: 64                     13.37              0.52             68.53        1000
[1,0]<stdout>: 128                    13.73              0.55             67.83        1000
[1,0]<stdout>: 256                    39.21              0.67            206.27        1000
[1,0]<stdout>: 512                    37.51              0.65            172.15        1000
[1,0]<stdout>: 1024                   37.38              0.70            216.41        1000
[1,0]<stdout>: 2048                   38.55              0.74            303.33        1000
[1,0]<stdout>: 4096                   35.55              0.77            480.86        1000
[1,0]<stdout>: 8192                  100.68             25.96            866.76        1000
[1,0]<stdout>: 16384                 125.71             24.17            933.73         100
[1,0]<stdout>: 32768                 177.80             26.00           2026.15         100
[1,0]<stdout>: 65536                 290.06             33.58           3702.86         100
[1,0]<stdout>: 131072                479.79             41.67           6243.50         100
[1,0]<stdout>: 262144                924.70             63.57          12132.51         100
[1,0]<stdout>: 524288               4508.58            278.82          32352.78         100

real	0m22.046s
user	0m0.163s
sys	0m0.130s

64 nodes x 64ppn = 4096 procs(reduced message size due to memory limit)

Default linear gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                       1.57              0.34           3921.16        1000
[1,0]<stdout>: 2                       1.59              0.34           3937.05        1000
[1,0]<stdout>: 4                       1.59              0.33           3973.91        1000
[1,0]<stdout>: 8                       1.58              0.33           3920.90        1000
[1,0]<stdout>: 16                      1.58              0.33           3923.34        1000
[1,0]<stdout>: 32                      1.57              0.33           3879.77        1000
[1,0]<stdout>: 64                      1.57              0.33           3869.90        1000
[1,0]<stdout>: 128                     1.57              0.33           3892.23        1000
[1,0]<stdout>: 256                     1.57              0.33           3676.15        1000
[1,0]<stdout>: 512                     1.55              0.34           3527.66        1000
[1,0]<stdout>: 1024                    1.61              0.35           3735.54        1000

real	1m18.041s
user	0m0.688s
sys	0m0.827s

Hierarchical gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      12.96              0.67            122.70        1000
[1,0]<stdout>: 2                      13.28              0.55            138.74        1000
[1,0]<stdout>: 4                      13.30              0.50            119.33        1000
[1,0]<stdout>: 8                      13.33              0.50            124.42        1000
[1,0]<stdout>: 16                     13.31              0.49            124.79        1000
[1,0]<stdout>: 32                     13.31              0.49            121.69        1000
[1,0]<stdout>: 64                     13.33              0.49            124.90        1000
[1,0]<stdout>: 128                    13.69              0.49            120.31        1000
[1,0]<stdout>: 256                    38.85              0.63            238.08        1000
[1,0]<stdout>: 512                    37.89              0.64            336.00        1000
[1,0]<stdout>: 1024                   38.44              0.68            525.95        1000

real	0m32.456s
user	0m0.790s
sys	0m0.636s

Worst Case Performance with Reordering (--rank-by node)

Only show result for hierarchical gatherv - default linear algorithm is not affected by rank-by

2 nodes x 64ppn = 128 procs

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      12.93              1.04             57.49        1000
[1,0]<stdout>: 2                      13.34              0.59             59.37        1000
[1,0]<stdout>: 4                      13.36              0.51             59.89        1000
[1,0]<stdout>: 8                      13.43              0.53             60.05        1000
[1,0]<stdout>: 16                     13.37              0.51             59.77        1000
[1,0]<stdout>: 32                     13.39              0.51             60.57        1000
[1,0]<stdout>: 64                     13.43              0.53             64.19        1000
[1,0]<stdout>: 128                    13.09              0.53             61.37        1000
[1,0]<stdout>: 256                    36.45              0.76            142.11        1000
[1,0]<stdout>: 512                    35.59              0.66            142.69        1000
[1,0]<stdout>: 1024                   34.87              0.74            143.69        1000
[1,0]<stdout>: 2048                   35.39              0.75            182.90        1000
[1,0]<stdout>: 4096                   32.51              0.77            194.39        1000
[1,0]<stdout>: 8192                   87.38             11.71            307.09        1000
[1,0]<stdout>: 16384                 109.99             11.96            358.53         100
[1,0]<stdout>: 32768                 150.94             21.37            527.55         100
[1,0]<stdout>: 65536                 238.63             18.98            884.33         100
[1,0]<stdout>: 131072                400.55             25.91           1519.31         100
[1,0]<stdout>: 262144                773.05             52.27           2917.65         100
[1,0]<stdout>: 524288               3143.42             75.64          18756.92         100
[1,0]<stdout>: 1048576              6288.95            142.21          37001.20         100

real	0m13.099s
user	0m0.037s
sys	0m0.041s

16 nodes x 64ppn = 1024 procs

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      12.97              0.99             85.45        1000
[1,0]<stdout>: 2                      13.21              0.62             85.09        1000
[1,0]<stdout>: 4                      13.22              0.53             84.87        1000
[1,0]<stdout>: 8                      13.24              0.55             86.50        1000
[1,0]<stdout>: 16                     13.22              0.52             84.07        1000
[1,0]<stdout>: 32                     13.25              0.53             85.32        1000
[1,0]<stdout>: 64                     13.25              0.52             90.10        1000
[1,0]<stdout>: 128                    13.57              0.55             69.28        1000
[1,0]<stdout>: 256                    37.74              0.63            173.56        1000
[1,0]<stdout>: 512                    36.96              0.66            190.24        1000
[1,0]<stdout>: 1024                   36.54              0.69            243.03        1000
[1,0]<stdout>: 2048                   37.76              0.73            377.21        1000
[1,0]<stdout>: 4096                   35.28              0.77            595.56        1000
[1,0]<stdout>: 8192                  100.17             25.75           1109.30        1000
[1,0]<stdout>: 16384                 127.45             26.63           1518.66         100
[1,0]<stdout>: 32768                 181.39             29.25           2992.01         100
[1,0]<stdout>: 65536                 566.74             25.20          23168.03         100
[1,0]<stdout>: 131072                931.37             46.47          39313.10         100
[1,0]<stdout>: 262144               1732.46             76.02          71912.28         100
[1,0]<stdout>: 524288               5978.15            161.02         144052.29         100

real	0m49.669s
user	0m0.100s
sys	0m0.199s

64 nodes x 64ppn = 4096 procs

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      12.88              0.94            167.05        1000
[1,0]<stdout>: 2                      13.15              0.55            167.64        1000
[1,0]<stdout>: 4                      13.17              0.49            170.75        1000
[1,0]<stdout>: 8                      13.18              0.49            167.91        1000
[1,0]<stdout>: 16                     13.18              0.49            163.97        1000
[1,0]<stdout>: 32                     13.19              0.49            182.50        1000
[1,0]<stdout>: 64                     13.21              0.49            169.98        1000
[1,0]<stdout>: 128                    13.59              0.49            188.04        1000
[1,0]<stdout>: 256                    40.82              0.64            299.74        1000
[1,0]<stdout>: 512                    40.03              0.66            426.60        1000
[1,0]<stdout>: 1024                   40.59              0.67            663.56        1000

real	0m33.257s
user	0m0.843s
sys	0m0.751s
OMB comparison between linear and hierarchical MPI_Scatterv

Happy Path Performance without reordering (--rank-by slot)

2 nodes x 64ppn = 128 procs

Default linear scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      16.52              1.35             62.98        1000
[1,0]<stdout>: 2                      17.07              1.36             63.90        1000
[1,0]<stdout>: 4                      17.11              1.34             63.99        1000
[1,0]<stdout>: 8                      17.21              1.26             64.46        1000
[1,0]<stdout>: 16                     17.26              1.28             64.58        1000
[1,0]<stdout>: 32                     17.20              1.31             64.48        1000
[1,0]<stdout>: 64                     17.23              1.25             64.44        1000
[1,0]<stdout>: 128                    17.74              1.36             65.36        1000
[1,0]<stdout>: 256                    21.51              1.60             70.10        1000
[1,0]<stdout>: 512                    22.20              1.67             71.19        1000
[1,0]<stdout>: 1024                   26.92              1.84             78.22        1000
[1,0]<stdout>: 2048                   30.35              1.95             84.08        1000
[1,0]<stdout>: 4096                   36.27              2.58             96.20        1000
[1,0]<stdout>: 8192                  531.94              3.30           1825.26        1000
[1,0]<stdout>: 16384                 570.35              3.26           1908.61         100
[1,0]<stdout>: 32768                 644.08              4.70           2055.98         100
[1,0]<stdout>: 65536                 793.32              7.12           2341.72         100
[1,0]<stdout>: 131072               1530.23             15.22           4681.64         100
[1,0]<stdout>: 262144               2040.22             27.88           5643.87         100
[1,0]<stdout>: 524288               3752.15             55.37           9855.75         100
[1,0]<stdout>: 1048576              5812.79            121.96          12268.16         100

real	0m13.233s
user	0m0.000s
sys	0m0.082s

Hierarchical scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      23.93              1.86             46.99        1000
[1,0]<stdout>: 2                      23.98              1.95             47.00        1000
[1,0]<stdout>: 4                      24.08              1.94             47.24        1000
[1,0]<stdout>: 8                      24.21              2.00             47.43        1000
[1,0]<stdout>: 16                     24.29              1.90             47.70        1000
[1,0]<stdout>: 32                     24.36              1.94             47.80        1000
[1,0]<stdout>: 64                     25.11              2.02             49.11        1000
[1,0]<stdout>: 128                    25.88              2.12             50.39        1000
[1,0]<stdout>: 256                    29.04              2.28             56.47        1000
[1,0]<stdout>: 512                    30.98              2.50             60.44        1000
[1,0]<stdout>: 1024                   35.34              2.85             68.55        1000
[1,0]<stdout>: 2048                   55.07              2.73            107.68        1000
[1,0]<stdout>: 4096                   64.59              3.29            126.34        1000
[1,0]<stdout>: 8192                  159.80              4.10            335.16        1000
[1,0]<stdout>: 16384                 181.21              4.63            409.44         100
[1,0]<stdout>: 32768                 265.88              6.04            598.07         100
[1,0]<stdout>: 65536                 441.75              9.02            989.17         100
[1,0]<stdout>: 131072                720.13             15.06           1594.35         100
[1,0]<stdout>: 262144               1272.70             29.57           2743.94         100
[1,0]<stdout>: 524288               6852.35             63.08          15249.06         100
[1,0]<stdout>: 1048576             13692.94            138.20          30069.45         100

real	0m10.032s
user	0m0.049s
sys	0m0.097s

16 nodes x 64ppn = 1024 procs

Default linear scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                     159.78              1.85            345.43        1000
[1,0]<stdout>: 2                     158.27              1.29            343.23        1000
[1,0]<stdout>: 4                     158.60              1.82            342.69        1000
[1,0]<stdout>: 8                     157.31              1.26            339.72        1000
[1,0]<stdout>: 16                    157.46              1.77            340.10        1000
[1,0]<stdout>: 32                    157.55              1.26            340.09        1000
[1,0]<stdout>: 64                    158.66              1.76            341.39        1000
[1,0]<stdout>: 128                   159.87              1.31            344.11        1000
[1,0]<stdout>: 256                   168.50              2.04            354.58        1000
[1,0]<stdout>: 512                   173.21              1.63            363.31        1000
[1,0]<stdout>: 1024                  185.34              2.26            384.19        1000
[1,0]<stdout>: 2048                  208.39              1.90            429.41        1000
[1,0]<stdout>: 4096                  245.61              2.77            501.90        1000
[1,0]<stdout>: 8192                11961.05              3.12          25562.80        1000
[1,0]<stdout>: 16384               12324.61              4.83          26333.23         100
[1,0]<stdout>: 32768               12902.40              5.46          27472.94         100
[1,0]<stdout>: 65536               13983.43              9.07          29607.63         100
[1,0]<stdout>: 131072              29112.35             14.61          61728.65         100
[1,0]<stdout>: 262144              32925.09             27.35          69120.39         100
[1,0]<stdout>: 524288              55311.94             55.04         115649.66         100

real	1m27.318s
user	0m0.053s
sys	0m0.115s

Hierarchical scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      29.80             12.80             46.78        1000
[1,0]<stdout>: 2                      29.73             12.76             46.74        1000
[1,0]<stdout>: 4                      29.69             12.78             46.60        1000
[1,0]<stdout>: 8                      30.00             12.90             46.58        1000
[1,0]<stdout>: 16                     30.83             13.20             47.40        1000
[1,0]<stdout>: 32                     30.69             13.41             47.83        1000
[1,0]<stdout>: 64                     31.86             14.10             49.73        1000
[1,0]<stdout>: 128                    34.52             15.39             50.35        1000
[1,0]<stdout>: 256                    40.02             15.11             62.27        1000
[1,0]<stdout>: 512                    49.78             16.71             77.03        1000
[1,0]<stdout>: 1024                   69.12             20.27            111.07        1000
[1,0]<stdout>: 2048                  134.34             20.38            189.22        1000
[1,0]<stdout>: 4096                  205.15             21.96            284.62        1000
[1,0]<stdout>: 8192                  433.22             23.15            611.64        1000
[1,0]<stdout>: 16384                 764.50             17.42           1055.88         100
[1,0]<stdout>: 32768                1590.97             19.54           2028.55         100
[1,0]<stdout>: 65536                3008.15             23.42           3724.62         100
[1,0]<stdout>: 131072               5573.11             30.70           6614.90         100
[1,0]<stdout>: 262144              10475.43             43.00          12152.31         100
[1,0]<stdout>: 524288              27654.88             79.65          33890.64         100

real	0m13.903s
user	0m0.133s
sys	0m0.245s

64 nodes x 64ppn = 4096 procs(reduced message size due to memory limit)

Default linear scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      63.43             10.58             92.32        1000
[1,0]<stdout>: 2                      63.92             10.69             92.41        1000
[1,0]<stdout>: 4                      63.29             10.61             92.22        1000
[1,0]<stdout>: 8                      63.53             10.59             93.07        1000
[1,0]<stdout>: 16                     63.92             10.76             93.06        1000
[1,0]<stdout>: 32                     63.88             10.61             94.49        1000
[1,0]<stdout>: 64                     65.36             10.81             98.71        1000
[1,0]<stdout>: 128                   835.28             10.80           1700.16        1000
[1,0]<stdout>: 256                   872.38             10.83           1762.42        1000
[1,0]<stdout>: 512                   911.33             10.94           1839.79        1000
[1,0]<stdout>: 1024                  982.67             10.97           1966.01        1000

real	0m39.731s
user	0m0.654s
sys	0m0.684s

Hierarchical scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      50.61             32.31             72.00        1000
[1,0]<stdout>: 2                      51.36             33.09             72.31        1000
[1,0]<stdout>: 4                      51.88             32.99             78.24        1000
[1,0]<stdout>: 8                      52.24             33.25             73.50        1000
[1,0]<stdout>: 16                     52.06             33.23             74.43        1000
[1,0]<stdout>: 32                     53.08             33.29             76.45        1000
[1,0]<stdout>: 64                     55.50             34.96             81.43        1000
[1,0]<stdout>: 128                    59.38             33.88             97.11        1000
[1,0]<stdout>: 256                    66.71             36.11             99.07        1000
[1,0]<stdout>: 512                    85.26             38.64            141.51        1000
[1,0]<stdout>: 1024                  140.49             46.03            228.36        1000

real	0m12.118s
user	0m0.821s
sys	0m0.877s

Worst Case Performance with Reordering (--rank-by node)

Only show result for hierarchical scatterv - default linear algorithm is not affected by rank-by

2 nodes x 64ppn = 128 procs

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      25.60              4.03             48.02        1000
[1,0]<stdout>: 2                      26.05              4.05             49.00        1000
[1,0]<stdout>: 4                      25.94              4.04             48.79        1000
[1,0]<stdout>: 8                      26.08              3.99             49.09        1000
[1,0]<stdout>: 16                     26.15              4.05             49.15        1000
[1,0]<stdout>: 32                     26.53              4.21             49.79        1000
[1,0]<stdout>: 64                     27.12              4.29             50.76        1000
[1,0]<stdout>: 128                    27.88              4.44             51.96        1000
[1,0]<stdout>: 256                    32.47              4.81             60.71        1000
[1,0]<stdout>: 512                    33.36              5.52             62.07        1000
[1,0]<stdout>: 1024                   39.12              6.80             72.22        1000
[1,0]<stdout>: 2048                   61.11              8.60            113.52        1000
[1,0]<stdout>: 4096                   76.31             13.76            137.90        1000
[1,0]<stdout>: 8192                  181.19             22.81            357.19        1000
[1,0]<stdout>: 16384                 217.13             43.21            437.48         100
[1,0]<stdout>: 32768                 336.13             78.06            671.74         100
[1,0]<stdout>: 65536                 580.73            148.29           1142.75         100
[1,0]<stdout>: 131072                986.48            289.44           1857.83         100
[1,0]<stdout>: 262144               1878.63            616.36           3279.75         100
[1,0]<stdout>: 524288              18502.51          11494.99          27117.56         100
[1,0]<stdout>: 1048576             36537.57          22721.70          53158.69         100

real	0m13.962s
user	0m0.057s
sys	0m0.088s

16 nodes x 64ppn = 1024 procs

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      56.74             40.34             74.19        1000
[1,0]<stdout>: 2                      56.53             40.20             74.12        1000
[1,0]<stdout>: 4                      56.78             40.15             73.90        1000
[1,0]<stdout>: 8                      56.73             40.36             74.07        1000
[1,0]<stdout>: 16                     57.96             40.96             75.21        1000
[1,0]<stdout>: 32                     58.12             41.36             75.79        1000
[1,0]<stdout>: 64                     59.50             42.23             77.72        1000
[1,0]<stdout>: 128                    63.65             44.93             80.03        1000
[1,0]<stdout>: 256                    71.07             47.87             90.31        1000
[1,0]<stdout>: 512                    95.40             65.06            120.45        1000
[1,0]<stdout>: 1024                  139.92             96.81            181.46        1000
[1,0]<stdout>: 2048                  250.04            129.94            310.99        1000
[1,0]<stdout>: 4096                  383.59            191.28            457.32        1000
[1,0]<stdout>: 8192                  737.86            326.46            925.51        1000
[1,0]<stdout>: 16384                1464.46            668.41           1716.35         100
[1,0]<stdout>: 32768                2933.77           1370.01           3353.69         100
[1,0]<stdout>: 65536               26660.53          24117.55          29337.44         100
[1,0]<stdout>: 131072              46952.71          43746.51          50632.84         100
[1,0]<stdout>: 262144              88646.68          81590.93          94318.28         100
[1,0]<stdout>: 524288             181359.35         163479.02         190249.45         100

real	0m49.024s
user	0m0.158s
sys	0m0.247s

64 nodes x 64ppn = 4096 procs

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                     167.34            147.75            190.17        1000
[1,0]<stdout>: 2                     167.45            148.12            190.65        1000
[1,0]<stdout>: 4                     168.58            149.19            191.40        1000
[1,0]<stdout>: 8                     167.53            147.38            200.54        1000
[1,0]<stdout>: 16                    170.57            150.43            196.34        1000
[1,0]<stdout>: 32                    178.27            152.16            202.29        1000
[1,0]<stdout>: 64                    174.22            151.39            207.85        1000
[1,0]<stdout>: 128                   185.17            158.47            234.55        1000
[1,0]<stdout>: 256                   219.19            188.50            278.50        1000
[1,0]<stdout>: 512                   304.37            260.04            367.82        1000
[1,0]<stdout>: 1024                  439.06            347.00            528.86        1000

real	0m14.463s
user	0m1.003s
sys	0m0.958s

@wenduwan wenduwan self-assigned this Feb 28, 2024
@wenduwan wenduwan force-pushed the han_gatherv branch 4 times, most recently from cae601f to c7d8a77 Compare February 29, 2024 15:23
@wenduwan wenduwan changed the title Implement hierarchical MPI_Gatherv Implement hierarchical MPI_Gatherv and MPI_Scatterv Feb 29, 2024
@jiaxiyan
Copy link
Contributor

jiaxiyan commented Mar 1, 2024

Edit: Moved OMB MPI_Scatterv comparison to PR description

ompi/mca/coll/han/coll_han_gatherv.c Show resolved Hide resolved
ompi/mca/coll/han/coll_han_gatherv.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_scatterv.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_scatterv.c Outdated Show resolved Hide resolved
@wenduwan
Copy link
Contributor Author

@devreal I updated the OMB metrics in description. Please take a look.

@wenduwan wenduwan requested a review from devreal March 13, 2024 23:23
@devreal
Copy link
Contributor

devreal commented Mar 14, 2024

Any idea what is going on at 4k procs on scatterv? It's significantly slower than the linear version, while for smaller proc numbers performance is similar or better...

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      63.43             10.58             92.32        1000
[1,0]<stdout>: 2                      63.92             10.69             92.41        1000
[1,0]<stdout>: 4                      63.29             10.61             92.22        1000
[1,0]<stdout>: 8                      63.53             10.59             93.07        1000
[1,0]<stdout>: 16                     63.92             10.76             93.06        1000
[1,0]<stdout>: 32                     63.88             10.61             94.49        1000
[1,0]<stdout>: 64                     65.36             10.81             98.71        1000
[1,0]<stdout>: 128                   835.28             10.80           1700.16        1000
[1,0]<stdout>: 256                   872.38             10.83           1762.42        1000
[1,0]<stdout>: 512                   911.33             10.94           1839.79        1000
[1,0]<stdout>: 1024                  982.67             10.97           1966.01        1000

real	0m39.731s
user	0m0.654s
sys	0m0.684s

We should also think about segmenting for larger messages messages. Esp gatherv seems to be doing worse at larger messages and smaller proc numbers.

@wenduwan wenduwan force-pushed the han_gatherv branch 2 times, most recently from c2d84f8 to 6829cec Compare March 14, 2024 17:22
@wenduwan
Copy link
Contributor Author

wenduwan commented Mar 14, 2024

Any idea what is going on at 4k procs on scatterv

@devreal lol You caught me. I flipped the table - the worse result was the current scatterv. Hierarchical scatterv was faster.

Also I had some bugs in the previous commit. It should be fixed now and I updated the scatterv numbers.

segmenting for larger messages messages

That is correct and unfortunate - we profiled the algorithm and identified 2 performance bottlenecks:

  • Node leaders need to allocate a (different) temporary buffer(recv buf for gatherv and send buf for scatterv). Using a temporary buffer incurs a cost in the network layer. For example, the buffer needs to be registered, pinned and eventually deregistered on EFA. This can only be mitigated with a reusable buffer pool, i.e. register once and use forever.
  • Related to above, because of the temporary buffers we need additional memory copy-in/out from/to sbuf/rbuf on node leaders. This cost is largely unavoidable without a very fast memcpy mechanism.

I don't see how segmentation actually helps with reducing the gatherv/scatterv latency, and it's very complicated for the *v collectives since not all processes need multiple segments, which requires per-peer bookkeeping on node leaders.

The unfortunate thing is that we cannot simply switch to a faster algorithm for large messages, since only Root knows the message size. I'm afraid it's a trade off for the application - it must choose one specific algorithm for all message sizes.

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, some more comments...

ompi/mca/coll/han/coll_han_scatterv.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_scatterv.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_scatterv.c Show resolved Hide resolved
@wenduwan
Copy link
Contributor Author

@devreal Come to think about disabling HAN Gatherv/Scatterv, currently the user has to use a dynamic rule file. It would be easier IMO to provide the mca param e.g. coll_han_disable_<coll>. Do you think it's a reasonable addition?

@devreal
Copy link
Contributor

devreal commented Mar 15, 2024

Why do you want to disable HAN for Gatherv/Scatterv? The numbers look good to me. I understand we don't have a way for fine-grain selection of collective implementations but maybe that's a broader decision than this PR?

@wenduwan
Copy link
Contributor Author

@devreal Yes I brought up the topic but we should address it separately.

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@wenduwan
Copy link
Contributor Author

@devreal Thank you for your kind review! We appreciate your feedback a lot.

@bosilca It would be great to hear your opinions too. On the other hand we would love to get the change merged and tested as part of nightly tarballs in case there is a bug.

I will hold the PR until 3/22.

wenduwan and others added 3 commits March 22, 2024 13:34
Relax the function requirement to allow null low/up_rank output
pointers, and rename the arguments because the function works for
non-root ranks as well.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
Add gatherv implementation to optimize large-scale communications on
multiple nodes and multiple processes per node, by avoiding high-incast
traffic on the root process.

Because *V collectives do not have equal datatype/count on every
process, it does not natively support message-size based tuning without
an additional global communication.

Similar to gather and allgather, the hierarchical gatherv requires a
temporary buffer and memory copy to handle out-of-order data, or
non-contiguous placement on the output buffer, which results in worse
performance for large messages compared to the linear implementation.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
Add scatterv implementation to optimize large-scale communications on
multiple nodes and multiple processes per node, by avoiding high-incast
traffic on the root process.

Because *V collectives do not have equal datatype/count on every
process, it does not natively support message-size based tuning without
an additional global communication.

Similar to scatter, the hierarchical scatterv requires a
temporary buffer and memory copy to handle out-of-order data, or
non-contiguous placement on the send buffer, which results in worse
performance for large messages compared to the linear implementation.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
@wenduwan wenduwan merged commit 984944d into open-mpi:main Mar 23, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants