Implement hierarchical MPI_Gatherv and MPI_Scatterv #12376

wenduwan · 2024-02-26T16:17:18Z

Summary

Add gatherv and scatterv implementation to optimize large-scale communications on multiple nodes and multiple processes per node, by avoiding high-incast traffic on the root process.

Because *V collectives do not have equal datatype/count on every process, it does not natively support message-size based tuning without an additional global communication.

Similar to gather and allgather, the hierarchical gatherv requires a temporary buffer and memory copy to handle out-of-order data, or non-contiguous placement on the output buffer, which results in worse performance for large messages compared to the linear implementation.

OMB comparison between linear and hierarchical MPI_Gatherv

Note: The following result has cherry picked coll/basic: mca_coll_basic_gatherv_intra root posts irecv #12305

Happy Path Performance without reordering (`--rank-by slot`)

2 nodes x 64ppn = 128 procs

Default linear gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                       4.64              0.35             49.63        1000
[1,0]<stdout>: 2                       4.67              0.35             47.83        1000
[1,0]<stdout>: 4                       4.40              0.35             46.51        1000
[1,0]<stdout>: 8                       4.48              0.34             47.04        1000
[1,0]<stdout>: 16                      4.46              0.35             47.13        1000
[1,0]<stdout>: 32                      4.50              0.35             47.31        1000
[1,0]<stdout>: 64                      5.84              0.35             53.00        1000
[1,0]<stdout>: 128                     5.60              0.35             52.76        1000
[1,0]<stdout>: 256                    12.01              0.35             77.85        1000
[1,0]<stdout>: 512                    11.66              0.35             76.06        1000
[1,0]<stdout>: 1024                   11.49              0.37             83.19        1000
[1,0]<stdout>: 2048                   11.35              0.38             88.96        1000
[1,0]<stdout>: 4096                   11.13              0.41             95.71        1000
[1,0]<stdout>: 8192                   60.45             13.57            216.82        1000
[1,0]<stdout>: 16384                  78.74             12.01            312.69         100
[1,0]<stdout>: 32768                 116.05             20.48            489.80         100
[1,0]<stdout>: 65536                 189.58             38.27            857.93         100
[1,0]<stdout>: 131072                742.51             48.57           1577.27         100
[1,0]<stdout>: 262144               1448.12             77.61           3137.05         100
[1,0]<stdout>: 524288               3041.88             78.65           6115.82         100
[1,0]<stdout>: 1048576              5046.68           1098.15           8445.72         100

real	0m7.592s
user	0m0.032s
sys	0m0.048s

Hierarchical gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      13.24              0.62             57.08        1000
[1,0]<stdout>: 2                      13.58              0.55             59.21        1000
[1,0]<stdout>: 4                      13.63              0.53             59.58        1000
[1,0]<stdout>: 8                      13.65              0.53             59.65        1000
[1,0]<stdout>: 16                     13.63              0.54             60.32        1000
[1,0]<stdout>: 32                     13.62              0.53             60.47        1000
[1,0]<stdout>: 64                     13.69              0.54             61.90        1000
[1,0]<stdout>: 128                    13.47              0.60             61.72        1000
[1,0]<stdout>: 256                    30.31              0.86            119.91        1000
[1,0]<stdout>: 512                    29.73              0.80            121.20        1000
[1,0]<stdout>: 1024                   29.74              0.84            125.64        1000
[1,0]<stdout>: 2048                   30.49              0.83            162.71        1000
[1,0]<stdout>: 4096                   28.90              0.79            174.88        1000
[1,0]<stdout>: 8192                   85.67             25.98            282.71        1000
[1,0]<stdout>: 16384                 104.62             25.79            323.59         100
[1,0]<stdout>: 32768                 144.11             29.70            452.66         100
[1,0]<stdout>: 65536                 229.81             33.26            721.29         100
[1,0]<stdout>: 131072                383.63             40.88           1195.43         100
[1,0]<stdout>: 262144                751.58             63.67           2270.66         100
[1,0]<stdout>: 524288               2977.58            282.49          13139.84         100
[1,0]<stdout>: 1048576              5969.17            548.00          25890.52         100

real	0m10.551s
user	0m0.038s
sys	0m0.037s

16 nodes x 64ppn = 1024 procs

Default linear gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                       1.68              0.34            634.42        1000
[1,0]<stdout>: 2                       1.69              0.34            638.99        1000
[1,0]<stdout>: 4                       1.69              0.34            642.46        1000
[1,0]<stdout>: 8                       1.70              0.25            652.73        1000
[1,0]<stdout>: 16                      1.71              0.32            653.11        1000
[1,0]<stdout>: 32                      1.72              0.26            654.13        1000
[1,0]<stdout>: 64                      1.76              0.27            679.29        1000
[1,0]<stdout>: 128                     1.78              0.25            681.59        1000
[1,0]<stdout>: 256                     1.97              0.34            693.36        1000
[1,0]<stdout>: 512                     2.02              0.34            717.60        1000
[1,0]<stdout>: 1024                    2.12              0.35            776.07        1000
[1,0]<stdout>: 2048                    2.25              0.37            898.00        1000
[1,0]<stdout>: 4096                    2.45              0.40           1054.73        1000
[1,0]<stdout>: 8192                  176.12             31.85           1413.68        1000
[1,0]<stdout>: 16384                 359.66             36.26           2521.24         100
[1,0]<stdout>: 32768                 722.06             40.71           4523.63         100
[1,0]<stdout>: 65536                1425.91             45.81           8647.08         100
[1,0]<stdout>: 131072               4968.47            185.44          15497.74         100
[1,0]<stdout>: 262144              16032.77            211.21          30089.29         100
[1,0]<stdout>: 524288              39176.47            272.97          58606.06         100

real	0m38.337s
user	0m0.079s
sys	0m0.088s

Hierarchical gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      13.10              0.59             66.92        1000
[1,0]<stdout>: 2                      13.32              0.56             65.37        1000
[1,0]<stdout>: 4                      13.34              0.54             65.17        1000
[1,0]<stdout>: 8                      13.37              0.53             69.54        1000
[1,0]<stdout>: 16                     13.34              0.54             67.34        1000
[1,0]<stdout>: 32                     13.35              0.52             66.23        1000
[1,0]<stdout>: 64                     13.37              0.52             68.53        1000
[1,0]<stdout>: 128                    13.73              0.55             67.83        1000
[1,0]<stdout>: 256                    39.21              0.67            206.27        1000
[1,0]<stdout>: 512                    37.51              0.65            172.15        1000
[1,0]<stdout>: 1024                   37.38              0.70            216.41        1000
[1,0]<stdout>: 2048                   38.55              0.74            303.33        1000
[1,0]<stdout>: 4096                   35.55              0.77            480.86        1000
[1,0]<stdout>: 8192                  100.68             25.96            866.76        1000
[1,0]<stdout>: 16384                 125.71             24.17            933.73         100
[1,0]<stdout>: 32768                 177.80             26.00           2026.15         100
[1,0]<stdout>: 65536                 290.06             33.58           3702.86         100
[1,0]<stdout>: 131072                479.79             41.67           6243.50         100
[1,0]<stdout>: 262144                924.70             63.57          12132.51         100
[1,0]<stdout>: 524288               4508.58            278.82          32352.78         100

real	0m22.046s
user	0m0.163s
sys	0m0.130s

64 nodes x 64ppn = 4096 procs(reduced message size due to memory limit)

Default linear gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                       1.57              0.34           3921.16        1000
[1,0]<stdout>: 2                       1.59              0.34           3937.05        1000
[1,0]<stdout>: 4                       1.59              0.33           3973.91        1000
[1,0]<stdout>: 8                       1.58              0.33           3920.90        1000
[1,0]<stdout>: 16                      1.58              0.33           3923.34        1000
[1,0]<stdout>: 32                      1.57              0.33           3879.77        1000
[1,0]<stdout>: 64                      1.57              0.33           3869.90        1000
[1,0]<stdout>: 128                     1.57              0.33           3892.23        1000
[1,0]<stdout>: 256                     1.57              0.33           3676.15        1000
[1,0]<stdout>: 512                     1.55              0.34           3527.66        1000
[1,0]<stdout>: 1024                    1.61              0.35           3735.54        1000

real	1m18.041s
user	0m0.688s
sys	0m0.827s

Hierarchical gatherv

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      12.96              0.67            122.70        1000
[1,0]<stdout>: 2                      13.28              0.55            138.74        1000
[1,0]<stdout>: 4                      13.30              0.50            119.33        1000
[1,0]<stdout>: 8                      13.33              0.50            124.42        1000
[1,0]<stdout>: 16                     13.31              0.49            124.79        1000
[1,0]<stdout>: 32                     13.31              0.49            121.69        1000
[1,0]<stdout>: 64                     13.33              0.49            124.90        1000
[1,0]<stdout>: 128                    13.69              0.49            120.31        1000
[1,0]<stdout>: 256                    38.85              0.63            238.08        1000
[1,0]<stdout>: 512                    37.89              0.64            336.00        1000
[1,0]<stdout>: 1024                   38.44              0.68            525.95        1000

real	0m32.456s
user	0m0.790s
sys	0m0.636s

Worst Case Performance with Reordering (`--rank-by node`)

Only show result for hierarchical gatherv - default linear algorithm is not affected by rank-by

2 nodes x 64ppn = 128 procs

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      12.93              1.04             57.49        1000
[1,0]<stdout>: 2                      13.34              0.59             59.37        1000
[1,0]<stdout>: 4                      13.36              0.51             59.89        1000
[1,0]<stdout>: 8                      13.43              0.53             60.05        1000
[1,0]<stdout>: 16                     13.37              0.51             59.77        1000
[1,0]<stdout>: 32                     13.39              0.51             60.57        1000
[1,0]<stdout>: 64                     13.43              0.53             64.19        1000
[1,0]<stdout>: 128                    13.09              0.53             61.37        1000
[1,0]<stdout>: 256                    36.45              0.76            142.11        1000
[1,0]<stdout>: 512                    35.59              0.66            142.69        1000
[1,0]<stdout>: 1024                   34.87              0.74            143.69        1000
[1,0]<stdout>: 2048                   35.39              0.75            182.90        1000
[1,0]<stdout>: 4096                   32.51              0.77            194.39        1000
[1,0]<stdout>: 8192                   87.38             11.71            307.09        1000
[1,0]<stdout>: 16384                 109.99             11.96            358.53         100
[1,0]<stdout>: 32768                 150.94             21.37            527.55         100
[1,0]<stdout>: 65536                 238.63             18.98            884.33         100
[1,0]<stdout>: 131072                400.55             25.91           1519.31         100
[1,0]<stdout>: 262144                773.05             52.27           2917.65         100
[1,0]<stdout>: 524288               3143.42             75.64          18756.92         100
[1,0]<stdout>: 1048576              6288.95            142.21          37001.20         100

real	0m13.099s
user	0m0.037s
sys	0m0.041s

16 nodes x 64ppn = 1024 procs

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      12.97              0.99             85.45        1000
[1,0]<stdout>: 2                      13.21              0.62             85.09        1000
[1,0]<stdout>: 4                      13.22              0.53             84.87        1000
[1,0]<stdout>: 8                      13.24              0.55             86.50        1000
[1,0]<stdout>: 16                     13.22              0.52             84.07        1000
[1,0]<stdout>: 32                     13.25              0.53             85.32        1000
[1,0]<stdout>: 64                     13.25              0.52             90.10        1000
[1,0]<stdout>: 128                    13.57              0.55             69.28        1000
[1,0]<stdout>: 256                    37.74              0.63            173.56        1000
[1,0]<stdout>: 512                    36.96              0.66            190.24        1000
[1,0]<stdout>: 1024                   36.54              0.69            243.03        1000
[1,0]<stdout>: 2048                   37.76              0.73            377.21        1000
[1,0]<stdout>: 4096                   35.28              0.77            595.56        1000
[1,0]<stdout>: 8192                  100.17             25.75           1109.30        1000
[1,0]<stdout>: 16384                 127.45             26.63           1518.66         100
[1,0]<stdout>: 32768                 181.39             29.25           2992.01         100
[1,0]<stdout>: 65536                 566.74             25.20          23168.03         100
[1,0]<stdout>: 131072                931.37             46.47          39313.10         100
[1,0]<stdout>: 262144               1732.46             76.02          71912.28         100
[1,0]<stdout>: 524288               5978.15            161.02         144052.29         100

real	0m49.669s
user	0m0.100s
sys	0m0.199s

64 nodes x 64ppn = 4096 procs

[1,0]<stdout>: # OSU MPI Gatherv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      12.88              0.94            167.05        1000
[1,0]<stdout>: 2                      13.15              0.55            167.64        1000
[1,0]<stdout>: 4                      13.17              0.49            170.75        1000
[1,0]<stdout>: 8                      13.18              0.49            167.91        1000
[1,0]<stdout>: 16                     13.18              0.49            163.97        1000
[1,0]<stdout>: 32                     13.19              0.49            182.50        1000
[1,0]<stdout>: 64                     13.21              0.49            169.98        1000
[1,0]<stdout>: 128                    13.59              0.49            188.04        1000
[1,0]<stdout>: 256                    40.82              0.64            299.74        1000
[1,0]<stdout>: 512                    40.03              0.66            426.60        1000
[1,0]<stdout>: 1024                   40.59              0.67            663.56        1000

real	0m33.257s
user	0m0.843s
sys	0m0.751s

OMB comparison between linear and hierarchical MPI_Scatterv

Happy Path Performance without reordering (`--rank-by slot`)

2 nodes x 64ppn = 128 procs

Default linear scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      16.52              1.35             62.98        1000
[1,0]<stdout>: 2                      17.07              1.36             63.90        1000
[1,0]<stdout>: 4                      17.11              1.34             63.99        1000
[1,0]<stdout>: 8                      17.21              1.26             64.46        1000
[1,0]<stdout>: 16                     17.26              1.28             64.58        1000
[1,0]<stdout>: 32                     17.20              1.31             64.48        1000
[1,0]<stdout>: 64                     17.23              1.25             64.44        1000
[1,0]<stdout>: 128                    17.74              1.36             65.36        1000
[1,0]<stdout>: 256                    21.51              1.60             70.10        1000
[1,0]<stdout>: 512                    22.20              1.67             71.19        1000
[1,0]<stdout>: 1024                   26.92              1.84             78.22        1000
[1,0]<stdout>: 2048                   30.35              1.95             84.08        1000
[1,0]<stdout>: 4096                   36.27              2.58             96.20        1000
[1,0]<stdout>: 8192                  531.94              3.30           1825.26        1000
[1,0]<stdout>: 16384                 570.35              3.26           1908.61         100
[1,0]<stdout>: 32768                 644.08              4.70           2055.98         100
[1,0]<stdout>: 65536                 793.32              7.12           2341.72         100
[1,0]<stdout>: 131072               1530.23             15.22           4681.64         100
[1,0]<stdout>: 262144               2040.22             27.88           5643.87         100
[1,0]<stdout>: 524288               3752.15             55.37           9855.75         100
[1,0]<stdout>: 1048576              5812.79            121.96          12268.16         100

real	0m13.233s
user	0m0.000s
sys	0m0.082s

Hierarchical scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      23.93              1.86             46.99        1000
[1,0]<stdout>: 2                      23.98              1.95             47.00        1000
[1,0]<stdout>: 4                      24.08              1.94             47.24        1000
[1,0]<stdout>: 8                      24.21              2.00             47.43        1000
[1,0]<stdout>: 16                     24.29              1.90             47.70        1000
[1,0]<stdout>: 32                     24.36              1.94             47.80        1000
[1,0]<stdout>: 64                     25.11              2.02             49.11        1000
[1,0]<stdout>: 128                    25.88              2.12             50.39        1000
[1,0]<stdout>: 256                    29.04              2.28             56.47        1000
[1,0]<stdout>: 512                    30.98              2.50             60.44        1000
[1,0]<stdout>: 1024                   35.34              2.85             68.55        1000
[1,0]<stdout>: 2048                   55.07              2.73            107.68        1000
[1,0]<stdout>: 4096                   64.59              3.29            126.34        1000
[1,0]<stdout>: 8192                  159.80              4.10            335.16        1000
[1,0]<stdout>: 16384                 181.21              4.63            409.44         100
[1,0]<stdout>: 32768                 265.88              6.04            598.07         100
[1,0]<stdout>: 65536                 441.75              9.02            989.17         100
[1,0]<stdout>: 131072                720.13             15.06           1594.35         100
[1,0]<stdout>: 262144               1272.70             29.57           2743.94         100
[1,0]<stdout>: 524288               6852.35             63.08          15249.06         100
[1,0]<stdout>: 1048576             13692.94            138.20          30069.45         100

real	0m10.032s
user	0m0.049s
sys	0m0.097s

16 nodes x 64ppn = 1024 procs

Default linear scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                     159.78              1.85            345.43        1000
[1,0]<stdout>: 2                     158.27              1.29            343.23        1000
[1,0]<stdout>: 4                     158.60              1.82            342.69        1000
[1,0]<stdout>: 8                     157.31              1.26            339.72        1000
[1,0]<stdout>: 16                    157.46              1.77            340.10        1000
[1,0]<stdout>: 32                    157.55              1.26            340.09        1000
[1,0]<stdout>: 64                    158.66              1.76            341.39        1000
[1,0]<stdout>: 128                   159.87              1.31            344.11        1000
[1,0]<stdout>: 256                   168.50              2.04            354.58        1000
[1,0]<stdout>: 512                   173.21              1.63            363.31        1000
[1,0]<stdout>: 1024                  185.34              2.26            384.19        1000
[1,0]<stdout>: 2048                  208.39              1.90            429.41        1000
[1,0]<stdout>: 4096                  245.61              2.77            501.90        1000
[1,0]<stdout>: 8192                11961.05              3.12          25562.80        1000
[1,0]<stdout>: 16384               12324.61              4.83          26333.23         100
[1,0]<stdout>: 32768               12902.40              5.46          27472.94         100
[1,0]<stdout>: 65536               13983.43              9.07          29607.63         100
[1,0]<stdout>: 131072              29112.35             14.61          61728.65         100
[1,0]<stdout>: 262144              32925.09             27.35          69120.39         100
[1,0]<stdout>: 524288              55311.94             55.04         115649.66         100

real	1m27.318s
user	0m0.053s
sys	0m0.115s

Hierarchical scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      29.80             12.80             46.78        1000
[1,0]<stdout>: 2                      29.73             12.76             46.74        1000
[1,0]<stdout>: 4                      29.69             12.78             46.60        1000
[1,0]<stdout>: 8                      30.00             12.90             46.58        1000
[1,0]<stdout>: 16                     30.83             13.20             47.40        1000
[1,0]<stdout>: 32                     30.69             13.41             47.83        1000
[1,0]<stdout>: 64                     31.86             14.10             49.73        1000
[1,0]<stdout>: 128                    34.52             15.39             50.35        1000
[1,0]<stdout>: 256                    40.02             15.11             62.27        1000
[1,0]<stdout>: 512                    49.78             16.71             77.03        1000
[1,0]<stdout>: 1024                   69.12             20.27            111.07        1000
[1,0]<stdout>: 2048                  134.34             20.38            189.22        1000
[1,0]<stdout>: 4096                  205.15             21.96            284.62        1000
[1,0]<stdout>: 8192                  433.22             23.15            611.64        1000
[1,0]<stdout>: 16384                 764.50             17.42           1055.88         100
[1,0]<stdout>: 32768                1590.97             19.54           2028.55         100
[1,0]<stdout>: 65536                3008.15             23.42           3724.62         100
[1,0]<stdout>: 131072               5573.11             30.70           6614.90         100
[1,0]<stdout>: 262144              10475.43             43.00          12152.31         100
[1,0]<stdout>: 524288              27654.88             79.65          33890.64         100

real	0m13.903s
user	0m0.133s
sys	0m0.245s

64 nodes x 64ppn = 4096 procs(reduced message size due to memory limit)

Default linear scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      63.43             10.58             92.32        1000
[1,0]<stdout>: 2                      63.92             10.69             92.41        1000
[1,0]<stdout>: 4                      63.29             10.61             92.22        1000
[1,0]<stdout>: 8                      63.53             10.59             93.07        1000
[1,0]<stdout>: 16                     63.92             10.76             93.06        1000
[1,0]<stdout>: 32                     63.88             10.61             94.49        1000
[1,0]<stdout>: 64                     65.36             10.81             98.71        1000
[1,0]<stdout>: 128                   835.28             10.80           1700.16        1000
[1,0]<stdout>: 256                   872.38             10.83           1762.42        1000
[1,0]<stdout>: 512                   911.33             10.94           1839.79        1000
[1,0]<stdout>: 1024                  982.67             10.97           1966.01        1000

real	0m39.731s
user	0m0.654s
sys	0m0.684s

Hierarchical scatterv

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      50.61             32.31             72.00        1000
[1,0]<stdout>: 2                      51.36             33.09             72.31        1000
[1,0]<stdout>: 4                      51.88             32.99             78.24        1000
[1,0]<stdout>: 8                      52.24             33.25             73.50        1000
[1,0]<stdout>: 16                     52.06             33.23             74.43        1000
[1,0]<stdout>: 32                     53.08             33.29             76.45        1000
[1,0]<stdout>: 64                     55.50             34.96             81.43        1000
[1,0]<stdout>: 128                    59.38             33.88             97.11        1000
[1,0]<stdout>: 256                    66.71             36.11             99.07        1000
[1,0]<stdout>: 512                    85.26             38.64            141.51        1000
[1,0]<stdout>: 1024                  140.49             46.03            228.36        1000

real	0m12.118s
user	0m0.821s
sys	0m0.877s

Worst Case Performance with Reordering (`--rank-by node`)

Only show result for hierarchical scatterv - default linear algorithm is not affected by rank-by

2 nodes x 64ppn = 128 procs

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      25.60              4.03             48.02        1000
[1,0]<stdout>: 2                      26.05              4.05             49.00        1000
[1,0]<stdout>: 4                      25.94              4.04             48.79        1000
[1,0]<stdout>: 8                      26.08              3.99             49.09        1000
[1,0]<stdout>: 16                     26.15              4.05             49.15        1000
[1,0]<stdout>: 32                     26.53              4.21             49.79        1000
[1,0]<stdout>: 64                     27.12              4.29             50.76        1000
[1,0]<stdout>: 128                    27.88              4.44             51.96        1000
[1,0]<stdout>: 256                    32.47              4.81             60.71        1000
[1,0]<stdout>: 512                    33.36              5.52             62.07        1000
[1,0]<stdout>: 1024                   39.12              6.80             72.22        1000
[1,0]<stdout>: 2048                   61.11              8.60            113.52        1000
[1,0]<stdout>: 4096                   76.31             13.76            137.90        1000
[1,0]<stdout>: 8192                  181.19             22.81            357.19        1000
[1,0]<stdout>: 16384                 217.13             43.21            437.48         100
[1,0]<stdout>: 32768                 336.13             78.06            671.74         100
[1,0]<stdout>: 65536                 580.73            148.29           1142.75         100
[1,0]<stdout>: 131072                986.48            289.44           1857.83         100
[1,0]<stdout>: 262144               1878.63            616.36           3279.75         100
[1,0]<stdout>: 524288              18502.51          11494.99          27117.56         100
[1,0]<stdout>: 1048576             36537.57          22721.70          53158.69         100

real	0m13.962s
user	0m0.057s
sys	0m0.088s

16 nodes x 64ppn = 1024 procs

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      56.74             40.34             74.19        1000
[1,0]<stdout>: 2                      56.53             40.20             74.12        1000
[1,0]<stdout>: 4                      56.78             40.15             73.90        1000
[1,0]<stdout>: 8                      56.73             40.36             74.07        1000
[1,0]<stdout>: 16                     57.96             40.96             75.21        1000
[1,0]<stdout>: 32                     58.12             41.36             75.79        1000
[1,0]<stdout>: 64                     59.50             42.23             77.72        1000
[1,0]<stdout>: 128                    63.65             44.93             80.03        1000
[1,0]<stdout>: 256                    71.07             47.87             90.31        1000
[1,0]<stdout>: 512                    95.40             65.06            120.45        1000
[1,0]<stdout>: 1024                  139.92             96.81            181.46        1000
[1,0]<stdout>: 2048                  250.04            129.94            310.99        1000
[1,0]<stdout>: 4096                  383.59            191.28            457.32        1000
[1,0]<stdout>: 8192                  737.86            326.46            925.51        1000
[1,0]<stdout>: 16384                1464.46            668.41           1716.35         100
[1,0]<stdout>: 32768                2933.77           1370.01           3353.69         100
[1,0]<stdout>: 65536               26660.53          24117.55          29337.44         100
[1,0]<stdout>: 131072              46952.71          43746.51          50632.84         100
[1,0]<stdout>: 262144              88646.68          81590.93          94318.28         100
[1,0]<stdout>: 524288             181359.35         163479.02         190249.45         100

real	0m49.024s
user	0m0.158s
sys	0m0.247s

64 nodes x 64ppn = 4096 procs

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                     167.34            147.75            190.17        1000
[1,0]<stdout>: 2                     167.45            148.12            190.65        1000
[1,0]<stdout>: 4                     168.58            149.19            191.40        1000
[1,0]<stdout>: 8                     167.53            147.38            200.54        1000
[1,0]<stdout>: 16                    170.57            150.43            196.34        1000
[1,0]<stdout>: 32                    178.27            152.16            202.29        1000
[1,0]<stdout>: 64                    174.22            151.39            207.85        1000
[1,0]<stdout>: 128                   185.17            158.47            234.55        1000
[1,0]<stdout>: 256                   219.19            188.50            278.50        1000
[1,0]<stdout>: 512                   304.37            260.04            367.82        1000
[1,0]<stdout>: 1024                  439.06            347.00            528.86        1000

real	0m14.463s
user	0m1.003s
sys	0m0.958s

jiaxiyan · 2024-03-01T00:07:00Z

Edit: Moved OMB MPI_Scatterv comparison to PR description

ompi/mca/coll/han/coll_han_gatherv.c

ompi/mca/coll/han/coll_han_scatterv.c

wenduwan · 2024-03-13T23:23:51Z

@devreal I updated the OMB metrics in description. Please take a look.

devreal · 2024-03-14T02:14:46Z

Any idea what is going on at 4k procs on scatterv? It's significantly slower than the linear version, while for smaller proc numbers performance is similar or better...

[1,0]<stdout>: # OSU MPI Scatterv Latency Test v7.2
[1,0]<stdout>: # Datatype: MPI_CHAR.
[1,0]<stdout>: # Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1,0]<stdout>: 1                      63.43             10.58             92.32        1000
[1,0]<stdout>: 2                      63.92             10.69             92.41        1000
[1,0]<stdout>: 4                      63.29             10.61             92.22        1000
[1,0]<stdout>: 8                      63.53             10.59             93.07        1000
[1,0]<stdout>: 16                     63.92             10.76             93.06        1000
[1,0]<stdout>: 32                     63.88             10.61             94.49        1000
[1,0]<stdout>: 64                     65.36             10.81             98.71        1000
[1,0]<stdout>: 128                   835.28             10.80           1700.16        1000
[1,0]<stdout>: 256                   872.38             10.83           1762.42        1000
[1,0]<stdout>: 512                   911.33             10.94           1839.79        1000
[1,0]<stdout>: 1024                  982.67             10.97           1966.01        1000

real	0m39.731s
user	0m0.654s
sys	0m0.684s

We should also think about segmenting for larger messages messages. Esp gatherv seems to be doing worse at larger messages and smaller proc numbers.

wenduwan · 2024-03-14T18:10:04Z

Any idea what is going on at 4k procs on scatterv

@devreal lol You caught me. I flipped the table - the worse result was the current scatterv. Hierarchical scatterv was faster.

Also I had some bugs in the previous commit. It should be fixed now and I updated the scatterv numbers.

segmenting for larger messages messages

That is correct and unfortunate - we profiled the algorithm and identified 2 performance bottlenecks:

Node leaders need to allocate a (different) temporary buffer(recv buf for gatherv and send buf for scatterv). Using a temporary buffer incurs a cost in the network layer. For example, the buffer needs to be registered, pinned and eventually deregistered on EFA. This can only be mitigated with a reusable buffer pool, i.e. register once and use forever.
Related to above, because of the temporary buffers we need additional memory copy-in/out from/to sbuf/rbuf on node leaders. This cost is largely unavoidable without a very fast memcpy mechanism.

I don't see how segmentation actually helps with reducing the gatherv/scatterv latency, and it's very complicated for the *v collectives since not all processes need multiple segments, which requires per-peer bookkeeping on node leaders.

The unfortunate thing is that we cannot simply switch to a faster algorithm for large messages, since only Root knows the message size. I'm afraid it's a trade off for the application - it must choose one specific algorithm for all message sizes.

devreal

Sorry, some more comments...

ompi/mca/coll/han/coll_han_scatterv.c

wenduwan · 2024-03-14T22:27:32Z

@devreal Come to think about disabling HAN Gatherv/Scatterv, currently the user has to use a dynamic rule file. It would be easier IMO to provide the mca param e.g. coll_han_disable_<coll>. Do you think it's a reasonable addition?

devreal · 2024-03-15T01:55:58Z

Why do you want to disable HAN for Gatherv/Scatterv? The numbers look good to me. I understand we don't have a way for fine-grain selection of collective implementations but maybe that's a broader decision than this PR?

wenduwan · 2024-03-15T20:00:23Z

@devreal Yes I brought up the topic but we should address it separately.

devreal

Thanks, LGTM

wenduwan · 2024-03-18T17:44:49Z

@devreal Thank you for your kind review! We appreciate your feedback a lot.

@bosilca It would be great to hear your opinions too. On the other hand we would love to get the change merged and tested as part of nightly tarballs in case there is a bug.

I will hold the PR until 3/22.

Relax the function requirement to allow null low/up_rank output pointers, and rename the arguments because the function works for non-root ranks as well. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>

Add gatherv implementation to optimize large-scale communications on multiple nodes and multiple processes per node, by avoiding high-incast traffic on the root process. Because *V collectives do not have equal datatype/count on every process, it does not natively support message-size based tuning without an additional global communication. Similar to gather and allgather, the hierarchical gatherv requires a temporary buffer and memory copy to handle out-of-order data, or non-contiguous placement on the output buffer, which results in worse performance for large messages compared to the linear implementation. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>

Add scatterv implementation to optimize large-scale communications on multiple nodes and multiple processes per node, by avoiding high-incast traffic on the root process. Because *V collectives do not have equal datatype/count on every process, it does not natively support message-size based tuning without an additional global communication. Similar to scatter, the hierarchical scatterv requires a temporary buffer and memory copy to handle out-of-order data, or non-contiguous placement on the send buffer, which results in worse performance for large messages compared to the linear implementation. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

wenduwan added the ⚠️ WIP-DNM! label Feb 26, 2024

github-actions bot added the Target: main label Feb 26, 2024

wenduwan force-pushed the han_gatherv branch from 2ca243b to 1fa5e1b Compare February 28, 2024 14:55

wenduwan self-assigned this Feb 28, 2024

jiaxiyan force-pushed the han_gatherv branch from 3fc83b3 to 7f67b7a Compare February 29, 2024 00:30

wenduwan force-pushed the han_gatherv branch 4 times, most recently from cae601f to c7d8a77 Compare February 29, 2024 15:23

wenduwan changed the title ~~Implement hierarchical MPI_Gatherv~~ Implement hierarchical MPI_Gatherv and MPI_Scatterv Feb 29, 2024

jiaxiyan force-pushed the han_gatherv branch from 0d28109 to c7d8a77 Compare February 29, 2024 21:34

wenduwan force-pushed the han_gatherv branch from c7d8a77 to 4a098d4 Compare March 1, 2024 00:04

jiaxiyan force-pushed the han_gatherv branch from 4a098d4 to 383f52d Compare March 1, 2024 21:05

wenduwan force-pushed the han_gatherv branch from 383f52d to ade2647 Compare March 3, 2024 23:08

jiaxiyan marked this pull request as ready for review March 4, 2024 22:49

jiaxiyan requested review from bosilca and devreal March 4, 2024 22:49

jiaxiyan removed the ⚠️ WIP-DNM! label Mar 4, 2024

wenduwan assigned jiaxiyan Mar 5, 2024

wenduwan mentioned this pull request Mar 6, 2024

New MPI_SCATTER bug in OpenMPI5 #12383

Closed

devreal reviewed Mar 11, 2024

View reviewed changes

ompi/mca/coll/han/coll_han_gatherv.c Show resolved Hide resolved

ompi/mca/coll/han/coll_han_gatherv.c Outdated Show resolved Hide resolved

ompi/mca/coll/han/coll_han_scatterv.c Outdated Show resolved Hide resolved

ompi/mca/coll/han/coll_han_scatterv.c Outdated Show resolved Hide resolved

wenduwan force-pushed the han_gatherv branch from ade2647 to 9b19a27 Compare March 13, 2024 23:20

wenduwan requested a review from devreal March 13, 2024 23:23

wenduwan force-pushed the han_gatherv branch 2 times, most recently from c2d84f8 to 6829cec Compare March 14, 2024 17:22

devreal reviewed Mar 14, 2024

View reviewed changes

ompi/mca/coll/han/coll_han_scatterv.c Outdated Show resolved Hide resolved

ompi/mca/coll/han/coll_han_scatterv.c Outdated Show resolved Hide resolved

ompi/mca/coll/han/coll_han_scatterv.c Show resolved Hide resolved

wenduwan force-pushed the han_gatherv branch from 6829cec to d554918 Compare March 14, 2024 21:46

wenduwan force-pushed the han_gatherv branch from d554918 to 58e440f Compare March 14, 2024 23:11

devreal approved these changes Mar 16, 2024

View reviewed changes

wenduwan and others added 3 commits March 22, 2024 13:34

coll/han: refactor mca_coll_han_get_ranks function

b22e5fa

Relax the function requirement to allow null low/up_rank output pointers, and rename the arguments because the function works for non-root ranks as well. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>

wenduwan force-pushed the han_gatherv branch from 58e440f to 2152b61 Compare March 22, 2024 20:34

wenduwan merged commit 984944d into open-mpi:main Mar 23, 2024
11 checks passed

wenduwan mentioned this pull request Mar 27, 2024

MTT MPI_Gatherv_c/MPI_Scatterv_c failure #12438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement hierarchical MPI_Gatherv and MPI_Scatterv #12376

Implement hierarchical MPI_Gatherv and MPI_Scatterv #12376

wenduwan commented Feb 26, 2024 •

edited

Loading

jiaxiyan commented Mar 1, 2024 •

edited by wenduwan

Loading

wenduwan commented Mar 13, 2024

devreal commented Mar 14, 2024

wenduwan commented Mar 14, 2024 •

edited

Loading

devreal left a comment

wenduwan commented Mar 14, 2024

devreal commented Mar 15, 2024

wenduwan commented Mar 15, 2024

devreal left a comment

wenduwan commented Mar 18, 2024

Implement hierarchical MPI_Gatherv and MPI_Scatterv #12376

Implement hierarchical MPI_Gatherv and MPI_Scatterv #12376

Conversation

wenduwan commented Feb 26, 2024 • edited Loading

Summary

Happy Path Performance without reordering (--rank-by slot)

2 nodes x 64ppn = 128 procs

16 nodes x 64ppn = 1024 procs

64 nodes x 64ppn = 4096 procs(reduced message size due to memory limit)

Worst Case Performance with Reordering (--rank-by node)

2 nodes x 64ppn = 128 procs

16 nodes x 64ppn = 1024 procs

64 nodes x 64ppn = 4096 procs

Happy Path Performance without reordering (--rank-by slot)

2 nodes x 64ppn = 128 procs

16 nodes x 64ppn = 1024 procs

64 nodes x 64ppn = 4096 procs(reduced message size due to memory limit)

Worst Case Performance with Reordering (--rank-by node)

2 nodes x 64ppn = 128 procs

16 nodes x 64ppn = 1024 procs

64 nodes x 64ppn = 4096 procs

jiaxiyan commented Mar 1, 2024 • edited by wenduwan Loading

wenduwan commented Mar 13, 2024

devreal commented Mar 14, 2024

wenduwan commented Mar 14, 2024 • edited Loading

devreal left a comment

Choose a reason for hiding this comment

wenduwan commented Mar 14, 2024

devreal commented Mar 15, 2024

wenduwan commented Mar 15, 2024

devreal left a comment

Choose a reason for hiding this comment

wenduwan commented Mar 18, 2024

wenduwan commented Feb 26, 2024 •

edited

Loading

Happy Path Performance without reordering (`--rank-by slot`)

Worst Case Performance with Reordering (`--rank-by node`)

Happy Path Performance without reordering (`--rank-by slot`)

Worst Case Performance with Reordering (`--rank-by node`)

jiaxiyan commented Mar 1, 2024 •

edited by wenduwan

Loading

wenduwan commented Mar 14, 2024 •

edited

Loading