Skip to content

How to define and measure the performance of MPI_Bcast exactly? #11022

@razor1991

Description

@razor1991

MPI_Bcast shows different performance under different MPI_Barrier algorithm via osu_bcast.

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI v4.1.x with UCX v1.10.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone
./configure --prefix=/path/to/mpi --with-ucx=/path/to/ucx --enable-mpi1-compatibility --with-platform=contrib/platform/mellanox/optimized

Please describe the system on which you are running

  • Operating system/version: CentOS 7.6
  • Computer hardware: Kunpeng920*2 (128 cores per node)
  • Network type: CX5

Details of the problem

I do the osu_bcast test on 8 nodes with 128 ppn, but the performance show much better than 4 nodes on small packages.
Intuitively, the performance of 8 nodes should be worse than that of 4 nodes.

8 nodes test result:

mpirun -np 1024 -N 128 --hostfile ./hf8 --mca btl ^openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=sm,rc osu_bcast -f

# OSU MPI Broadcast Latency Test v5.9
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                       1.70              0.93              2.55        1000
2                       1.60              0.88              2.51        1000
4                       1.60              0.86              2.53        1000
8                       1.63              0.90              2.49        1000
16                      1.63              0.86              2.52        1000
32                      2.23              1.40              3.04        1000
64                      2.57              1.47              3.86        1000
128                     2.74              1.41              3.86        1000
256                     4.14              2.04              5.93        1000
512                     4.75              2.47              6.73        1000
1024                    7.41              4.15             10.42        1000
2048                    9.06              5.12             12.36        1000
4096                   13.68              7.09             19.08        1000

4 nodes test result:

mpirun -np 512 -N 128 --hostfile ./hf4 --mca btl ^openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=sm,rc osu_bcast -f

# OSU MPI Broadcast Latency Test v5.9
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                       7.64              2.48             15.57        1000
2                       7.74              2.50             15.78        1000
4                       7.62              2.51             15.78        1000
8                       7.61              2.51             15.49        1000
16                      7.66              2.43             15.61        1000
32                      8.18              2.61             16.52        1000
64                      9.19              2.89             17.88        1000
128                     9.32              3.20             17.94        1000
256                    12.02              3.73             25.10        1000
512                    12.83              4.28             26.23        1000
1024                   14.80              4.92             29.42        1000
2048                   16.38              5.26             33.53        1000
4096                   19.58              7.11             38.04        1000

I read the source code of coll/tuned, and I found that MPI_barrier use algorithm 4 (bruck) under 512 procs and use algorithm 6 (tree) under 1024 procs. So I change the MPI_Barrier algorithm 6->4 under 1024 procs, the result is below.

8 nodes with MPI_Barrier using algorithm 4, test result:

mpirun -np 1024 -N 128 --hostfile ./hf8 --mca btl ^openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=sm,rc -mca coll_tuned_use_dynamic_rules true -mca coll_tuned_barrier_algorithm 4 osu_bcast -f

# OSU MPI Broadcast Latency Test v5.9
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                       8.41              2.53             16.74        1000
2                       8.53              2.66             16.99        1000
4                       8.48              2.63             16.96        1000
8                       8.39              2.55             16.75        1000
16                      8.47              2.53             16.65        1000
32                      9.11              3.01             17.86        1000
64                     10.18              3.25             20.58        1000
128                    10.42              3.35             20.40        1000
256                    13.03              3.66             27.95        1000
512                    13.87              4.20             29.90        1000
1024                   16.49              5.23             34.55        1000
2048                   25.56              7.90             50.83        1000
4096                   22.57              7.53             44.77        1000

So, I have some questions.

  1. Why different MPI_Barrier will affect the performance display of osu_bcast?
  2. How to define and measure the performance of MPI_Bcast exactly?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions