How to define and measure the performance of MPI_Bcast exactly?

`MPI_Bcast` shows different performance under different `MPI_Barrier` algorithm via `osu_bcast`.

## Background information

### What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI v4.1.x with UCX v1.10.1

### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
`./configure --prefix=/path/to/mpi --with-ucx=/path/to/ucx --enable-mpi1-compatibility --with-platform=contrib/platform/mellanox/optimized`

### Please describe the system on which you are running

* Operating system/version: CentOS 7.6
* Computer hardware: Kunpeng920*2 (128 cores per node)
* Network type: CX5

-----------------------------

## Details of the problem
I do the osu_bcast test on 8 nodes with 128 ppn, but the performance show much better than 4 nodes on small packages.
**Intuitively, the performance of 8 nodes should be worse than that of 4 nodes.**
#### 8 nodes test result:
```
mpirun -np 1024 -N 128 --hostfile ./hf8 --mca btl ^openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=sm,rc osu_bcast -f

# OSU MPI Broadcast Latency Test v5.9
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                       1.70              0.93              2.55        1000
2                       1.60              0.88              2.51        1000
4                       1.60              0.86              2.53        1000
8                       1.63              0.90              2.49        1000
16                      1.63              0.86              2.52        1000
32                      2.23              1.40              3.04        1000
64                      2.57              1.47              3.86        1000
128                     2.74              1.41              3.86        1000
256                     4.14              2.04              5.93        1000
512                     4.75              2.47              6.73        1000
1024                    7.41              4.15             10.42        1000
2048                    9.06              5.12             12.36        1000
4096                   13.68              7.09             19.08        1000
```

#### 4 nodes test result:
```
mpirun -np 512 -N 128 --hostfile ./hf4 --mca btl ^openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=sm,rc osu_bcast -f

# OSU MPI Broadcast Latency Test v5.9
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                       7.64              2.48             15.57        1000
2                       7.74              2.50             15.78        1000
4                       7.62              2.51             15.78        1000
8                       7.61              2.51             15.49        1000
16                      7.66              2.43             15.61        1000
32                      8.18              2.61             16.52        1000
64                      9.19              2.89             17.88        1000
128                     9.32              3.20             17.94        1000
256                    12.02              3.73             25.10        1000
512                    12.83              4.28             26.23        1000
1024                   14.80              4.92             29.42        1000
2048                   16.38              5.26             33.53        1000
4096                   19.58              7.11             38.04        1000
```
I read the source code of `coll/tuned`, and I found that MPI_barrier use algorithm 4 (bruck) under 512 procs and use algorithm 6 (tree) under 1024 procs. So I change the MPI_Barrier algorithm 6->4 under 1024 procs, the result is below.
#### 8 nodes with MPI_Barrier using algorithm 4, test result:
```
mpirun -np 1024 -N 128 --hostfile ./hf8 --mca btl ^openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=sm,rc -mca coll_tuned_use_dynamic_rules true -mca coll_tuned_barrier_algorithm 4 osu_bcast -f

# OSU MPI Broadcast Latency Test v5.9
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                       8.41              2.53             16.74        1000
2                       8.53              2.66             16.99        1000
4                       8.48              2.63             16.96        1000
8                       8.39              2.55             16.75        1000
16                      8.47              2.53             16.65        1000
32                      9.11              3.01             17.86        1000
64                     10.18              3.25             20.58        1000
128                    10.42              3.35             20.40        1000
256                    13.03              3.66             27.95        1000
512                    13.87              4.20             29.90        1000
1024                   16.49              5.23             34.55        1000
2048                   25.56              7.90             50.83        1000
4096                   22.57              7.53             44.77        1000
```

### So, I have some questions.
1. Why different MPI_Barrier will affect the performance display of osu_bcast?
2. How to define and measure the performance of MPI_Bcast exactly?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to define and measure the performance of MPI_Bcast exactly? #11022

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

8 nodes test result:

4 nodes test result:

8 nodes with MPI_Barrier using algorithm 4, test result:

So, I have some questions.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to define and measure the performance of MPI_Bcast exactly? #11022

Description

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

8 nodes test result:

4 nodes test result:

8 nodes with MPI_Barrier using algorithm 4, test result:

So, I have some questions.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions