-
Notifications
You must be signed in to change notification settings - Fork 928
Open
Description
MPI_Bcast shows different performance under different MPI_Barrier algorithm via osu_bcast.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI v4.1.x with UCX v1.10.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
./configure --prefix=/path/to/mpi --with-ucx=/path/to/ucx --enable-mpi1-compatibility --with-platform=contrib/platform/mellanox/optimized
Please describe the system on which you are running
- Operating system/version: CentOS 7.6
- Computer hardware: Kunpeng920*2 (128 cores per node)
- Network type: CX5
Details of the problem
I do the osu_bcast test on 8 nodes with 128 ppn, but the performance show much better than 4 nodes on small packages.
Intuitively, the performance of 8 nodes should be worse than that of 4 nodes.
8 nodes test result:
mpirun -np 1024 -N 128 --hostfile ./hf8 --mca btl ^openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=sm,rc osu_bcast -f
# OSU MPI Broadcast Latency Test v5.9
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 1.70 0.93 2.55 1000
2 1.60 0.88 2.51 1000
4 1.60 0.86 2.53 1000
8 1.63 0.90 2.49 1000
16 1.63 0.86 2.52 1000
32 2.23 1.40 3.04 1000
64 2.57 1.47 3.86 1000
128 2.74 1.41 3.86 1000
256 4.14 2.04 5.93 1000
512 4.75 2.47 6.73 1000
1024 7.41 4.15 10.42 1000
2048 9.06 5.12 12.36 1000
4096 13.68 7.09 19.08 1000
4 nodes test result:
mpirun -np 512 -N 128 --hostfile ./hf4 --mca btl ^openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=sm,rc osu_bcast -f
# OSU MPI Broadcast Latency Test v5.9
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 7.64 2.48 15.57 1000
2 7.74 2.50 15.78 1000
4 7.62 2.51 15.78 1000
8 7.61 2.51 15.49 1000
16 7.66 2.43 15.61 1000
32 8.18 2.61 16.52 1000
64 9.19 2.89 17.88 1000
128 9.32 3.20 17.94 1000
256 12.02 3.73 25.10 1000
512 12.83 4.28 26.23 1000
1024 14.80 4.92 29.42 1000
2048 16.38 5.26 33.53 1000
4096 19.58 7.11 38.04 1000
I read the source code of coll/tuned, and I found that MPI_barrier use algorithm 4 (bruck) under 512 procs and use algorithm 6 (tree) under 1024 procs. So I change the MPI_Barrier algorithm 6->4 under 1024 procs, the result is below.
8 nodes with MPI_Barrier using algorithm 4, test result:
mpirun -np 1024 -N 128 --hostfile ./hf8 --mca btl ^openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=sm,rc -mca coll_tuned_use_dynamic_rules true -mca coll_tuned_barrier_algorithm 4 osu_bcast -f
# OSU MPI Broadcast Latency Test v5.9
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 8.41 2.53 16.74 1000
2 8.53 2.66 16.99 1000
4 8.48 2.63 16.96 1000
8 8.39 2.55 16.75 1000
16 8.47 2.53 16.65 1000
32 9.11 3.01 17.86 1000
64 10.18 3.25 20.58 1000
128 10.42 3.35 20.40 1000
256 13.03 3.66 27.95 1000
512 13.87 4.20 29.90 1000
1024 16.49 5.23 34.55 1000
2048 25.56 7.90 50.83 1000
4096 22.57 7.53 44.77 1000
So, I have some questions.
- Why different MPI_Barrier will affect the performance display of osu_bcast?
- How to define and measure the performance of MPI_Bcast exactly?
Metadata
Metadata
Assignees
Labels
No labels