-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error by using cuda-aware-mpi-example, bandwidth was wrong #41
Comments
@jirikraus can you take a look at this issue? |
Thanks for making me aware Mark. I would have missed this. I need to wrap up a few other things and will take a look at this later. |
I found the reason is the local domain size, when I used the same hardware structure, that means 4 node and each node 1 A100 GPU, when the local domain size is 4096, the bandwidth is around 800 GB/s, but when the local domain size is 20480, the bandwidth is around 2.4 TB/s, are there some problems with the bandwidth algorithm? |
Hi Mountain-ql, sorry for following up late. I did not have the time to deep dive into this yet. I agree that regarding the bandwidth calculations something is of. Regarding the performance difference between CUDA-aware MPI and regular MPI can you provide a few more details on your system? What exact MPI are you using (exact version and how it has been built) and the output of nvidia-smi topo -m on the system you are running on. |
sorry for the late reply. GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity Legend: X = Self |
Thanks. Can can you attach the output of |
sorry for late reply! here is the output of "ucx_info -b": |
Thanks. I can't spot anything regarding your Software Setup. As the performance difference between CUDA-aware MPI and regular MPI on a single node is about 2x and CUDA-aware MPI is faster for 2 processes on two nodes I am suspecting there is an issue with the GPU affinity handling. I.e. |
Thanks a lot!! |
Thanks for the feedback. Closing this as it does not seem to be an issue with the code. |
I tried to run jacobi_cuda_aware_mpi and jacobi_cuda_normal_mpi on HPC, and I use 2 A100 with 40GB memory as devices. The Max. GPU memory bandwidth is 1,555GB/s, but in the benchmark I got 2.52 TB/s, and when I used the GPUs in a same node, the bandwidth of GPU which with CUDA-aware is slower than normal one...
This is the normal MPI result, which came from 2 Nividia a 100 on the same node:
Topology size: 2 x 1
Local domain size (current node): 20480 x 20480
Global domain size (all nodes): 40960 x 20480
normal-ID= 0
normal-ID= 1
Starting Jacobi run with 2 processes using "A100-SXM4-40GB" GPUs (ECC enabled: 2 / 2):
Iteration: 0 - Residue: 0.250000
Iteration: 100 - Residue: 0.002397
Iteration: 200 - Residue: 0.001204
Iteration: 300 - Residue: 0.000804
Iteration: 400 - Residue: 0.000603
Iteration: 500 - Residue: 0.000483
Iteration: 600 - Residue: 0.000403
Iteration: 700 - Residue: 0.000345
Iteration: 800 - Residue: 0.000302
Iteration: 900 - Residue: 0.000269
Iteration: 1000 - Residue: 0.000242
Iteration: 1100 - Residue: 0.000220
Iteration: 1200 - Residue: 0.000201
Iteration: 1300 - Residue: 0.000186
Iteration: 1400 - Residue: 0.000173
Iteration: 1500 - Residue: 0.000161
Iteration: 1600 - Residue: 0.000151
Iteration: 1700 - Residue: 0.000142
Iteration: 1800 - Residue: 0.000134
Iteration: 1900 - Residue: 0.000127
Stopped after 2000 iterations with residue 0.000121
Total Jacobi run time: 21.3250 sec.
Average per-process communication time: 0.2794 sec.
Measured lattice updates: 78.66 GLU/s (total), 39.33 GLU/s (per process)
Measured FLOPS: 393.31 GFLOPS (total), 196.66 GFLOPS (per process)
Measured device bandwidth: 5.03 TB/s (total), 2.52 TB/s (per process)
This is the result of CUDA-aware MPI which came from 2 Nvidia a 100 on the same node:
Topology size: 2 x 1
Local domain size (current node): 20480 x 20480
Global domain size (all nodes): 40960 x 20480
Starting Jacobi run with 2 processes using "A100-SXM4-40GB" GPUs (ECC enabled: 2 / 2):
Iteration: 0 - Residue: 0.250000
Iteration: 100 - Residue: 0.002397
Iteration: 200 - Residue: 0.001204
Iteration: 300 - Residue: 0.000804
Iteration: 400 - Residue: 0.000603
Iteration: 500 - Residue: 0.000483
Iteration: 600 - Residue: 0.000403
Iteration: 700 - Residue: 0.000345
Iteration: 800 - Residue: 0.000302
Iteration: 900 - Residue: 0.000269
Iteration: 1000 - Residue: 0.000242
Iteration: 1100 - Residue: 0.000220
Iteration: 1200 - Residue: 0.000201
Iteration: 1300 - Residue: 0.000186
Iteration: 1400 - Residue: 0.000173
Iteration: 1500 - Residue: 0.000161
Iteration: 1600 - Residue: 0.000151
Iteration: 1700 - Residue: 0.000142
Iteration: 1800 - Residue: 0.000134
Iteration: 1900 - Residue: 0.000127
Stopped after 2000 iterations with residue 0.000121
Total Jacobi run time: 51.8048 sec.
Average per-process communication time: 4.4083 sec.
Measured lattice updates: 32.38 GLU/s (total), 16.19 GLU/s (per process)
Measured FLOPS: 161.90 GFLOPS (total), 80.95 GFLOPS (per process)
Measured device bandwidth: 2.07 TB/s (total), 1.04 TB/s (per process)
I ran them with same node and same GPUs, because I am using the sbatch system, so I changed the flag "ENV_LOCAL_RANK" as "SLURM_LOCALID", but I also tried "OMPI_COMM_WORLD_LOCAL_RANK" because I used OpenMPI, but the result of CUDA-aware MPI were much slower than normal one, when the GPUs on a same node (but if each GPU on the different node CUDA-aware MPI is a little bit faster than normal one), maybe I didn't activate CUDA-aware?
Does someone has any idea about this? Thanks a lot!
The text was updated successfully, but these errors were encountered: