Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error by using cuda-aware-mpi-example, bandwidth was wrong #41

Closed
Mountain-ql opened this issue Feb 8, 2022 · 10 comments
Closed

error by using cuda-aware-mpi-example, bandwidth was wrong #41

Mountain-ql opened this issue Feb 8, 2022 · 10 comments

Comments

@Mountain-ql
Copy link

I tried to run jacobi_cuda_aware_mpi and jacobi_cuda_normal_mpi on HPC, and I use 2 A100 with 40GB memory as devices. The Max. GPU memory bandwidth is 1,555GB/s, but in the benchmark I got 2.52 TB/s, and when I used the GPUs in a same node, the bandwidth of GPU which with CUDA-aware is slower than normal one...

This is the normal MPI result, which came from 2 Nividia a 100 on the same node:
Topology size: 2 x 1
Local domain size (current node): 20480 x 20480
Global domain size (all nodes): 40960 x 20480
normal-ID= 0
normal-ID= 1
Starting Jacobi run with 2 processes using "A100-SXM4-40GB" GPUs (ECC enabled: 2 / 2):
Iteration: 0 - Residue: 0.250000
Iteration: 100 - Residue: 0.002397
Iteration: 200 - Residue: 0.001204
Iteration: 300 - Residue: 0.000804
Iteration: 400 - Residue: 0.000603
Iteration: 500 - Residue: 0.000483
Iteration: 600 - Residue: 0.000403
Iteration: 700 - Residue: 0.000345
Iteration: 800 - Residue: 0.000302
Iteration: 900 - Residue: 0.000269
Iteration: 1000 - Residue: 0.000242
Iteration: 1100 - Residue: 0.000220
Iteration: 1200 - Residue: 0.000201
Iteration: 1300 - Residue: 0.000186
Iteration: 1400 - Residue: 0.000173
Iteration: 1500 - Residue: 0.000161
Iteration: 1600 - Residue: 0.000151
Iteration: 1700 - Residue: 0.000142
Iteration: 1800 - Residue: 0.000134
Iteration: 1900 - Residue: 0.000127
Stopped after 2000 iterations with residue 0.000121
Total Jacobi run time: 21.3250 sec.
Average per-process communication time: 0.2794 sec.
Measured lattice updates: 78.66 GLU/s (total), 39.33 GLU/s (per process)
Measured FLOPS: 393.31 GFLOPS (total), 196.66 GFLOPS (per process)
Measured device bandwidth: 5.03 TB/s (total), 2.52 TB/s (per process)

This is the result of CUDA-aware MPI which came from 2 Nvidia a 100 on the same node:
Topology size: 2 x 1
Local domain size (current node): 20480 x 20480
Global domain size (all nodes): 40960 x 20480
Starting Jacobi run with 2 processes using "A100-SXM4-40GB" GPUs (ECC enabled: 2 / 2):
Iteration: 0 - Residue: 0.250000
Iteration: 100 - Residue: 0.002397
Iteration: 200 - Residue: 0.001204
Iteration: 300 - Residue: 0.000804
Iteration: 400 - Residue: 0.000603
Iteration: 500 - Residue: 0.000483
Iteration: 600 - Residue: 0.000403
Iteration: 700 - Residue: 0.000345
Iteration: 800 - Residue: 0.000302
Iteration: 900 - Residue: 0.000269
Iteration: 1000 - Residue: 0.000242
Iteration: 1100 - Residue: 0.000220
Iteration: 1200 - Residue: 0.000201
Iteration: 1300 - Residue: 0.000186
Iteration: 1400 - Residue: 0.000173
Iteration: 1500 - Residue: 0.000161
Iteration: 1600 - Residue: 0.000151
Iteration: 1700 - Residue: 0.000142
Iteration: 1800 - Residue: 0.000134
Iteration: 1900 - Residue: 0.000127
Stopped after 2000 iterations with residue 0.000121
Total Jacobi run time: 51.8048 sec.
Average per-process communication time: 4.4083 sec.
Measured lattice updates: 32.38 GLU/s (total), 16.19 GLU/s (per process)
Measured FLOPS: 161.90 GFLOPS (total), 80.95 GFLOPS (per process)
Measured device bandwidth: 2.07 TB/s (total), 1.04 TB/s (per process)

I ran them with same node and same GPUs, because I am using the sbatch system, so I changed the flag "ENV_LOCAL_RANK" as "SLURM_LOCALID", but I also tried "OMPI_COMM_WORLD_LOCAL_RANK" because I used OpenMPI, but the result of CUDA-aware MPI were much slower than normal one, when the GPUs on a same node (but if each GPU on the different node CUDA-aware MPI is a little bit faster than normal one), maybe I didn't activate CUDA-aware?

Does someone has any idea about this? Thanks a lot!

@harrism
Copy link
Member

harrism commented Feb 8, 2022

@jirikraus can you take a look at this issue?

@jirikraus
Copy link
Member

Thanks for making me aware Mark. I would have missed this. I need to wrap up a few other things and will take a look at this later.

@Mountain-ql
Copy link
Author

I found the reason is the local domain size, when I used the same hardware structure, that means 4 node and each node 1 A100 GPU, when the local domain size is 4096, the bandwidth is around 800 GB/s, but when the local domain size is 20480, the bandwidth is around 2.4 TB/s, are there some problems with the bandwidth algorithm?

@jirikraus
Copy link
Member

Hi Mountain-ql, sorry for following up late. I did not have the time to deep dive into this yet. I agree that regarding the bandwidth calculations something is of. Regarding the performance difference between CUDA-aware MPI and regular MPI can you provide a few more details on your system? What exact MPI are you using (exact version and how it has been built) and the output of nvidia-smi topo -m on the system you are running on.

@Mountain-ql
Copy link
Author

sorry for the late reply.
the MPI I used was OpenMPI/4.0.5, because it is the module on HPC, so I don't know how it has been built.
And the output of "nvidia-smi topo -m" is:

GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0 X NV12 SYS SYS 0 0-7
GPU1 NV12 X SYS SYS 0 0-7
mlx5_0 SYS SYS X SYS
mlx5_1 SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

@jirikraus
Copy link
Member

Thanks. Can can you attach the output of ompi_info -c and ucx_info -b that will provide the missing information about the MPI you are using.

@Mountain-ql
Copy link
Author

sorry for late reply!
here is the output of "ompi_info -c":
Configured by: hpcglrun
Configured on: Wed Feb 17 12:42:06 CET 2021
Configure host: taurusi6395.taurus.hrsk.tu-dresden.de
Configure command line: '--prefix=/sw/installed/OpenMPI/4.0.5-gcccuda-2020b'
'--build=x86_64-pc-linux-gnu'
'--host=x86_64-pc-linux-gnu' '--with-slurm'
'--with-pmi=/usr' '--with-pmi-libdir=/usr/lib64'
'--with-knem=/opt/knem-1.1.3.90mlnx1'
'--enable-mpirun-prefix-by-default'
'--enable-shared'
'--with-cuda=/sw/installed/CUDAcore/11.1.1'
'--with-hwloc=/sw/installed/hwloc/2.2.0-GCCcore-10.2.0'
'--with-libevent=/sw/installed/libevent/2.1.12-GCCcore-10.2.0'
'--with-ofi=/sw/installed/libfabric/1.11.0-GCCcore-10.2.0'
'--with-pmix=/sw/installed/PMIx/3.1.5-GCCcore-10.2.0'
'--with-ucx=/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1'
'--without-verbs'
Built by: hpcglrun
Built on: Wed Feb 17 12:50:42 CET 2021
Built host: taurusi6395.taurus.hrsk.tu-dresden.de
C bindings: yes
C++ bindings: no
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the gfortran compiler and/or Open
MPI, does not support the following: array
subsections, direct passthru (where possible) to
underlying Open MPI's C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /sw/installed/GCCcore/10.2.0/bin/gcc
C compiler family name: GNU
C compiler version: 10.2.0
C char size: 1
C bool size: 1
C short size: 2
C int size: 4
C long size: 8
C float size: 4
C double size: 8
C pointer size: 8
C char align: 1
C bool align: skipped
C int align: 4
C float align: 4
C double align: 8
C++ compiler: g++
C++ compiler absolute: /sw/installed/GCCcore/10.2.0/bin/g++
Fort compiler: gfortran
Fort compiler abs: /sw/installed/GCCcore/10.2.0/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
Fort integer size: 4
Fort logical size: 4
Fort logical value true: 1
Fort have integer1: yes
Fort have integer2: yes
Fort have integer4: yes
Fort have integer8: yes
Fort have integer16: no
Fort have real4: yes
Fort have real8: yes
Fort have real16: yes
Fort have complex8: yes
Fort have complex16: yes
Fort have complex32: yes
Fort integer1 size: 1
Fort integer2 size: 2
Fort integer4 size: 4
Fort integer8 size: 8
Fort integer16 size: -1
Fort real size: 4
Fort real4 size: 4
Fort real8 size: 8
Fort real16 size: 16
Fort dbl prec size: 8
Fort cplx size: 8
Fort dbl cplx size: 16
Fort cplx8 size: 8
Fort cplx16 size: 16
Fort cplx32 size: 32
Fort integer align: 4
Fort integer1 align: 1
Fort integer2 align: 2
Fort integer4 align: 4
Fort integer8 align: 8
Fort integer16 align: -1
Fort real align: 4
Fort real4 align: 4
Fort real8 align: 8
Fort real16 align: 16
Fort dbl prec align: 8
Fort cplx align: 4
Fort dbl cplx align: 8
Fort cplx8 align: 4
Fort cplx16 align: 8
Fort cplx32 align: 16
C profiling: yes
C++ profiling: no
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, ORTE progress: yes, Event lib:
yes)
Sparse Groups: no
Build CFLAGS: -DNDEBUG -O3 -march=native -fno-math-errno
-finline-functions -fno-strict-aliasing
Build CXXFLAGS: -DNDEBUG -O3 -march=native -fno-math-errno
-finline-functions
Build FCFLAGS: -O3 -march=native -fno-math-errno
Build LDFLAGS: -L/sw/installed/PMIx/3.1.5-GCCcore-10.2.0/lib64
-L/sw/installed/PMIx/3.1.5-GCCcore-10.2.0/lib
-L/sw/installed/libfabric/1.11.0-GCCcore-10.2.0/lib64
-L/sw/installed/libfabric/1.11.0-GCCcore-10.2.0/lib
-L/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1/lib64
-L/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1/lib
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib
-L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib64
-L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-L/sw/installed/zlib/1.2.11-GCCcore-10.2.0/lib64
-L/sw/installed/zlib/1.2.11-GCCcore-10.2.0/lib
-L/sw/installed/GCCcore/10.2.0/lib64
-L/sw/installed/GCCcore/10.2.0/lib
-L/sw/installed/CUDAcore/11.1.1/lib64
-L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
Build LIBS: -lutil -lm -lrt -lcudart -lpthread -lz -lhwloc
-levent_core -levent_pthreads
Wrapper extra CFLAGS:
Wrapper extra CXXFLAGS:
Wrapper extra FCFLAGS: -I${libdir}
Wrapper extra LDFLAGS: -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
-Wl,-rpath
-Wl,/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-Wl,-rpath
-Wl,/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
-Wl,-rpath -Wl,@{libdir} -Wl,--enable-new-dtags
Wrapper extra LIBS: -lhwloc -ldl -levent_core -levent_pthreads -lutil
-lm -lrt -lcudart -lpthread -lz
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
mpirun default --prefix: yes
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI1 compatibility: no
MPI extensions: affinity, cuda, pcollreq
FT Checkpoint support: no (checkpoint thread: no)
C/R Enabled Debugging: no
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128

here is the output of "ucx_info -b":
#define UCX_CONFIG_H
#define ENABLE_BUILTIN_MEMCPY 1
#define ENABLE_DEBUG_DATA 0
#define ENABLE_MT 1
#define ENABLE_PARAMS_CHECK 0
#define ENABLE_SYMBOL_OVERRIDE 1
#define HAVE_1_ARG_BFD_SECTION_SIZE 1
#define HAVE_ALLOCA 1
#define HAVE_ALLOCA_H 1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV 1
#define HAVE_CPLUS_DEMANGLE 1
#define HAVE_CPU_SET_T 1
#define HAVE_CUDA 1
#define HAVE_CUDA_H 1
#define HAVE_CUDA_RUNTIME_H 1
#define HAVE_DC_EXP 1
#define HAVE_DECL_ASPRINTF 1
#define HAVE_DECL_BASENAME 1
#define HAVE_DECL_BFD_GET_SECTION_FLAGS 0
#define HAVE_DECL_BFD_GET_SECTION_VMA 0
#define HAVE_DECL_BFD_SECTION_FLAGS 1
#define HAVE_DECL_BFD_SECTION_VMA 1
#define HAVE_DECL_CPU_ISSET 1
#define HAVE_DECL_CPU_ZERO 1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN 1
#define HAVE_DECL_F_SETOWN_EX 1
#define HAVE_DECL_GDR_COPY_TO_MAPPING 1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 0
#define HAVE_DECL_IBV_ADVISE_MR 0
#define HAVE_DECL_IBV_ALLOC_DM 0
#define HAVE_DECL_IBV_ALLOC_TD 0
#define HAVE_DECL_IBV_CMD_MODIFY_QP 1
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ 1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_EXP_ALLOC_DM 1
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 1
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 1
#define HAVE_DECL_IBV_EXP_CREATE_QP 1
#define HAVE_DECL_IBV_EXP_CREATE_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 1
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_DESTROY_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 1
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 1
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 1
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 1
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 1
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_EXP_POST_SEND 1
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 1
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 1
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 1
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 1
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 1
#define HAVE_DECL_IBV_EXP_REG_MR 1
#define HAVE_DECL_IBV_EXP_RES_DOMAIN_THREAD_MODEL 1
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 1
#define HAVE_DECL_IBV_EXP_SETENV 1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 1
#define HAVE_DECL_IBV_EXP_WR_NOP 1
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_CQ_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_QP_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_SRQ_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_UPDATE_CQ_CI 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 0
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID 1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_MADV_FREE 0
#define HAVE_DECL_MADV_REMOVE 1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 0
#define HAVE_DECL_MLX5DV_CREATE_QP 0
#define HAVE_DECL_MLX5DV_DCTYPE_DCT 0
#define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 0
#define HAVE_DECL_MLX5DV_INIT_OBJ 1
#define HAVE_DECL_MLX5DV_IS_SUPPORTED 0
#define HAVE_DECL_MLX5DV_OBJ_AH 0
#define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 0
#define HAVE_DECL_MLX5_WQE_CTRL_SOLICITED 1
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER 1
#define HAVE_DECL_RDMA_ESTABLISH 1
#define HAVE_DECL_RDMA_INIT_QP_ATTR 1
#define HAVE_DECL_SPEED_UNKNOWN 1
#define HAVE_DECL_STRERROR_R 1
#define HAVE_DECL_SYS_BRK 1
#define HAVE_DECL_SYS_IPC 0
#define HAVE_DECL_SYS_MADVISE 1
#define HAVE_DECL_SYS_MMAP 1
#define HAVE_DECL_SYS_MREMAP 1
#define HAVE_DECL_SYS_MUNMAP 1
#define HAVE_DECL_SYS_SHMAT 1
#define HAVE_DECL_SYS_SHMDT 1
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DETAILED_BACKTRACE 1
#define HAVE_DLFCN_H 1
#define HAVE_EXP_UMR 1
#define HAVE_EXP_UMR_KSM 1
#define HAVE_GDRAPI_H 1
#define HAVE_HW_TIMER 1
#define HAVE_IB 1
#define HAVE_IBV_DM 1
#define HAVE_IBV_EXP_DM 1
#define HAVE_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_IBV_EXP_RES_DOMAIN 1
#define HAVE_IB_EXT_ATOMICS 1
#define HAVE_IN6_ADDR_S6_ADDR32 1
#define HAVE_INFINIBAND_MLX5DV_H 1
#define HAVE_INFINIBAND_MLX5_HW_H 1
#define HAVE_INTTYPES_H 1
#define HAVE_IP_IP_DST 1
#define HAVE_LIBGEN_H 1
#define HAVE_LIBRT 1
#define HAVE_LINUX_FUTEX_H 1
#define HAVE_LINUX_IP_H 1
#define HAVE_LINUX_MMAN_H 1
#define HAVE_MALLOC_GET_STATE 1
#define HAVE_MALLOC_H 1
#define HAVE_MALLOC_HOOK 1
#define HAVE_MALLOC_SET_STATE 1
#define HAVE_MALLOC_TRIM 1
#define HAVE_MASKED_ATOMICS_ENDIANNESS 1
#define HAVE_MEMALIGN 1
#define HAVE_MEMORY_H 1
#define HAVE_MLX5_HW 1
#define HAVE_MLX5_HW_UD 1
#define HAVE_MREMAP 1
#define HAVE_NETINET_IP_H 1
#define HAVE_NET_ETHERNET_H 1
#define HAVE_NUMA 1
#define HAVE_NUMAIF_H 1
#define HAVE_NUMA_H 1
#define HAVE_ODP 1
#define HAVE_ODP_IMPLICIT 1
#define HAVE_POSIX_MEMALIGN 1
#define HAVE_PREFETCH 1
#define HAVE_RDMACM_QP_LESS 1
#define HAVE_SCHED_GETAFFINITY 1
#define HAVE_SCHED_SETAFFINITY 1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T 1
#define HAVE_STDINT_H 1
#define HAVE_STDLIB_H 1
#define HAVE_STRERROR_R 1
#define HAVE_STRINGS_H 1
#define HAVE_STRING_H 1
#define HAVE_STRUCT_BITMASK 1
#define HAVE_STRUCT_DL_PHDR_INFO 1
#define HAVE_STRUCT_IBV_ASYNC_EVENT_ELEMENT_DCT 1
#define HAVE_STRUCT_IBV_EXP_CREATE_SRQ_ATTR_DC_OFFLOAD_PARAMS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_EXP_DEVICE_CAP_FLAGS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS_PER_TRANSPORT_CAPS_DC_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_MR_MAX_SIZE 1
#define HAVE_STRUCT_IBV_EXP_QP_INIT_ATTR_MAX_INL_RECV 1
#define HAVE_STRUCT_IBV_MLX5_QP_INFO_BF_NEED_LOCK 1
#define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1
#define HAVE_STRUCT_MLX5_AH_IBV_AH 1
#define HAVE_STRUCT_MLX5_CQE64_IB_STRIDE_INDEX 1
#define HAVE_STRUCT_MLX5_GRH_AV_RMAC 1
#define HAVE_STRUCT_MLX5_SRQ_CMD_QP 1
#define HAVE_STRUCT_MLX5_WQE_AV_BASE 1
#define HAVE_SYS_EPOLL_H 1
#define HAVE_SYS_EVENTFD_H 1
#define HAVE_SYS_STAT_H 1
#define HAVE_SYS_TYPES_H 1
#define HAVE_SYS_UIO_H 1
#define HAVE_TL_DC 1
#define HAVE_TL_RC 1
#define HAVE_TL_UD 1
#define HAVE_UCM_PTMALLOC286 1
#define HAVE_UNISTD_H 1
#define HAVE_VERBS_EXP_H 1
#define HAVE___CLEAR_CACHE 1
#define HAVE___CURBRK 1
#define HAVE___SIGHANDLER_T 1
#define IBV_HW_TM 1
#define LT_OBJDIR ".libs/"
#define NVALGRIND 1
#define PACKAGE "ucx"
#define PACKAGE_BUGREPORT ""
#define PACKAGE_NAME "ucx"
#define PACKAGE_STRING "ucx 1.9"
#define PACKAGE_TARNAME "ucx"
#define PACKAGE_URL ""
#define PACKAGE_VERSION "1.9"
#define STDC_HEADERS 1
#define STRERROR_R_CHAR_P 1
#define UCM_BISTRO_HOOKS 1
#define UCS_MAX_LOG_LEVEL UCS_LOG_LEVEL_INFO
#define UCT_UD_EP_DEBUG_HOOKS 0
#define UCX_CONFIGURE_FLAGS "--disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1 --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc --with-cuda=/sw/installed/CUDAcore/11.1.1 --with-gdrcopy=/sw/installed/GDRCopy/2.1-GCCcore-10.2.0-CUDA-11.1.1"
#define UCX_MODULE_SUBDIR "ucx"
#define VERSION "1.9"
#define restrict __restrict
#define test_MODULES ":module"
#define ucm_MODULES ":cuda"
#define uct_MODULES ":cuda:ib:rdmacm:cma"
#define uct_cuda_MODULES ":gdrcopy"
#define uct_ib_MODULES ":cm"
#define uct_rocm_MODULES ""
#define ucx_perftest_MODULES ":cuda"

@jirikraus
Copy link
Member

Thanks. I can't spot anything regarding your Software Setup. As the performance difference between CUDA-aware MPI and regular MPI on a single node is about 2x and CUDA-aware MPI is faster for 2 processes on two nodes I am suspecting there is an issue with the GPU affinity handling. I.e. ENV_LOCAL_RANK defined the wrong way (but you seem to have that right) or that on the system you are using CUDA_VISIBLE_DEVICES is set in a funky way. As this code has not been updated for quite some time can you try with https://github.com/NVIDIA/multi-gpu-programming-models (also a Jacobi solver but with a simpler code that I regularly use in tutorials).
I also checked the math for the bandwidth: The formula used does not consider caches, see https://github.com/NVIDIA-developer-blog/code-samples/blob/master/posts/cuda-aware-mpi-example/src/Host.c#L291 which explains why you are seeing too large memory bandwidths.

@Mountain-ql
Copy link
Author

Thanks a lot!!

@jirikraus
Copy link
Member

Thanks for the feedback. Closing this as it does not seem to be an issue with the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants