CUDA GPUDirect blocks on message size greater than GPUDirect RDMA limit with openib btl #3972

Felandric · 2017-07-27T14:17:36Z

Background information

When using CUDA GPUDirect to Send and Recv directly from GPU buffers on message sizes greater than the RDMA limit, rather than expected behavior of staging buffers through host memory, Open MPI simply hangs.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.0.x

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Git clone.
Commands used:

./autogen.pl

./configure --prefix=/labhome/ompi/install --with-knem=/path/to/knem --with-mxm=/path/to/mxm --with-slurm --with-pmi --with-platform=contrib/platform/mellanox/optimized --with-hcoll=/path/to/hcoll --with-ucx=/path/to/ucx --with-cuda

make all install

Please describe the system on which you are running

Operating system/version: Red Hat Enterprise Linux Server 7.2 (Maipo) (3.10.0-327.el7.x86_64)
CUDA distribution: 8.0
CUDA GPUDirect Driver: Mellanox OFED GPUDirect RDMA
Computer hardware: 2 nodes, 48 cores/node, 2 GPUs per node
CPUs: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
GPUs: NVIDIA Tesla P100-PCIE-16GB
Network type:
Infiniband

Details of the problem

GPUDirect is intended to improve latency and bandwidth by allowing RDMA transfers from GPU to GPU and bypassing the CPU. However, there is a message size limit on RDMA transfers. Above this limit, GPUDirect is expected to stage buffers through host memory with cudaMemcpy.
The limit can be changed with the parameter "-mca btl_openib_cuda_rdma_limit x", where x is the message size in bytes.
Unfortunately, rather than behave as expected, the program simply hangs on MPI_Recv.
According to this, this may occur due to OpenMPI using blocking copies. The option "-mca mpi_common_cuda_cumemcpy_async 1" is meant to instruct OpenMPI to use non-blocking asynchronous copies when staging buffers through host.
However, enabling this option does nothing, and according to this, the option was enabled by default in OpenMPI v1.10 and onwards.
I attempted the same test in v1.10 and it was successful.

Below is gpudirect_bug.c, to replicate the behavior. argv[1] is taken as the number of floating point elements in the send and receive buffers, so the actual array size is that multiplied by sizeof(float).
The btl_openib_cuda_rdma_limit parameter is a limit on total message size; in my case an extra 56 bytes seemed to be used as a header, so a limit of 1000 is actually a limit of 944. This small discrepancy is a minor inconvenience but should possibly be corrected to match the user message length.

In my command line I inputted 1000, so the array was 4000 bytes, and the RDMA limit was 1000 bytes, so it was above the limit and hung. Message sizes below the limit act as expected.

gpudirect_bug.c:

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <mpi.h>
int main(int argc, char** argv)
{
        MPI_Init(NULL,NULL);
        int rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Status st;
        float *dev_buf;
        size_t elements = atoi(argv[1]);
        size_t arraysize = elements * sizeof(float);
        if (rank == 0) {
                printf("Array size: %llu\n", arraysize);
        }

        cudaMalloc((void**)&dev_buf, arraysize);

        MPI_Barrier(MPI_COMM_WORLD);
        if (rank == 0) {
                MPI_Send(dev_buf, elements, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
                MPI_Recv(dev_buf, elements, MPI_FLOAT, 1, 0, MPI_COMM_WORLD, &st);
        } else if (rank == 1) {
                MPI_Recv(dev_buf, elements, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &st);
                MPI_Send(dev_buf, elements, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);
        }
        MPI_Barrier(MPI_COMM_WORLD);

        cudaFree(dev_buf);
        MPI_Finalize();
        return 0;
}

Compilation:

mpicc -o gpu_test gpudirect_bug.c -lcudart -g

Command line:

mpirun -np 2 -bind-to core -mca pml ob1 -mca btl openib,self --map-by node -H vulcan03,vulcan02 --display-map -mca coll '^hcoll' -mca mtl '^mxm' -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_cuda_rdma_limit 1000 gpu_test 1000

Stacktrace:

== vulcan02 == 24222
Thread 5 (Thread 0x7ffff42ff700 (LWP 24223)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x663840, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x663840, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff663445e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7ffff2c23700 (LWP 24224)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x68e5d0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x68e5d0, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff32a321e in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:109
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7fffda73d700 (LWP 24240)):
#0  0x00007ffff70b4bcf in accept4 () from /usr/lib64/libc.so.6
#1  0x00007fffe8a28676 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a1be9d in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fffd9f3c700 (LWP 24241)):
#0  0x00007ffff70a8dfd in poll () from /usr/lib64/libc.so.6
#1  0x00007fffe8a277b3 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a8af56 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fad740 (LWP 24222)):
#0  0x00007fffe9491abe in progress_one_device (device=0x7ae600) at btl_openib_component.c:3689
#1  btl_openib_component_progress () at btl_openib_component.c:3765
#2  0x00007ffff662e574 in opal_progress () at runtime/opal_progress.c:222
#3  0x00007fffe81fc425 in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:392
#4  mca_pml_ob1_recv (addr=0x10316000000, count=<optimized out>, datatype=0x6022e0 <ompi_mpi_float>, src=<optimized out>, tag=<optimized out>, comm=<optimized out>, status=0x7fffffffd580) at pml_ob1_irecv.c:135
#5  0x00007ffff75ff41c in PMPI_Recv (buf=<optimized out>, count=<optimized out>, type=<optimized out>, source=<optimized out>, tag=<optimized out>, comm=0x6020e0 <ompi_mpi_comm_world>, status=0x7fffffffd580) at precv.c:79
#6  0x00000000004011ac in main (argc=2, argv=0x7fffffffd6e8) at /labhome/avivkh/with_hpl/a.c:122
== vulcan03 == 9658
Thread 5 (Thread 0x7ffff42ff700 (LWP 9661)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x663860, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x663860, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff663445e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7ffff2c23700 (LWP 9662)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x68e5f0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x68e5f0, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff32a321e in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:109
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7fffda73d700 (LWP 9675)):
#0  0x00007ffff70b4bcf in accept4 () from /usr/lib64/libc.so.6
#1  0x00007fffe8a28676 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a1be9d in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fffd9f3c700 (LWP 9676)):
#0  0x00007ffff70a8dfd in poll () from /usr/lib64/libc.so.6
#1  0x00007fffe8a277b3 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a8af56 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fad740 (LWP 9658)):
#0  0x00007ffff7389210 in pthread_spin_lock () from /usr/lib64/libpthread.so.0
#1  0x00007fffeaf01b24 in mlx5_poll_cq_1 () from /usr/lib64/libmlx5-rdmav2.so
#2  0x00007fffe9490e83 in ibv_poll_cq (wc=0x7fffffffa7e0, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1458
#3  poll_device (device=device@entry=0x7a2f00, count=count@entry=0) at btl_openib_component.c:3608
#4  0x00007fffe9491c22 in progress_one_device (device=0x7a2f00) at btl_openib_component.c:3741
#5  btl_openib_component_progress () at btl_openib_component.c:3765
#6  0x00007ffff662e574 in opal_progress () at runtime/opal_progress.c:222
#7  0x00007fffe81fc425 in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:392
#8  mca_pml_ob1_recv (addr=0x10316000000, count=<optimized out>, datatype=0x6022e0 <ompi_mpi_float>, src=<optimized out>, tag=<optimized out>, comm=<optimized out>, status=0x7fffffffdcd0) at pml_ob1_irecv.c:135
#9  0x00007ffff75ff41c in PMPI_Recv (buf=<optimized out>, count=<optimized out>, type=<optimized out>, source=<optimized out>, tag=<optimized out>, comm=0x6020e0 <ompi_mpi_comm_world>, status=0x7fffffffdcd0) at precv.c:79
#10 0x00000000004012e6 in main (argc=2, argv=0x7fffffffde38) at /labhome/avivkh/with_hpl/a.c:138

The text was updated successfully, but these errors were encountered:

tgpfeiffer · 2018-01-19T08:40:56Z

Are there any updates on this issue? OpenMPI 3.0 hangs for me with CUDA+IB where 2.1 works fine, so maybe it's the same problem.

bwbarrett · 2018-06-01T15:18:38Z

@jladd-mlnx, what's the recommended path for GPUDirect on Mellanox hardware in v3.0.x? Is it still OpenIB (and we should fix this issue) or is it UCX and we should figure out how to encourage a move to UCX?

Akshay-Venkatesh · 2018-08-22T17:42:44Z

@Felandric I was made aware of this bug about 20 mins ago. I've missed my assignment notification that happened a year ago. My apologies.

@bwbarrett I'm able to verify that the issue doesn't occur with UCX CUDA support

int main(int argc, char** argv)
{
    MPI_Init(NULL,NULL);
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Status st;
    float *dev_buf;
    size_t elements = atoi(argv[1]);
    size_t arraysize = elements * sizeof(float);
    if (rank == 0) {
        printf("Array size: %llu\n", arraysize);
    }

    cudaMalloc((void**)&dev_buf, arraysize);

    MPI_Barrier(MPI_COMM_WORLD);
    if (rank == 0) {
        MPI_Send(dev_buf, elements, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
        MPI_Recv(dev_buf, elements, MPI_FLOAT, 1, 0, MPI_COMM_WORLD, &st);
    } else if (rank == 1) {
        MPI_Recv(dev_buf, elements, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &st);
        MPI_Send(dev_buf, elements, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);
    }
    MPI_Barrier(MPI_COMM_WORLD);

    cudaFree(dev_buf);
    printf("%d: Test complete\n", rank);
    MPI_Finalize();
    return 0;
}

$ mpicc -o gpu_test gpudirect_bug.c -lcudart -g
$ mpirun -np 2 --hostfile $PWD/hostfile --mca pml ucx --mca btl ^vader,openib,sm,tcp -x TLS=rc_x,sm,cuda_copy,gdr_copy,cuda_ipc -x LD_LIBRARY_PATH ./gpu_test 1000
Array size: 4000
0: Test complete
1: Test complete

Akshay-Venkatesh · 2018-08-22T17:49:34Z

@Felandric
Trying with ob1 pml...

Hang case:
=========
$ mpirun -np 2 --hostfile $PWD/hostfile --mca btl vader,self,smcuda,openib  --mca btl_openib_warn_default_gid_prefix 0  --mca btl_smcuda_use_cuda_ipc_same_gpu 1  --mca btl_openib_want_cuda_gdr 1  --mca btl_openib_cuda_async_recv true --mca btl_smcuda_use_cuda_ipc 1 --mca btl_openib_allow_ib true --mca btl_openib_cuda_rdma_limit 1000 -x LD_LIBRARY_PATH ./gpu_test 1000
Array size: 4000

No-Hang case:
============
$ mpirun -np 2 --hostfile $PWD/hostfile --mca btl vader,self,smcuda,openib  --mca btl_openib_warn_default_gid_prefix 0  --mca btl_smcuda_use_cuda_ipc_same_gpu 1  --mca btl_openib_want_cuda_gdr 1  --mca btl_openib_cuda_async_recv false --mca btl_smcuda_use_cuda_ipc 1 --mca btl_openib_allow_ib true --mca btl_openib_cuda_rdma_limit 1000 -x LD_LIBRARY_PATH ./gpu_test 1000
Array size: 4000
0: Test complete
1: Test complete

So using --mca btl_openib_cuda_async_recv false does prevent the hang for me on this two node test.

Akshay-Venkatesh · 2018-08-22T17:52:31Z

@hppritcha I thought that a decision was made a few months ago that --mca btl_openib_cuda_async_recv false would be default and that users wouldn't have to set this explicitly should they end up using ob1 pml for cuda-aware MPI. Is this still in the pipeline?

jsquyres · 2018-08-24T16:36:18Z

@Akshay-Venkatesh Are you referring to #4650?

Recall that openib is not the default for IB networks in the upcoming v4.0.0 and will likely go away in future releases.

Akshay-Venkatesh · 2018-08-24T17:05:20Z

@jsquyres I was indeed referring to that. I agree that this would be of little consequence 4.0.0 and onwards but I think that btl_openib_cuda_async_recv should default false for those who are still working with openib/smcuda. As this experiment proves, using UCX solves the problem but there are people who're still migrating and fixing the default state would be useful to them.

jsquyres · 2018-08-24T18:24:58Z

@hppritcha @bwbarrett Opinions for v2.x / v3.0.x / v3.1.x?

We're not going to do a new v2.x release for this, but we could commit it so that it's at least there (e.g., if anyone uses the nightly tarball).

Akshay-Venkatesh · 2018-08-24T19:11:29Z

@jsquyres This approach suffices for the time being. Usual admins install an openmpi package on a cluster. So at least internally, we can recommend picking up the nightly tarball.

jsquyres · 2018-09-18T15:30:34Z

Per 2018-09-18 webex, @Akshay-Venkatesh is going to make a PR for master + release branches to change the MCA var default value as described in #3972 (comment).

jsquyres · 2018-09-18T15:32:22Z

BTW, also per 2018-09-18 webex, it turns out that we are going to do another v2.1.x release (sigh) because of #5696. So I'm going to add the v2.x label to this issue, too.

Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References open-mpi#3972 Signed-off-by: Brian Barrett <bbarrett@amazon.com>

Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References #3972 Signed-off-by: Brian Barrett <bbarrett@amazon.com>

Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References open-mpi#3972 Signed-off-by: Brian Barrett <bbarrett@amazon.com> (cherry picked from commit 9344afd) Signed-off-by: Brian Barrett <bbarrett@amazon.com>

bwbarrett · 2018-09-20T21:28:18Z

Work around is in master; pull requests for v2.1.x, v3.0.x, v3.1.x, and v4.0.x opened.

bwbarrett · 2018-09-27T21:20:43Z

Pull requests all merged. Closing.

alinask added the bug label Jul 27, 2017

alinask added this to the v3.0.1 milestone Jul 27, 2017

sjeaugey assigned Akshay-Venkatesh Jul 27, 2017

bwbarrett removed this from the v3.0.1 milestone Mar 1, 2018

bwbarrett added the Target: v3.0.x label Mar 1, 2018

bwbarrett added this to the v3.0.2 milestone Mar 1, 2018

keisukefukuda mentioned this issue Mar 4, 2018

Multi-GPU training hangs chainer/chainermn#217

Closed

keisukefukuda mentioned this issue Mar 20, 2018

ChainerMN hangs with Open MPI 3 chainer/chainermn#221

Closed

everpeace mentioned this issue May 24, 2018

Add MPI Operator proposal. kubeflow/community#139

Merged

bwbarrett modified the milestones: v3.0.2, v3.0.3 Jun 1, 2018

LiweiPeng mentioned this issue Aug 1, 2018

openmpi MPI_Bcast hang with cuda gdr with RoCE #5479

Closed

bwbarrett added Target: v3.1.x Target: main Target: v4.0.x labels Sep 18, 2018

jsquyres added the Target: v2.x label Sep 18, 2018

bwbarrett mentioned this issue Sep 19, 2018

openib: Disable CUDA async by default #5740

Merged

This was referenced Sep 20, 2018

openib: Disable CUDA async by default #5745

Merged

openib: Disable CUDA async by default #5746

Merged

openib: Disable CUDA async by default #5747

Merged

openib: Disable CUDA async by default #5748

Merged

bwbarrett removed the Target: main label Sep 20, 2018

bwbarrett added State: PR posted and removed Target: v3.0.x Target: v3.1.x Target: v4.0.x labels Sep 21, 2018

bwbarrett closed this as completed Sep 27, 2018

kuenishi mentioned this issue Nov 29, 2018

Add latest OpenMPI versions chainer/chainer#5740

Closed

keisukefukuda mentioned this issue May 9, 2019

pytest does not detect Open MPI's hang issue on MNIST chainer/chainer#7093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA GPUDirect blocks on message size greater than GPUDirect RDMA limit with openib btl #3972

CUDA GPUDirect blocks on message size greater than GPUDirect RDMA limit with openib btl #3972

Felandric commented Jul 27, 2017 •

edited

tgpfeiffer commented Jan 19, 2018

bwbarrett commented Jun 1, 2018

Akshay-Venkatesh commented Aug 22, 2018

Akshay-Venkatesh commented Aug 22, 2018

Akshay-Venkatesh commented Aug 22, 2018

jsquyres commented Aug 24, 2018

Akshay-Venkatesh commented Aug 24, 2018

jsquyres commented Aug 24, 2018

Akshay-Venkatesh commented Aug 24, 2018

jsquyres commented Sep 18, 2018

jsquyres commented Sep 18, 2018

bwbarrett commented Sep 20, 2018

bwbarrett commented Sep 27, 2018

CUDA GPUDirect blocks on message size greater than GPUDirect RDMA limit with openib btl #3972

CUDA GPUDirect blocks on message size greater than GPUDirect RDMA limit with openib btl #3972

Comments

Felandric commented Jul 27, 2017 • edited

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

tgpfeiffer commented Jan 19, 2018

bwbarrett commented Jun 1, 2018

Akshay-Venkatesh commented Aug 22, 2018

Akshay-Venkatesh commented Aug 22, 2018

Akshay-Venkatesh commented Aug 22, 2018

jsquyres commented Aug 24, 2018

Akshay-Venkatesh commented Aug 24, 2018

jsquyres commented Aug 24, 2018

Akshay-Venkatesh commented Aug 24, 2018

jsquyres commented Sep 18, 2018

jsquyres commented Sep 18, 2018

bwbarrett commented Sep 20, 2018

bwbarrett commented Sep 27, 2018

Felandric commented Jul 27, 2017 •

edited