Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA GPUDirect blocks on message size greater than GPUDirect RDMA limit with openib btl #3972

Closed
Felandric opened this issue Jul 27, 2017 · 13 comments

Comments

@Felandric
Copy link

Felandric commented Jul 27, 2017

Background information

When using CUDA GPUDirect to Send and Recv directly from GPU buffers on message sizes greater than the RDMA limit, rather than expected behavior of staging buffers through host memory, Open MPI simply hangs.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.0.x

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Git clone.
Commands used:

./autogen.pl

./configure --prefix=/labhome/ompi/install --with-knem=/path/to/knem --with-mxm=/path/to/mxm --with-slurm --with-pmi --with-platform=contrib/platform/mellanox/optimized --with-hcoll=/path/to/hcoll --with-ucx=/path/to/ucx --with-cuda

make all install

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Linux Server 7.2 (Maipo) (3.10.0-327.el7.x86_64)
    CUDA distribution: 8.0
    CUDA GPUDirect Driver: Mellanox OFED GPUDirect RDMA

  • Computer hardware: 2 nodes, 48 cores/node, 2 GPUs per node
    CPUs: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
    GPUs: NVIDIA Tesla P100-PCIE-16GB

  • Network type:
    Infiniband


Details of the problem

GPUDirect is intended to improve latency and bandwidth by allowing RDMA transfers from GPU to GPU and bypassing the CPU. However, there is a message size limit on RDMA transfers. Above this limit, GPUDirect is expected to stage buffers through host memory with cudaMemcpy.
The limit can be changed with the parameter "-mca btl_openib_cuda_rdma_limit x", where x is the message size in bytes.
Unfortunately, rather than behave as expected, the program simply hangs on MPI_Recv.
According to this, this may occur due to OpenMPI using blocking copies. The option "-mca mpi_common_cuda_cumemcpy_async 1" is meant to instruct OpenMPI to use non-blocking asynchronous copies when staging buffers through host.
However, enabling this option does nothing, and according to this, the option was enabled by default in OpenMPI v1.10 and onwards.
I attempted the same test in v1.10 and it was successful.

Below is gpudirect_bug.c, to replicate the behavior. argv[1] is taken as the number of floating point elements in the send and receive buffers, so the actual array size is that multiplied by sizeof(float).
The btl_openib_cuda_rdma_limit parameter is a limit on total message size; in my case an extra 56 bytes seemed to be used as a header, so a limit of 1000 is actually a limit of 944. This small discrepancy is a minor inconvenience but should possibly be corrected to match the user message length.

In my command line I inputted 1000, so the array was 4000 bytes, and the RDMA limit was 1000 bytes, so it was above the limit and hung. Message sizes below the limit act as expected.

gpudirect_bug.c:

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <mpi.h>
int main(int argc, char** argv)
{
        MPI_Init(NULL,NULL);
        int rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Status st;
        float *dev_buf;
        size_t elements = atoi(argv[1]);
        size_t arraysize = elements * sizeof(float);
        if (rank == 0) {
                printf("Array size: %llu\n", arraysize);
        }

        cudaMalloc((void**)&dev_buf, arraysize);

        MPI_Barrier(MPI_COMM_WORLD);
        if (rank == 0) {
                MPI_Send(dev_buf, elements, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
                MPI_Recv(dev_buf, elements, MPI_FLOAT, 1, 0, MPI_COMM_WORLD, &st);
        } else if (rank == 1) {
                MPI_Recv(dev_buf, elements, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &st);
                MPI_Send(dev_buf, elements, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);
        }
        MPI_Barrier(MPI_COMM_WORLD);

        cudaFree(dev_buf);
        MPI_Finalize();
        return 0;
}

Compilation:

mpicc -o gpu_test gpudirect_bug.c -lcudart -g

Command line:

mpirun -np 2 -bind-to core -mca pml ob1 -mca btl openib,self --map-by node -H vulcan03,vulcan02 --display-map -mca coll '^hcoll' -mca mtl '^mxm' -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_cuda_rdma_limit 1000 gpu_test 1000

Stacktrace:

== vulcan02 == 24222
Thread 5 (Thread 0x7ffff42ff700 (LWP 24223)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x663840, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x663840, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff663445e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7ffff2c23700 (LWP 24224)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x68e5d0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x68e5d0, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff32a321e in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:109
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7fffda73d700 (LWP 24240)):
#0  0x00007ffff70b4bcf in accept4 () from /usr/lib64/libc.so.6
#1  0x00007fffe8a28676 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a1be9d in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fffd9f3c700 (LWP 24241)):
#0  0x00007ffff70a8dfd in poll () from /usr/lib64/libc.so.6
#1  0x00007fffe8a277b3 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a8af56 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fad740 (LWP 24222)):
#0  0x00007fffe9491abe in progress_one_device (device=0x7ae600) at btl_openib_component.c:3689
#1  btl_openib_component_progress () at btl_openib_component.c:3765
#2  0x00007ffff662e574 in opal_progress () at runtime/opal_progress.c:222
#3  0x00007fffe81fc425 in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:392
#4  mca_pml_ob1_recv (addr=0x10316000000, count=<optimized out>, datatype=0x6022e0 <ompi_mpi_float>, src=<optimized out>, tag=<optimized out>, comm=<optimized out>, status=0x7fffffffd580) at pml_ob1_irecv.c:135
#5  0x00007ffff75ff41c in PMPI_Recv (buf=<optimized out>, count=<optimized out>, type=<optimized out>, source=<optimized out>, tag=<optimized out>, comm=0x6020e0 <ompi_mpi_comm_world>, status=0x7fffffffd580) at precv.c:79
#6  0x00000000004011ac in main (argc=2, argv=0x7fffffffd6e8) at /labhome/avivkh/with_hpl/a.c:122
== vulcan03 == 9658
Thread 5 (Thread 0x7ffff42ff700 (LWP 9661)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x663860, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x663860, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff663445e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7ffff2c23700 (LWP 9662)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x68e5f0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x68e5f0, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff32a321e in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:109
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7fffda73d700 (LWP 9675)):
#0  0x00007ffff70b4bcf in accept4 () from /usr/lib64/libc.so.6
#1  0x00007fffe8a28676 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a1be9d in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fffd9f3c700 (LWP 9676)):
#0  0x00007ffff70a8dfd in poll () from /usr/lib64/libc.so.6
#1  0x00007fffe8a277b3 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a8af56 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fad740 (LWP 9658)):
#0  0x00007ffff7389210 in pthread_spin_lock () from /usr/lib64/libpthread.so.0
#1  0x00007fffeaf01b24 in mlx5_poll_cq_1 () from /usr/lib64/libmlx5-rdmav2.so
#2  0x00007fffe9490e83 in ibv_poll_cq (wc=0x7fffffffa7e0, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1458
#3  poll_device (device=device@entry=0x7a2f00, count=count@entry=0) at btl_openib_component.c:3608
#4  0x00007fffe9491c22 in progress_one_device (device=0x7a2f00) at btl_openib_component.c:3741
#5  btl_openib_component_progress () at btl_openib_component.c:3765
#6  0x00007ffff662e574 in opal_progress () at runtime/opal_progress.c:222
#7  0x00007fffe81fc425 in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:392
#8  mca_pml_ob1_recv (addr=0x10316000000, count=<optimized out>, datatype=0x6022e0 <ompi_mpi_float>, src=<optimized out>, tag=<optimized out>, comm=<optimized out>, status=0x7fffffffdcd0) at pml_ob1_irecv.c:135
#9  0x00007ffff75ff41c in PMPI_Recv (buf=<optimized out>, count=<optimized out>, type=<optimized out>, source=<optimized out>, tag=<optimized out>, comm=0x6020e0 <ompi_mpi_comm_world>, status=0x7fffffffdcd0) at precv.c:79
#10 0x00000000004012e6 in main (argc=2, argv=0x7fffffffde38) at /labhome/avivkh/with_hpl/a.c:138
@alinask alinask added the bug label Jul 27, 2017
@alinask alinask added this to the v3.0.1 milestone Jul 27, 2017
@tgpfeiffer
Copy link

Are there any updates on this issue? OpenMPI 3.0 hangs for me with CUDA+IB where 2.1 works fine, so maybe it's the same problem.

@bwbarrett
Copy link
Member

@jladd-mlnx, what's the recommended path for GPUDirect on Mellanox hardware in v3.0.x? Is it still OpenIB (and we should fix this issue) or is it UCX and we should figure out how to encourage a move to UCX?

@Akshay-Venkatesh
Copy link
Contributor

@Felandric I was made aware of this bug about 20 mins ago. I've missed my assignment notification that happened a year ago. My apologies.

@bwbarrett I'm able to verify that the issue doesn't occur with UCX CUDA support

int main(int argc, char** argv)
{
    MPI_Init(NULL,NULL);
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Status st;
    float *dev_buf;
    size_t elements = atoi(argv[1]);
    size_t arraysize = elements * sizeof(float);
    if (rank == 0) {
        printf("Array size: %llu\n", arraysize);
    }

    cudaMalloc((void**)&dev_buf, arraysize);

    MPI_Barrier(MPI_COMM_WORLD);
    if (rank == 0) {
        MPI_Send(dev_buf, elements, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
        MPI_Recv(dev_buf, elements, MPI_FLOAT, 1, 0, MPI_COMM_WORLD, &st);
    } else if (rank == 1) {
        MPI_Recv(dev_buf, elements, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &st);
        MPI_Send(dev_buf, elements, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);
    }
    MPI_Barrier(MPI_COMM_WORLD);

    cudaFree(dev_buf);
    printf("%d: Test complete\n", rank);
    MPI_Finalize();
    return 0;
}

$ mpicc -o gpu_test gpudirect_bug.c -lcudart -g
$ mpirun -np 2 --hostfile $PWD/hostfile --mca pml ucx --mca btl ^vader,openib,sm,tcp -x TLS=rc_x,sm,cuda_copy,gdr_copy,cuda_ipc -x LD_LIBRARY_PATH ./gpu_test 1000
Array size: 4000
0: Test complete
1: Test complete

@Akshay-Venkatesh
Copy link
Contributor

@Felandric
Trying with ob1 pml...

Hang case:
=========
$ mpirun -np 2 --hostfile $PWD/hostfile --mca btl vader,self,smcuda,openib  --mca btl_openib_warn_default_gid_prefix 0  --mca btl_smcuda_use_cuda_ipc_same_gpu 1  --mca btl_openib_want_cuda_gdr 1  --mca btl_openib_cuda_async_recv true --mca btl_smcuda_use_cuda_ipc 1 --mca btl_openib_allow_ib true --mca btl_openib_cuda_rdma_limit 1000 -x LD_LIBRARY_PATH ./gpu_test 1000
Array size: 4000

No-Hang case:
============
$ mpirun -np 2 --hostfile $PWD/hostfile --mca btl vader,self,smcuda,openib  --mca btl_openib_warn_default_gid_prefix 0  --mca btl_smcuda_use_cuda_ipc_same_gpu 1  --mca btl_openib_want_cuda_gdr 1  --mca btl_openib_cuda_async_recv false --mca btl_smcuda_use_cuda_ipc 1 --mca btl_openib_allow_ib true --mca btl_openib_cuda_rdma_limit 1000 -x LD_LIBRARY_PATH ./gpu_test 1000
Array size: 4000
0: Test complete
1: Test complete

So using --mca btl_openib_cuda_async_recv false does prevent the hang for me on this two node test.

@Akshay-Venkatesh
Copy link
Contributor

@hppritcha I thought that a decision was made a few months ago that --mca btl_openib_cuda_async_recv false would be default and that users wouldn't have to set this explicitly should they end up using ob1 pml for cuda-aware MPI. Is this still in the pipeline?

@jsquyres
Copy link
Member

@Akshay-Venkatesh Are you referring to #4650?

Recall that openib is not the default for IB networks in the upcoming v4.0.0 and will likely go away in future releases.

@Akshay-Venkatesh
Copy link
Contributor

@jsquyres I was indeed referring to that. I agree that this would be of little consequence 4.0.0 and onwards but I think that btl_openib_cuda_async_recv should default false for those who are still working with openib/smcuda. As this experiment proves, using UCX solves the problem but there are people who're still migrating and fixing the default state would be useful to them.

@jsquyres
Copy link
Member

@hppritcha @bwbarrett Opinions for v2.x / v3.0.x / v3.1.x?

We're not going to do a new v2.x release for this, but we could commit it so that it's at least there (e.g., if anyone uses the nightly tarball).

@Akshay-Venkatesh
Copy link
Contributor

@jsquyres This approach suffices for the time being. Usual admins install an openmpi package on a cluster. So at least internally, we can recommend picking up the nightly tarball.

@jsquyres
Copy link
Member

Per 2018-09-18 webex, @Akshay-Venkatesh is going to make a PR for master + release branches to change the MCA var default value as described in #3972 (comment).

@jsquyres
Copy link
Member

BTW, also per 2018-09-18 webex, it turns out that we are going to do another v2.1.x release (sigh) because of #5696. So I'm going to add the v2.x label to this issue, too.

bwbarrett added a commit to bwbarrett/ompi that referenced this issue Sep 19, 2018
Disable async receive for CUDA under OpenIB.  While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit.  This change has been validated
and approved by Akshay.

References open-mpi#3972

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit that referenced this issue Sep 20, 2018
Disable async receive for CUDA under OpenIB.  While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit.  This change has been validated
and approved by Akshay.

References #3972

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Sep 20, 2018
Disable async receive for CUDA under OpenIB.  While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit.  This change has been validated
and approved by Akshay.

References open-mpi#3972

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 9344afd)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Sep 20, 2018
Disable async receive for CUDA under OpenIB.  While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit.  This change has been validated
and approved by Akshay.

References open-mpi#3972

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 9344afd)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Sep 20, 2018
Disable async receive for CUDA under OpenIB.  While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit.  This change has been validated
and approved by Akshay.

References open-mpi#3972

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 9344afd)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Sep 20, 2018
Disable async receive for CUDA under OpenIB.  While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit.  This change has been validated
and approved by Akshay.

References open-mpi#3972

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 9344afd)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
@bwbarrett
Copy link
Member

Work around is in master; pull requests for v2.1.x, v3.0.x, v3.1.x, and v4.0.x opened.

@bwbarrett
Copy link
Member

Pull requests all merged. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants