OpenMPI hangs on MPI_Test with InfiniBand and high message rate #4863

Noxoomo · 2018-02-25T22:03:51Z

Background information

I have GPU-based application, which uses MPI to transfer messages and data between several nodes.
It works fine on 1GB/s networks, but stucks in deadlock if i switch to InfiniBand.

System description, OpenMPI details and etc

I've reproduced issue on two clusters with InfiniBand network. I've used several OpenMPI versions builded from source for all runs (see details below)

First cluster consists of Intel dual-socket servers with Ubuntu 12.04 and NVIDIA GPUs, Mellanox Technologies MT27500 Family [ConnectX-3] for InfiniBand. On this cluster i've tried OpenMPI 2.1.2 and OpenMPI 3.0

On the second one there are dual-socket Intel servers with Ubuntu 16.04 and NVIDIA GPUs, Mellanox Technologies MT27700 Family [ConnectX-4] for InifniBand. This cluster is IPV6-only, so on it OpenMPI was builded from master with several patches to fix IPV6-compability.

All MPI builds were with CUDA support.
I could provide more information on request.

Details of the problem

My application sends a lot of small message. All sends and receives are async (ISend, IRecv) and i use MPI_Test in loop to check for completeness of operation. On InfiniBand networks MPI_Test would not "return true" for MPI_ISend which was received by other host (communications in my application usually uses unique tags so i dumped every receive and send request and successful call to MPI_Test and from logs i saw that some MPI_ISend was received, but sender was notified about it). As a result, my application waits forever.

I checked the same code with MVAPICH2 and everything work fine.

My application is complex, so I've reproduced the same issue with far more simple code, attached bellow:


#include <iostream>
#include <vector>

#define MPI_SAFE_CALL(cmd)                                                    \
   {                                                                          \
        int mpiErrNo = (cmd);                                                 \
        if (MPI_SUCCESS != mpiErrNo) {                                        \
            char msg[MPI_MAX_ERROR_STRING];                                   \
            int len;                                                          \
            MPI_Error_string(mpiErrNo, msg, &len);                            \
            std::cout << "MPI failed with error code :" << mpiErrNo           \
                                << " " << msg << std::endl;                   \
            MPI_Abort(MPI_COMM_WORLD, mpiErrNo);                              \
        }                                                                     \
    }


class TMpiRequest {
public:
    bool IsComplete() const {
        if (!Flag) {
            MPI_SAFE_CALL(MPI_Test(Request.get(), &Flag, &Status));
        }
        return static_cast<bool>(Flag);
    }


    TMpiRequest(TMpiRequest&& other) {
        if (this != &other) {
            this->Flag = other.Flag;
            this->Request.swap(other.Request);
            this->Status = other.Status;
            other.Clear();
        }
    }

    TMpiRequest& operator=(TMpiRequest&& other) {
        if (this != &other) {
            this->Flag = other.Flag;
            this->Request.swap(other.Request);
            this->Status = other.Status;
            other.Clear();
        }
        return *this;
    }

    TMpiRequest() {
    }

    ~TMpiRequest() {
    }

    TMpiRequest(std::unique_ptr<MPI_Request>&& request)
            : Flag(0)
              , Request(std::move(request)) {
        IsComplete();
    }

    void Clear() {
        Flag = -1;
        Request = nullptr;
    }

private:
    mutable int Flag = -1;
    std::unique_ptr<MPI_Request> Request;
    mutable MPI_Status Status;
};


TMpiRequest ReadAsync(char* data, int dataSize, int sourceRank, int tag) {
    std::unique_ptr<MPI_Request> request(new MPI_Request);
    MPI_SAFE_CALL(MPI_Irecv(data, dataSize, MPI_CHAR, sourceRank, tag, MPI_COMM_WORLD, request.get()));
    return {std::move(request)};
}


TMpiRequest WriteAsync(const char* data, int dataSize, int destRank, int tag) {
    std::unique_ptr<MPI_Request> request(new MPI_Request);
    MPI_SAFE_CALL(MPI_Issend(data, dataSize, MPI_CHAR, destRank, tag, MPI_COMM_WORLD, request.get()));
    return {std::move(request)};
}

int main(int argc, char** argv) {
    int providedLevel;
    int hostCount;
    int hostId;
    int threadLevel = MPI_THREAD_MULTIPLE;

    MPI_SAFE_CALL(MPI_Init_thread(&argc, &argv, threadLevel, &providedLevel));
    MPI_SAFE_CALL(MPI_Comm_size(MPI_COMM_WORLD, &hostCount));
    MPI_SAFE_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &hostId));

    int tag = 0;
    const int otherRank = hostId == 0 ? 1 : 0;

    for (int i = 0; i < 100000; ++i) {
        std::vector<TMpiRequest> requests;
        const int batchSize = 256;//16 + random.NextUniformL() % 256;
        std::vector<std::vector<char> > data;

        for (int j = 0; j < batchSize; ++j) {
            ++tag;
            const bool isWrite = (2 * j < batchSize) == hostId;
            const int size = 127;//random.NextUniformL() % 4096;
            data.push_back(std::vector<char>());
            data.back().resize(size);
            if (isWrite) {
                requests.push_back(WriteAsync(data.back().data(), size, otherRank, tag));
            } else {
                requests.push_back(ReadAsync(data.back().data(), size, otherRank, tag));
            }
        }
        std::cout << "Send batch # " << i << " of size " << batchSize << std::endl;
        while (requests.size()) {
            std::vector<TMpiRequest> pending;
            for (auto& request : requests) {
                if (!request.IsComplete()) {
                    pending.push_back(std::move(request));
                }
            }
            requests.swap(pending);
        }
        std::cout << "Wait complete batch done " << batchSize << std::endl;
    }
}

Backtrace:

Thread 3 (Thread 0x7f285d984700 (LWP 424323)):
#0 0x00007f28612bfa13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2860961438 in epoll_dispatch (base=0x107402400, tv=) at epoll.c:407
#2 0x00007f28609647ff in opal_libevent2022_event_base_loop (base=0x107402400, flags=1) at event.c:1630
#3 0x00007f285ee7089e in progress_engine () from /usr/local/openmpi-git/lib/openmpi/mca_pmix_pmix3x.so
#4 0x00007f28615896ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f28612bf41d in clone () from /lib/x86_64-linux-gnu/libc.so.6
Thread 2 (Thread 0x7f285f8c4700 (LWP 424322)):
#0 0x00007f28612b374d in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f286096e0b8 in poll (__timeout=, __nfds=4, __fds=0x107102d00) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2 poll_dispatch (base=0x107400000, tv=) at poll.c:165
#3 0x00007f28609647ff in opal_libevent2022_event_base_loop (base=0x107400000, flags=1) at event.c:1630
#4 0x00007f286091d8fe in progress_engine () from /usr/local/openmpi-git/lib/libopen-pal.so.0
#5 0x00007f28615896ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007f28612bf41d in clone () from /lib/x86_64-linux-gnu/libc.so.6
Thread 1 (Thread 0x7f28620b7740 (LWP 424321)):
#0 0x00007f286158d4ee in pthread_mutex_unlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f28589d1c42 in btl_openib_component_progress () from /usr/local/openmpi-git/lib/openmpi/mca_btl_openib.so
#2 0x00007f2860917d1c in opal_progress () from /usr/local/openmpi-git/lib/libopen-pal.so.0
#3 0x00007f28617e5c83 in ompi_request_default_test () from /usr/local/mpi/lib/libmpi.so.0
#4 0x00007f2861826fb9 in PMPI_Test () from /usr/local/mpi/lib/libmpi.so.0
#5 0x0000000000203068 in TMpiRequest::IsComplete (this=0x107021880) at /place/noxoomo/mini-arcadia/junk/noxoomo/failed_mpi/main.cpp:31
#6 0x0000000000202038 in main (argc=1, argv=0x7ffdb5c75f28) at /place/noxoomo/mini-arcadia/junk/noxoomo/failed_mpi/main.cpp:133

The text was updated successfully, but these errors were encountered:

bosilca · 2018-02-26T18:01:30Z

I think this is related to issue #4795. This PR #4852, once ready for primetime, might be the solution.

jladd-mlnx · 2018-02-26T20:52:24Z

@Noxoomo could you include your command line, please. How many threads?

Noxoomo · 2018-02-26T21:06:05Z

For simplified example i just run /usr/local/mpi/bin/mpirun --prefix /usr/local/mpi/ --bind-to none --host host1,host2 binary
So 2 threads, one on each host

In my application I use 1 thread per GPU device + 1 thread on each machine to route command (I tried to serialize all calls to MPI function via global mutex, but this only reduce frequency of freezes). Issue was reproduced even with 1 active device per machine (so 2 host, each 2 working threads)

thananon · 2018-02-27T19:06:08Z

This might be because openib BTL doesnt handle high injection rate very well.

@Noxoomo Can you try this?

Try increase the number of logical infiniband module:
/usr/local/mpi/bin/mpirun --prefix /usr/local/mpi/ --bind-to none --map-by node -mca btl_openib_btls_per_lid 8 --host host1,host2 binary
Try to run TCP over IB:
/usr/local/mpi/bin/mpirun --prefix /usr/local/mpi/ --bind-to none --map-by node -mca btl tcp,vader,self --host host1,host2 binary

Noxoomo · 2018-02-27T22:29:45Z

Try increase the number of logical infiniband module:
/usr/local/mpi/bin/mpirun --prefix /usr/local/mpi/ --bind-to none --map-by node -mca btl_openib_btls_per_lid 8 --host host1,host2 binary

Does not help, still hangs. Simple example from first post hangs the same way.

Try to run TCP over IB:
/usr/local/mpi/bin/mpirun --prefix /usr/local/mpi/ --bind-to none --map-by node -mca btl tcp,vader,self --host host1,host2 binary

This one does not hang example from first post, but it might work well just for selected for test constants, so i'll check it in more details later.

UPD: build application to check from wrong revision, results about it will provide latter

jladd-mlnx · 2018-02-28T14:36:23Z

@bureddy Could you check and see if this repros with UCX GPU code? @Noxoomo Just FYI, OpenIB BTL is being deprecated and will be replaced by the UCX PML.

bureddy · 2018-02-28T19:26:22Z

@jladd-mlnx attached example code do not have any GPU specific buffers.
I'm could not repro same hang either with OpenIB btl or UCX pml on my setup.

@Noxoomo can you give it a try with UCX pml?

Noxoomo · 2018-02-28T19:51:57Z

@bureddy, @jladd-mlnx
Thank for suggestion, I'll check UCX pml later this week.

Today I build OpenMPI from fresh master with latest commits (merged PR #4852 ) and I currently can't reproduce hang. I'll run more stress test later and will write about results

Noxoomo · 2018-03-03T11:57:24Z

I make more test with merged PR #4852:

I can't reproduce hangs with attached example now
I still have issue with freeze in my application, but they become much rare: i rewrote application logic in a such way, that all MPI call are made by one single thread. With such approach my application hanged after 10 hours (more tests are still running)
I check UCX, but I was able to build only simple example before, and this examples works fine.

I currently can't run my application with OpenMPI + UCX, i get exception with this stacktrace:
Larger alignment are not guaranteed with this implementation
[zergling24:30065] *** Process received signal ***
[zergling24:30065] Signal: Aborted (6)
[zergling24:30065] Signal code: (-6)
[zergling24:30065] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f9333811cb0]
[zergling24:30065] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f9333477035]
[zergling24:30065] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7f933347a79b]
[zergling24:30065] [ 3] /home/noxoomo/catboost[0x9228bd]
[zergling24:30065] [ 4] /home/noxoomo/catboost(memalign+0x52)[0x8f8572]
[zergling24:30065] [ 5] /usr/local/ucx/lib/libuct.so.0(uct_mem_alloc+0x144)[0x7f933013e3a4]

I assume this caused by some compatibility issues with compilers and/or system libraries, but I did not figure out whats going wrong yet.

thananon · 2018-03-05T16:41:48Z

Thank you very much for reporting back.

I still have issue with freeze in my application, but they become much rare: i rewrote application logic in a such way, that all MPI call are made by one single thread. With such approach my application hanged after 10 hours (more tests are still running)

You should not have to move all comms to single thread. This might still be the same problem that #4852 try to fix. We would like to understand more about this hang but this seems to be very difficult to track (after 10 hours of run) but when it hang, can you get the stack trace for us?

Noxoomo · 2018-03-05T16:49:36Z

You should not have to move all comms to single thread

AFAIK, OpenMPI + UCX doesn't support MPI_THREAD_MULTIPLE (at least with default compile instructions)

BTW, separate thread for all MPI calls is good for my application design and i just didn't have time before to make it.

Noxoomo · 2018-03-05T16:56:48Z

We would like to understand more about this hang but this seems to be very difficult to track (after 10 hours of run) but when it hang, can you get the stack trace for us?

OK, i'll provide stack trace on the next run. It'll be at the end of week, GPU's are scare resource and I can't take them for too much time during working days.

Noxoomo · 2018-03-05T17:08:17Z

Just one more thing: i run benchmarks before i changed logic to comms through single thread with merged PR #4852, there hang were also rare (after several hours).

Noxoomo · 2018-03-20T23:11:44Z

@thananon

Just for you information: I don't forget about the issue, but I currently can't use machines for enough time to reproduce problems.

thananon · 2018-03-21T14:23:23Z

Thank you for the update.

thananon · 2018-06-26T18:54:12Z

Closing due to no update. Can be reopened.

thananon closed this as completed Jun 26, 2018

nevion mentioned this issue Apr 16, 2019

MPI_THREAD_MULTIPLE and UCX #6593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMPI hangs on MPI_Test with InfiniBand and high message rate #4863

OpenMPI hangs on MPI_Test with InfiniBand and high message rate #4863

Noxoomo commented Feb 25, 2018 •

edited

Loading

bosilca commented Feb 26, 2018

jladd-mlnx commented Feb 26, 2018

Noxoomo commented Feb 26, 2018 •

edited

Loading

thananon commented Feb 27, 2018

Noxoomo commented Feb 27, 2018 •

edited

Loading

jladd-mlnx commented Feb 28, 2018

bureddy commented Feb 28, 2018

Noxoomo commented Feb 28, 2018

Noxoomo commented Mar 3, 2018 •

edited

Loading

thananon commented Mar 5, 2018

Noxoomo commented Mar 5, 2018 •

edited

Loading

Noxoomo commented Mar 5, 2018 •

edited

Loading

Noxoomo commented Mar 5, 2018 •

edited

Loading

Noxoomo commented Mar 20, 2018

thananon commented Mar 21, 2018

thananon commented Jun 26, 2018

OpenMPI hangs on MPI_Test with InfiniBand and high message rate #4863

OpenMPI hangs on MPI_Test with InfiniBand and high message rate #4863

Comments

Noxoomo commented Feb 25, 2018 • edited Loading

Background information

System description, OpenMPI details and etc

Details of the problem

bosilca commented Feb 26, 2018

jladd-mlnx commented Feb 26, 2018

Noxoomo commented Feb 26, 2018 • edited Loading

thananon commented Feb 27, 2018

Noxoomo commented Feb 27, 2018 • edited Loading

jladd-mlnx commented Feb 28, 2018

bureddy commented Feb 28, 2018

Noxoomo commented Feb 28, 2018

Noxoomo commented Mar 3, 2018 • edited Loading

thananon commented Mar 5, 2018

Noxoomo commented Mar 5, 2018 • edited Loading

Noxoomo commented Mar 5, 2018 • edited Loading

Noxoomo commented Mar 5, 2018 • edited Loading

Noxoomo commented Mar 20, 2018

thananon commented Mar 21, 2018

thananon commented Jun 26, 2018

Noxoomo commented Feb 25, 2018 •

edited

Loading

Noxoomo commented Feb 26, 2018 •

edited

Loading

Noxoomo commented Feb 27, 2018 •

edited

Loading

Noxoomo commented Mar 3, 2018 •

edited

Loading

Noxoomo commented Mar 5, 2018 •

edited

Loading

Noxoomo commented Mar 5, 2018 •

edited

Loading

Noxoomo commented Mar 5, 2018 •

edited

Loading