Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI hangs on MPI_Test with InfiniBand and high message rate #4863

Closed
Noxoomo opened this issue Feb 25, 2018 · 16 comments
Closed

OpenMPI hangs on MPI_Test with InfiniBand and high message rate #4863

Noxoomo opened this issue Feb 25, 2018 · 16 comments

Comments

@Noxoomo
Copy link

Noxoomo commented Feb 25, 2018

Background information

I have GPU-based application, which uses MPI to transfer messages and data between several nodes.
It works fine on 1GB/s networks, but stucks in deadlock if i switch to InfiniBand.

System description, OpenMPI details and etc

I've reproduced issue on two clusters with InfiniBand network. I've used several OpenMPI versions builded from source for all runs (see details below)

First cluster consists of Intel dual-socket servers with Ubuntu 12.04 and NVIDIA GPUs, Mellanox Technologies MT27500 Family [ConnectX-3] for InfiniBand. On this cluster i've tried OpenMPI 2.1.2 and OpenMPI 3.0

On the second one there are dual-socket Intel servers with Ubuntu 16.04 and NVIDIA GPUs, Mellanox Technologies MT27700 Family [ConnectX-4]  for InifniBand. This cluster is IPV6-only, so on it OpenMPI was builded from master with several patches to fix IPV6-compability.

All MPI builds were with CUDA support.
I could provide more information on request.


Details of the problem

My application sends a lot of small message. All sends and receives are async (ISend, IRecv) and i use MPI_Test in loop to check for completeness of operation. On InfiniBand networks MPI_Test would not "return true" for MPI_ISend which was received by other host (communications in my application usually uses unique tags so i dumped every receive and send request and successful call to MPI_Test and from logs i saw that some MPI_ISend was received, but sender was notified about it). As a result, my application waits forever.

I checked the same code with MVAPICH2 and everything work fine.

My application is complex, so I've reproduced the same issue with far more simple code, attached bellow:


#include <iostream>
#include <vector>

#define MPI_SAFE_CALL(cmd)                                                    \
   {                                                                          \
        int mpiErrNo = (cmd);                                                 \
        if (MPI_SUCCESS != mpiErrNo) {                                        \
            char msg[MPI_MAX_ERROR_STRING];                                   \
            int len;                                                          \
            MPI_Error_string(mpiErrNo, msg, &len);                            \
            std::cout << "MPI failed with error code :" << mpiErrNo           \
                                << " " << msg << std::endl;                   \
            MPI_Abort(MPI_COMM_WORLD, mpiErrNo);                              \
        }                                                                     \
    }


class TMpiRequest {
public:
    bool IsComplete() const {
        if (!Flag) {
            MPI_SAFE_CALL(MPI_Test(Request.get(), &Flag, &Status));
        }
        return static_cast<bool>(Flag);
    }


    TMpiRequest(TMpiRequest&& other) {
        if (this != &other) {
            this->Flag = other.Flag;
            this->Request.swap(other.Request);
            this->Status = other.Status;
            other.Clear();
        }
    }

    TMpiRequest& operator=(TMpiRequest&& other) {
        if (this != &other) {
            this->Flag = other.Flag;
            this->Request.swap(other.Request);
            this->Status = other.Status;
            other.Clear();
        }
        return *this;
    }

    TMpiRequest() {
    }

    ~TMpiRequest() {
    }

    TMpiRequest(std::unique_ptr<MPI_Request>&& request)
            : Flag(0)
              , Request(std::move(request)) {
        IsComplete();
    }

    void Clear() {
        Flag = -1;
        Request = nullptr;
    }

private:
    mutable int Flag = -1;
    std::unique_ptr<MPI_Request> Request;
    mutable MPI_Status Status;
};


TMpiRequest ReadAsync(char* data, int dataSize, int sourceRank, int tag) {
    std::unique_ptr<MPI_Request> request(new MPI_Request);
    MPI_SAFE_CALL(MPI_Irecv(data, dataSize, MPI_CHAR, sourceRank, tag, MPI_COMM_WORLD, request.get()));
    return {std::move(request)};
}


TMpiRequest WriteAsync(const char* data, int dataSize, int destRank, int tag) {
    std::unique_ptr<MPI_Request> request(new MPI_Request);
    MPI_SAFE_CALL(MPI_Issend(data, dataSize, MPI_CHAR, destRank, tag, MPI_COMM_WORLD, request.get()));
    return {std::move(request)};
}

int main(int argc, char** argv) {
    int providedLevel;
    int hostCount;
    int hostId;
    int threadLevel = MPI_THREAD_MULTIPLE;

    MPI_SAFE_CALL(MPI_Init_thread(&argc, &argv, threadLevel, &providedLevel));
    MPI_SAFE_CALL(MPI_Comm_size(MPI_COMM_WORLD, &hostCount));
    MPI_SAFE_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &hostId));

    int tag = 0;
    const int otherRank = hostId == 0 ? 1 : 0;

    for (int i = 0; i < 100000; ++i) {
        std::vector<TMpiRequest> requests;
        const int batchSize = 256;//16 + random.NextUniformL() % 256;
        std::vector<std::vector<char> > data;

        for (int j = 0; j < batchSize; ++j) {
            ++tag;
            const bool isWrite = (2 * j < batchSize) == hostId;
            const int size = 127;//random.NextUniformL() % 4096;
            data.push_back(std::vector<char>());
            data.back().resize(size);
            if (isWrite) {
                requests.push_back(WriteAsync(data.back().data(), size, otherRank, tag));
            } else {
                requests.push_back(ReadAsync(data.back().data(), size, otherRank, tag));
            }
        }
        std::cout << "Send batch # " << i << " of size " << batchSize << std::endl;
        while (requests.size()) {
            std::vector<TMpiRequest> pending;
            for (auto& request : requests) {
                if (!request.IsComplete()) {
                    pending.push_back(std::move(request));
                }
            }
            requests.swap(pending);
        }
        std::cout << "Wait complete batch done " << batchSize << std::endl;
    }
}

Backtrace:

Thread 3 (Thread 0x7f285d984700 (LWP 424323)):
#0 0x00007f28612bfa13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2860961438 in epoll_dispatch (base=0x107402400, tv=) at epoll.c:407
#2 0x00007f28609647ff in opal_libevent2022_event_base_loop (base=0x107402400, flags=1) at event.c:1630
#3 0x00007f285ee7089e in progress_engine () from /usr/local/openmpi-git/lib/openmpi/mca_pmix_pmix3x.so
#4 0x00007f28615896ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f28612bf41d in clone () from /lib/x86_64-linux-gnu/libc.so.6
Thread 2 (Thread 0x7f285f8c4700 (LWP 424322)):
#0 0x00007f28612b374d in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f286096e0b8 in poll (__timeout=, __nfds=4, __fds=0x107102d00) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2 poll_dispatch (base=0x107400000, tv=) at poll.c:165
#3 0x00007f28609647ff in opal_libevent2022_event_base_loop (base=0x107400000, flags=1) at event.c:1630
#4 0x00007f286091d8fe in progress_engine () from /usr/local/openmpi-git/lib/libopen-pal.so.0
#5 0x00007f28615896ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007f28612bf41d in clone () from /lib/x86_64-linux-gnu/libc.so.6
Thread 1 (Thread 0x7f28620b7740 (LWP 424321)):
#0 0x00007f286158d4ee in pthread_mutex_unlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f28589d1c42 in btl_openib_component_progress () from /usr/local/openmpi-git/lib/openmpi/mca_btl_openib.so
#2 0x00007f2860917d1c in opal_progress () from /usr/local/openmpi-git/lib/libopen-pal.so.0
#3 0x00007f28617e5c83 in ompi_request_default_test () from /usr/local/mpi/lib/libmpi.so.0
#4 0x00007f2861826fb9 in PMPI_Test () from /usr/local/mpi/lib/libmpi.so.0
#5 0x0000000000203068 in TMpiRequest::IsComplete (this=0x107021880) at /place/noxoomo/mini-arcadia/junk/noxoomo/failed_mpi/main.cpp:31
#6 0x0000000000202038 in main (argc=1, argv=0x7ffdb5c75f28) at /place/noxoomo/mini-arcadia/junk/noxoomo/failed_mpi/main.cpp:133

@bosilca
Copy link
Member

bosilca commented Feb 26, 2018

I think this is related to issue #4795. This PR #4852, once ready for primetime, might be the solution.

@jladd-mlnx
Copy link
Member

@Noxoomo could you include your command line, please. How many threads?

@Noxoomo
Copy link
Author

Noxoomo commented Feb 26, 2018

For simplified example i just run /usr/local/mpi/bin/mpirun --prefix /usr/local/mpi/ --bind-to none --host host1,host2 binary
So 2 threads, one on each host

In my application I use 1 thread per GPU device + 1 thread on each machine to route command (I tried to serialize all calls to MPI function via global mutex, but this only reduce frequency of freezes). Issue was reproduced even with 1 active device per machine (so 2 host, each 2 working threads)

@thananon
Copy link
Member

This might be because openib BTL doesnt handle high injection rate very well.

@Noxoomo Can you try this?

  1. Try increase the number of logical infiniband module:
    /usr/local/mpi/bin/mpirun --prefix /usr/local/mpi/ --bind-to none --map-by node -mca btl_openib_btls_per_lid 8 --host host1,host2 binary
  2. Try to run TCP over IB:
    /usr/local/mpi/bin/mpirun --prefix /usr/local/mpi/ --bind-to none --map-by node -mca btl tcp,vader,self --host host1,host2 binary

@Noxoomo
Copy link
Author

Noxoomo commented Feb 27, 2018

  1. Try increase the number of logical infiniband module:
    /usr/local/mpi/bin/mpirun --prefix /usr/local/mpi/ --bind-to none --map-by node -mca btl_openib_btls_per_lid 8 --host host1,host2 binary

Does not help, still hangs. Simple example from first post hangs the same way.

Try to run TCP over IB:
/usr/local/mpi/bin/mpirun --prefix /usr/local/mpi/ --bind-to none --map-by node -mca btl tcp,vader,self --host host1,host2 binary

This one does not hang example from first post, but it might work well just for selected for test constants, so i'll check it in more details later.

UPD: build application to check from wrong revision, results about it will provide latter

@jladd-mlnx
Copy link
Member

@bureddy Could you check and see if this repros with UCX GPU code? @Noxoomo Just FYI, OpenIB BTL is being deprecated and will be replaced by the UCX PML.

@bureddy
Copy link
Member

bureddy commented Feb 28, 2018

@jladd-mlnx attached example code do not have any GPU specific buffers.
I'm could not repro same hang either with OpenIB btl or UCX pml on my setup.

@Noxoomo can you give it a try with UCX pml?

@Noxoomo
Copy link
Author

Noxoomo commented Feb 28, 2018

@bureddy, @jladd-mlnx
Thank for suggestion, I'll check UCX pml later this week.

Today I build OpenMPI from fresh master with latest commits (merged PR #4852 ) and I currently can't reproduce hang. I'll run more stress test later and will write about results

@Noxoomo
Copy link
Author

Noxoomo commented Mar 3, 2018

I make more test with merged PR #4852:

  1. I can't reproduce hangs with attached example now

  2. I still have issue with freeze in my application, but they become much rare: i rewrote application logic in a such way, that all MPI call are made by one single thread. With such approach my application hanged after 10 hours (more tests are still running)

  3. I check UCX, but I was able to build only simple example before, and this examples works fine.

I currently can't run my application with OpenMPI + UCX, i get exception with this stacktrace:
Larger alignment are not guaranteed with this implementation
[zergling24:30065] *** Process received signal ***
[zergling24:30065] Signal: Aborted (6)
[zergling24:30065] Signal code: (-6)
[zergling24:30065] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f9333811cb0]
[zergling24:30065] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f9333477035]
[zergling24:30065] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7f933347a79b]
[zergling24:30065] [ 3] /home/noxoomo/catboost[0x9228bd]
[zergling24:30065] [ 4] /home/noxoomo/catboost(memalign+0x52)[0x8f8572]
[zergling24:30065] [ 5] /usr/local/ucx/lib/libuct.so.0(uct_mem_alloc+0x144)[0x7f933013e3a4]

I assume this caused by some compatibility issues with compilers and/or system libraries, but I did not figure out whats going wrong yet.

@thananon
Copy link
Member

thananon commented Mar 5, 2018

Thank you very much for reporting back.

I still have issue with freeze in my application, but they become much rare: i rewrote application logic in a such way, that all MPI call are made by one single thread. With such approach my application hanged after 10 hours (more tests are still running)

You should not have to move all comms to single thread. This might still be the same problem that #4852 try to fix. We would like to understand more about this hang but this seems to be very difficult to track (after 10 hours of run) but when it hang, can you get the stack trace for us?

@Noxoomo
Copy link
Author

Noxoomo commented Mar 5, 2018

You should not have to move all comms to single thread

AFAIK, OpenMPI + UCX doesn't support MPI_THREAD_MULTIPLE (at least with default compile instructions)

BTW, separate thread for all MPI calls is good for my application design and i just didn't have time before to make it.

@Noxoomo
Copy link
Author

Noxoomo commented Mar 5, 2018

We would like to understand more about this hang but this seems to be very difficult to track (after 10 hours of run) but when it hang, can you get the stack trace for us?

OK, i'll provide stack trace on the next run. It'll be at the end of week, GPU's are scare resource and I can't take them for too much time during working days.

@Noxoomo
Copy link
Author

Noxoomo commented Mar 5, 2018

Just one more thing: i run benchmarks before i changed logic to comms through single thread with merged PR #4852, there hang were also rare (after several hours).

@Noxoomo
Copy link
Author

Noxoomo commented Mar 20, 2018

@thananon

Just for you information: I don't forget about the issue, but I currently can't use machines for enough time to reproduce problems.

@thananon
Copy link
Member

Thank you for the update.

@thananon
Copy link
Member

Closing due to no update. Can be reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants