Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi4py: Regression in alltoall with intercommunicators #12367

Closed
dalcinl opened this issue Feb 25, 2024 · 42 comments
Closed

mpi4py: Regression in alltoall with intercommunicators #12367

dalcinl opened this issue Feb 25, 2024 · 42 comments
Assignees

Comments

@dalcinl
Copy link
Contributor

dalcinl commented Feb 25, 2024

https://github.com/mpi4py/mpi4py-testing/actions/runs/8034284372/job/21945704643

testAlltoallInter (test_util_pkl5.TestMPIWorld.testAlltoallInter) ... python: pml_ob1_comm.c:51: mca_pml_ob1_comm_proc_destruct: Assertion `NULL == proc->frags_cant_match' failed.

BTW, I've submitted #12357 to re-enable multi-proc runs of mpi4py tests. I've just rebased that PR, but now test may fail the same way I'm reporting here.

@jsquyres
Copy link
Member

The specific error is:

testAlltoallInter (test_util_pkl5.TestMPIWorld.testAlltoallInter) ... python: pml_ob1_comm.c:51: mca_pml_ob1_comm_proc_destruct: Assertion `NULL == proc->frags_cant_match' failed.
[fv-az979-137:145684] *** Process received signal ***
[fv-az979-137:145684] Signal: Aborted (6)
[fv-az979-137:145684] Signal code:  (-6)

@jsquyres jsquyres added the bug label Feb 25, 2024
@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 25, 2024

The specific error is:

What do you mean? That's exactly what I reported in the description right below the link to the full logs . Am I missing something?

@jsquyres
Copy link
Member

That's exactly what I reported in the description right below the link to the full logs . Am I missing something?

Errr... no, sorry. You didn't report the version of Open MPI, and I had to dig into the CI link to find it; I labeled that then copy-and-pasted the specific error out of the logs mostly on auto-pilot (my eyes skipped right over the long line in your description and I missed it).

@hppritcha
Copy link
Member

@dalcinl could you remind me which arg to main.py controls the set of tests which are run?

@hppritcha
Copy link
Member

okay I figured that out. I'm not able to reproduce this problem running standalone.

@bosilca
Copy link
Member

bosilca commented Mar 4, 2024

This errors is only possible when out-of-sequence messages are left into a process queue when the communicator is freed or when MPI_Finalize is called. This means there are messages unmatched and out-of-sequence, kind of suggesting some incorrect MPI application.

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 6, 2024

This failing test is doing a series of shot-message Python (pickle-based) comm.alltoall(...) on an intercomm in rapid succession. The test basically reads

comm = make_intercomm(MPI.COMM_WORLD)
size = comm.Get_remote_size()
for smess in messages:
    rmess = comm.alltoall(None)
    assert rmess == [None]*size
    rmess = comm.alltoall([smess]*size
    assert rmess == [smess]*size
    rmess = comm.alltoall(iter([smess]*size))
    assert rmess == [smess]*size
comm.Free()

The algorithm has not change in 15+ years: 1) MPI_Alltoall of counts, and 2) MPI_Alltoallv of the (pickled) byte stream.
https://github.com/mpi4py/mpi4py/blob/master/src/mpi4py/MPI/msgpickle.pxi#L908

This test has been working on MPICH/Intel MPI/Microsoft MPI on Linux/macOS/Windows. I never gave a problem with Open MPI v4.1.x, v5.0.x, and until recently, the main branch. Moreover, the issue is not easy to reproduce, and the test even pass most times in GitHub Actions.

While I understand it is tempting to assume that the problem is on the application (that is, mpi4py) side, IMHO all the evidence points to the backend. A sequence of MPI_Alltoall+MPI_Alltoallv calls occurring simultaneously at all processes should not end up in unmatched messages, unless the MPI implementation is internally messing up things.

I'm indeed freeing the intercommunicator (with Free()) after the many rounds of MPI_Alltoall+MPI_Alltoallv calls. Maybe I should use Disconnect() instead, then rather than an assertion from unmatched messages, we should end up in a deadlock.

@ggouaillardet
Copy link
Contributor

@dalcinl is there a command to run only the test_util_pkl5.TestMPIWorld.testAlltoallInter test?

also, what if you mpirun --mca coll ^han ...

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 6, 2024

@dalcinl is there a command to run only the test_util_pkl5.TestMPIWorld.testAlltoallInter test?

The main entry point to run everything is test/main.py, but every test/test_*.py script can be run individually. In turn, there are a few different ways to run a specific test (all following Python's standard unittest infrastructure), but the most obvious/easy one probably is:

mpiexec -n 5 python test/main.py -v -k test_util_pkl5.TestMPIWorld.testAlltoallInter

PS: Run python test/main.py --help for all the options...

also, what if you mpirun --mca coll ^han ...

I'll try locally, but given that I cannot reproduce that way, I'm not sure such a try will be really relevant.

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 6, 2024

OK, now I can reproduce. I have a machine with 128 virtual cores.

If I run with 127 mpi4py processes it seems to work most of the times:

mpiexec -n 127 python test/main.py -v -k test_util_pkl5.TestMPIWorld.testAlltoallInter

but as soon as I ask for 128 MPI processes, then the assertion triggers:

mpiexec -n 127 python test/main.py -v -k test_util_pkl5.TestMPIWorld.testAlltoallInter
...
[31@kw61149] numpy 1.26.4 (/home/dalcinl/.local/lib/python3.12/site-packages/numpy)
[31@kw61149] MPI 3.1 (Open MPI 5.1.0)
[31@kw61149] mpi4py 4.0.0.dev0 (/home/dalcinl/Devel/mpi4py/build/lib.linux-x86_64-cpython-312/mpi4py)
testAlltoallInter (test_util_pkl5.TestMPIWorld.testAlltoallInter) ... python: ../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_comm.c:51: mca_pml_ob1_comm_proc_destruct: Assertion `NULL == proc->frags_cant_match' failed.
[kw61149:3686873] *** Process received signal ***
[kw61149:3686873] Signal: Aborted (6)
[kw61149:3686873] Signal code:  (-6)
[kw61149:3686873] [ 0] /lib64/libc.so.6(+0x3e9a0)[0x7f7eca2a39a0]
[kw61149:3686873] [ 1] /lib64/libc.so.6(+0x90834)[0x7f7eca2f5834]
[kw61149:3686873] [ 2] /lib64/libc.so.6(raise+0x1e)[0x7f7eca2a38ee]
[kw61149:3686873] [ 3] /lib64/libc.so.6(abort+0xdf)[0x7f7eca28b8ff]
[kw61149:3686873] [ 4] /lib64/libc.so.6(+0x2681b)[0x7f7eca28b81b]
[kw61149:3686873] [ 5] /lib64/libc.so.6(+0x36c57)[0x7f7eca29bc57]
[kw61149:3686873] [ 6] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x377734)[0x7f7ebb177734]
[kw61149:3686873] [ 7] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x3771b5)[0x7f7ebb1771b5]
[kw61149:3686873] [ 8] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x377ce5)[0x7f7ebb177ce5]
[kw61149:3686873] [ 9] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x3736a7)[0x7f7ebb1736a7]
[kw61149:3686873] [10] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(mca_pml_ob1_del_comm+0xe1)[0x7f7ebb17584d]
[kw61149:3686873] [11] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x6345b)[0x7f7ebae6345b]
[kw61149:3686873] [12] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x6440c)[0x7f7ebae6440c]
[kw61149:3686873] [13] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_comm_free+0x48e)[0x7f7ebae6a4fc]
[kw61149:3686873] [14] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(PMPI_Comm_free+0x11d)[0x7f7ebaedf8cc]
[kw61149:3686873] [15] /home/dalcinl/Devel/mpi4py/build/lib.linux-x86_64-cpython-312/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x14ae82)[0x7f7ebb54ae82]
[kw61149:3686873] [16] /home/dalcinl/Devel/mpi4py/build/lib.linux-x86_64-cpython-312/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x14ae29)[0x7f7ebb54ae29]
[kw61149:3686873] [17] /home/dalcinl/Devel/mpi4py/build/lib.linux-x86_64-cpython-312/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x338511)[0x7f7ebb738511]
[kw61149:3686873] [18] /lib64/libpython3.12.so.1.0(PyObject_Vectorcall+0x5c)[0x7f7ec9e0ae9c]
[kw61149:3686873] [19] /lib64/libpython3.12.so.1.0(+0x10fb3a)[0x7f7ec9d0fb3a]
[kw61149:3686873] [20] /lib64/libpython3.12.so.1.0(+0x234b7d)[0x7f7ec9e34b7d]
[kw61149:3686873] [21] /lib64/libpython3.12.so.1.0(+0x1105fb)[0x7f7ec9d105fb]
[kw61149:3686873] [22] /lib64/libpython3.12.so.1.0(_PyObject_FastCallDictTstate+0xa6)[0x7f7ec9df9b66]
[kw61149:3686873] [23] /lib64/libpython3.12.so.1.0(_PyObject_Call_Prepend+0x73)[0x7f7ec9e1e4d3]
[kw61149:3686873] [24] /lib64/libpython3.12.so.1.0(+0x2cb485)[0x7f7ec9ecb485]
[kw61149:3686873] [25] /lib64/libpython3.12.so.1.0(_PyObject_MakeTpCall+0x76)[0x7f7ec9df7466]
[kw61149:3686873] [26] /lib64/libpython3.12.so.1.0(+0x10fb3a)[0x7f7ec9d0fb3a]
[kw61149:3686873] [27] /lib64/libpython3.12.so.1.0(+0x234b7d)[0x7f7ec9e34b7d]
[kw61149:3686873] [28] /lib64/libpython3.12.so.1.0(+0x1105fb)[0x7f7ec9d105fb]
[kw61149:3686873] [29] /lib64/libpython3.12.so.1.0(_PyObject_FastCallDictTstate+0xa6)[0x7f7ec9df9b66]
[kw61149:3686873] *** End of error message ***
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

The --mca ^han option is not working on my side, but most likely because the option syntax is incorrect?

mpiexec -n 2 --mca ^han python test/main.py -v -k test_util_pkl5.TestMPIWorld.testAlltoallInter
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

@ggouaillardet
Copy link
Contributor

my bad, the correct syntax is

mpirun --mca coll ^han ...

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 6, 2024

mpirun --mca coll ^han ...

I figured out, eventually... it does not fix the issue.

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Mar 6, 2024

I confirm I was able to reproduce the issue.

The inline reproducer can be used to evidence it.
In my environment, it crashes with 13 MPI tasks

$ mpirun -np 13 -oversubscribe --mca btl tcp,self --mca pml ob1 ./alltoallinter
#include <mpi.h>
#include <stdio.h>

MPI_Comm intercomm;

int main(int argc, char *argv[]) {
    int rank, size, lrank, lsize, rsize;
    MPI_Comm local;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_split(MPI_COMM_WORLD, (rank<size/2)?0:1, 0, &local);
    MPI_Comm_rank(local, &lrank);
    MPI_Comm_size(local, &lsize);
    printf("I am %d: %d / %d\n", rank, lrank, lsize);
    MPI_Intercomm_create(local, 0, MPI_COMM_WORLD, (rank<size/2)?size/2:0, 0, &intercomm);
    MPI_Comm_remote_size(intercomm, &rsize);
    size_t sbuf[size], rbuf[size];
    MPI_Request reqs[size*2];
    MPI_Request *req = reqs;
    int i;
    for (i=0; i<rsize; i++, req++) {
        MPI_Irecv(rbuf+i, 1, MPI_COUNT, i, 0, intercomm, req);
    }
    for (i=0; i<rsize; i++, req++) {
        MPI_Isend(sbuf+i, 1, MPI_COUNT, i, 0, intercomm, req);
    }
    MPI_Waitall(rsize*2, reqs, MPI_STATUSES_IGNORE);
    
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Comm_free(&intercomm);
    MPI_Comm_free(&local);
    MPI_Finalize();
    return 0;
}

@hppritcha
Copy link
Member

I still can't reproduce either the above 'c' code or mpi4py test at 128 ranks.
I notice when using the UCX PML however that it emits these warnings:

[1709738640.598286] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8dfc0 was not matched
[1709738640.598305] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8df00 was not matched
[1709738640.598314] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8de40 was not matched
[1709738640.598321] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8dd80 was not matched
[1709738640.598326] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8dcc0 was not matched
[1709738640.598331] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8dc00 was not matched
[1709738640.598336] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8db40 was not matched
[1709738640.598341] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8da80 was not matched
[1709738640.598347] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8d9c0 was not matched
[1709738640.598414] [st-master:755723:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x2edfdfc0 was not matched
[1709738640.598431] [st-master:755723:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x2edfdf00 was not matched
[1709738640.598440] [st-master:755723:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x2edfde40 was not matched
[1709738640.598353] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8d900 was not matched

@wenduwan
Copy link
Contributor

wenduwan commented Mar 6, 2024

Is this main branch only, or also affecting 5.0?

@hppritcha
Copy link
Member

i have not checked @wenduwan the 5.0.x branch. UCX seems to be complaining that a receive that had been posted was never matched when the app called mpi_finalize.

@hppritcha
Copy link
Member

i'm not sure what @ggouaillardet code snippet is suppose to be demonstrating. its only doing a bunch if isends.

@ggouaillardet
Copy link
Contributor

The program crashes in my environment (most of the time) with 13 tasks. My box only has 4 cores, so maybe oversubscription helps evidencing the issue.
I will complete git bisect tomorrow.
Note if I perform the same communications on MPI_COMM_WORLD, there is no issue.

@hppritcha
Copy link
Member

your example is just doing isends. where are the matching receives?

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 6, 2024

@wenduwan I had not seen the issue with v5.0.x on GHA so far. I tried locally with 128 and 256 MPI processes many times, and the test runs to completion without any issue.

@hppritcha I'm using the following configure options, note --enable-debug and --enable-mem-debug, plus no external libs:

options=(
    --prefix=/home/devel/mpi/openmpi/$branch
    --without-ofi
    --without-ucx
    --without-psm2
    --without-cuda
    --without-rocm
    --with-pmix=internal
    --with-prrte=internal
    --with-libevent=internal
    --with-hwloc=internal
    --enable-debug
    --enable-mem-debug
)

@hppritcha
Copy link
Member

yep i'm using the --enable-debug. i've not used --enable-mem-debug but i'll add that.

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 6, 2024

BTW, just to add reinforce my point that I don't believe this issue is mpi4py's fault, I run 128 and 256 processes with v4.1.x (same configure options as above), and all seems to be good.

@ggouaillardet
Copy link
Contributor

Oops. The first call should be MPI_Irecv(), I will double check that. This was supposed to be a MPI_Alltoall() that can also be used to evidence the issue.

@bosilca
Copy link
Member

bosilca commented Mar 6, 2024

I confirm, ob1 seems to be broken (independent of the used BTL). I altered @ggouaillardet code above such that on deadlock one can do call mca_pml_ob1_dump(intercomm, 1) in the debugger to get the state of the communications on the intercom. There are unmatched receives, as if the sends were either not sent or put into a local queue (a such as the cant_match queue and never extracted.

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 6, 2024

In case it is useful, the first time I noticed the error was about 2 week ago while running commit cb00772.

@bosilca
Copy link
Member

bosilca commented Mar 6, 2024

Don't think that particular commit can have any impact (none of our machines would be using the usnic BTL). But clearly we need to do some digging to find what broke the intercomm support or the PML.

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 6, 2024

Don't think that particular commit can have any impact

Definitely. My whole point was to speedup a bit the bisection. The faulty commit is hopefully not too much older than the one I reported above.

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Mar 7, 2024

fwiw, git bisect points to 23df181

it seems some fixes were applied later, but that did not fix the issue we are discussing now.

@ggouaillardet
Copy link
Contributor

here is the reproducer I am now using:

I noted that if I uncomment the MPI_Barrier() after MPI_Intercomm_create(), the program successes.

#include <stdio.h>

#include <mpi.h>

int main(int argc, char *argv[]) {
    int rank, size, lrank, lsize, rsize;
    MPI_Comm local, inter;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_split(MPI_COMM_WORLD, (rank<size/2)?0:1, 0, &local);
    MPI_Comm_rank(local, &lrank);
    MPI_Comm_size(local, &lsize);
    printf("I am %d: %d / %d\n", rank, lrank, lsize);
    MPI_Intercomm_create(local, 0, MPI_COMM_WORLD, (rank<size/2)?size/2:0, 0, &inter);
    // MPI_Barrier(MPI_COMM_WORLD);
    MPI_Comm_remote_size(inter, &rsize);
    size_t sbuf[size], rbuf[size];
    MPI_Request reqs[size*2];
    MPI_Request *req = reqs;
    int i;
    for (i=0; i<rsize; i++, req++) {
        MPI_Irecv(rbuf+i, 1, MPI_COUNT, i, 0, inter, req);
    }
    for (i=0; i<rsize; i++, req++) {
        MPI_Isend(sbuf+i, 1, MPI_COUNT, i, 0, inter, req);
    }
    MPI_Waitall(rsize*2, reqs, MPI_STATUSES_IGNORE);
    
    MPI_Comm_free(&inter);
    MPI_Comm_free(&local);
    MPI_Finalize();
    return 0;
}

@ggouaillardet
Copy link
Contributor

The difference between now and before, is there is no more allreduce on inter communicators.
Based on the observation that MPI_Barrier() is enough to avoid the problem, a workaround is to re-introduce a (dummy) allreduce on inter communicators. I do not understand exactly what is gong wrong, but I guess allreduce() has a barrier() semantic that is somehow needed.
(the changes in libnbc is to avoid a warning on malloc(0))

diff --git a/ompi/communicator/comm_cid.c b/ompi/communicator/comm_cid.c
index be022b8..fa427c6 100644
--- a/ompi/communicator/comm_cid.c
+++ b/ompi/communicator/comm_cid.c
@@ -917,14 +917,14 @@ int ompi_comm_activate_nb (ompi_communicator_t **newcomm, ompi_communicator_t *c
          */
         ret = context->iallreduce_fn (&context->local_peers, &context->max_local_peers, 1, MPI_MAX, context,
                                       &subreq);
-        if (OMPI_SUCCESS != ret) {
-            ompi_comm_request_return (request);
-            return ret;
-        }
-        ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreq, 1);
     } else {
-        ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, NULL, 0);
+        ret = context->iallreduce_fn (NULL, NULL, 0, MPI_MAX, context, &subreq);
+    }
+    if (OMPI_SUCCESS != ret) {
+        ompi_comm_request_return (request);
+        return ret;
     }
+    ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreq, 1);
 
     ompi_comm_request_start (request);
 
diff --git a/ompi/mca/coll/libnbc/nbc_ireduce.c b/ompi/mca/coll/libnbc/nbc_ireduce.c
index 1f7464e..3052f2a 100644
--- a/ompi/mca/coll/libnbc/nbc_ireduce.c
+++ b/ompi/mca/coll/libnbc/nbc_ireduce.c
@@ -67,7 +67,7 @@ static int nbc_reduce_init(const void* sendbuf, void* recvbuf, int count, MPI_Da
   MPI_Aint ext;
   NBC_Schedule *schedule;
   char *redbuf=NULL, inplace;
-  void *tmpbuf;
+  void *tmpbuf = NULL;
   char tmpredbuf = 0;
   enum { NBC_RED_BINOMIAL, NBC_RED_CHAIN, NBC_RED_REDSCAT_GATHER} alg;
   ompi_coll_libnbc_module_t *libnbc_module = (ompi_coll_libnbc_module_t*) module;
@@ -126,25 +126,27 @@ static int nbc_reduce_init(const void* sendbuf, void* recvbuf, int count, MPI_Da
   }
 
   /* allocate temporary buffers */
-  if (alg == NBC_RED_REDSCAT_GATHER || alg == NBC_RED_BINOMIAL) {
-    if (rank == root) {
-      /* root reduces in receive buffer */
-      tmpbuf = malloc(span);
-      redbuf = recvbuf;
+  if (0 < span) {
+    if (alg == NBC_RED_REDSCAT_GATHER || alg == NBC_RED_BINOMIAL) {
+      if (rank == root) {
+        /* root reduces in receive buffer */
+        tmpbuf = malloc(span);
+        redbuf = recvbuf;
+      } else {
+        /* recvbuf may not be valid on non-root nodes */
+        ptrdiff_t span_align = OPAL_ALIGN(span, datatype->super.align, ptrdiff_t);
+        tmpbuf = malloc(span_align + span);
+        redbuf = (char *)span_align - gap;
+        tmpredbuf = 1;
+      }
     } else {
-      /* recvbuf may not be valid on non-root nodes */
-      ptrdiff_t span_align = OPAL_ALIGN(span, datatype->super.align, ptrdiff_t);
-      tmpbuf = malloc(span_align + span);
-      redbuf = (char *)span_align - gap;
-      tmpredbuf = 1;
+      tmpbuf = malloc (span);
+      segsize = 16384/2;
     }
-  } else {
-    tmpbuf = malloc (span);
-    segsize = 16384/2;
-  }
 
-  if (OPAL_UNLIKELY(NULL == tmpbuf)) {
-    return OMPI_ERR_OUT_OF_RESOURCE;
+    if (OPAL_UNLIKELY(0 < span && NULL == tmpbuf)) {
+      return OMPI_ERR_OUT_OF_RESOURCE;
+    }
   }
 
 #ifdef NBC_CACHE_SCHEDULE

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 8, 2024

@ggouaillardet Oh boy... this is the damn comm_cid stuff that has been causing grief elsewhere. Maybe that's the root of the problem, without the barrier, the whole thing is broken for intercomms. BTW, maybe a better fix would to actually use an ibarrier ?

@bosilca
Copy link
Member

bosilca commented Mar 8, 2024

Let's not get ahead of the solution here. comm_cid selection algorithm is one of the most critical pieces on communicator management, it has hundreds of thousands tests on it every night and it performs exactly how is was designed.

What the last reduce (which in the case of the intercomm would give an answer we would not be able to use anyway) provides is an extra synchronization preventing processes from sending messages too early. However, this reduce being applied on the parent communicator it does indeed behave as a barrier for the creation of the new intercom and as such it delays arrivals of messages until the local parts of the communicators are up and running (this is also confirmed by the fact that your error was frelated to the frags_cant_match list). We had protection against this, but for some reason it does not seem to work correctly.

@hppritcha
Copy link
Member

i was finally able to reproduce (sometimes) the mpi4py alltoall hang (although in a different place - test_cco_obj_inter) by setting up a VM with 4 cores and running oversubscribed by 1.

@hppritcha
Copy link
Member

If I revert commit 23df181 I'm not seeing a hang when running mpirun -np 5 --map-by :OVERSUBSCRIBE ./main.py in a loop.

@hppritcha
Copy link
Member

hppritcha commented Apr 6, 2024

I am able to reproduce regularly with @ggouaillardet test case and am looking in to this.

@hppritcha hppritcha self-assigned this Apr 6, 2024
@hppritcha
Copy link
Member

okay, i figured out the problem here. There's either a design flaw in the way CIDS are being allocated or else we really do need a barrier like operation on the parent communicator used to created the new intercomm.

What i determined is that here in the OB1 PML, a bogus communicator is being returned (because the parent comm was putting as a place holder in the ompi_mpi_communicators

here's the code snippet in mca_pml_o1_recv_frag_callback_match:

    /* communicator pointer */
    comm_ptr = ompi_comm_lookup(hdr->hdr_ctx);
    if(OPAL_UNLIKELY(NULL == comm_ptr)) {
        /* This is a special case. A message for a not yet existing
         * communicator can happens. Instead of doing a matching we
         * will temporarily add it the a pending queue in the PML.
         * Later on, when the communicator is completely instantiated,
         * this pending queue will be searched and all matching fragments
         * moved to the right communicator.
         */
        append_frag_to_list( &mca_pml_ob1.non_existing_communicator_pending,
                             btl, hdr, segments, num_segments, NULL );
        return;
    }
    comm = (mca_pml_ob1_comm_t *)comm_ptr->c_pml_comm;

The problem is the ompi_comm_lookup. The way the vector of communicator handles is being used in the cid alloc algorithm, the parent communicator is used as a "placeholdr" in this array to help deal with the case of multiple threads trying to get a cid. BUT this communicator is not valid for use with the ctx in the hdr pkt. So rather than the pkt being routed on to the non_existing_communicator_pending it gets put on the wrong mca_pml_ob1_comm_proc_t struct.

Now in the comm_cid.c in 4.1.x branch here's the original code we had that related to this:


int ompi_comm_activate_nb (ompi_communicator_t **newcomm, ompi_communicator_t *comm,
                           ompi_communicator_t *bridgecomm, const void *arg0,
                           const void *arg1, bool send_first, int mode, ompi_request_t **req)
{
    ompi_comm_cid_context_t *context;
    ompi_comm_request_t *request;
    ompi_request_t *subreq;
    int ret = 0;

    context = mca_comm_cid_context_alloc (*newcomm, comm, bridgecomm, arg0, arg1, "activate",
                                          send_first, mode);
    if (NULL == context) {
        return OMPI_ERR_OUT_OF_RESOURCE;
    }

    /* keep track of the pointer so it can be set to MPI_COMM_NULL on failure */
    context->newcommp = newcomm;
                                              
    request = ompi_comm_request_get ();       
    if (NULL == request) {
        OBJ_RELEASE(context);
        return OMPI_ERR_OUT_OF_RESOURCE;
    }

    request->context = &context->super;
    
    if (MPI_UNDEFINED != (*newcomm)->c_local_group->grp_my_rank) {
        /* Initialize the PML stuff in the newcomm  */
        if ( OMPI_SUCCESS != (ret = MCA_PML_CALL(add_comm(*newcomm))) ) {
            OBJ_RELEASE(*newcomm);
            OBJ_RELEASE(context);
            *newcomm = MPI_COMM_NULL;
            return ret;
        }
        OMPI_COMM_SET_PML_ADDED(*newcomm);
    }

    /* Step 1: the barrier, after which it is allowed to
     * send messages over the new communicator
     */
    ret = context->allreduce_fn (&context->ok, &context->ok, 1, MPI_MIN, context,
                                 &subreq);
    if (OMPI_SUCCESS != ret) {
        ompi_comm_request_return (request);
        return ret;
    }

    ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreq, 1);
    ompi_comm_request_start (request);

    *req = &request->super;

    return OMPI_SUCCESS;
}

but now in main and 5.0.x it looks like this

int ompi_comm_activate_nb (ompi_communicator_t **newcomm, ompi_communicator_t *comm,
                           ompi_communicator_t *bridgecomm, const void *arg0,
                           const void *arg1, bool send_first, int mode, ompi_request_t **req)
{
    ompi_comm_cid_context_t *context;
    ompi_comm_request_t *request;
    ompi_request_t *subreq;
    int ret = 0;

    /* the caller should not pass NULL for comm (it may be the same as *newcomm) */
    assert (NULL != comm);
    context = mca_comm_cid_context_alloc (*newcomm, comm, bridgecomm, arg0, arg1, "activate",
                                          send_first, mode);
    if (NULL == context) {
        return OMPI_ERR_OUT_OF_RESOURCE;
    }

    /* keep track of the pointer so it can be set to MPI_COMM_NULL on failure */
    context->newcommp = newcomm;

    request = ompi_comm_request_get ();
    if (NULL == request) {
        OBJ_RELEASE(context);
        return OMPI_ERR_OUT_OF_RESOURCE;
    }

    request->context = &context->super;

    if (MPI_UNDEFINED != (*newcomm)->c_local_group->grp_my_rank) {
        /* Initialize the PML stuff in the newcomm  */
        if ( OMPI_SUCCESS != (ret = MCA_PML_CALL(add_comm(*newcomm))) ) {
            OBJ_RELEASE(*newcomm);
            OBJ_RELEASE(context);
            *newcomm = MPI_COMM_NULL;
            return ret;
        }
        OMPI_COMM_SET_PML_ADDED(*newcomm);
    }

    if (OMPI_COMM_IS_INTRA(*newcomm)) {
        /**
         * The communicator's disjointness is inferred from max_local_peers.
         * Note: MPI_IN_PLACE cannot be used here because the parent could be an
         * inter-communicator
         */
        ret = context->iallreduce_fn (&context->local_peers, &context->max_local_peers, 1, MPI_MAX, context,
                                      &subreq);
        if (OMPI_SUCCESS != ret) {
            ompi_comm_request_return (request);
            return ret;
        }
        ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreq, 1);
    } else {
        ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, NULL, 0);
    }

    ompi_comm_request_start (request);

    *req = &request->super;

    return ret;
}

Its unfortunate that the useful note about barrier comment was removed. would have saved a week or so of debugging.
I think @bosilca was wrong in his assertion that a barrier is not needed in the cid alloc algorithm.

@wenduwan
Copy link
Contributor

@hppritcha A minor correction - the barrier still exists on v5.0.x - it's only removed in main.

@hppritcha
Copy link
Member

ah good point @wenduwan .

@bosilca
Copy link
Member

bosilca commented Apr 11, 2024

The existence of the code protecting early arrival of messages proves that I am correct in my assessment that a barrier is, or at least was at some point, not necessary. Indeed, if the barrier was necessary there would be no need for non_existing_communicator_pending. Which means that the rework of the communicator creation, added the need for the barrier as a "cheaper" way of avoid other issues. I say "cheaper" because we now need about 4 to 5 collective in order to create a communicator, doubling the number of collective needed previously.

Going back to your analysis that we put the parent communicator in the ompi_mpi_communicators array. I don't see any obvious reason to do so, more precisely not a valid reason to have to put the parent. If instead we put the newly created communicator, aka. context->newcomm instead of context->comm, we should be good because if the cid allreduce fails to produce a valid cid nobody will send a message with that cid, and if it succeeded anywhere (which will justify receiving an early message) it will be added to the correct communicator.

hppritcha added a commit to hppritcha/ompi that referenced this issue Apr 11, 2024
A barrier was removed from the cid allocation algorithm
for intercommunicators that leads to random hangs when
doing MPI_Intercomm_create or moral equivalent.

See issue open-mpi#12367

the cid allocation algorithm and comm from cid lookup
was modified to use a pointer to ompi_mpi_comm_null as the placeholder
in the list of pointers to ompi commucators.

A convenience debug function was added to comm_cid.c to dump
out this table of communicators.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
wenduwan pushed a commit to hppritcha/ompi that referenced this issue Apr 11, 2024
A barrier was removed from the cid allocation algorithm
for intercommunicators that leads to random hangs when
doing MPI_Intercomm_create or moral equivalent.

See issue open-mpi#12367

the cid allocation algorithm and comm from cid lookup
was modified to use a pointer to ompi_mpi_comm_null as the placeholder
in the list of pointers to ompi commucators.

A convenience debug function was added to comm_cid.c to dump
out this table of communicators.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Apr 12, 2024
A barrier was removed from the cid allocation algorithm
for intercommunicators that leads to random hangs when
doing MPI_Intercomm_create or moral equivalent.

See issue open-mpi#12367

the cid allocation algorithm and comm from cid lookup
was modified to use a pointer to ompi_mpi_comm_null as the placeholder
in the list of pointers to ompi commucators.

A convenience debug function was added to comm_cid.c to dump
out this table of communicators.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Apr 12, 2024
A barrier was removed from the cid allocation algorithm
for intercommunicators that leads to random hangs when
doing MPI_Intercomm_create or moral equivalent.

See issue open-mpi#12367

the cid allocation algorithm and comm from cid lookup
was modified to use OMPI_COMM_SENTINEL as the placeholder
in the list of pointers to ompi commucators.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@dalcinl
Copy link
Contributor Author

dalcinl commented Apr 15, 2024

Guys, may I gently request to add the barrier back until a definitive and more performant solution is found? Otherwise, this issue keeps giving trouble when testing unrelated developments like #12384.

hppritcha added a commit to hppritcha/ompi that referenced this issue Apr 15, 2024
A barrier was removed from the cid allocation algorithm
for intercommunicators that leads to random hangs when
doing MPI_Intercomm_create or moral equivalent.

See issue open-mpi#12367

the cid allocation algorithm and comm from cid lookup
was modified to use OMPI_COMM_SENTINEL as the placeholder
in the list of pointers to ompi commucators.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Apr 16, 2024
A barrier was removed from the cid allocation algorithm
for intercommunicators that leads to random hangs when
doing MPI_Intercomm_create or moral equivalent.

See issue open-mpi#12367

the cid allocation algorithm and comm from cid lookup
was modified to use OMPI_COMM_SENTINEL as the placeholder
in the list of pointers to ompi commucators.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@hppritcha
Copy link
Member

fixed via #12465

@dalcinl
Copy link
Contributor Author

dalcinl commented Apr 18, 2024

Thanks, @hppritcha !!!

hppritcha added a commit to hppritcha/ompi that referenced this issue Apr 24, 2024
A barrier was removed from the cid allocation algorithm
for intercommunicators that leads to random hangs when
doing MPI_Intercomm_create or moral equivalent.

See issue open-mpi#12367

the cid allocation algorithm and comm from cid lookup
was modified to use OMPI_COMM_SENTINEL as the placeholder
in the list of pointers to ompi commucators.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 15a3d26)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants